Handle Multi-language #24

mascarpon3 · 2025-11-18T14:49:37Z

Pull Request Overview

This pull request enhances the full-text search capabilities by adding multi-language support with language-specific analyzers and trigram-based fuzzy matching. The changes introduce support for French, English, German, and Dutch languages, with proper stemming, stop words, and accent folding for each language.

note: this pr includes changes from the evaluation branch I need to evalute the impact of my improvements.

Key Changes

Multi-language support: Documents are now indexed with language-specific fields (e.g., title.en-us, content.fr-fr) using appropriate analyzers for each language
Added French analyzer with stemming, stop words, elision, and ASCII folding for better French language support
Implemented trigram analyzer for fuzzy matching to handle typos and partial word matches
Introduced min_max normalization in the search pipeline for consistent score interpretation

src/backend/core/services/opensearch.py

src/backend/find/settings.py

src/backend/core/management/commands/evaluate_search_engine.py

qbey · 2025-11-20T10:40:50Z

src/backend/core/services/opensearch.py

+                            "french_analyzer": {
+                                "type": "custom",
+                                "tokenizer": "standard",
+                                "filter": [
+                                    "lowercase",
+                                    "asciifolding",
+                                    "french_elision",
+                                    "french_stop",
+                                    "french_stemmer",
+                                ],
+                            },


I see you override the default French analyser to add "asciifolding" (https://docs.opensearch.org/latest/analyzers/language-analyzers/french/) :

does asciifolding really improves results compared to the default French analyser? I'm asking the question, because it would improve readability and configuration if we were using the default one. For sure ignoring accents is a good idea.

I think this analyzer should be configurable for the internationalization of the product

Do you know if we can stack several languages analyzers? Is it a bad idea? or should we simply set custom filters that match several languages if needed?

Do you know if the "search" time is increased and by how much?

I don't understand point 1. Even if I remove asciifolding we would need a custom french_analyser.

I have just compute the impact of analyzers on time search. As I expected they can be measured but are pretty low.

They are two order of magnitude lower than the vector computation.

qbey

Nice, almost there! just few questions (and probably one fix) :)

qbey · 2025-12-01T21:30:45Z

src/backend/core/services/opensearch.py

+def detect_language_code(text):
+    """Detect the language code of the document content."""
+
+    code_mapping = {"fr": "fr-fr", "en": "en-us", "de": "de-de", "nl": "nl-nl"}


This raises the question: do we need language variation in the index?

We don't. I thought it was a standard used by la Suite.

qbey · 2025-12-01T21:32:23Z

src/backend/core/services/opensearch.py

+def prepare_document_for_indexing(document):
+    """Prepare document for indexing using nested language structure and handle embedding"""
+
+    language_code = detect_language_code(f"{document['title']} {document['content']}")


👍 you did well by combining title and content for language detection.

qbey · 2025-12-01T21:39:47Z

src/backend/core/services/opensearch.py

+        f"title.{language_code}": document["title"],
+        f"content.{language_code}": document["content"],


Should we remove other language content?

Exemple:

I start a new document, and put a simple title like "2025-01 Planning"

Wait enough time for it to be indexed : language code will either be "en-us" or "und" (also, py3langid will struggle on very small texts) => prepare_document_for_indexing will fill title.en-us...

Fill my document content: "Oui alors pour le planning de find on va faire ceci cela, avé des accents ça aide"

Wait again for indexation => prepare_document_for_indexing will fill title.fr-fr... but title.en-us will still exist.

In this example, this would not be a big deal, but you have the idea : if I start a document with a hug English copy/paste, then ask docs to translate in French, both "content" will be filed but only one will be kept up to date.

I added to a test test_api_documents_index_and_reindex_same_document to make sure we do not keep the former language code if it changes.

The way I did the pipeline I think we we ca not have several title or content fields with different language_code.

src/backend/find/settings.py

I want to automize the search evaluation. This new command computes performance metrics.

I add more data to my evaluations.

add changelog and various fixes.

I introduce two analyszers to improve the full text search.

I update the changelog

I fix tests and linters

I fix a buch of small things

I define settings to remove magic numbers

I copy the evaluation command

I index in multi-language

I changed my mind. I want a flat structure.

the search must be updated so everything works

I add more tests so the feature is tested more

I docuemnt so the feature is documented

I did many mistakes. There are now fixed.

things were a bit broken but I ixed them

I change the logic. I detect the language instead of receiving it as queryparams

things are broken and I fixe them here

I improve the documentation a little bit

more test is better. I add tests.

socket-security · 2025-12-02T11:27:18Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	py3langid@0.3.0

View full report

fiiiiiiiiiiiiiiiiiiiiiiiix things

we do not need language variations

things are broken. now they are fixed.

mascarpon3 requested review from joehybird and qbey and removed request for joehybird November 18, 2025 14:49

joehybird suggested changes Nov 19, 2025

View reviewed changes

src/backend/core/services/opensearch.py Show resolved Hide resolved

src/backend/find/settings.py Show resolved Hide resolved

src/backend/core/management/commands/evaluate_search_engine.py Show resolved Hide resolved

joehybird mentioned this pull request Nov 19, 2025

Evaluate #22

Merged

qbey changed the base branch from main to evaluate November 20, 2025 10:20

qbey reviewed Nov 20, 2025

View reviewed changes

mascarpon3 force-pushed the evaluate branch 2 times, most recently from 956f37a to 289b44b Compare November 21, 2025 10:23

mascarpon3 changed the title ~~Improve full text~~ Handle Multi-language Nov 25, 2025

mascarpon3 changed the base branch from evaluate to main November 25, 2025 16:19

mascarpon3 requested review from joehybird and qbey November 25, 2025 16:22

mascarpon3 force-pushed the improve-full-text branch 8 times, most recently from 6401738 to 4ee1530 Compare November 27, 2025 10:41

qbey reviewed Dec 1, 2025

View reviewed changes

mascarpon3 added 8 commits December 2, 2025 10:20

✨(backend) add evaluate-search-engine command

98e2d0f

I want to automize the search evaluation. This new command computes performance metrics.

✨(backend) improve evaluation

a51f79a

I add more data to my evaluations.

📝(backend) add changelog

122a4b6

add changelog and various fixes.

✨(backend) improve full text

9e6aa6e

I introduce two analyszers to improve the full text search.

📝(backend) add changelog

94c051f

I update the changelog

🧪(backend) fix tests and linters

78dd582

I fix tests and linters

♻️(backend) various fixes

03bd2e3

I fix a buch of small things

🔧(backend) define settings

0e135d3

I define settings to remove magic numbers

mascarpon3 added 12 commits December 2, 2025 10:20

✨(backend) copy evaluation

ec4d1ee

I copy the evaluation command

✨(backend) index multi-language

e870d93

I index in multi-language

✨(backend) flatten the data structure

d26387e

I changed my mind. I want a flat structure.

♻️(backend) handle search

9bc1105

the search must be updated so everything works

🧪(backend) more tests

ad57755

I add more tests so the feature is tested more

📝(backend) docuemntation

8fc0b0c

I docuemnt so the feature is documented

♻️(backend) various fixes

9cc1744

I did many mistakes. There are now fixed.

🚨(backend) fix things

1000413

things were a bit broken but I ixed them

✨(backend) detect language

e058186

I change the logic. I detect the language instead of receiving it as queryparams

🚨(backend) fix things

d47bdec

things are broken and I fixe them here

📝(backend) better documentation

112fb6e

I improve the documentation a little bit

🧪(backend) test

74f5f38

more test is better. I add tests.

mascarpon3 force-pushed the improve-full-text branch 3 times, most recently from 55d5a32 to 0804d56 Compare December 2, 2025 11:27

mascarpon3 force-pushed the improve-full-text branch 2 times, most recently from 7b300ec to 61ff28a Compare December 2, 2025 13:23

🚨(backend) fix things

30da7ed

fiiiiiiiiiiiiiiiiiiiiiiiix things

mascarpon3 force-pushed the improve-full-text branch from 61ff28a to 30da7ed Compare December 2, 2025 13:49

♻️(backend) simplify language_code

442d75c

we do not need language variations

qbey approved these changes Dec 2, 2025

View reviewed changes

♻️(backend) fix things

fe3722b

things are broken. now they are fixed.

mascarpon3 merged commit 8b4566b into main Dec 8, 2025
21 of 24 checks passed

mascarpon3 deleted the improve-full-text branch December 8, 2025 09:08

		f"title.{language_code}": document["title"],
		f"content.{language_code}": document["content"],

Handle Multi-language #24

Handle Multi-language #24

Uh oh!

Conversation

mascarpon3 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Overview

Key Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qbey Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

mascarpon3 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

mascarpon3 Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qbey left a comment

Choose a reason for hiding this comment

Uh oh!

qbey Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

mascarpon3 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

qbey Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

qbey Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

mascarpon3 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

socket-security bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mascarpon3 commented Nov 18, 2025 •

edited

Loading

mascarpon3 Nov 26, 2025 •

edited

Loading

socket-security bot commented Dec 2, 2025 •

edited

Loading