-
Notifications
You must be signed in to change notification settings - Fork 4
Handle Multi-language #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "french_analyzer": { | ||
| "type": "custom", | ||
| "tokenizer": "standard", | ||
| "filter": [ | ||
| "lowercase", | ||
| "asciifolding", | ||
| "french_elision", | ||
| "french_stop", | ||
| "french_stemmer", | ||
| ], | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you override the default French analyser to add "asciifolding" (https://docs.opensearch.org/latest/analyzers/language-analyzers/french/) :
- does asciifolding really improves results compared to the default French analyser? I'm asking the question, because it would improve readability and configuration if we were using the default one. For sure ignoring accents is a good idea.
- I think this analyzer should be configurable for the internationalization of the product
- Do you know if we can stack several languages analyzers? Is it a bad idea? or should we simply set custom filters that match several languages if needed?
- Do you know if the "search" time is increased and by how much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand point 1. Even if I remove asciifolding we would need a custom french_analyser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
956f37a to
289b44b
Compare
6401738 to
4ee1530
Compare
qbey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, almost there! just few questions (and probably one fix) :)
| def detect_language_code(text): | ||
| """Detect the language code of the document content.""" | ||
|
|
||
| code_mapping = {"fr": "fr-fr", "en": "en-us", "de": "de-de", "nl": "nl-nl"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises the question: do we need language variation in the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't. I thought it was a standard used by la Suite.
| def prepare_document_for_indexing(document): | ||
| """Prepare document for indexing using nested language structure and handle embedding""" | ||
|
|
||
| language_code = detect_language_code(f"{document['title']} {document['content']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 you did well by combining title and content for language detection.
| f"title.{language_code}": document["title"], | ||
| f"content.{language_code}": document["content"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove other language content?
Exemple:
- I start a new document, and put a simple title like "2025-01 Planning"
- Wait enough time for it to be indexed : language code will either be "en-us" or "und" (also, py3langid will struggle on very small texts) =>
prepare_document_for_indexingwill filltitle.en-us... - Fill my document content: "Oui alors pour le planning de find on va faire ceci cela, avé des accents ça aide"
- Wait again for indexation =>
prepare_document_for_indexingwill filltitle.fr-fr... buttitle.en-uswill still exist.
In this example, this would not be a big deal, but you have the idea : if I start a document with a hug English copy/paste, then ask docs to translate in French, both "content" will be filed but only one will be kept up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added to a test test_api_documents_index_and_reindex_same_document to make sure we do not keep the former language code if it changes.
The way I did the pipeline I think we we ca not have several title or content fields with different language_code.
I want to automize the search evaluation. This new command computes performance metrics.
I add more data to my evaluations.
add changelog and various fixes.
I introduce two analyszers to improve the full text search.
I update the changelog
I fix tests and linters
I fix a buch of small things
I define settings to remove magic numbers
I copy the evaluation command
I index in multi-language
I changed my mind. I want a flat structure.
the search must be updated so everything works
I add more tests so the feature is tested more
I docuemnt so the feature is documented
I did many mistakes. There are now fixed.
things were a bit broken but I ixed them
I change the logic. I detect the language instead of receiving it as queryparams
things are broken and I fixe them here
I improve the documentation a little bit
more test is better. I add tests.
55d5a32 to
0804d56
Compare
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
7b300ec to
61ff28a
Compare
fiiiiiiiiiiiiiiiiiiiiiiiix things
61ff28a to
30da7ed
Compare
we do not need language variations
things are broken. now they are fixed.

Pull Request Overview
This pull request enhances the full-text search capabilities by adding multi-language support with language-specific analyzers and trigram-based fuzzy matching. The changes introduce support for French, English, German, and Dutch languages, with proper stemming, stop words, and accent folding for each language.
note: this pr includes changes from the evaluation branch I need to evalute the impact of my improvements.
Key Changes