Skip to content

Conversation

@mascarpon3
Copy link
Contributor

@mascarpon3 mascarpon3 commented Nov 18, 2025

Pull Request Overview

This pull request enhances the full-text search capabilities by adding multi-language support with language-specific analyzers and trigram-based fuzzy matching. The changes introduce support for French, English, German, and Dutch languages, with proper stemming, stop words, and accent folding for each language.

note: this pr includes changes from the evaluation branch I need to evalute the impact of my improvements.

Key Changes

  • Multi-language support: Documents are now indexed with language-specific fields (e.g., title.en-us, content.fr-fr) using appropriate analyzers for each language
  • Added French analyzer with stemming, stop words, elision, and ASCII folding for better French language support
  • Implemented trigram analyzer for fuzzy matching to handle typos and partial word matches
  • Introduced min_max normalization in the search pipeline for consistent score interpretation

@mascarpon3 mascarpon3 requested review from joehybird and qbey and removed request for joehybird November 18, 2025 14:49
@joehybird joehybird mentioned this pull request Nov 19, 2025
@qbey qbey changed the base branch from main to evaluate November 20, 2025 10:20
Comment on lines 287 to 297
"french_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"french_elision",
"french_stop",
"french_stemmer",
],
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you override the default French analyser to add "asciifolding" (https://docs.opensearch.org/latest/analyzers/language-analyzers/french/) :

  • does asciifolding really improves results compared to the default French analyser? I'm asking the question, because it would improve readability and configuration if we were using the default one. For sure ignoring accents is a good idea.
  • I think this analyzer should be configurable for the internationalization of the product
  • Do you know if we can stack several languages analyzers? Is it a bad idea? or should we simply set custom filters that match several languages if needed?
  • Do you know if the "search" time is increased and by how much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand point 1. Even if I remove asciifolding we would need a custom french_analyser.

Copy link
Contributor Author

@mascarpon3 mascarpon3 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just compute the impact of analyzers on time search. As I expected they can be measured but are pretty low.

Capture d’écran 2025-11-26 à 10 37 08

They are two order of magnitude lower than the vector computation.

@mascarpon3 mascarpon3 force-pushed the evaluate branch 2 times, most recently from 956f37a to 289b44b Compare November 21, 2025 10:23
@mascarpon3 mascarpon3 changed the title Improve full text Handle Multi-language Nov 25, 2025
@mascarpon3 mascarpon3 changed the base branch from evaluate to main November 25, 2025 16:19
@mascarpon3 mascarpon3 requested review from joehybird and qbey November 25, 2025 16:22
@mascarpon3 mascarpon3 force-pushed the improve-full-text branch 8 times, most recently from 6401738 to 4ee1530 Compare November 27, 2025 10:41
Copy link
Member

@qbey qbey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, almost there! just few questions (and probably one fix) :)

def detect_language_code(text):
"""Detect the language code of the document content."""

code_mapping = {"fr": "fr-fr", "en": "en-us", "de": "de-de", "nl": "nl-nl"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question: do we need language variation in the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. I thought it was a standard used by la Suite.

def prepare_document_for_indexing(document):
"""Prepare document for indexing using nested language structure and handle embedding"""

language_code = detect_language_code(f"{document['title']} {document['content']}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 you did well by combining title and content for language detection.

Comment on lines +290 to +293
f"title.{language_code}": document["title"],
f"content.{language_code}": document["content"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove other language content?

Exemple:

  • I start a new document, and put a simple title like "2025-01 Planning"
  • Wait enough time for it to be indexed : language code will either be "en-us" or "und" (also, py3langid will struggle on very small texts) => prepare_document_for_indexing will fill title.en-us...
  • Fill my document content: "Oui alors pour le planning de find on va faire ceci cela, avé des accents ça aide"
  • Wait again for indexation => prepare_document_for_indexing will fill title.fr-fr... but title.en-us will still exist.

In this example, this would not be a big deal, but you have the idea : if I start a document with a hug English copy/paste, then ask docs to translate in French, both "content" will be filed but only one will be kept up to date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added to a test test_api_documents_index_and_reindex_same_document to make sure we do not keep the former language code if it changes.

The way I did the pipeline I think we we ca not have several title or content fields with different language_code.

I want to automize the search evaluation. This new command
computes performance metrics.
I add more data to my evaluations.
add changelog and various fixes.
I introduce two analyszers to improve the full text search.
I update the changelog
I fix tests and linters
I fix a buch of small things
I define settings to remove magic numbers
I copy the evaluation command
I index in multi-language
I changed my mind. I want a flat structure.
the search must be updated so everything works
I add more tests so the feature is tested more
I docuemnt so the feature is documented
I did many mistakes. There are now fixed.
things were a bit broken but I ixed them
I change the logic.
I detect the language instead of receiving it as queryparams
things are broken and I fixe them here
I improve the documentation a little bit
more test is better. I add tests.
@mascarpon3 mascarpon3 force-pushed the improve-full-text branch 3 times, most recently from 55d5a32 to 0804d56 Compare December 2, 2025 11:27
@socket-security
Copy link

socket-security bot commented Dec 2, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpy3langid@​0.3.099100100100100

View full report

@mascarpon3 mascarpon3 force-pushed the improve-full-text branch 2 times, most recently from 7b300ec to 61ff28a Compare December 2, 2025 13:23
fiiiiiiiiiiiiiiiiiiiiiiiix things
we do not need language variations
things are broken. now they are fixed.
@mascarpon3 mascarpon3 merged commit 8b4566b into main Dec 8, 2025
21 of 24 checks passed
@mascarpon3 mascarpon3 deleted the improve-full-text branch December 8, 2025 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants