Skip to content

feat: add SemanticTextSplitter using EmbeddingModel cosine similarity#5816

Open
anuragg-saxenaa wants to merge 1 commit intospring-projects:mainfrom
anuragg-saxenaa:feature/semantic-text-splitter
Open

feat: add SemanticTextSplitter using EmbeddingModel cosine similarity#5816
anuragg-saxenaa wants to merge 1 commit intospring-projects:mainfrom
anuragg-saxenaa:feature/semantic-text-splitter

Conversation

@anuragg-saxenaa
Copy link
Copy Markdown

Summary

Add SemanticTextSplitter — a new TextSplitter that uses semantic similarity rather than fixed token counts to determine chunk boundaries.

Closes #5464

Motivation

TokenTextSplitter splits at fixed token counts, often breaking sentences mid-thought and degrading RAG retrieval quality. Users frequently reach for external tools (e.g., Docling) or write custom solutions. A native semantic splitter addresses this gap without introducing new external dependencies.

Implementation

SemanticTextSplitter extends the existing TextSplitter base class:

  1. Sentence tokenisation — split on ., !, ? followed by whitespace.
  2. Embedding — embed each sentence via the injected EmbeddingModel.
  3. Cosine similarity — compute similarity between consecutive sentence embeddings.
  4. Chunking — start a new chunk when similarity drops below similarityThreshold or the buffer would exceed maxChunkSize characters.

Defaults: similarityThreshold = 0.5, maxChunkSize = 1000.
No new dependencies — reuses Spring AI's own EmbeddingModel.

Example

SemanticTextSplitter splitter = SemanticTextSplitter.builder()
    .embeddingModel(embeddingModel)
    .similarityThreshold(0.6)
    .maxChunkSize(800)
    .build();

List<Document> chunks = splitter.split(document);

Tests (SemanticTextSplitterTests)

Test What it verifies
emptyTextReturnsEmptyList blank/empty input
singleSentenceReturnsSingleChunk single-sentence edge case
highSimilarityKeepsSentencesTogether identical embeddings → one chunk
lowSimilaritySplitsSentences orthogonal embeddings → separate chunks
maxChunkSizeForcesNewChunk size cap overrides similarity
thresholdExactlyAtSimilarityKeepsTogether boundary: sim >= threshold
thresholdJustAboveSimilaritySplits boundary: sim < threshold
splitDocumentPreservesMetadata metadata propagation
cosineSimilarity* (×3) math helper correctness
Constructor validation (×4) null model, invalid threshold/size

Closes spring-projects#5464

## Summary

Add a new TextSplitter that uses semantic similarity rather than fixed
token counts to determine chunk boundaries, improving RAG retrieval
quality.

## Implementation

- SemanticTextSplitter extends the existing TextSplitter base class
- Splits input into sentences on sentence-ending punctuation (.!?)
- Embeds each sentence via the injected EmbeddingModel
- Computes cosine similarity between consecutive sentence embeddings
- Starts a new chunk when similarity < similarityThreshold OR the
  accumulated buffer would exceed maxChunkSize characters
- Defaults: similarityThreshold=0.5, maxChunkSize=1000
- No new external dependencies — uses Spring AI's own EmbeddingModel

## Tests (SemanticTextSplitterTests)

- Empty/blank input → empty list
- Single sentence → single chunk
- Identical embeddings (sim=1.0) → sentences merged
- Orthogonal embeddings (sim=0.0) → sentences split
- maxChunkSize forces split even when similarity is high
- Threshold boundary: exactly at sim keeps together, just above splits
- Document-level split preserves metadata
- cosineSimilarity helper: identical, orthogonal, zero vectors
- Constructor validation: null model, negative/over-1 threshold, zero size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add native semantic text chunking support

1 participant