feat: add SemanticTextSplitter using EmbeddingModel cosine similarity by anuragg-saxenaa · Pull Request #5816 · spring-projects/spring-ai

anuragg-saxenaa · 2026-04-17T14:07:48Z

Summary

Add SemanticTextSplitter — a new TextSplitter that uses semantic similarity rather than fixed token counts to determine chunk boundaries.

Closes #5464

Motivation

TokenTextSplitter splits at fixed token counts, often breaking sentences mid-thought and degrading RAG retrieval quality. Users frequently reach for external tools (e.g., Docling) or write custom solutions. A native semantic splitter addresses this gap without introducing new external dependencies.

Implementation

SemanticTextSplitter extends the existing TextSplitter base class:

Sentence tokenisation — split on ., !, ? followed by whitespace.
Embedding — embed each sentence via the injected EmbeddingModel.
Cosine similarity — compute similarity between consecutive sentence embeddings.
Chunking — start a new chunk when similarity drops below similarityThreshold or the buffer would exceed maxChunkSize characters.

Defaults: similarityThreshold = 0.5, maxChunkSize = 1000.
No new dependencies — reuses Spring AI's own EmbeddingModel.

Example

SemanticTextSplitter splitter = SemanticTextSplitter.builder()
    .embeddingModel(embeddingModel)
    .similarityThreshold(0.6)
    .maxChunkSize(800)
    .build();

List<Document> chunks = splitter.split(document);

Tests (`SemanticTextSplitterTests`)

Test	What it verifies
`emptyTextReturnsEmptyList`	blank/empty input
`singleSentenceReturnsSingleChunk`	single-sentence edge case
`highSimilarityKeepsSentencesTogether`	identical embeddings → one chunk
`lowSimilaritySplitsSentences`	orthogonal embeddings → separate chunks
`maxChunkSizeForcesNewChunk`	size cap overrides similarity
`thresholdExactlyAtSimilarityKeepsTogether`	boundary: `sim >= threshold`
`thresholdJustAboveSimilaritySplits`	boundary: `sim < threshold`
`splitDocumentPreservesMetadata`	metadata propagation
`cosineSimilarity*` (×3)	math helper correctness
Constructor validation (×4)	null model, invalid threshold/size

Closes spring-projects#5464 ## Summary Add a new TextSplitter that uses semantic similarity rather than fixed token counts to determine chunk boundaries, improving RAG retrieval quality. ## Implementation - SemanticTextSplitter extends the existing TextSplitter base class - Splits input into sentences on sentence-ending punctuation (.!?) - Embeds each sentence via the injected EmbeddingModel - Computes cosine similarity between consecutive sentence embeddings - Starts a new chunk when similarity < similarityThreshold OR the accumulated buffer would exceed maxChunkSize characters - Defaults: similarityThreshold=0.5, maxChunkSize=1000 - No new external dependencies — uses Spring AI's own EmbeddingModel ## Tests (SemanticTextSplitterTests) - Empty/blank input → empty list - Single sentence → single chunk - Identical embeddings (sim=1.0) → sentences merged - Orthogonal embeddings (sim=0.0) → sentences split - maxChunkSize forces split even when similarity is high - Threshold boundary: exactly at sim keeps together, just above splits - Document-level split preserves metadata - cosineSimilarity helper: identical, orthogonal, zero vectors - Constructor validation: null model, negative/over-1 threshold, zero size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SemanticTextSplitter using EmbeddingModel cosine similarity#5816

feat: add SemanticTextSplitter using EmbeddingModel cosine similarity#5816
anuragg-saxenaa wants to merge 1 commit intospring-projects:mainfrom
anuragg-saxenaa:feature/semantic-text-splitter

anuragg-saxenaa commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anuragg-saxenaa commented Apr 17, 2026

Summary

Motivation

Implementation

Example

Tests (SemanticTextSplitterTests)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests (`SemanticTextSplitterTests`)