From beb314b8357772d8d050e4b044fe806e5b0986c0 Mon Sep 17 00:00:00 2001 From: diberry <41597107+diberry@users.noreply.github.com> Date: Mon, 8 Jun 2026 15:23:45 -0700 Subject: [PATCH 1/2] feat: add Cosmos DB NoSQL create-index sample (Java) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- nosql-create-index-java/.env.example | 16 ++ nosql-create-index-java/README.md | 98 ++++++++ .../output/sample-output.txt | 39 +++ nosql-create-index-java/pom.xml | 80 ++++++ nosql-create-index-java/sample.env | 16 ++ .../com/azure/cosmos/createindex/App.java | 64 +++++ .../com/azure/cosmos/createindex/Config.java | 140 +++++++++++ .../azure/cosmos/createindex/DataPlane.java | 229 ++++++++++++++++++ nosql-create-index-java/tests/.gitkeep | 0 9 files changed, 682 insertions(+) create mode 100644 nosql-create-index-java/.env.example create mode 100644 nosql-create-index-java/README.md create mode 100644 nosql-create-index-java/output/sample-output.txt create mode 100644 nosql-create-index-java/pom.xml create mode 100644 nosql-create-index-java/sample.env create mode 100644 nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/App.java create mode 100644 nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/Config.java create mode 100644 nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/DataPlane.java create mode 100644 nosql-create-index-java/tests/.gitkeep diff --git a/nosql-create-index-java/.env.example b/nosql-create-index-java/.env.example new file mode 100644 index 0000000..733365c --- /dev/null +++ b/nosql-create-index-java/.env.example @@ -0,0 +1,16 @@ +# Azure Cosmos DB for NoSQL +AZURE_COSMOSDB_ENDPOINT="https://.documents.azure.com:443/" +AZURE_COSMOSDB_DATABASENAME="Hotels" +AZURE_COSMOSDB_CONTAINER_NAME="" + +# Azure OpenAI embedding configuration +AZURE_OPENAI_EMBEDDING_ENDPOINT="https://.openai.azure.com/" +AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-3-small" +AZURE_OPENAI_EMBEDDING_API_VERSION="2024-08-01-preview" + +# Vector query selection +# Set diskann or quantizedflat. Leave empty to run both containers. +VECTOR_ALGORITHM="" + +# Shared repo-root dataset, referenced relative to this sample folder +DATA_FILE_WITH_VECTORS="..\data\HotelsData_toCosmosDB_Vector.json" diff --git a/nosql-create-index-java/README.md b/nosql-create-index-java/README.md new file mode 100644 index 0000000..e27367f --- /dev/null +++ b/nosql-create-index-java/README.md @@ -0,0 +1,98 @@ +# Azure Cosmos DB for NoSQL create-index sample with Java + +This sample shows how to load pre-vectorized hotel documents into existing Azure Cosmos DB for NoSQL containers and run vector similarity queries with Java. + +It uses: +- `DefaultAzureCredential` for Azure Cosmos DB and the Azure OpenAI client +- existing `Hotels` database resources created by `azd up` +- the shared `..\data\HotelsData_toCosmosDB_Vector.json` dataset +- bulk upsert operations for `hotels_diskann` and `hotels_quantizedflat` +- `VectorDistance()` SQL queries for similarity search + +> [!IMPORTANT] +> This sample is data-plane only. It does not create databases, containers, or vector indexes. Run `azd up` from the repo root before you run this sample. + +## Prerequisites + +- Java 17 LTS or later +- Maven 3.9 or later +- Azure CLI installed and signed in with `az login` +- Azure resources already provisioned by `azd up` +- Microsoft Entra ID roles: + - **Cosmos DB Built-in Data Contributor** + - **Cognitive Services OpenAI User** + +The sample expects these existing containers in the `Hotels` database: +- `hotels_diskann` +- `hotels_quantizedflat` + +## Set up the sample + +1. Copy the environment template. + + ```powershell + Copy-Item sample.env .env + ``` + +2. Update `.env` with your Azure Cosmos DB endpoint and Azure OpenAI settings. + + Notes: + - Leave `AZURE_COSMOSDB_CONTAINER_NAME` empty to run all supported containers. + - Leave `VECTOR_ALGORITHM` empty to run both algorithms. + - Set `VECTOR_ALGORITHM` to `diskann` or `quantizedflat` to run one algorithm. + - Set `AZURE_COSMOSDB_CONTAINER_NAME` only if you want to target one container directly. + +3. Build the project. + + ```powershell + mvn compile + ``` + +## Run the sample + +Run from this directory: + +```powershell +mvn exec:java +``` + +Examples: + +```powershell +# Run both containers (default) +mvn exec:java + +# Run only DiskANN +$env:VECTOR_ALGORITHM = 'diskann' +mvn exec:java + +# Run only QuantizedFlat +$env:VECTOR_ALGORITHM = 'quantizedflat' +mvn exec:java +``` + +## Expected output + +The sample prints: +- configuration and target container selection +- embedding dimension verification for `text-embedding-3-small` +- bulk ingestion status for each container +- top vector matches from each queried container + +See `output/sample-output.txt` for example console output. + +## Project structure + +```text +nosql-create-index-java/ +├── .env.example +├── output/ +│ └── sample-output.txt +├── pom.xml +├── README.md +├── sample.env +└── src/main/java/com/azure/cosmos/createindex/ + ├── App.java + ├── Config.java + └── DataPlane.java +``` diff --git a/nosql-create-index-java/output/sample-output.txt b/nosql-create-index-java/output/sample-output.txt new file mode 100644 index 0000000..7b90c2a --- /dev/null +++ b/nosql-create-index-java/output/sample-output.txt @@ -0,0 +1,39 @@ +======================================================================== +Azure Cosmos DB for NoSQL - create and query vector indexes with Java +======================================================================== +Database: Hotels +Data file: C:\project-dina-ai-dev-tools\repos\public-azure-samples-cosmos-db-vector-samples\data\HotelsData_toCosmosDB_Vector.json +Target containers: hotels_diskann, hotels_quantizedflat + +=== Verify embedding dimensions === +Deployment: text-embedding-3-small +Actual: 1536 +Expected: 1536 + +=== Ingest documents: hotels_diskann === +Upserted 50/50 documents using bulk operations. RU: 6812.47 + +=== Ingest documents: hotels_quantizedflat === +Upserted 50/50 documents using bulk operations. RU: 6810.92 + +Query text: hotel near the ocean + +=== Query results: hotels_diskann (DiskANN) === +Request charge: 5.33 RUs +1. HotelId=11 | HotelName=Royal Cottage Resort | score=0.4991 | Description=Your home away from home. Brand new fully equipped premium rooms, fast WiFi, full kitchen, washer & dryer... +2. HotelId=47 | HotelName=Country Comfort Inn | score=0.4786 | Description=Situated conveniently at the north end of the village, the inn is just a short walk from the lake... +3. HotelId=48 | HotelName=Nordick's Valley Motel | score=0.4635 | Description=Only 90 miles from the nation's capital and nearby most everything the historic valley has to offer... +4. HotelId=19 | HotelName=Economy Universe Motel | score=0.4461 | Description=Local, family-run hotel in bustling downtown Redmond. We are a pet-friendly establishment... +5. HotelId=7 | HotelName=Roach Motel | score=0.4388 | Description=Perfect Location on Main Street. Earn points while enjoying close proximity to the city's best shopping... + +=== Query results: hotels_quantizedflat (QuantizedFlat) === +Request charge: 5.35 RUs +1. HotelId=11 | HotelName=Royal Cottage Resort | score=0.4991 | Description=Your home away from home. Brand new fully equipped premium rooms, fast WiFi, full kitchen, washer & dryer... +2. HotelId=47 | HotelName=Country Comfort Inn | score=0.4786 | Description=Situated conveniently at the north end of the village, the inn is just a short walk from the lake... +3. HotelId=48 | HotelName=Nordick's Valley Motel | score=0.4635 | Description=Only 90 miles from the nation's capital and nearby most everything the historic valley has to offer... +4. HotelId=19 | HotelName=Economy Universe Motel | score=0.4461 | Description=Local, family-run hotel in bustling downtown Redmond. We are a pet-friendly establishment... +5. HotelId=7 | HotelName=Roach Motel | score=0.4388 | Description=Perfect Location on Main Street. Earn points while enjoying close proximity to the city's best shopping... + +======================================================================== +Complete +======================================================================== diff --git a/nosql-create-index-java/pom.xml b/nosql-create-index-java/pom.xml new file mode 100644 index 0000000..0d8312a --- /dev/null +++ b/nosql-create-index-java/pom.xml @@ -0,0 +1,80 @@ + + 4.0.0 + + com.azure.samples.cosmosdb + nosql-create-index-java + 1.0.0 + Azure Cosmos DB NoSQL Create Index - Java + + + 17 + UTF-8 + + + + + + com.azure + azure-sdk-bom + 1.2.23 + pom + import + + + + + + + com.azure + azure-cosmos + + + com.azure + azure-identity + + + com.azure + azure-ai-openai + 1.0.0-beta.16 + + + io.github.cdimascio + dotenv-java + 3.0.0 + + + com.fasterxml.jackson.core + jackson-databind + 2.18.2 + + + org.slf4j + slf4j-nop + 2.0.17 + runtime + + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 3.14.0 + + ${maven.compiler.release} + + + + org.codehaus.mojo + exec-maven-plugin + 3.5.0 + + com.azure.cosmos.createindex.App + + + + + diff --git a/nosql-create-index-java/sample.env b/nosql-create-index-java/sample.env new file mode 100644 index 0000000..733365c --- /dev/null +++ b/nosql-create-index-java/sample.env @@ -0,0 +1,16 @@ +# Azure Cosmos DB for NoSQL +AZURE_COSMOSDB_ENDPOINT="https://.documents.azure.com:443/" +AZURE_COSMOSDB_DATABASENAME="Hotels" +AZURE_COSMOSDB_CONTAINER_NAME="" + +# Azure OpenAI embedding configuration +AZURE_OPENAI_EMBEDDING_ENDPOINT="https://.openai.azure.com/" +AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-3-small" +AZURE_OPENAI_EMBEDDING_API_VERSION="2024-08-01-preview" + +# Vector query selection +# Set diskann or quantizedflat. Leave empty to run both containers. +VECTOR_ALGORITHM="" + +# Shared repo-root dataset, referenced relative to this sample folder +DATA_FILE_WITH_VECTORS="..\data\HotelsData_toCosmosDB_Vector.json" diff --git a/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/App.java b/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/App.java new file mode 100644 index 0000000..a132409 --- /dev/null +++ b/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/App.java @@ -0,0 +1,64 @@ +package com.azure.cosmos.createindex; + +import com.azure.cosmos.CosmosException; +import com.azure.identity.DefaultAzureCredentialBuilder; + +import java.util.List; +import java.util.Map; + +public final class App { + private App() { + } + + public static void main(String[] args) { + try { + run(); + } catch (CosmosException exception) { + System.err.println("\nError: Cosmos DB data-plane request failed. Verify that azd up created the database and containers and that your identity has the required Microsoft Entra ID roles. Original error: " + + exception.getMessage()); + System.exit(1); + } catch (Exception exception) { + System.err.println("\nError: " + exception.getMessage()); + System.exit(1); + } + } + + private static void run() throws Exception { + SampleConfig config = Config.load(); + Config.validate(config); + + System.out.println("========================================================================"); + System.out.println("Azure Cosmos DB for NoSQL - create and query vector indexes with Java"); + System.out.println("========================================================================"); + System.out.println("Database: " + config.databaseName()); + System.out.println("Data file: " + config.dataFileWithVectors()); + System.out.println("Target containers: " + String.join(", ", Config.targetContainers(config))); + + var credential = new DefaultAzureCredentialBuilder().build(); + try (var cosmosClient = DataPlane.createCosmosClient(credential, config)) { + var openAiClient = DataPlane.createAzureOpenAIClient(credential, config); + var database = cosmosClient.getDatabase(config.databaseName()); + + DataPlane.verifyEmbeddingDimensions(openAiClient, config); + List> documents = DataPlane.readDocuments(config); + + for (String containerName : Config.targetContainers(config)) { + var container = database.getContainer(containerName); + DataPlane.ingestDocuments(container, containerName, documents); + } + + List queryEmbedding = DataPlane.generateEmbedding(openAiClient, config, config.queryText()); + System.out.println("\nQuery text: " + config.queryText()); + + for (String containerName : Config.targetContainers(config)) { + var container = database.getContainer(containerName); + QuerySummary summary = DataPlane.queryTopMatches(container, containerName, config, queryEmbedding); + DataPlane.printQuerySummary(summary); + } + } + + System.out.println("\n========================================================================"); + System.out.println("Complete"); + System.out.println("========================================================================"); + } +} diff --git a/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/Config.java b/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/Config.java new file mode 100644 index 0000000..b9e8717 --- /dev/null +++ b/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/Config.java @@ -0,0 +1,140 @@ +package com.azure.cosmos.createindex; + +import io.github.cdimascio.dotenv.Dotenv; + +import java.nio.file.Path; +import java.util.List; +import java.util.Map; + +public final class Config { + private static final Map KNOWN_CONTAINERS = Map.of( + "diskann", "hotels_diskann", + "quantizedflat", "hotels_quantizedflat" + ); + private static final List TARGET_CONTAINERS = List.of("hotels_diskann", "hotels_quantizedflat"); + + private static final String DEFAULT_DATABASE_NAME = "Hotels"; + private static final String DEFAULT_API_VERSION = "2024-08-01-preview"; + private static final String DEFAULT_QUERY_TEXT = "hotel near the ocean"; + private static final String DEFAULT_DATA_FILE = "..\\data\\HotelsData_toCosmosDB_Vector.json"; + private static final String DEFAULT_EMBEDDING_FIELD = "DescriptionVector"; + private static final int DEFAULT_TOP_COUNT = 5; + private static final int EXPECTED_DIMENSIONS = 1536; + + private Config() { + } + + public static SampleConfig load() { + Dotenv dotenv = Dotenv.configure().ignoreIfMissing().load(); + + String databaseName = read(dotenv, "AZURE_COSMOSDB_DATABASENAME", DEFAULT_DATABASE_NAME); + String containerName = read(dotenv, "AZURE_COSMOSDB_CONTAINER_NAME", null); + String vectorAlgorithm = read(dotenv, "VECTOR_ALGORITHM", null); + if (vectorAlgorithm != null) { + vectorAlgorithm = vectorAlgorithm.toLowerCase(); + } + + Path sampleRoot = Path.of("").toAbsolutePath().normalize(); + Path dataFile = sampleRoot.resolve(read(dotenv, "DATA_FILE_WITH_VECTORS", DEFAULT_DATA_FILE)).normalize(); + + return new SampleConfig( + read(dotenv, "AZURE_COSMOSDB_ENDPOINT", null), + databaseName, + containerName, + read(dotenv, "AZURE_OPENAI_EMBEDDING_ENDPOINT", null), + read(dotenv, "AZURE_OPENAI_EMBEDDING_DEPLOYMENT", null), + read(dotenv, "AZURE_OPENAI_EMBEDDING_API_VERSION", DEFAULT_API_VERSION), + vectorAlgorithm, + dataFile, + DEFAULT_QUERY_TEXT, + DEFAULT_EMBEDDING_FIELD, + DEFAULT_TOP_COUNT, + EXPECTED_DIMENSIONS + ); + } + + public static void validate(SampleConfig config) { + StringBuilder missing = new StringBuilder(); + appendMissing(missing, "AZURE_COSMOSDB_ENDPOINT", config.cosmosEndpoint()); + appendMissing(missing, "AZURE_COSMOSDB_DATABASENAME", config.databaseName()); + appendMissing(missing, "AZURE_OPENAI_EMBEDDING_ENDPOINT", config.openAiEmbeddingEndpoint()); + appendMissing(missing, "AZURE_OPENAI_EMBEDDING_DEPLOYMENT", config.openAiEmbeddingDeployment()); + + if (missing.length() > 0) { + throw new IllegalArgumentException("Missing required environment variables: " + missing); + } + + if (config.vectorAlgorithm() != null && !KNOWN_CONTAINERS.containsKey(config.vectorAlgorithm())) { + throw new IllegalArgumentException("VECTOR_ALGORITHM must be one of: diskann, quantizedflat."); + } + + if (config.containerName() != null && !TARGET_CONTAINERS.contains(config.containerName())) { + throw new IllegalArgumentException("AZURE_COSMOSDB_CONTAINER_NAME must be one of: hotels_diskann, hotels_quantizedflat."); + } + + if (config.containerName() != null && config.vectorAlgorithm() != null) { + String expectedContainer = KNOWN_CONTAINERS.get(config.vectorAlgorithm()); + if (!config.containerName().equals(expectedContainer)) { + throw new IllegalArgumentException( + "AZURE_COSMOSDB_CONTAINER_NAME and VECTOR_ALGORITHM refer to different containers."); + } + } + + if (!config.dataFileWithVectors().toFile().exists()) { + throw new IllegalArgumentException("DATA_FILE_WITH_VECTORS does not exist: " + config.dataFileWithVectors()); + } + } + + public static List targetContainers(SampleConfig config) { + if (config.containerName() != null) { + return List.of(config.containerName()); + } + if (config.vectorAlgorithm() != null) { + return List.of(KNOWN_CONTAINERS.get(config.vectorAlgorithm())); + } + return TARGET_CONTAINERS; + } + + public static String algorithmLabel(String containerName) { + return switch (containerName) { + case "hotels_diskann" -> "DiskANN"; + case "hotels_quantizedflat" -> "QuantizedFlat"; + default -> containerName; + }; + } + + private static String read(Dotenv dotenv, String name, String defaultValue) { + String value = System.getenv(name); + if (value == null || value.isBlank()) { + value = dotenv.get(name); + } + if (value == null || value.isBlank()) { + return defaultValue; + } + return value.trim(); + } + + private static void appendMissing(StringBuilder missing, String name, String value) { + if (value == null || value.isBlank()) { + if (!missing.isEmpty()) { + missing.append(", "); + } + missing.append(name); + } + } +} + +record SampleConfig( + String cosmosEndpoint, + String databaseName, + String containerName, + String openAiEmbeddingEndpoint, + String openAiEmbeddingDeployment, + String openAiEmbeddingApiVersion, + String vectorAlgorithm, + Path dataFileWithVectors, + String queryText, + String embeddingFieldName, + int topCount, + int expectedDimensions) { +} diff --git a/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/DataPlane.java b/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/DataPlane.java new file mode 100644 index 0000000..8b415d0 --- /dev/null +++ b/nosql-create-index-java/src/main/java/com/azure/cosmos/createindex/DataPlane.java @@ -0,0 +1,229 @@ +package com.azure.cosmos.createindex; + +import com.azure.ai.openai.OpenAIClient; +import com.azure.ai.openai.OpenAIClientBuilder; +import com.azure.ai.openai.models.EmbeddingsOptions; +import com.azure.core.credential.TokenCredential; +import com.azure.cosmos.CosmosClient; +import com.azure.cosmos.CosmosClientBuilder; +import com.azure.cosmos.CosmosContainer; +import com.azure.cosmos.models.CosmosBulkOperations; +import com.azure.cosmos.models.CosmosItemOperation; +import com.azure.cosmos.models.CosmosQueryRequestOptions; +import com.azure.cosmos.models.PartitionKeyBuilder; +import com.azure.cosmos.models.SqlParameter; +import com.azure.cosmos.models.SqlQuerySpec; +import com.fasterxml.jackson.core.type.TypeReference; +import com.fasterxml.jackson.databind.ObjectMapper; + +import java.io.IOException; +import java.nio.file.Files; +import java.util.ArrayList; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.regex.Pattern; + +public final class DataPlane { + private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper(); + private static final Pattern FIELD_NAME_PATTERN = Pattern.compile("^[A-Za-z_][A-Za-z0-9_]*$"); + + private DataPlane() { + } + + public static CosmosClient createCosmosClient(TokenCredential credential, SampleConfig config) { + return new CosmosClientBuilder() + .endpoint(config.cosmosEndpoint()) + .credential(credential) + .contentResponseOnWriteEnabled(false) + .buildClient(); + } + + public static OpenAIClient createAzureOpenAIClient(TokenCredential credential, SampleConfig config) { + return new OpenAIClientBuilder() + .endpoint(config.openAiEmbeddingEndpoint()) + .credential(credential) + .buildClient(); + } + + public static List generateEmbedding(OpenAIClient client, SampleConfig config, String text) { + EmbeddingsOptions options = new EmbeddingsOptions(List.of(text)); + return client.getEmbeddings(config.openAiEmbeddingDeployment(), options) + .getData() + .get(0) + .getEmbedding(); + } + + public static void verifyEmbeddingDimensions(OpenAIClient client, SampleConfig config) { + System.out.println("\n=== Verify embedding dimensions ==="); + List embedding = generateEmbedding(client, config, "dimension check"); + int actualDimensions = embedding.size(); + + System.out.println("Deployment: " + config.openAiEmbeddingDeployment()); + System.out.println("Actual: " + actualDimensions); + System.out.println("Expected: " + config.expectedDimensions()); + + if (actualDimensions != config.expectedDimensions()) { + throw new IllegalStateException( + "Embedding dimensions do not match the container definition. Expected " + + config.expectedDimensions() + ", received " + actualDimensions + "."); + } + } + + public static List> readDocuments(SampleConfig config) throws IOException { + byte[] payload = Files.readAllBytes(config.dataFileWithVectors()); + List> items = OBJECT_MAPPER.readValue( + payload, + new TypeReference>>() { + }); + + List> documents = new ArrayList<>(items.size()); + for (Map item : items) { + LinkedHashMap document = new LinkedHashMap<>(item); + Object hotelId = document.get("HotelId"); + if (hotelId == null) { + throw new IllegalStateException("Each document must contain HotelId."); + } + document.put("id", String.valueOf(hotelId)); + documents.add(document); + } + return documents; + } + + public static IngestionSummary ingestDocuments( + CosmosContainer container, + String containerName, + List> documents) { + System.out.println("\n=== Ingest documents: " + containerName + " ==="); + + List operations = new ArrayList<>(documents.size()); + for (Map document : documents) { + String hotelId = String.valueOf(document.get("HotelId")); + operations.add(CosmosBulkOperations.getUpsertItemOperation( + document, + new PartitionKeyBuilder().add(hotelId).build())); + } + + int upsertedDocuments = 0; + int failedDocuments = 0; + double requestCharge = 0.0; + + for (var response : container.executeBulkOperations(operations)) { + var itemResponse = response.getResponse(); + if (itemResponse == null) { + failedDocuments++; + continue; + } + + requestCharge += itemResponse.getRequestCharge(); + int statusCode = itemResponse.getStatusCode(); + if (statusCode >= 200 && statusCode < 300) { + upsertedDocuments++; + } else { + failedDocuments++; + } + } + + System.out.printf("Upserted %d/%d documents using bulk operations. RU: %.2f%n", + upsertedDocuments, + documents.size(), + requestCharge); + + return new IngestionSummary(containerName, documents.size(), upsertedDocuments, failedDocuments, requestCharge); + } + + public static QuerySummary queryTopMatches( + CosmosContainer container, + String containerName, + SampleConfig config, + List queryEmbedding) { + String embeddingField = validateFieldName(config.embeddingFieldName()); + String queryText = "SELECT TOP @topK c.HotelId, c.HotelName, c.Description, " + + "VectorDistance(c." + embeddingField + ", @embedding) AS SimilarityScore " + + "FROM c ORDER BY VectorDistance(c." + embeddingField + ", @embedding)"; + + SqlQuerySpec querySpec = new SqlQuerySpec( + queryText, + List.of( + new SqlParameter("@topK", config.topCount()), + new SqlParameter("@embedding", toDoubleList(queryEmbedding)))); + + CosmosQueryRequestOptions options = new CosmosQueryRequestOptions(); + List results = new ArrayList<>(); + double requestCharge = 0.0; + + for (var page : container.queryItems(querySpec, options, Map.class).iterableByPage()) { + requestCharge += page.getRequestCharge(); + for (Object item : page.getResults()) { + @SuppressWarnings("unchecked") + Map result = (Map) item; + results.add(new QueryResult( + String.valueOf(result.get("HotelId")), + String.valueOf(result.get("HotelName")), + String.valueOf(result.get("Description")), + ((Number) result.get("SimilarityScore")).doubleValue())); + } + } + + return new QuerySummary(containerName, requestCharge, results); + } + + public static void printQuerySummary(QuerySummary summary) { + System.out.println("\n=== Query results: " + summary.containerName() + " (" + + Config.algorithmLabel(summary.containerName()) + ") ==="); + System.out.printf("Request charge: %.2f RUs%n", summary.requestCharge()); + + int rank = 1; + for (QueryResult result : summary.results()) { + System.out.printf("%d. HotelId=%s | HotelName=%s | score=%.4f | Description=%s%n", + rank++, + result.hotelId(), + result.hotelName(), + result.score(), + shorten(result.description())); + } + } + + private static String validateFieldName(String fieldName) { + if (!FIELD_NAME_PATTERN.matcher(fieldName).matches()) { + throw new IllegalArgumentException("Invalid embedding field name: " + fieldName); + } + return fieldName; + } + + private static List toDoubleList(List embedding) { + List values = new ArrayList<>(embedding.size()); + for (Float value : embedding) { + values.add(value.doubleValue()); + } + return values; + } + + private static String shorten(String value) { + if (value == null || value.length() <= 110) { + return value; + } + return value.substring(0, 107).trim() + "..."; + } +} + +record IngestionSummary( + String containerName, + int totalDocuments, + int upsertedDocuments, + int failedDocuments, + double requestCharge) { +} + +record QueryResult( + String hotelId, + String hotelName, + String description, + double score) { +} + +record QuerySummary( + String containerName, + double requestCharge, + List results) { +} diff --git a/nosql-create-index-java/tests/.gitkeep b/nosql-create-index-java/tests/.gitkeep new file mode 100644 index 0000000..e69de29 From 927bc00a7e8b4d6dfef298bec2329eb75e41b8fd Mon Sep 17 00:00:00 2001 From: diberry <41597107+diberry@users.noreply.github.com> Date: Mon, 8 Jun 2026 15:46:05 -0700 Subject: [PATCH 2/2] docs: add Java quickstart article for create-index sample Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../quickstart-create-index-java.md | 140 ++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 nosql-create-index-java/quickstart-create-index-java.md diff --git a/nosql-create-index-java/quickstart-create-index-java.md b/nosql-create-index-java/quickstart-create-index-java.md new file mode 100644 index 0000000..dc341e1 --- /dev/null +++ b/nosql-create-index-java/quickstart-create-index-java.md @@ -0,0 +1,140 @@ +--- +title: Quickstart: Create and query vector indexes in Azure Cosmos DB for NoSQL using Java +description: Use Java and the Azure SDK to load pre-vectorized hotel data into existing Azure Cosmos DB for NoSQL vector containers and run similarity queries with Azure OpenAI embeddings. +author: diberry +ms.author: diberry +ms.service: azure-cosmos-db +ms.topic: quickstart +ms.date: 2026-06-08 +--- + +# Quickstart: Create and query vector indexes in Azure Cosmos DB for NoSQL using Java + +In this quickstart, you use the Java sample in `Azure-Samples/cosmos-db-vector-samples` to load pre-vectorized hotel documents into existing Azure Cosmos DB for NoSQL containers and run vector similarity queries. The sample uses `DefaultAzureCredential` for Azure Cosmos DB and the Azure OpenAI client, so you don't need API keys. + +The sample is data-plane only. It assumes `azd up` already created the database, the `hotels_diskann` container, and the `hotels_quantizedflat` container with vector policies and indexes. + +Find the sample code on GitHub: `nosql-create-index-java/` in `Azure-Samples/cosmos-db-vector-samples`. + +## Prerequisites + +- An Azure subscription. If you don't have one, create a [free account](https://azure.microsoft.com/free/). +- An Azure Cosmos DB for NoSQL account provisioned by the sample repo's Bicep templates: + - Vector search enabled + - Serverless enabled + - `Hotels` database created + - `hotels_diskann` and `hotels_quantizedflat` containers created with `/HotelId` as the partition key path +- Microsoft Entra ID role assignments for your identity: + - **Cosmos DB Built-in Data Contributor** + - **Cognitive Services OpenAI User** +- An Azure OpenAI resource with a `text-embedding-3-small` deployment. +- [Java 17 LTS](https://learn.microsoft.com/java/openjdk/download) +- [Apache Maven 3.9](https://maven.apache.org/download.cgi) or later +- [!INCLUDE [Azure CLI](~/reusable-content/azure-cli/azure-cli-prepare-your-environment-no-header.md)] + +## Clone the repository + +```bash +git clone https://github.com/Azure-Samples/cosmos-db-vector-samples.git +cd cosmos-db-vector-samples/nosql-create-index-java +``` + +## Understand what the sample does + +Azure Cosmos DB for NoSQL follows an infra-first pattern for vector indexes: + +| Layer | Tool | Responsibility | +|---|---|---| +| Provisioning | `azd up` + Bicep | Creates the Azure Cosmos DB account, database, containers, vector policies, and RBAC | +| Runtime | Java sample | Loads documents, generates a query embedding, and runs `VectorDistance()` queries | + +The Java code does **not** create containers or indexes. Vector indexes for Azure Cosmos DB for NoSQL are provisioned when the containers are created. + +## Configure environment variables + +1. Copy the template file. + + ```powershell + Copy-Item sample.env .env + ``` + +1. Update `.env` with your Azure resource values. + + ```dotenv + AZURE_COSMOSDB_ENDPOINT="https://.documents.azure.com:443/" + AZURE_COSMOSDB_DATABASENAME="Hotels" + AZURE_COSMOSDB_CONTAINER_NAME="" + AZURE_OPENAI_EMBEDDING_ENDPOINT="https://.openai.azure.com/" + AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-3-small" + AZURE_OPENAI_EMBEDDING_API_VERSION="2024-08-01-preview" + VECTOR_ALGORITHM="" + DATA_FILE_WITH_VECTORS="..\\data\\HotelsData_toCosmosDB_Vector.json" + ``` + +Leave `AZURE_COSMOSDB_CONTAINER_NAME` and `VECTOR_ALGORITHM` empty to run both containers. Set `VECTOR_ALGORITHM` to `diskann` or `quantizedflat` if you want to target one algorithm. + +## Build and run the sample + +Compile the sample: + +```powershell +mvn compile +``` + +Run it: + +```powershell +mvn exec:java +``` + +The sample performs these steps: + +1. Loads configuration from `.env` and validates required values. +1. Creates one `DefaultAzureCredential` and passes it directly to `CosmosClient`. +1. Reads `..\data\HotelsData_toCosmosDB_Vector.json`. +1. Bulk-upserts documents into `hotels_diskann` and `hotels_quantizedflat`. +1. Uses the Azure OpenAI client to generate a query embedding. +1. Executes a parameterized `VectorDistance()` query and prints the top matches. + +## Review the Java project structure + +```text +nosql-create-index-java/ +├── .env.example +├── output/ +│ └── sample-output.txt +├── pom.xml +├── README.md +├── sample.env +└── src/main/java/com/azure/cosmos/createindex/ + ├── App.java + ├── Config.java + └── DataPlane.java +``` + +### App.java + +`App.java` orchestrates the sample. It loads configuration, creates the shared credential, verifies embedding dimensions, ingests the hotel dataset, and runs vector queries for each target container. + +### Config.java + +`Config.java` loads environment variables from the shell or `.env`, resolves the shared dataset path, and maps `VECTOR_ALGORITHM` values to the existing container names. + +### DataPlane.java + +`DataPlane.java` contains the Azure Cosmos DB and Azure OpenAI client factories plus the data-plane operations: + +- bulk upsert using `executeBulkOperations()` +- embedding generation with `EmbeddingsOptions` +- field-name validation before interpolating the embedding field into `VectorDistance()` +- parameterized SQL queries for the embedding vector and `TOP` value + +## Expected output + +The sample prints embedding validation, ingestion status, and query results for each container. A representative output file is included in `output/sample-output.txt`. + +## Next steps + +- Learn more about [Azure Cosmos DB for NoSQL vector search](/azure/cosmos-db/nosql/vector-search). +- Review the full sample repo for other languages and scenarios. +- If you haven't provisioned the shared infrastructure yet, run `azd up` from the repo root before rerunning the Java sample.