Enable internal text embedding API #3441
Open
ajtiwari07 wants to merge 67 commits intoAzure:mainfrom
Open
Conversation
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…configuration Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
… and telemetry integration Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…ate empty embeddings Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…oint/health sub-objects Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…orization Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…lthCheckConfig in converter
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…dding controller tests, switching the dab schema for the embedding system to default to false
embeddings endpoint is now permanently fixed to /embed with no user-configurable path option. This removes unnecessary configuration surface since the feature has not been released yet, eliminating the need for backward compatibility. Changes: - Remove path property from dab.draft.schema.json - Remove Path, UserProvidedPath, and EffectivePath from EmbeddingsEndpointOptions - Remove EffectiveEndpointPath from EmbeddingsOptions - Remove path deserialization from EmbeddingsOptionsConverterFactory - Remove --runtime.embeddings.endpoint.path CLI option - Remove path configuration logic from ConfigGenerator - Remove endpoint path validation from RuntimeConfigValidator - Update Startup.cs logging to use DEFAULT_PATH constant - Update all tests to remove path references
|
Azure Pipelines successfully started running 6 pipeline(s). |
Contributor
|
/azp run |
|
Azure Pipelines successfully started running 6 pipeline(s). |
…om/ajtiwari07/data-api-builder into add-internal-text-embedding-system
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 6 pipeline(s). |
…om/ajtiwari07/data-api-builder into add-internal-text-embedding-system
Author
|
/azp run |
|
Commenter does not have sufficient privileges for PR 3441 in repo Azure/data-api-builder |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 6 pipeline(s). |
Aniruddh25
requested changes
May 2, 2026
Collaborator
Aniruddh25
left a comment
There was a problem hiding this comment.
Main question around need for embedding cache options. Cant we reuse runtime cache options to decide what we do for embeddings, does it need to be granular?
…om/ajtiwari07/data-api-builder into add-internal-text-embedding-system
Aniruddh25
approved these changes
May 5, 2026
Collaborator
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds configurable text chunking capabilities to the embeddings API, enabling automatic text segmentation before embedding generation. This feature supports both single-text and multi-document batch processing with runtime configuration and query parameter overrides.
Changes
Configuration
Added EmbeddingsChunkingOptions.cs - Configuration model for chunking behavior
Enabled (bool) - Enable/disable chunking
SizeChars (int) - Chunk size in characters (default: 800)
OverlapChars (int) - Overlap between chunks (default: 100)
EffectiveSizeChars property ensures minimum valid chunk size
Modified EmbeddingsOptions.cs - Added Chunking property and IsChunkingEnabled helper
Removed EmbeddingsCacheOptions.cs - Simplified configuration by removing unused cache feature
API Enhancements
Modified Controllers/EmbeddingController.cs
Auto-detects request type (single text vs. document array)
Implements overlapping text chunking algorithm
Supports query parameter overrides: $chunking.enabled, $chunking.size-chars, $chunking.overlap-chars
Returns multiple embeddings per document when chunking is enabled
Added Models/EmbedDocumentRequest.cs - Request model for document arrays
Added Models/EmbedDocumentResponse.cs - Response model with chunked embeddings
Schema (schemas)
Modified dab.draft.schema.json - Added chunking configuration schema with validation rules
Testing (UnitTests)
Added EmbeddingsChunkingOptionsTests.cs (13 tests) - Configuration validation
Added ChunkTextTests.cs (21 tests) - Chunking algorithm validation including edge cases
Modified EmbeddingControllerTests.cs (+18 tests) - API endpoint tests for chunking and document arrays
Total Test Coverage: 72 tests (48 existing + 24 new) - All passing## Why make this change?
Testing
All 72 unit tests passing
Edge cases covered: empty text, very small chunks, overlap larger than chunk size, Unicode text
Query parameter parsing validated
Backward compatibility verified
Breaking Changes
None - This is a backward-compatible addition. Existing single-text requests continue to work without modification.