TIKA-4727: Add experimental strongly-typed protobuf response to tika-grpc#2811
Draft
nddipiazza wants to merge 1 commit into
Draft
TIKA-4727: Add experimental strongly-typed protobuf response to tika-grpc#2811nddipiazza wants to merge 1 commit into
nddipiazza wants to merge 1 commit into
Conversation
…grpc
Adds TikaTypedResponse as an experimental alternative to the flat
map<string,string> fields in FetchAndParseReply (TIKA-4722).
Motivation (raised by Kristian Rickert, ai-pipestream):
Tika's internal metadata model is already strongly typed — booleans,
integers, timestamps, and repeated values are all serialised to strings
in the current gRPC schema. That forces callers to parse them back,
wastes CPU on both sides, and makes cross-language consumption error-
prone ("true"/"false" vs. bool, ISO-8601 strings vs. Timestamp, etc.).
Design:
- New proto file tika_typed_response.proto adds TikaTypedResponse with:
- TikaTextContent — plain-text body + summary fields
- DublinCoreMetadata — dc:/dcterms: fields strongly typed
- oneof document_metadata — PdfTypedMetadata, OfficeTypedMetadata,
ImageTypedMetadata, EmailTypedMetadata, MediaTypedMetadata,
GenericTypedMetadata (selected by Content-Type)
- TikaTypedParseStatus — parse lifecycle info
- repeated TikaEmbeddedDocument — embedded doc references
- map<string,string> overflow_fields — fields not covered above
- FetchAndParseReply gains optional field 5 (typed_response).
Field is always populated alongside the existing fields map — no
breaking change for existing clients.
- TikaTypedMetadataMapper maps from List<Metadata> → TikaTypedResponse.
- TikaGrpcServerImpl wires the mapper in fetchAndParseImpl().
The typed schema is marked experimental. Feedback on field coverage,
naming conventions, and the oneof approach is welcome on the JIRA ticket.
Credit: Kristian Rickert's ai-pipestream/pipestream-protos served as
the reference design for the typed field mapping.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
TikaTypedResponseas an experimental, opt-in alternative to the flatmap<string,string> fieldsinFetchAndParseReply. The existingfieldsmap is unchanged — no breaking change for existing clients.Sparked by a conversation with Kristian Rickert (@krickert) who built a comprehensive typed proto schema for Tika metadata at ai-pipestream/pipestream-protos and opened the discussion about whether tika-grpc should expose the same.
Motivation
Tika's internal metadata model is already strongly typed. The gRPC layer currently serialises everything to strings:
That forces callers to:
As Kristian pointed out, this essentially gives you the same overhead as JSON — with none of the type safety benefits of protobuf.
Changes
tika-grpc/src/main/proto/tika_typed_response.protoTikaTypedResponse,DublinCoreMetadata,PdfTypedMetadata,OfficeTypedMetadata,ImageTypedMetadata,EmailTypedMetadata,MediaTypedMetadata,GenericTypedMetadata,TikaTypedParseStatus,TikaEmbeddedDocumenttika-grpc/src/main/proto/tika.protoTikaTypedResponse typed_response = 5toFetchAndParseReplytika-grpc/src/main/java/.../TikaTypedMetadataMapper.javaList<Metadata>→TikaTypedResponse; dispatch by Content-Typetika-grpc/src/main/java/.../TikaGrpcServerImpl.javafetchAndParseImpl()Design
The
oneof document_metadatabranch is selected by theContent-Typeof the primary metadata entry. Any metadata key not handled by the typed branch lands inoverflow_fieldsso callers never lose data.Review Focus Areas
oneof document_metadataapproach the right shape, or should we use a different extension strategy?TikaTypedMetadataMapper; spot-check against actualMetadataoutput for your document typestyped_responsebe gated on a request flag rather than always-on?Critical Files
tika-grpc/src/main/proto/tika_typed_response.prototika-grpc/src/main/java/org/apache/tika/pipes/grpc/TikaTypedMetadataMapper.javaTesting Instructions
FetchAndParseon a PDF —reply.typed_response.pdfshould contain typed fieldsreply.typed_response.officeshould have word_count, page_count, etc.reply.fieldsstill contains the full flat map (no regression)tika-e2e-testscover the base behaviour; no typed-response-specific tests yet (intentional for this draft)Review Checklist
fieldsmap populationTikaTypedMetadataMapperhandles null / missing metadata gracefullyPotential Concerns
overflow_fieldsmap ensures no data loss in the meantime.Credit: Kristian Rickert — pipestream-protos served as the reference design.