Skip to content

TIKA-4727: Add experimental strongly-typed protobuf response to tika-grpc#2811

Draft
nddipiazza wants to merge 1 commit into
apache:mainfrom
nddipiazza:TIKA-4731-typed-grpc-response
Draft

TIKA-4727: Add experimental strongly-typed protobuf response to tika-grpc#2811
nddipiazza wants to merge 1 commit into
apache:mainfrom
nddipiazza:TIKA-4731-typed-grpc-response

Conversation

@nddipiazza
Copy link
Copy Markdown
Contributor

Summary

Adds TikaTypedResponse as an experimental, opt-in alternative to the flat map<string,string> fields in FetchAndParseReply. The existing fields map is unchanged — no breaking change for existing clients.

Sparked by a conversation with Kristian Rickert (@krickert) who built a comprehensive typed proto schema for Tika metadata at ai-pipestream/pipestream-protos and opened the discussion about whether tika-grpc should expose the same.

Motivation

Tika's internal metadata model is already strongly typed. The gRPC layer currently serialises everything to strings:

// before
map<string, string> fields = 2;   // "pdf:encrypted" -> "true", "xmpTPg:NPages" -> "3"

That forces callers to:

  • Re-parse booleans, integers, and timestamps from strings
  • Handle repeated values that are squashed to a single string
  • Spend CPU cycles on avoidable serialisation in both directions

As Kristian pointed out, this essentially gives you the same overhead as JSON — with none of the type safety benefits of protobuf.

Changes

File Description
tika-grpc/src/main/proto/tika_typed_response.proto New proto: TikaTypedResponse, DublinCoreMetadata, PdfTypedMetadata, OfficeTypedMetadata, ImageTypedMetadata, EmailTypedMetadata, MediaTypedMetadata, GenericTypedMetadata, TikaTypedParseStatus, TikaEmbeddedDocument
tika-grpc/src/main/proto/tika.proto Add TikaTypedResponse typed_response = 5 to FetchAndParseReply
tika-grpc/src/main/java/.../TikaTypedMetadataMapper.java Maps List<Metadata>TikaTypedResponse; dispatch by Content-Type
tika-grpc/src/main/java/.../TikaGrpcServerImpl.java Wire mapper in fetchAndParseImpl()

Design

message FetchAndParseReply {
  string fetch_key = 1;
  map<string, string> fields = 2;          // existing — unchanged
  string status = 3;
  string error_message = 4;
  TikaTypedResponse typed_response = 5;   // new (experimental)
}

message TikaTypedResponse {
  TikaTextContent content = 1;
  DublinCoreMetadata dublin_core = 2;
  oneof document_metadata {
    PdfTypedMetadata pdf = 3;
    OfficeTypedMetadata office = 4;
    ImageTypedMetadata image = 5;
    EmailTypedMetadata email = 6;
    MediaTypedMetadata media = 7;
    GenericTypedMetadata generic = 8;
  }
  TikaTypedParseStatus parse_status = 9;
  repeated TikaEmbeddedDocument embedded_documents = 10;
  map<string, string> overflow_fields = 11;   // unmapped fields
}

The oneof document_metadata branch is selected by the Content-Type of the primary metadata entry. Any metadata key not handled by the typed branch lands in overflow_fields so callers never lose data.

Review Focus Areas

  • Proto field coverage — are there important Tika metadata fields missing from each typed message?
  • oneof vs. separate messages — is the oneof document_metadata approach the right shape, or should we use a different extension strategy?
  • Naming conventions — field names follow Kristian's reference design; are they consistent with existing Tika naming?
  • Mapper correctness — Tika metadata key strings are in TikaTypedMetadataMapper; spot-check against actual Metadata output for your document types
  • Experimental gate — should population of typed_response be gated on a request flag rather than always-on?

Critical Files

  • tika-grpc/src/main/proto/tika_typed_response.proto
  • tika-grpc/src/main/java/org/apache/tika/pipes/grpc/TikaTypedMetadataMapper.java

Testing Instructions

  1. Start the gRPC server with any fetcher config
  2. Call FetchAndParse on a PDF — reply.typed_response.pdf should contain typed fields
  3. Call on an Office document — reply.typed_response.office should have word_count, page_count, etc.
  4. Verify reply.fields still contains the full flat map (no regression)
  5. The existing e2e tests in tika-e2e-tests cover the base behaviour; no typed-response-specific tests yet (intentional for this draft)

Review Checklist

  • Proto backwards compatibility (new optional field 5 — OK per proto3 rules)
  • No change to existing fields map population
  • TikaTypedMetadataMapper handles null / missing metadata gracefully
  • Content-Type dispatch covers common document families

Potential Concerns

  • Maintenance burden: typed fields need updating if Tika adds new metadata keys. The overflow_fields map ensures no data loss in the meantime.
  • Field count: Kristian's full schema maps ~1500 fields; this PR covers the most common families. PRs for additional type branches (HTML, archive, font, WARC, etc.) can follow.
  • Proto stability: marking as experimental allows iteration on field names/numbers before the schema is frozen.

Credit: Kristian Rickertpipestream-protos served as the reference design.

…grpc

Adds TikaTypedResponse as an experimental alternative to the flat
map<string,string> fields in FetchAndParseReply (TIKA-4722).

Motivation (raised by Kristian Rickert, ai-pipestream):
Tika's internal metadata model is already strongly typed — booleans,
integers, timestamps, and repeated values are all serialised to strings
in the current gRPC schema.  That forces callers to parse them back,
wastes CPU on both sides, and makes cross-language consumption error-
prone ("true"/"false" vs. bool, ISO-8601 strings vs. Timestamp, etc.).

Design:
- New proto file tika_typed_response.proto adds TikaTypedResponse with:
  - TikaTextContent       — plain-text body + summary fields
  - DublinCoreMetadata    — dc:/dcterms: fields strongly typed
  - oneof document_metadata — PdfTypedMetadata, OfficeTypedMetadata,
    ImageTypedMetadata, EmailTypedMetadata, MediaTypedMetadata,
    GenericTypedMetadata (selected by Content-Type)
  - TikaTypedParseStatus  — parse lifecycle info
  - repeated TikaEmbeddedDocument — embedded doc references
  - map<string,string> overflow_fields — fields not covered above
- FetchAndParseReply gains optional field 5 (typed_response).
  Field is always populated alongside the existing fields map — no
  breaking change for existing clients.
- TikaTypedMetadataMapper maps from List<Metadata> → TikaTypedResponse.
- TikaGrpcServerImpl wires the mapper in fetchAndParseImpl().

The typed schema is marked experimental.  Feedback on field coverage,
naming conventions, and the oneof approach is welcome on the JIRA ticket.

Credit: Kristian Rickert's ai-pipestream/pipestream-protos served as
the reference design for the typed field mapping.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant