Skip to content

SOLR-18187: Document enrichment with LLMs#4259

Draft
nicolo-rinaldi wants to merge 24 commits intoapache:mainfrom
SeaseLtd:llm-document-enrichment
Draft

SOLR-18187: Document enrichment with LLMs#4259
nicolo-rinaldi wants to merge 24 commits intoapache:mainfrom
SeaseLtd:llm-document-enrichment

Conversation

@nicolo-rinaldi
Copy link
Copy Markdown
Contributor

@nicolo-rinaldi nicolo-rinaldi commented Apr 1, 2026

https://issues.apache.org/jira/browse/SOLR-18187

Description

The goal of this PR is to add a way to integrate LLMs directly into Solr at index time to fill fields that might be useful (e.g., categories, tags, etc.)

Solution

This PR adds LLM-based document enrichment capabilities to Solr's indexing pipeline via a new DocumentEnrichmentUpdateProcessorFactory in the language-models module. The processor allows users to enrich documents at index time by calling an LLM (via https://github.com/langchain4j/langchain4j) with a configurable prompt built from one or more existing document fields (inputFields), and storing the model's response into an output field. The output field can be of different types (i.e., string, text, int, long, float, double, boolean, and date) and can be single-valued or multi-valued. The structured output has been used to adapt to the output field type.

The implementation has taken inspiration from the text-to-vector feature in the same module. This has been done to keep the implementation consistent with conventions already in the language-models module.

Note: this PR was developed with assistance from Claude Code (Anthropic).

Tests

Tests covering configuration validation (missing required params, conflicting params, invalid field types, placeholder mismatches), and processor initialization.

Tests covering single-valued and multi-valued output fields of all supported types, multi-input-field prompts, prompt file loading, error handling (model exceptions, ambiguous/malformed JSON responses, unsupported model types), and skipNullOrMissingFieldValues behaviour. All the supported models have been tested.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Dependency upgrades tool:build tests labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

@aruggero aruggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1");
}

private UpdateRequestProcessor createUpdateProcessor(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this always be generalised and used for all the tests? In some of them, you are now repeating this code with small changes...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a function initializeUpdateProcessorFactory that is used inside createUpdateProcessor. In this way, the code inside the first one can be reused

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed tests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why some test could not use these new functions?
e.g. init_multipleInputFields_shouldInitAllFields

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept them unrelated to the model creation, just to see the proper initialization of the Factory. I can see if this can be changed if you want

restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1");
}

private UpdateRequestProcessor createUpdateProcessor(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?


=== Models

* A model in this module is a chat model, that answers with text given a prompt.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

=== Models

* A model in this module is a chat model, that answers with text given a prompt.
* A model in this Solr module is a reference to an external API that runs the Large Language Model responsible for chat
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed


Exactly one of the following parameters is required: `prompt` or `promptFile`.

Another important feature of this module is that one (or more) `inputField` needs to be injected in the prompt. This is
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nicolo-rinaldi nicolo-rinaldi requested a review from aruggero April 17, 2026 08:29
.messages(UserMessage.from(prompt))
.build();
String rawJson = chatModel.chat(chatRequest).aiMessage().text();
Object parsed = Utils.fromJSONString(rawJson);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is parsing an 'Object' necessary?

public SolrChatModel getModel(String modelName) {
return store.getModel(modelName);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire class feels like exactly the same of the one I implemented for embedding models.
Can't we use the same class but for multiple storage solutions?
So you instantiate different endpoints but same class.
It feels a lot of duplicate code

"model '" + name + "' already exists. Please use a different name");
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above in regards of duplicated code

// as for now, only a plain text as prompt is sent to the model (no support for
// tools/skills/agents)
// chatModel.chat returns the parsed value from the structured JSON response
Object value = chatModel.chat(injectedPrompt, responseFormat);
Copy link
Copy Markdown
Contributor

@alessandrobenedetti alessandrobenedetti Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value? isn't it the output? Also, langchain4j returns an 'Object'? is that a weak typing?

Comment thread changelog/unreleased/SOLR-18187-llm-document-enrichment.yml Outdated
Copy link
Copy Markdown
Contributor

@alessandrobenedetti alessandrobenedetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, but there's a lot to discuss and change!

Reasonable first draft though!

SolrException.ErrorCode.SERVER_ERROR,
"field type is not supported by Document Enrichment: "
+ fieldType.getClass().getSimpleName());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with switch-case java construct this part will be more readable

Monitor your indexing logs to detect documents that were not enriched as expected.
====

== Chat Model Setup
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chat Model is a LangChain4j naming, please remove it entirely from the doc and Solr where possible.
Furthermore we don't offer any chat style interaction so it can be misleading.

let's just use 'general purpose LLM'

…y reference to ChatModel when is not needed.
# Conflicts:
#	gradle/libs.versions.toml
#	solr/modules/language-models/gradle.lockfile
.build();
String rawJson = chatModel.chat(chatRequest).aiMessage().text();
Object parsed = Utils.fromJSONString(rawJson);
// It makes sense to keep this due to Ollama support
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the comment
"Ollama support" is not clear to me

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed


// fall-back to SolrException
default -> throw unsupportedFieldTypeException;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use

    return switch (fieldType) {
      case StrField _, TextField _, DatePointField _ -> new JsonStringSchema();
      case IntPointField _, LongPointField _         -> new JsonIntegerSchema();
      case FloatPointField _, DoublePointField _     -> new JsonNumberSchema();
      case BoolField _                               -> new JsonBooleanSchema();
      default                                        -> throw unsupportedFieldTypeException;
    };

with Java 22+?

@@ -0,0 +1,85 @@
/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we remove any "chatModel" reference from the contribution?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this is a mock for the actual langchain4j ChatModel class, indeed it implements the ChatModel interface, so I thought the name must contain some reference to the langchain4j library.

<field name="output_binary" type="binary" indexed="false" stored="true" multiValued="false"/>

<!-- output fields for unsupported types -->
<field name="output_bynary" type="binary" indexed="true" stored="true" multiValued="false"/>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have both binary and bynary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't even noticed this. "output_bynary" is never used. Removed

<!-- output fields for unsupported types -->
<field name="output_bynary" type="binary" indexed="true" stored="true" multiValued="false"/>

<!-- output fields for types explicitly unsupported (without they are supported via inheritance) -->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, see if now is clear

}

/* buildResponseFormat tests for field types from the Solr documentation */
/* buildResponseFormat tests for unsupported field types from the Solr documentation: 1 general (Binary) + 3 via inheritance */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, see if now is clear

@@ -0,0 +1,48 @@
/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we remove any "chatModel" reference from the contribution?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as DummyChatModel

@@ -0,0 +1,48 @@
/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we remove any "chatModel" reference from the contribution?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as DummyChatModel

…y reference to ChatModel when is not needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency upgrades documentation Improvements or additions to documentation tests tool:build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants