Skip to content

Camel-Milvus: create rag helpers to be used as beans#21994

Open
smongiar wants to merge 1 commit intoapache:mainfrom
smongiar:add-milvus-rag-features
Open

Camel-Milvus: create rag helpers to be used as beans#21994
smongiar wants to merge 1 commit intoapache:mainfrom
smongiar:add-milvus-rag-features

Conversation

@smongiar
Copy link
Contributor

#Description

Need to collect processors and beans for more comfortable usage in route DSL definitions. Some changes involve JBang as well to make it work with camel CLI

#Target

[ X] I checked that the commit is targeting the correct branch (Camel 4 uses the main branch)

[ X] I have run mvn clean install -DskipTests locally from root folder and I have committed all auto-generated changes.

@github-actions
Copy link
Contributor

🌟 Thank you for your contribution to the Apache Camel project! 🌟
🤖 CI automation will test this PR automatically.

🐫 Apache Camel Committers, please review the following items:

  • First-time contributors require MANUAL approval for the GitHub Actions to run
  • You can use the command /component-test (camel-)component-name1 (camel-)component-name2.. to request a test from the test bot although they are normally detected and executed by CI.
  • You can label PRs using build-all, build-dependents, skip-tests and test-dependents to fine-tune the checks executed by this PR.
  • Build and test logs are available in the summary page. Only Apache Camel committers have access to the summary.

⚠️ Be careful when sharing logs. Review their contents before sharing them publicly.

private String collectionName = "rag_collection";
private String collectionDescription = "RAG collection";
private String idFieldName = "id";
private String dimension = "768";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are int/long types Strings?

import io.milvus.response.QueryResultsWrapper;
import org.apache.camel.Exchange;

public class RAGResultExtractor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't it a processor?


FieldType vectorField = FieldType.newBuilder()
.withName(vectorFieldName)
.withDataType(DataType.FloatVector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be configurable?

Copy link
Contributor

@gnodet gnodet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few observations on this PR:

  1. Test changes modify what's being tested: The existing tests (MilvusCreateCollectionTest, MilvusComponentIT) previously tested the Milvus component directly using the Milvus Java SDK APIs. The refactored tests now route through the new RAG helpers, which means they're no longer testing the raw component behavior — they're testing the helpers + component together. This could mask issues in either layer. Consider keeping the original tests as-is and adding new tests for the RAG helpers separately.

  2. No input validation in processors: The RAG processors like RAGInsert and RAGUpsert split textFieldMappings on = and access pair[1] without bounds checking. A malformed mapping like "content" (missing =value) would throw an ArrayIndexOutOfBoundsException. Similar issue with Integer.parseInt(dimension) — invalid input gives an unhelpful error.

  3. dimension and limit as String fields: In RAGCreateCollection, dimension is stored as String but always parsed to int. Same for limit in RAGSearch and textFieldMaxLength. Using int/long fields directly would be more natural and catch invalid values at configuration time rather than at runtime.

  4. Unchecked cast warning: exchange.getIn().getBody(List.class) in RAGInsert/RAGUpsert/RAGSearch — the @SuppressWarnings("unchecked") suppresses the warning but getBody(List.class) could return null if the body isn't convertible. No null check before using the result.

These are minor issues. The overall approach of providing RAG helper beans is sound.

@gnodet
Copy link
Contributor

gnodet commented Mar 16, 2026

Code Review

Thanks for the contribution! The idea of providing RAG convenience beans is sound and follows the pattern established by camel-qdrant. Here's detailed feedback:

Critical: Existing tests should not be modified

The existing MilvusCreateCollectionTest and MilvusComponentIT previously validated the raw Milvus component behavior by constructing SDK objects directly. Replacing them with RAG helpers means:

  • The tests now test RAG helpers + component together, not the component in isolation
  • If a RAG helper has a bug, the test would pass/fail for the wrong reason
  • The original tests used CollectionSchemaParam with withShardsNum(2) and withEnableDynamicField(false) — the RAG helper doesn't expose these options, so the test is now covering a different code path

Recommendation: Revert the existing tests. Add separate unit tests for each RAG helper class.

Input validation and error handling

RAGInsert.java / RAGUpsert.java — The textFieldMappings parsing is fragile:

String[] pair = mapping.trim().split("=");
String variableName = pair[1].trim();  // ArrayIndexOutOfBoundsException if no '='

A malformed mapping like "content" (missing =value) throws ArrayIndexOutOfBoundsException with no useful context. Similarly, Integer.parseInt(dimension) and Long.parseLong(limit) throw unhelpful exceptions.

RAGInsert.java / RAGUpsert.java / RAGSearch.javaexchange.getIn().getBody(List.class) returns null if the body can't be converted, but the result is used without a null check, causing a NullPointerException deeper in the SDK.

Code duplication between RAGInsert and RAGUpsert

These two classes are nearly identical — the only differences are InsertParam vs UpsertParam and MilvusAction.INSERT vs MilvusAction.UPSERT. Consider extracting common logic into a shared base class or utility method.

Hardcoded values that should be configurable

  • RAGSearch hardcodes ConsistencyLevel.STRONG — this should be a configurable property
  • RAGSearch doesn't expose an offset property (the original test used .withOffset(0L))
  • RAGCreateIndex doesn't support setting an index name (the original IT test used .withIndexName("userFaceIndex"))
  • RAGCreateCollection — additional text fields always use VarChar and the same maxLength as the primary field

RAGDelete without filter may be dangerous

If filter is null (the default), the DeleteParam is built without an expression. Depending on the SDK behavior, this could be a no-op or could delete everything. Consider requiring a filter or logging a warning.

Numeric properties as Strings

dimension, textFieldMaxLength, and limit are stored as String but always parsed to int/long. Camel's property binding handles type conversion automatically, so using native types would catch invalid values earlier. If this is intentional for JBang/YAML compatibility (matching the Qdrant pattern), a brief comment would help.

Missing documentation

None of the new classes have Javadoc. For classes intended as public-facing beans, at minimum class-level Javadoc explaining purpose and usage would be helpful.

Summary

The concept is good but the PR needs some work before it's ready:

  1. Revert existing test modifications — add dedicated RAG helper tests instead
  2. Add input validation with descriptive error messages
  3. Add null checks for exchange body
  4. Make hardcoded values (consistency level, index name, offset) configurable
  5. Reduce duplication between RAGInsert/RAGUpsert
  6. Add Javadoc

@smongiar
Copy link
Contributor Author

Hi @gnodet ... thank you for your review and observations to the PR.

Anyway, let me answer first about "Test changes modify what's being tested" complain. The most general, important one.
These "RAG" classes are not implementing RAG logic. They are simple Processor implementations that can be used as bean ref in YAML DSL without defining any additional Java code.
They don't do retrieval-augmented generation,they don't call an LLM, they don't combine retrieved context with prompts, they don't do chunking or any RAG pipeline orchestration.
I agree that The "RAG" prefix is arguably a misnomer (maybe something like MilvusHelper would be less confusing), so I can change them in case, but it's the same we did for Qdrant example.
But the tests, are still testing the component in isolation.
The reason why I'm focusing on YAML DSL, is to make the AI components easily configurable so they can be well integrated in UI Camel Tools, used by customers. It something we will do for each AI component, to follow this direction.

Then, about other review observation (sorry but I need to explain on some points, there have been too much replies and this make the review a little bit confusing, so let me explain briefly some of them, for now):

Input validation and error handling:
It's a standard Camel pattern. Dozens of components across the codebase do split("=") and access [0]/[1] without bounds checking. Adding defensive validation here would make these helpers inconsistent with the rest of the project, so these properties are developer-configured, not user-supplied. Just to make an example, if a developer writes "content" instead of "content=text", they have a bug in their route definition (which is effectively the same as a startup-time configuration error). Wrapping these in custom exceptions adds complexity for no real benefit. A NumberFormatException from Integer.parseInt("abc") already tells you "abc" is not a valid integer. Wrapping it to say "dimension must be a valid integer" adds negligible value for a developer audience. Same for ArrayIndexOutOfBoundsException — the stack trace points directly at the split("=") line, making the cause obvious.
In case, I can consider a @param Javadoc note on the setter documenting the expected format (e.g., "field1=var1,field2=var2"), not runtime validation code. Tell me if you agree.

Hardcoded values:
"Make everything configurable", it's just the purpose of these helpers. Basically we are talking about reasonable defaults, that need to be changed, plus they involve optional parameters sometime.
As example, what are you saying about RAGCreateIndex, and .withIndexName("userFaceIndex"), it's supported and originally reported by the related IT code. But indexName is an optional parameter in Milvus. Omitting it just auto-generates a name, which is fine for most use cases.

Code duplication (INSERT/UPSERT):
These are two separate Processor beans meant to be independently referenced in YAML DSL. They happen to look similar today, but they map to fundamentally different Milvus operations (INSERT vs UPSERT) which have different semantics and could diverge in the future. It was so in the original implementation, why consider this as duplication of code, now, during the refactoring?

Numeric properties as Strings:
When you say "If this is intentional for JBang/YAML compatibility" ... So, yes, it is intentional, as already mentioned above. These are beans configured via YAML DSL property binding, where everything comes in as strings. Using String types is the pragmatic, consistent choice that aligns with how the existing Qdrant or Milvus RAG helpers already work. It's a deliberate pattern, not an oversight.

Revert existing test modifications — add dedicated RAG helper tests instead:
It sounds like you are treating the RAG helpers as a separate "system under test" that needs its own test suite. But as we've already established, these helpers are just convenience processors that build Milvus SDK parameter objects. They aren't a new feature with independent logic worth testing in isolation. Other than that I didn't changed those tests. Processor were just collected and refactored.
So, why this need to create RAG tests, since RAG itself is not tested as explained before?

Most of complains I suppose are due to misleading classes names. Or at least, I suppose. So I hope, this clarification, about the use of "RAG" prefix should address the review in a different context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants