Enhancing VectorStoreWriter for better RAG support#7396
Enhancing VectorStoreWriter for better RAG support#7396Copilot wants to merge 14 commits intodata-ingestion-preview2from
Conversation
…tedChunkRecord base type Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
…toreWriter refactoring Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
…ion property names Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot address my feedback
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/VectorStoreWriter.cs
Outdated
Show resolved
Hide resolved
...sions.AI.Templates/templates/AIChatWeb-CSharp/AIChatWeb-CSharp.Web/Services/IngestedChunk.cs
Outdated
Show resolved
Hide resolved
…al, throw in SetMetadata, make consts protected, remove JSON attributes Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot please address my feedback
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/VectorStoreWriterOptions.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Writers/TestChunkRecordWithMetadata.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Writers/VectorStoreWriterTests.cs
Outdated
Show resolved
Hide resolved
…oc reference, remove JsonPropertyName, add custom schema test Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot address my feedback
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Writers/VectorStoreWriterTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Writers/VectorStoreWriterTests.cs
Outdated
Show resolved
Hide resolved
… custom storage names in test Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot please address my feedback
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
… for consistency Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot the CI has failed with following error:
test\Libraries\Microsoft.Extensions.DataIngestion.Tests\Writers\TestChunkRecordWithMetadata.cs(4,1): error S1128: (NETCORE_ENGINEERING_TELEMETRY=Build) Remove this unnecessary 'using'. (https://rules.sonarsource.com/csharp/RSPEC-1128
Perform full build and fix all the warnings and errors, run the tests locally before pushing changes
Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
This comment was marked as outdated.
This comment was marked as outdated.
…ot IngestedChunk.cs files Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
... Removed stale |
adamsitnik
left a comment
There was a problem hiding this comment.
LGTM, @roji the PR is ready for review.
| /// </summary> | ||
| /// <typeparam name="TChunk">The type of the chunk content.</typeparam> | ||
| /// <remarks> | ||
| /// When the vector dimension count is not known at compile time, use the <see cref="CreateCollectionDefinition"/> |
There was a problem hiding this comment.
To other reviewers: this is very important. My idea was following:
- introduce a non-sealed
IngestedChunkRecord<TChunk>type that comes with all the default properties. It gives us the ability to perform the query to get chunks that point to the same document but also allows the users for an easy RAG (they don't need to provide their own type) - those who don't need any custom schema, can just use this type and call
CreateCollectionDefinitionto create the definition (it's required, because they need so somehow provide thedimensionCount). - those who need a custom schema, can create it and pass to
VectoreStoreCollectionctor - the other way to customize the schema is to derive from
IngestedChunkRecord<TChunk>and override selected properties.
There was a problem hiding this comment.
OK... It has been a while since we discussed all this so I may have forgotten some considerations we talked about.
So If I understand correctly, this moves away from the previous dynamic approach (Dictionary<string, object?>) to a typed approach. Some comments:
- Importantly, typed mapping with MEVD is (currently) not trimming (and therefore NativeAOT)-compatible... Serializing/deserializing a dictionary is easy enough, but doing it with a .NET type requires a source generator which we don't yet have. I think that's going to be a problem.
- I admit it took me quite a while to understand how custom metadata is supposed to work (with SetMetadata()) and why; having a partly typed, partly dynamic story seems to introduce quite a bit of complexity/weirdness. Here's my understanding:
- A user that wants custom metadata must extend IngestedChunkRecord (this already feels a bit heavy compared to just having a
Dictionary<string, object?>as before). - On their CustomIngestedChunkRecord, they add .NET properties for the extra metadata.
- But they must also override SetMetadata(), to copy the dynamic metadata properties from the incoming IngestionChunk to the strongly-typed .NET properties on CustomIngestedChunkRecord. That's some various boilerplate-y, tedious glue between the dynamic nature of IngestionChunk and the static nature of IngestedChunkRecord.
- (aside from all the above, they must also call CreateCollectionDefinition() to get the VectorStoreCollectionDefinition, and mutate that to add their custom properties. But that's unrelated)
- A user that wants custom metadata must extend IngestedChunkRecord (this already feels a bit heavy compared to just having a
- Note that their code in SetMetadata() can get out of sync... Like I can see someone adding a new static property on CustomIngestedChunkRecord, but then forgetting to update SetMetadata() (and then we silently write empty properties).
* In other words, users need to keep (1) the incoming IngestionChunk's metadata (wherever it's populated), (2) CustomIngestedChunkRecord's .NET properties, (3) SetMetadata() and (4) the record definition in sync, which seems quite brittle... With the previous, fully dynamic model only (1) and (4) needed to be kept in sync (which is the absolute minimum) - Because of all this, I'm trying to understand the value we get out of this partly static/partly dynamic design, compared to simply continuing to map
Dictionary<string, object?>as before, and whether it's worth it...
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
| /// <summary> | ||
| /// Gets or sets the name of the collection. When not provided, "chunks" will be used. | ||
| /// </summary> | ||
| public string CollectionName |
There was a problem hiding this comment.
To other reviewers: These settings are no longer configured by the writer, they are part of the collection creation process.
| [VectorStoreVector(VectorDimensions, DistanceFunction = VectorDistanceFunction, StorageName = "embedding")] | ||
| [JsonPropertyName("embedding")] | ||
| public string? Vector => Text; | ||
| [VectorStoreVector(VectorDimensions, DistanceFunction = VectorDistanceFunction, StorageName = EmbeddingStorageName)] |
There was a problem hiding this comment.
To other reviewers: this is the simplest way of configuring the dimension count (overriding virtual property and annotating it with the right attribute)
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Writers/TestChunkRecordWithMetadata.cs
Outdated
Show resolved
Hide resolved
|
|
||
| // User creates their own definition without using CreateCollectionDefinition, | ||
| // using custom storage names to prove they can map to a pre-existing collection schema. | ||
| VectorStoreCollectionDefinition definition = new() |
There was a problem hiding this comment.
To other reviewers: this is an example of providing custom schema (without the need to provide a dedicated mapper)
| /// When the vector dimension count is known at compile time, derive from this class and add | ||
| /// the <see cref="VectorStoreVectorAttribute"/> to the <see cref="Embedding"/> property. | ||
| /// </remarks> | ||
| public class IngestedChunkRecord<TChunk> |
There was a problem hiding this comment.
To other reviewers: I am going to work on removing this generic argument very soon (I want IngestionChunk to be able to represent any input without using generic argument). But it's out of the scope of this PR.
There was a problem hiding this comment.
I know you're going to remove the generic argument, but FYI the TChunk name is causing me a bit of confusion, also in IngestionChunk<TChunk> (as if it's a chunk over itself). When reading this code I wasn't sure if with IngestedChunkRecord<TChunk>, TChunk should be string or IngestionChunk<string>.
So maybe consider renaming TChunk to just T (or TContent) everywhere.
There was a problem hiding this comment.
Yes, I really want to remove it and now I think I even know how ( #7404)
Please keep in mind it's Preview2 branch, so whatever gets merged does not automatically gets released to nuget.org. So I would prefer to keep TChunk here and just remove it completely in next PR.
roji
left a comment
There was a problem hiding this comment.
Hey @adamsitnik here are some first thoughts/questions about the new design... We should probably get past these before me reviewing the rest of the PR in detail. Feels like maybe we should jump into a call to discuss this stuff.
| /// When the vector dimension count is known at compile time, derive from this class and add | ||
| /// the <see cref="VectorStoreVectorAttribute"/> to the <see cref="Embedding"/> property. | ||
| /// </remarks> | ||
| public class IngestedChunkRecord<TChunk> |
There was a problem hiding this comment.
I know you're going to remove the generic argument, but FYI the TChunk name is causing me a bit of confusion, also in IngestionChunk<TChunk> (as if it's a chunk over itself). When reading this code I wasn't sure if with IngestedChunkRecord<TChunk>, TChunk should be string or IngestionChunk<string>.
So maybe consider renaming TChunk to just T (or TContent) everywhere.
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
| /// </summary> | ||
| /// <typeparam name="TChunk">The type of the chunk content.</typeparam> | ||
| /// <remarks> | ||
| /// When the vector dimension count is not known at compile time, use the <see cref="CreateCollectionDefinition"/> |
There was a problem hiding this comment.
OK... It has been a while since we discussed all this so I may have forgotten some considerations we talked about.
So If I understand correctly, this moves away from the previous dynamic approach (Dictionary<string, object?>) to a typed approach. Some comments:
- Importantly, typed mapping with MEVD is (currently) not trimming (and therefore NativeAOT)-compatible... Serializing/deserializing a dictionary is easy enough, but doing it with a .NET type requires a source generator which we don't yet have. I think that's going to be a problem.
- I admit it took me quite a while to understand how custom metadata is supposed to work (with SetMetadata()) and why; having a partly typed, partly dynamic story seems to introduce quite a bit of complexity/weirdness. Here's my understanding:
- A user that wants custom metadata must extend IngestedChunkRecord (this already feels a bit heavy compared to just having a
Dictionary<string, object?>as before). - On their CustomIngestedChunkRecord, they add .NET properties for the extra metadata.
- But they must also override SetMetadata(), to copy the dynamic metadata properties from the incoming IngestionChunk to the strongly-typed .NET properties on CustomIngestedChunkRecord. That's some various boilerplate-y, tedious glue between the dynamic nature of IngestionChunk and the static nature of IngestedChunkRecord.
- (aside from all the above, they must also call CreateCollectionDefinition() to get the VectorStoreCollectionDefinition, and mutate that to add their custom properties. But that's unrelated)
- A user that wants custom metadata must extend IngestedChunkRecord (this already feels a bit heavy compared to just having a
- Note that their code in SetMetadata() can get out of sync... Like I can see someone adding a new static property on CustomIngestedChunkRecord, but then forgetting to update SetMetadata() (and then we silently write empty properties).
* In other words, users need to keep (1) the incoming IngestionChunk's metadata (wherever it's populated), (2) CustomIngestedChunkRecord's .NET properties, (3) SetMetadata() and (4) the record definition in sync, which seems quite brittle... With the previous, fully dynamic model only (1) and (4) needed to be kept in sync (which is the absolute minimum) - Because of all this, I'm trying to understand the value we get out of this partly static/partly dynamic design, compared to simply continuing to map
Dictionary<string, object?>as before, and whether it's worth it...
So far, there was only one schema (created by us), so we knew how to do the mapping. If we continue to map to And once they want to do RAG, they need to re-map the dictionary using these magic names. With strongly typed approach, they don't need to. And RAG is much easier. But at a cost of complicated metadata story (which is not very common scenario). Let's have a call later today and discuss it. |
Isn't that still the case even after this PR? The vector collection definition still needs to be tweaked by the user to include the custom metadata properties, and these must correspond 100% to what will actually be coming in on IngestionChunk, right?
Right, I can see that. The basic problem here - and I think the source of the complexity - is that we have a dynamic metadata scheme on the ingestion framework side (on IngestionChunk), and we're trying to shoehorn that into a static, typed scheme for MEVD. Maybe an alternative here is to say that the built-in VectorStoreWriter only works with the universal/built-in fields (no custom metadata), and if you want custom metadata you need to do your own VectorStoreWriter; this effectively moves the complexity (like SetMetadata()) from IngestedChunkRecord into the writer. We may be able to think of making VectorStoreWriter extensible with hooks for this (again, instead of having an extensible IngestedChunkRecord with its SetMetadata() hook). Just some ideas I haven't fully thought through yet. Let's discuss. |
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot please address my feedback
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Writers/IngestedChunkRecord.cs
Outdated
Show resolved
Hide resolved
…n, move SetMetadata to non-sealed VectorStoreWriter Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
…n with provided dimension count
| [EmbeddingName] = chunk.Content, | ||
| [ContextName] = chunk.Context, | ||
| [DocumentIdName] = chunk.Document.Identifier, | ||
| Key = Guid.NewGuid(), |
There was a problem hiding this comment.
We discussed it offline and agreed I am going to remove it in a separate PR (#7410), as it most likely going to require a NuGet update and mirror sync.
| /// <returns>A vector store collection configured for ingested chunk records.</returns> | ||
| [RequiresDynamicCode("This API is not compatible with NativeAOT. You can implement your own IngestionChunkWriter that uses dynamic mapping via VectorStore.GetCollectionDynamic().")] | ||
| [RequiresUnreferencedCode("This API is not compatible with trimming. You can implement your own IngestionChunkWriter that uses dynamic mapping via VectorStore.GetCollectionDynamic().")] | ||
| public static VectorStoreCollection<Guid, TRecord> GetIngestionRecordCollection<TRecord, TChunk>(this VectorStore vectorStore, |
There was a problem hiding this comment.
To other reviewers: This is an alternative to exposing entire schema. We can just expose a factory method that does the right thing.
The advantages;
- one method call instead of two
- clear message what needs to happen when you need NativeAOT.
The disadvantages:
- you need to know that it exists (other docs point to this method so I hope it won't be a big problem)
- using two generic arguments. In the near future ([MEDI] Make the IngestionChunk non-generic #7404) it will be only one.
| /// Override this method in derived classes to store metadata as typed properties with | ||
| /// <see cref="VectorStoreDataAttribute"/> attributes. | ||
| /// </remarks> | ||
| protected virtual void SetMetadata(TRecord record, string key, object? value) |
There was a problem hiding this comment.
To other reviewers: So far, we were optimized for very easy ingestion. Now, the RAG is way simpler but when you need to use metadata, you need to create a derived type and handle it on your own. We throw here to avoid silent errors.
This PR addresses two issues in the data ingestion layer:
#6967 - Allow VectorStoreWriter to accept a pre-created VectorStoreCollection:
VectorStoreWriternow takes aVectorStoreCollection<Guid, TRecord>directly, allowing users to configure the collection in specific ways. TheTKeygeneric parameter was removed since all MEVD vector collections supportGuidkeys. Users can also provide their ownVectorStoreCollectionDefinitionwith custom storage names to map to pre-existing collections with different schemas.#6968 - Figure out the metadata story:
SetMetadatanow lives onVectorStoreWriter(which is now non-sealed) and throwsNotSupportedExceptionby default so that metadata schema mismatches are caught early rather than silently ignored. Users who need metadata must explicitly overrideSetMetadatain a derivedVectorStoreWriter.Changes
IngestedChunkRecord.cs: Usenameof()for property names inCreateDefaultCollectionDefinition, assign storage names viaStorageNameinitializer. RemoveTKeygeneric parameter (Key is alwaysGuid). Make storage name constants private (using*StorageNamesuffix for consistency), exceptEmbeddingStorageNamewhich isprotectedso derived types can reference it in[VectorStoreVector]attributes. Remove all JSON attributes and make properties virtual so derived types can annotate as needed. RenameCreateCollectionDefinitiontoCreateDefaultCollectionDefinitionto clarify that the returned definition has no custom metadata properties.VectorStoreWriter.cs: RemoveTKeygeneric parameter, hardcodeGuidkey type, simplify key generation toGuid.NewGuid(). Make class non-sealed and add virtualSetMetadatamethod (moved fromIngestedChunkRecord) that throwsNotSupportedExceptionby default. Users who need metadata override this method in a derivedVectorStoreWriter.VectorStoreWriterOptions.cs: Fix XML doc reference to match updated generic signatureVectorStoreWriter{TChunk, TRecord}.Template and snapshot files: Simplify
IngestedChunkto always useGuidkey (removing#ifconditionals), useEmbeddingStorageNameforStorageNamein derived types. Remove staleusing System.Text.Json.Serializationfrom all snapshot files.Test files: Remove unnecessary
JsonPropertyNameattributes andusingdirectives. AddCanWriteChunksWithCustomDefinitiontest proving users can create their ownVectorStoreCollectionDefinitionwith custom storage names to map to pre-existing collections with different schemas. AddTestVectorStoreWriterWithMetadatato demonstrate metadata handling via a derivedVectorStoreWriter. UpdateTestChunkRecordWithMetadatato no longer overrideSetMetadata(now handled by the writer).Fix unnecessary
using Systemin TestChunkRecordWithMetadata.cs (CI failure S1128)Remove stale
using System.Text.Json.Serializationfrom all 5 snapshot IngestedChunk.cs files (template snapshot test failures)Rename
CreateCollectionDefinitiontoCreateDefaultCollectionDefinitionMake
VectorStoreWriternon-sealed and moveSetMetadatafromIngestedChunkRecordtoVectorStoreWriterBuild with
-warnaserror— 0 warnings, 0 errorsRun tests — 123 passed, 11 skipped, 0 failed
📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.