Skip to content

API, Core: Introduce foundational types for V4 manifest support#15049

Open
anoopj wants to merge 23 commits intoapache:mainfrom
anoopj:v4-foundational-types
Open

API, Core: Introduce foundational types for V4 manifest support#15049
anoopj wants to merge 23 commits intoapache:mainfrom
anoopj:v4-foundational-types

Conversation

@anoopj
Copy link
Copy Markdown
Contributor

@anoopj anoopj commented Jan 14, 2026

These types will be used for the v4 adaptive metadata tree and will be used by subsequent PRs for manifest reading/writing.

For now, we are adding these as package-private interfaces in core, and eventually we will move them into api.

Prototype PR: #14533

Changes

API:

  • Add DATA_MANIFEST and DELETE_MANIFEST to FileContent enum for V4 manifest entry types

Core:

  • TrackedFile - Unified V4 manifest entry representation (equivalent of ContentFile for V4)
  • TrackingInfo - Entry status, snapshot ID, sequence numbers, and first-row-id
  • ContentInfo - Metadata for externally stored content (deletion vectors)
  • ManifestStats - Manifest-level statistics (file/row counts, min sequence number)

DATA(0),
POSITION_DELETES(1),
EQUALITY_DELETES(2);
EQUALITY_DELETES(2),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make this API change so that TrackingInfo can return the contentType. This is backward compatible since this is a new enum value. If we don't want to make any API changes, I will need to change TrackingInfo.contentType() to return an int instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me. We could also introduce a different enum for v4, but that doesn't seem worth it.

Copy link
Copy Markdown
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a cursory review, still need to review in more depth when I get a chance but thank you for publishing this @anoopj !

*
* <p>Used for applying manifest deletion vectors.
*/
Long pos();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a primitive long? as There would always be some non-null ordinal position

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Preference: Would prefer to use the full name position()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be consistent with the existing ContentFile.pos() API that returns a Long.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we will always project this.

Also, this is in TrackingInfo so I don't think we need it here.

Comment on lines +173 to +175
* <p>Must be defined if location is defined.
*/
Long fileSizeInBytes();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a miss on the proposal but this would be required always no? And as a result primitive long?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I changed the proposal (left it as a suggestion). I think we had it optional when we had inline manifest DVs as a separate record, which is not the case anymore.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with long.

@nastra nastra changed the title Introduce foundational types for V4 manifest support API, Core: Introduce foundational types for V4 manifest support Jan 15, 2026
* @return a DataFile representation
* @throws IllegalStateException if content_type is not DATA
*/
DataFile asDataFile(PartitionSpec spec);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to #14533 (comment) this should probably not have any arguments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"referenced_file",
Types.StringType.get(),
"Location of data file that a DV references if content_type is POSITION_DELETES. "
+ "Location of affiliated data manifest if content_type is DELETE_MANIFEST, or null if unaffiliated");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These descriptions should be shorter. This explains what the field is, but it doesn't need to document all idiosyncrasies.

*/
interface TrackingInfo {
/** Status of an entry in a tracked file */
enum Status {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had to copy this because it was in ManifestEntry. It's probably better to make this top-level to avoid that problem in the future.

*
* <p>Must be defined if content_type is POSITION_DELETES, must be null otherwise.
*/
ContentInfo contentInfo();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem: I'm debating whether to have a separate struct for this. Primarily motivated by how awkward the name "content info" is. It's nice to have an optional struct with required fields, but that is a very minor benefit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on this. Can change this if you feel strongly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it as part of the new DvInfo class we introduced as part of colocation

*
* <p>TODO: Define ContentStats structure per V4 proposal.
*/
Object contentStats();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a really important part of the implementation, so we should focus on getting this ready.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. @nastra is working on this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContentStats is already in core: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/stats/ContentStats.java#L25
But in order to use it here we would have to move it into the api module

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nastra. Incorporated. We don't need to move it to api for now because this PR is currently against core.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it, great. Somehow I thought we're in the api module. All good then

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Feb 6, 2026

Looking good overall, but I think we need to get ContentStats defined and then make sure the translation to DataFile and DeleteFile is reasonable.

* @return a DataFile representation
* @throws IllegalStateException if content_type is not DATA or partition spec is not set
*/
DataFile asDataFile();
Copy link
Copy Markdown
Contributor Author

@anoopj anoopj Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be useful in the colocated model. We can directly turn TrackedFile into scan tasks: @rdblue let me know if you have any objections if I remove it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it.

/**
* Content type stored in a file.
*
* <p>For V1-V3 tables: DATA, POSITION_DELETES, or EQUALITY_DELETES.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1 tables contained only data files.

*
* <p>For V1-V3 tables: DATA, POSITION_DELETES, or EQUALITY_DELETES.
*
* <p>For V4 tables: DATA, EQUALITY_DELETES, DATA_MANIFEST, or DELETE_MANIFEST. Note that
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* <p>For V4 tables: DATA, EQUALITY_DELETES, DATA_MANIFEST, or DELETE_MANIFEST. Note that
* <p>For v4 tables: DATA, EQUALITY_DELETES, DATA_MANIFEST, or DELETE_MANIFEST. Note that

/**
* Content type stored in a file.
*
* <p>For V1-V3 tables: DATA, POSITION_DELETES, or EQUALITY_DELETES.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* <p>For V1-V3 tables: DATA, POSITION_DELETES, or EQUALITY_DELETES.
* <p>For v1-v3 tables: DATA, POSITION_DELETES, or EQUALITY_DELETES.

@anoopj anoopj requested a review from rdblue March 6, 2026 01:21
@anoopj anoopj requested a review from nastra March 6, 2026 15:29
EQUALITY_DELETES(2);
/** Equality delete file content. Added in v2. */
EQUALITY_DELETES(2),
/** Data manifest content, referencing data files in a root manifest. Added in v4. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manifests that contain data files. Added in v4? The current wording makes it seem like the manifest entries themselves reference files in the root

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing

EQUALITY_DELETES(2),
/** Data manifest content, referencing data files in a root manifest. Added in v4. */
DATA_MANIFEST(3),
/** Delete manifest content, referencing delete files in a root manifest. Added in v4. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion as above, but with delete files?

* <li>0: EXISTING - file was already in the table
* <li>1: ADDED - file newly added
* <li>2: DELETED - file removed
* <li>3: REPLACED - entry replaced by a column update or DV change (v4 only)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I'm not sure we really need the (v4 only) in the comment considering this interface is only for V4 manifest entries anyways?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to duplicate docs in so many places. Small duplication is okay, but there's a lot here, which means it is a lot to keep up to date. In that case, I prefer keeping docs minimal and deciding carefully where things should be documented.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed most of the docs.

Comment on lines +143 to +144
* bitmap. Files that have been replaced will have a corresponding EXISTING entry for the same
* location.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar in theme to the other comments, I'm not sure we really need statements like "Files that have been replaced will have a corresponding EXISTING entry for the same location" in these java doc comments. Those are true from a proposed spec perspective but I don't think that at this level of abstraction in the code it's particularly useful context.

* entries that have had data column updates or deletion vector changes, and every REPLACED entry
* will have a corresponding EXISTING entry for the same location.
*/
public enum ManifestEntryStatus {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public for this PR? I agree with the modeling but it looks like it could be package private for now. I think that's especially important since we'll want to get others input on statuses before exposing stuff and avoiding potential deprecation churn.

/** Content type stored in a file, one of DATA, POSITION_DELETES, or EQUALITY_DELETES. */
/** Content type stored in a file. */
public enum FileContent {
/** Data file content. Stored in data manifests since v1. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the docs here are too specific and aren't adding much value.

Here, for example, "data file content" duplicates "file content". A more useful description is "A data file", but I don't think that is really necessary. The content of the file this is attached to is "data" -- that's already clear.

The additional symbols are also good: file content is a "data manifest", or file content is a "delete manifest".

Also, this was not stored in data manifests since v1. All files in v1 were data files, but this was not stored anywhere. This is one reason why over-specifying in docs is a problem. The other problem is that you have to remember to come back here and update it if you include requirements or assumptions. The best practice is it be very simple and not add anything you don't need to.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the docs.

* Status of an entry in a manifest file.
*
* <p>This is a top-level enum to avoid duplication across manifest entry types (v3 ManifestEntry
* and v4 TrackingInfo).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't replace ManifestEntry.Status so it does duplicate. I guess the idea is that it won't need to be duplicated again, but we don't really need to include the rationale for including it. I'd remove this paragraph.

* <p>For v4: Only ADDED and EXISTING entries are considered live for scan planning. DELETED and
* REPLACED entries are used for change detection but are not live. REPLACED may only be used for
* entries that have had data column updates or deletion vector changes, and every REPLACED entry
* will have a corresponding EXISTING entry for the same location.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted above, I don't think we want to document requirements or invariants in places like this. This should be clear about what the status represents, but we don't want to need to update it if the rules change.

import org.apache.iceberg.types.Types;

/**
* Metadata about a deletion vector (DV) associated with a data file entry.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "associated with a data file entry". This is metadata about a deletion vector. How it is used doesn't need to be constrained here.

*
* @param requestedColumnIds column IDs for which to keep stats
*/
TrackedFile copyWithStats(Set<Integer> requestedColumnIds);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: we will need to clarify that these are table field IDs, not content stats field IDs.

@RussellSpitzer RussellSpitzer self-requested a review March 30, 2026 21:27

/** A file tracked by a v4 manifest. */
interface TrackedFile {
Types.NestedField TRACKING =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fields aren't identical to DataFile: TrackedFile uses location instead of file_path, content_type includes manifest types , and stats are captured differently.

This makes sense.

Tracking info (status, sequence numbers, snapshot IDs) applies to the entry as a whole, not just to the content file , so it naturally lives at the same level.

Do sequence numbers and snapshot IDs apply to the entry as a whole? Use sequence number as an example, there are two sequence numbers in my mind.

  • Row level sequence number which maps to the data sequence number for the base file. Equality deletes matching should use this sequence number
  • last updated sequence number for row lineage purpose. It should be the sequence number of the latest column file was added (inheritance) or persisted values in the column file.

Types.NestedField FIRST_ROW_ID =
Types.NestedField.optional(
142, "first_row_id", Types.LongType.get(), "ID of the first row in the data file");
Types.NestedField DELETED_POSITIONS =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the purpose of deleted and replaced positions? are they for change detection?

We talked about diff DV for manifest DV before. are these two fields for that purpose?

For data DVs (in Puffin files), I assume we won't compute and store diff DVs during write.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are the diff bitmaps for change detection in the tracking struct. They record what changed in the current snapshot:

  • deleted_positions: positions that were deleted (via a new DV) in this snapshot
  • replaced_positions: positions that were replaced n this snapshot

Yes, we won't compute diff DVs for data DVs (because it will be prohibitively expensive).

"file_sequence_number",
Types.LongType.get(),
"File sequence number indicating when the file was added");
Types.NestedField DV_SNAPSHOT_ID =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should DV_SNAPSHOT_ID be part of DeletionVector schema?

If we add column files in the future, where should we track the column file snapshot id? part of column file struct or the tracking struct here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great questions again. DV_SNAPSHOT_ID is in Tracking rather than DeletionVector because the DV struct represents the serialized content (location, offset, size, cardinality) while tracking holds lifecycle metadata (snapshot IDs, sequence numbers, status).

For column files in the future, I'd expect each column file to carry its own snapshot ID within the column file struct, similar to how it would carry its own file_sequence_number. The tracking struct tracks the lifecycle of the entry as a whole, while per-component metadata (when was this specific column file added) would live with that component's struct.

@rdblue may have a more definitive opinion on this. I am happy to align on it when column files are designed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may write a column file for DV column update.

  • not-modified rows should carry over the DV snapshot id and sequence number in the new column file
  • modified rows should inherit the DV snapshot id and sequence number from the new snapshot

As I mentioned in my PR approval message, non of my comments are blockers. We can always revisit these decisions for non-public classes.

Comment on lines +21 to +27
/** Content type stored in a file. */
public enum FileContent {
DATA(0),
POSITION_DELETES(1),
EQUALITY_DELETES(2);
EQUALITY_DELETES(2),
DATA_MANIFEST(3),
DELETE_MANIFEST(4);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need DELETE_MANIFEST entries in the root manifest file (even without equality deletes), since V3 to V4 upgrade doesn't require rewriting existing data and delete manifest files to colocate the DVs.

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few questions. But I am fine with merging this now and iterates later, since most code are non-public classes and interfaces.

@anoopj anoopj force-pushed the v4-foundational-types branch from 7543189 to da8185b Compare March 30, 2026 22:22
import java.nio.ByteBuffer;
import org.apache.iceberg.types.Types;

/** Summary information about a manifest referenced by a v4 root manifest entry. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to say "v4 here", since it will go out of date

Suggested change
/** Summary information about a manifest referenced by a v4 root manifest entry. */
/** Summary information about a manifest referenced by a root manifest entry. */

ByteBuffer keyMetadata();

/** Returns the list of recommended split locations, or null. */
List<Long> splitOffsets();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the re-usable container issues we have in the parquet pr. We end up making immutable lists and then the "BaseFile" is no longer directly compatible with "re-use" containers . I think this is fine for now though, this may not be an issue anymore

import java.nio.ByteBuffer;
import org.apache.iceberg.types.Types;

/** Tracking information for a v4 manifest entry. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Don't say V4, if we can help it

* <p>Tracks where a DV blob can be read. The DV blob follows the format defined by the
* deletion-vector-v1 blob type in the Puffin spec.
*/
interface DeletionVector {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe TrackedDeletionVector

We already have a few DeletionVector named things and this is not actually a DV, it's the reference too one

Types.ListType.ofRequired(136, Types.IntegerType.get()),
"Field ids used to determine row equality in equality delete files");

static Types.StructType schemaWithContentStats(Types.StructType contentStatsType) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a null check on contentStatsType?

/** Returns the list of recommended split locations, or null. */
List<Long> splitOffsets();

/** Returns the set of field IDs used for equality comparison in equality delete files. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:nit: Returns a list which we semantically hope is a set :)

5,
"dv_snapshot_id",
Types.LongType.get(),
"Snapshot ID where the DV was added; null if there is no DV");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure we need to shorten deletion vector

/**
* Metadata about a deletion vector.
*
* <p>Tracks where a DV blob can be read. The DV blob follows the format defined by the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely don't need to abbreviate in the doc, I get paid by the character

Copy link
Copy Markdown
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a lot of nits about naming, but I'm fine with changing those as we go. The Structure looks solid to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants