[Schema Testing] Modification of ingestion pipeline for testing full load. by gmechali · Pull Request #500 · datacommonsorg/import

gmechali · 2026-04-22T19:56:43Z

No description provided.

codacy-production · 2026-04-22T19:58:53Z

Not up to standards ⛔

🔴 Issues 1 critical

Alerts:
⚠ 1 issue (≤ 0 issues of at least minor severity)

Results:
1 new issue

Category Results

Security 1 critical

View in Codacy

🟢 Metrics 20 complexity · 0 duplication

Metric Results

Complexity 20

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes. Give us feedback}

gemini-code-assist

Code Review

This pull request introduces a new schema for observation mutations and refactors the ingestion pipeline to handle nodes, edges, and observations separately. Key changes include the implementation of toNewSchemaMutations and filterNewSchemaMutations, as well as updating the pipeline to ensure edges are written only after nodes are committed. Feedback highlights a critical issue where database validation was commented out and a regression in the pipeline's return value that ignores node and edge write completions. Additionally, suggestions were made to avoid hardcoding table names and to centralize the logic for generating series DCIDs to prevent duplication.

gemini-code-assist · 2026-04-22T19:59:02Z


    LOGGER.info("Starting Spanner DDL creation...");
-    spannerClient.validateOrInitializeDatabase();
+    // spannerClient.validateOrInitializeDatabase();


Commenting out spannerClient.validateOrInitializeDatabase() disables critical database schema validation and initialization. If this was done for testing purposes, please ensure it is reverted before merging, or implement a conditional check if skipping is intended for specific environments.

gemini-code-assist · 2026-04-22T19:59:02Z

+      var writtenNodes =
+          nodeMutations.apply("WriteNodesToSpanner", spannerClient.getWriteTransform());
+
+      var waitingEdges = edgeMutations.apply("EdgesWaitOnNodes", Wait.on(writtenNodes.getOutput()));
+
+      waitingEdges.apply("WriteEdgesToSpanner", spannerClient.getWriteTransform());
+
+      var writeObs =
+          observations.apply("WriteObsToSpanner", spannerClient.getWriteGroupedTransform());
+      return writeObs.getOutput();


The method now only returns the output of writeObs, ignoring the completion of writtenNodes and the edge writes. This is a regression from the previous implementation which merged all mutation writes. If a downstream transform depends on the output of groupByGraphOnly, it may start before nodes and edges are fully committed to Spanner. You should merge the outputs of all write transforms before returning.

var writtenNodes = nodeMutations.apply("WriteNodesToSpanner", spannerClient.getWriteTransform()); var writtenEdges = edgeMutations .apply("EdgesWaitOnNodes", Wait.on(writtenNodes.getOutput())) .apply("WriteEdgesToSpanner", spannerClient.getWriteTransform()); var writtenObs = observations.apply("WriteObsToSpanner", spannerClient.getWriteGroupedTransform()); return PCollectionList.of(writtenNodes.getOutput()) .and(writtenEdges.getOutput()) .and(writtenObs.getOutput()) .apply("MergeOutputs", Flatten.pCollections());

gemini-code-assist · 2026-04-22T19:59:02Z

+
+      var nodeMutations =
+          graphMutations
+              .apply("FilterNodes", Filter.by(kv -> kv.getValue().getTable().equals("Node")))


Avoid hardcoding the table name "Node". Use spannerClient.getNodeTableName() to ensure the filter remains correct if the table name is customized via configuration.

Suggested change

.apply("FilterNodes", Filter.by(kv -> kv.getValue().getTable().equals("Node")))

.apply("FilterNodes", Filter.by(kv -> kv.getValue().getTable().equals(spannerClient.getNodeTableName())))

gemini-code-assist · 2026-04-22T19:59:02Z

+
+      var edgeMutations =
+          graphMutations
+              .apply("FilterEdges", Filter.by(kv -> kv.getValue().getTable().equals("Edge")))


Avoid hardcoding the table name "Edge". Use spannerClient.getEdgeTableName() to ensure the filter remains correct if the table name is customized via configuration.

Suggested change

.apply("FilterEdges", Filter.by(kv -> kv.getValue().getTable().equals("Edge")))

.apply("FilterEdges", Filter.by(kv -> kv.getValue().getTable().equals(spannerClient.getEdgeTableName())))

gemini-code-assist · 2026-04-22T19:59:02Z

+    String seriesDcid =
+        "dc/os/"
+            + Joiner.on("_")
+                .join(
+                    obs.getVariableMeasured().replace('/', '_'),
+                    obs.getObservationAbout().replace('/', '_'),
+                    obs.getFacetId());


The logic for generating seriesDcid is duplicated from Observation.toObsGraph(). To ensure consistency and simplify maintenance, this logic should be centralized, for example by adding a public static method to the Observation class that returns the series DCID.

Diffs to suport the new schema.

7f1e9ed

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Schema Testing] Modification of ingestion pipeline for testing full load.#500

[Schema Testing] Modification of ingestion pipeline for testing full load.#500
gmechali wants to merge 1 commit intodatacommonsorg:masterfrom
gmechali:fullload

gmechali commented Apr 22, 2026

Uh oh!

codacy-production Bot commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	.apply("FilterNodes", Filter.by(kv -> kv.getValue().getTable().equals("Node")))
	.apply("FilterNodes", Filter.by(kv -> kv.getValue().getTable().equals(spannerClient.getNodeTableName())))

	.apply("FilterEdges", Filter.by(kv -> kv.getValue().getTable().equals("Edge")))
	.apply("FilterEdges", Filter.by(kv -> kv.getValue().getTable().equals(spannerClient.getEdgeTableName())))

Conversation

gmechali commented Apr 22, 2026

Uh oh!

codacy-production Bot commented Apr 22, 2026

Not up to standards ⛔

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant