Skip to content

V0 version of embedding ingestion core.#1964

Open
shixiao-coder wants to merge 12 commits intodatacommonsorg:masterfrom
shixiao-coder:v0-embedding-ingestion-core-logic
Open

V0 version of embedding ingestion core.#1964
shixiao-coder wants to merge 12 commits intodatacommonsorg:masterfrom
shixiao-coder:v0-embedding-ingestion-core-logic

Conversation

@shixiao-coder
Copy link
Copy Markdown
Contributor

@shixiao-coder shixiao-coder commented Apr 17, 2026

It includes:
requirements.txt to install the related packages
placeholder for main.py function: placeholder to trigger various embedding ingestion logic
embedding_utils.py including function to:

  • read the latest timestamp for the lock, if no timestamp set None
  • function to find Node IDs to update based on timestamp, nodetype, validity by none empty
  • function to convert the Node and fields to ID and embedding_content
  • function to generate embeddings from the ID and embedding content in batch

It includes:
requirements.txt to install the related packages
embedding_utils.py including function to:
 - read the latest timestamp for the lock, if no timestamp set None
 - function to find Node IDs to update based on timestamp, nodetype, validity by none empty
 - function to convert the Node and fields to ID and embedding_content
 - function to generate embeddings from the ID and embedding content in batch
@shixiao-coder shixiao-coder requested a review from gmechali April 17, 2026 16:45
@shixiao-coder shixiao-coder requested a review from clincoln8 April 17, 2026 16:46
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Cloud Function and helper utilities for automating node embedding workflows using Google Cloud Spanner. Key features include fetching updated nodes based on ingestion locks and processing embeddings in batches via ML.PREDICT. Feedback identifies critical issues such as a missing time import and the use of unsupported INSERT OR UPDATE syntax in Spanner SQL. Other improvements include correcting a typo in the initialization action string, removing duplicate imports and redundant initializations, and optimizing result set processing.

Comment thread import-automation/workflow/embedding-helper/embedding_utils.py
Comment thread import-automation/workflow/embedding-helper/embedding_utils.py Outdated
Comment thread import-automation/workflow/embedding-helper/embedding_utils.py Outdated
Comment thread import-automation/workflow/embedding-helper/embedding_utils.py Outdated
Comment thread import-automation/workflow/embedding-helper/embedding_utils.py Outdated
Comment thread import-automation/workflow/embedding-helper/main.py Outdated
…st on a request of 250. If the batch is smaller than 250 and or not divisible by 250. It actually send more requests to Embeddings models, with each batch containing a much smaller number.

Batch from 100 -> 500 changes QPM usage from 1000 to 700

Timeout is set since each request is now containing 250 data and will run longer
Comment thread import-automation/workflow/embedding-helper/embedding_utils.py
Comment thread import-automation/workflow/embedding-helper/embedding_utils.py
Copy link
Copy Markdown

@gmechali gmechali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Xiao, this looks great, left a coupe minor comments + one on the query and maybe how to optimize it.

My main feedback is that we need tests to go in this PR too! Can you add a tests/ directory and add unit tests for embedding_utils and e2e tests for main.py?

Comment thread import-automation/workflow/embedding-helper/embedding_utils.py
Comment thread import-automation/workflow/embedding-helper/embedding_utils.py
Yields:
Dictionaries containing subject_id and name.
"""
timestamp_condition = "update_timestamp > @timestamp" if timestamp else "TRUE"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check, did you get approval from @keyurva and the data team to make this change to the schema to support the timestamp on the Node Table?

Comment thread import-automation/workflow/embedding-helper/embedding_utils.py
@shixiao-coder
Copy link
Copy Markdown
Contributor Author

Thanks Xiao, this looks great, left a coupe minor comments + one on the query and maybe how to optimize it.

My main feedback is that we need tests to go in this PR too! Can you add a tests/ directory and add unit tests for embedding_utils and e2e tests for main.py?

Added @gmechali

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants