-
Notifications
You must be signed in to change notification settings - Fork 141
[DCP Ingestion] Automate the DB initialization, Dockerize Ingestion Helper, and fix lock acquisition issue #1971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gmechali
wants to merge
7
commits into
datacommonsorg:master
Choose a base branch
from
gmechali:dockerize
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
e02f44f
feat(ingestion): automate DB initialization and fix lock mechanism
gmechali 92af9b1
Remove redundancy
gmechali 54a701c
Merge branch 'master' into dockerize
gmechali eb9d02e
Adds readme
gmechali 30160d1
Merge branch 'master' into dockerize
gmechali 644f8aa
Merge branch 'master' into dockerize
shixiao-coder 84b2caf
Comments
gmechali File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # Copyright 2026 Google LLC | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| FROM python:3.12-slim | ||
|
|
||
| # Copy uv binary | ||
| COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ | ||
|
|
||
| # Allow statements and log messages to immediately appear in the logs | ||
| ENV PYTHONUNBUFFERED True | ||
|
|
||
| WORKDIR /app | ||
|
|
||
| # Install protobuf compiler and curl | ||
| RUN apt-get update && apt-get install -y protobuf-compiler curl && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # Install production dependencies. | ||
| COPY requirements.txt . | ||
| RUN uv pip install --system --no-cache -r requirements.txt | ||
|
|
||
| # Copy local code to the container image. | ||
| COPY . . | ||
|
|
||
| # Fetch proto file from GitHub | ||
| RUN curl -o storage.proto https://raw.githubusercontent.com/datacommonsorg/import/master/pipeline/data/src/main/proto/storage.proto | ||
|
|
||
| # Generate proto descriptor set | ||
| RUN protoc --include_imports --descriptor_set_out=storage.pb storage.proto | ||
|
|
||
| # Run the functions framework | ||
| CMD ["functions-framework", "--target", "ingestion_helper"] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
48 changes: 48 additions & 0 deletions
48
import-automation/workflow/ingestion-helper/cloudbuild.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # Copyright 2026 Google LLC | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| steps: | ||
| # Build the container image | ||
| - name: 'gcr.io/cloud-builders/docker' | ||
| args: ['build', '-t', '$_AR_REGION-docker.pkg.dev/datcom-ci/$_AR_REPO/datacommons-ingestion-helper:$_TAG', '-t', '$_AR_REGION-docker.pkg.dev/datcom-ci/$_AR_REPO/datacommons-ingestion-helper:latest', '.'] | ||
|
|
||
| # Push the container image - Needed before deployment | ||
| - name: 'gcr.io/cloud-builders/docker' | ||
| args: ['push', '$_AR_REGION-docker.pkg.dev/datcom-ci/$_AR_REPO/datacommons-ingestion-helper:$_TAG'] | ||
|
|
||
| # Deploy container image to Cloud Run (Conditional) | ||
| - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk' | ||
| entrypoint: 'bash' | ||
| args: | ||
| - '-c' | ||
| - | | ||
| if [ "$_DEPLOY" = "true" ]; then | ||
| gcloud run deploy '$_SERVICE_NAME' \ | ||
| --image '$_AR_REGION-docker.pkg.dev/datcom-ci/$_AR_REPO/datacommons-ingestion-helper:$_TAG' \ | ||
| --region '$_AR_REGION' \ | ||
| --project '$PROJECT_ID' | ||
| else | ||
| echo "Skipping deployment as _DEPLOY is not set to true" | ||
| fi | ||
|
|
||
| substitutions: | ||
| _AR_REGION: 'us' | ||
| _AR_REPO: 'gcr.io' | ||
| _SERVICE_NAME: 'spanner-ingestion-helper' | ||
| _DEPLOY: 'false' | ||
| _TAG: 'latest' | ||
|
|
||
| images: | ||
| - '$_AR_REGION-docker.pkg.dev/datcom-ci/$_AR_REPO/datacommons-ingestion-helper:$_TAG' | ||
| - '$_AR_REGION-docker.pkg.dev/datcom-ci/$_AR_REPO/datacommons-ingestion-helper:latest' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| -- Copyright 2026 Google LLC | ||
|
gmechali marked this conversation as resolved.
|
||
| -- | ||
| -- Licensed under the Apache License, Version 2.0 (the "License") | ||
| -- you may not use this file except in compliance with the License. | ||
| -- You may obtain a copy of the License at | ||
| -- | ||
| -- http://www.apache.org/licenses/LICENSE-2.0 | ||
| -- | ||
| -- Unless required by applicable law or agreed to in writing, software | ||
| -- distributed under the License is distributed on an "AS IS" BASIS, | ||
| -- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| -- See the License for the specific language governing permissions and | ||
| -- limitations under the License. | ||
|
|
||
| CREATE PROTO BUNDLE ( | ||
| `org.datacommons.Observations` | ||
| ); | ||
|
|
||
| CREATE TABLE Node ( | ||
|
gmechali marked this conversation as resolved.
|
||
| subject_id STRING(1024) NOT NULL, | ||
| value STRING(MAX), | ||
| bytes BYTES(MAX), | ||
| name STRING(MAX), | ||
| types ARRAY<STRING(1024)>, | ||
| name_tokenlist TOKENLIST AS (TOKENIZE_FULLTEXT(name)) HIDDEN, | ||
| ) PRIMARY KEY(subject_id); | ||
|
|
||
| CREATE TABLE Edge ( | ||
| subject_id STRING(1024) NOT NULL, | ||
| predicate STRING(1024) NOT NULL, | ||
| object_id STRING(1024) NOT NULL, | ||
| provenance STRING(1024) NOT NULL, | ||
| ) PRIMARY KEY(subject_id, predicate, object_id, provenance), | ||
| INTERLEAVE IN Node; | ||
|
|
||
| CREATE TABLE Observation ( | ||
| observation_about STRING(1024) NOT NULL, | ||
| variable_measured STRING(1024) NOT NULL, | ||
| facet_id STRING(1024) NOT NULL, | ||
| observation_period STRING(1024), | ||
| measurement_method STRING(1024), | ||
| unit STRING(1024), | ||
| scaling_factor STRING(1024), | ||
| observations org.datacommons.Observations, | ||
| import_name STRING(1024), | ||
| provenance_url STRING(1024), | ||
| is_dc_aggregate BOOL, | ||
| ) PRIMARY KEY(observation_about, variable_measured, facet_id); | ||
|
|
||
| CREATE TABLE ImportStatus ( | ||
| ImportName STRING(MAX) NOT NULL, | ||
| LatestVersion STRING(MAX), | ||
| GraphPath STRING(MAX), | ||
| State STRING(1024) NOT NULL, | ||
| JobId STRING(1024), | ||
| WorkflowId STRING(1024), | ||
| ExecutionTime INT64, | ||
| DataVolume INT64, | ||
| DataImportTimestamp TIMESTAMP OPTIONS ( allow_commit_timestamp = TRUE ), | ||
| StatusUpdateTimestamp TIMESTAMP OPTIONS ( allow_commit_timestamp = TRUE ), | ||
| NextRefreshTimestamp TIMESTAMP, | ||
| ) PRIMARY KEY(ImportName); | ||
|
|
||
| CREATE TABLE IngestionHistory ( | ||
| CompletionTimestamp TIMESTAMP NOT NULL OPTIONS ( allow_commit_timestamp = TRUE ), | ||
| IngestionFailure Bool NOT NULL, | ||
| WorkflowExecutionID STRING(1024) NOT NULL, | ||
| DataflowJobID STRING(1024), | ||
| IngestedImports ARRAY<STRING(MAX)>, | ||
| ExecutionTime INT64, | ||
| NodeCount INT64, | ||
| EdgeCount INT64, | ||
| ObservationCount INT64, | ||
| ) PRIMARY KEY(CompletionTimestamp DESC); | ||
|
|
||
| CREATE TABLE ImportVersionHistory ( | ||
| ImportName STRING(MAX) NOT NULL, | ||
| Version STRING(MAX) NOT NULL, | ||
| UpdateTimestamp TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true), | ||
| Comment STRING(MAX), | ||
| ) PRIMARY KEY (ImportName, UpdateTimestamp DESC); | ||
|
|
||
| CREATE TABLE IngestionLock ( | ||
| LockID STRING(1024) NOT NULL, | ||
| LockOwner STRING(1024), | ||
| AcquiredTimestamp TIMESTAMP OPTIONS ( allow_commit_timestamp = TRUE ), | ||
| ) PRIMARY KEY(LockID); | ||
|
|
||
| CREATE PROPERTY GRAPH DCGraph | ||
| NODE TABLES( | ||
| Node | ||
| KEY(subject_id) | ||
| LABEL Node PROPERTIES( | ||
| bytes, | ||
| name, | ||
| subject_id, | ||
| types, | ||
| value) | ||
| ) | ||
| EDGE TABLES( | ||
| Edge | ||
| KEY(subject_id, predicate, object_id, provenance) | ||
| SOURCE KEY(subject_id) REFERENCES Node(subject_id) | ||
| DESTINATION KEY(object_id) REFERENCES Node(subject_id) | ||
| LABEL Edge PROPERTIES( | ||
| object_id, | ||
| predicate, | ||
| provenance, | ||
| subject_id) | ||
| ); | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the gcp functions-framework utility is the right tool to use here for hosting a rest service.
Is the plan to migrate all of these functions to cloud run services?
If so- i think it makes sense to put them in the DCP service as a new endpoint:
https://github.com/datacommonsorg/datacommons/tree/main/packages/datacommons-api/datacommons_api/
This way we wont need a new image or ci/cd pipeline for building the import automation docker image.
Also- moving from functions-framework to fastapi gives us better logging, request & response validations with Pydantic objects, and puts these APIs in a single place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that functions-framework is a bit limited and odd in this case, but this is essentially the minimum work to migrate the existing cloud run function to a cloud run service executing the cloud run function. It allows dockerization for DCP, without bloating Base DC infra/
I think there needs to be much more discussion before moving this to DCP API, it certainly has some pros but also some major drawbacks such as moving away ingestion code used across base DC in this new DCP repo. It might be premature to move to DCP service, especially as DCP service is out of scope forM1.
If we leave it like this M1 ships 4 artifacts:
all connected through the Terraform deployment. If we move ingestion helper to DCP service, then we just need to maintain the DCP service image. Agreed functions framework is limited and not ideal, I think that migration can happen at any point in the future. There is no need to do it now, and there would be no complexity in migrating it to within DCP service when we launch DCP service (in q3/q4).
Especially as ingestion helper is required for base DC, I dont think it makes sense for us to take this on now and we would have to deploy DCP service to Base DC for ingestion helper support.
Happy to hop on a call to discuss furhter!