[DCP Ingestion] Automate the DB initialization, Dockerize Ingestion Helper, and fix lock acquisition issue#1971
[DCP Ingestion] Automate the DB initialization, Dockerize Ingestion Helper, and fix lock acquisition issue#1971gmechali wants to merge 7 commits intodatacommonsorg:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a database initialization feature for the ingestion helper, including Docker and Cloud Build configurations, a Spanner schema, and logic to apply DDL statements. The locking mechanism was also updated to handle the creation of new lock rows. Feedback provides a suggestion for better DatabaseAdminClient configuration.
dwnoble
left a comment
There was a problem hiding this comment.
Thanks Gabe!
I think this tooling may be better suited to live in the DCP repo at https://github.com/datacommonsorg/datacommons
This PR converts the import automation cloud functions to a REST service, but if we're going to do that, I think it makes sense to consolidate it with the DCP FastAPI service so we don't end up adding a separate process.
Then this action could look something like: POST https://<dcp-service>/api/admin/initialize-database
Regardless, i think we should also:
- convert this toolset from requirements.txt files to uv so we can have consistent builds with the uv.lock files
- right now the functions in main.py are tied to cloud run functions. we should migrate these to a plain python module, and the cloud run part should be minimal wrapper on top of it
- we should expose these functions via a command line tool. If we migrate to the dcp repo, this might look like:
$ datacommons admin initialize-databaseor$ datacommons admin release-ingestion-lock.
cc @vish-cs
| RUN protoc --include_imports --descriptor_set_out=storage.pb storage.proto | ||
|
|
||
| # Run the functions framework | ||
| CMD ["functions-framework", "--target", "ingestion_helper"] |
There was a problem hiding this comment.
I don't think the gcp functions-framework utility is the right tool to use here for hosting a rest service.
Is the plan to migrate all of these functions to cloud run services?
If so- i think it makes sense to put them in the DCP service as a new endpoint:
https://github.com/datacommonsorg/datacommons/tree/main/packages/datacommons-api/datacommons_api/
This way we wont need a new image or ci/cd pipeline for building the import automation docker image.
Also- moving from functions-framework to fastapi gives us better logging, request & response validations with Pydantic objects, and puts these APIs in a single place.
There was a problem hiding this comment.
I agree that functions-framework is a bit limited and odd in this case, but this is essentially the minimum work to migrate the existing cloud run function to a cloud run service executing the cloud run function. It allows dockerization for DCP, without bloating Base DC infra/
I think there needs to be much more discussion before moving this to DCP API, it certainly has some pros but also some major drawbacks such as moving away ingestion code used across base DC in this new DCP repo. It might be premature to move to DCP service, especially as DCP service is out of scope forM1.
If we leave it like this M1 ships 4 artifacts:
- MCP PyPi package
- Mixer image
- ingestion helper image
- dataflow template
all connected through the Terraform deployment. If we move ingestion helper to DCP service, then we just need to maintain the DCP service image. Agreed functions framework is limited and not ideal, I think that migration can happen at any point in the future. There is no need to do it now, and there would be no complexity in migrating it to within DCP service when we launch DCP service (in q3/q4).
Especially as ingestion helper is required for base DC, I dont think it makes sense for us to take this on now and we would have to deploy DCP service to Base DC for ingestion helper support.
Happy to hop on a call to discuss furhter!
Other comment kind of replies about moving to DCP repo. I think one day maybe but it's premature, and splitting up ingestion code also seems controversial. |
This PR does 3 main things:
As a follow up we should do the following: