| Key | Value |
|---|---|
| Services | EMR Serverless, S3, IAM |
| Integrations | Terraform, AWS CLI |
| Categories | Analytics; Big Data; Spark |
A demo application illustrating how to add Python dependencies to an EMR Serverless Spark job using LocalStack. This sample implements a workaround for mounting Python environments directly into the LocalStack container, enabling PySpark jobs with custom dependencies to run locally.
- A valid LocalStack for AWS license. Your license provides a
LOCALSTACK_AUTH_TOKENto activate LocalStack. - Docker
localstackCLIawslocalCLI- Terraform ~> 1.9.1
make checkThis initializes your Terraform workspaces:
make initBuild the Python dependencies for the Spark job. For LocalStack, we create a /pyspark_env folder that is mounted into the LocalStack container (rather than packaging it as a tarball like in AWS):
# For LocalStack: creates /pyspark_env folder
make build
# For AWS: creates pyspark_deps.tar.gz
make build-awsexport LOCALSTACK_AUTH_TOKEN=<your-auth-token>
make startCreates the following resources via Terraform: IAM role, IAM policy, S3 bucket, and an EMR Serverless application.
# Deploy to LocalStack (starts LocalStack via docker-compose and applies Terraform)
LOCALSTACK_AUTH_TOKEN=$LOCALSTACK_AUTH_TOKEN make deploy
# Deploy to AWS
make deploy-awsWe can finally run our Spark job. Notice the difference in start_job.sh between LocalStack and AWS: for AWS, spark.archives references environment/bin/python; for LocalStack, we rely on the volume-mounted container and use the absolute path /tmp/environment/bin/python.
# LocalStack
make run
# AWS
make run-aws# LocalStack
make destroy
# AWS
make destroy-awsThis code is available under the Apache 2.0 license.