AWS Data Engineering Pipeline
This is a repository for the Duke University Cloud Computing course project on Serverless Data Engineering Pipeline. For this project, I recreated the below pipeline in iCloud9 (reference: https://github.com/noahgift/awslambda):
Below are the steps of how to build this pipeline in AWS:
name as your unique id for your items in the fang table.
fang table in DynamoDB and SQS queue.
You can check how to do it here.
-
In iCloud9, initialize a serverless application with SAM template:
sam init
Inputs: 1, 2, 4, "producer"
-
Set virtual environment and source it:
# I called my virtual environment "comprehendProducer" python3 -m venv ~/.comprehendProducer source ~/.comprehendProducer/bin/activate
-
Add the code for your application to
app.py -
Add relevant packages used in your app to
requirements.txtfile -
Install requirements
cd hello_world/ pip install -r requirements.txt cd ..
-
Create a repository (
producer) in Elastic Container Registry (ECR) and copy its URI -
Build and deploy your serverless application:
sam build sam deploy --guided
When prompted to input URI, paste the URI for the
producerrepository that you've just created. -
Create IAM Role granting Administrator Access to the Producer Lambda function.
🤔 Not sure how to create IAM Role? Check out this video (17 min ). -
Add the execution role that you created to the Producer Lambda function.
In case you forgot how to do it:
In AWS console: Lambda
➡️ click on producer function➡️ configuration➡️ permissions➡️ Edit➡️ Select the role underExisting role. -
You are all set with the
producerfunction! Now deactivate virtual environment:deactivate cd ..
Repeat steps in
app.py, make sure to replace bucket="fangsentiment" with the name of your S3 bucket.
Producer Lambda Function: CloudWatchEvent(30 min)
Consumer Lambda Function: SQS (42 min)
sam build && sam deploy
#list containers
docker image ls
# remove a container
docker image rm <containerId>

