A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
A small discord bot to interface with python-discord's snekbox.

A small discord bot to interface with python-discord's snekbox.

Hassan Abouelela 0 Oct 05, 2021
Techie Sneh 19 Dec 03, 2021
Widevine MPD Content Downloader & Decryptor

Widevine-DL Encrypted MPD Manifest Content Downloader + Decryptor (not a Widevine Key Extractor!) Requirements ffmpeg, yt-dlp, aria2, widevine-l3-decr

Vank0n (SJJeon) 170 Dec 30, 2022
SongBot2.0 With Python

SongBot2.0 Host 👨‍💻 Heroku 🚀 Manditary Vars BOT_TOKEN : Get It from @Botfather Special Feature Downloads Songs fastly and less errors as well as 0

Mr.Tanaji 5 Nov 19, 2021
GitHub Actions Docker training

GitHub-Actions-Docker-training Training exercise repository for GitHub Actions using a docker base. This repository should be cloned and used for trai

GitHub School 1 Jan 21, 2022
Auto Moderation is a powerfull moderation bot

Auto Moderation.py Auto Moderation a powerful Moderation Discord Bot 🎭 Futures Moderation Auto Moderation 🚀 Installation git clone https://github.co

G∙MAX 2 Apr 02, 2022
A bot created with Python that interacts with GroupMe

GroupMe_Bot This is a bot I'm working on a small groupme group I'm in. This is something I'll work on in my spare time. Nothing but just a fun little

0 May 19, 2022
Cytotron - A unique discord bot like never before. Add it to your server to keep it active, motiviated, and amazing!!

Cytotron - Take your server to the next level Most of the details are in the website. Go to https://cytotron-bot.gq for more information. If that link

LeviathanProgramming 6 Jun 13, 2021
Um script simples para consultar dados, com API's simples.

Info sobre o Script Esta é uma das mais simples ferramentas para consultar dados. Daqui um tempo eu farei um UPGRADE no painel, irei adicionar um banc

Crowley 6 Apr 11, 2022
An enhanced discord.py, based off of the now-archived discord.py project

enhanced-discord.py A modern, maintained, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. The Future of enhanced

Devision 2 Dec 21, 2022
Morpheus is a telegram bot that helps to simplify the process of making custom telegram stickers.

😎 Morpheus is a telegram bot that helps to simplify the process of making custom telegram stickers. As you may know, Telegram's official Sti

Abhijith K S 1 Dec 14, 2022
checks anilist for available usernames (200rq/s)

Anilist checker Running the program Set a path to the extracted files Install the packages with pip install -r req.txt Run the script by typing python

gxzs 1 Oct 13, 2021
Short Program using Transavia's API to notify via email an user waiting for a flight at special dates and with the best price

Flight-Notifier Short Program using Transavia's API to notify via email an user waiting for a flight at special dates and with the best price Algorith

Wassim 2 Apr 10, 2022
A Python Script to scan through an Instagram account to find all the followers and followings.

Instagram Followers Scan A Python Script to scan through an Instagram account to find all the followers and followings. You can also get filtered list

Nityasmit Mallick 6 Oct 27, 2022
Tiktok-bot - A Simple Tiktok bot With Python

Install the requirements pip install selenium pip install pyfiglet==0.7.5 How ca

Muchlis Faroqi 5 Aug 23, 2022
Basic Discord python bot

#How to Create a Discord Bot Account In order to work with the Python library and the Discord API, we must first create a Discord Bot account. Here ar

Tustus 1 Oct 13, 2021
This is a simple bot for running Python code through Discord

Python Code Runner Discord Bot This is a simple bot for running Python code through Discord. It was originally developed for the Beginner.Codes Discor

beginner.py 1 Feb 14, 2022
Simple, yet effective moderator bot for telegram. With reports, logs, profanity filter and more :3

👹 Samurai Telegram Bot Simple, yet effective moderator bot for telegram. With reports, logs, profanity filter and more :3 Description Personal bot, m

Abraham Tugalov 106 Dec 13, 2022
Anti Spam/NSFW Telegram Bot Written In Python With Pyrogram.

✨ SpamProtectionRobot ✨ Anti Spam/NSFW Telegram Bot Written In Python With Pyrogram. Requirements Python = 3.7 Install Locally Or On A VPS $ git clon

Akshay Rajput 46 Dec 13, 2022
AWSXenos will list all the trust relationships in all the IAM roles and S3 buckets

AWS External Account Scanner Xenos, is Greek for stranger. AWSXenos will list all the trust relationships in all the IAM roles, and S3 buckets, in an

AirWalk 57 Nov 07, 2022