A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
allow windows programs to call dssp/mkdssp command from wsl; rework biopython on windows (PDB -> dssp -> fasta)

dssp-wsl Converting PDB (Protein Data Bank) file format to DSSP file format is required for generating datasets of peptides and their secondary struct

Taine Zhao 1 Feb 23, 2022
Boto3 code assistance for any API in any IDE, always up to date

botostubs Gives you code assistance for any boto3 API in any IDE. Get started by running pip install botostubs Demo Features PyPI package automaticall

Jeshan Giovanni BABOOA 94 Nov 14, 2022
Simple library for logging to Loggly

#Hoover A python wrapper used to hit the Loggly. API For more information on Hoover see http://wiki.loggly.com/hooverguide ##Install With this git rep

Hoover Loggly 34 May 19, 2021
Media Replay Engine (MRE) is a framework to build automated video clipping and replay (highlight) generation pipelines for live and video-on-demand content.

Media Replay Engine (MRE) is a framework for building automated video clipping and replay (highlight) generation pipelines using AWS services for live

Amazon Web Services - Labs 30 Nov 29, 2022
“ HOLA HUMANS 👋 I'M DAISYX 2.0 „ LATEST VERSION OF DAISYX.. Source Code of @Daisyxbot

DaisyX 2.0 A Powerful, Smart And Simple Group Manager ... Written with AioGram , Pyrogram and Telethon... The first AioGram based modified groupmanage

TeamDaisyX 153 Dec 06, 2022
DSAIL repos - DSAIL Repository Template

DSAIL Repository Template DSAIL @ KAIST . ├── configs ('--F', help='for configur

yunhak 2 Feb 14, 2022
Pdisk Link Converter Telegram Bot, Convert link in a single click

Pdisk Converter Bot Make short link by using Pdisk API key Installation The Easy Way Required Variables BOT_TOKEN: Create a bot using @BotFather, and

Ayush Kumar Jaiswal 6 Jul 28, 2022
A powerfull SMS Bomber for Bangladesh . NO limite .Unlimited SMS Spaming

RedBomberBD A powerfull SMS Bomber for Bangladesh . NO limite .Unlimited SMS Spaming Installation Install my-tool on termux by using thoes commands pk

Abdullah Al Redwan 3 Feb 16, 2022
Discord bot script for sending multiple media files to a discord channel according to discord limitations.

Discord Bulk Image Sending Bot Send bulk images to Discord channel. This is a bot script that will allow you to send multiple images to Discord channe

Nikola Arbov 1 Jan 13, 2022
This bot is made with Python and it is running using Docker container and is concentrated on heroku.

This bot is made with Python and it is running using Docker container and is concentrated on heroku.

Movindu Bandara 1 Nov 16, 2021
Modified Version of mega.py package for Pyrogram Bots

Pyro Mega.py Python library for the Mega.co.nz API, currently supporting: login uploading downloading deleting searching sharing renaming moving files

I'm Not A Bot #Left_TG 10 Aug 03, 2022
An incomplete add-on extension to Pyrogram, to create telegram bots a bit more easily

PyStark A star ⭐ from you means a lot An incomplete add-on extension to Pyrogram

Stark Bots 36 Dec 23, 2022
A simple script & container to pull COVID data from covidlive.com.au and post a summary to a slack channel

CovidLive AU Summary Slackbot This bot is a very simple slackbot that pulls data, summarises and posts up to date AU COVID stats to a provided slack c

James 3 Dec 18, 2021
YouTube-Discord-Bot - Discord Bot to Search YouTube

YouTube Bot Info YouTube Bot is a discord bot where you can search for anything

Riceblades11 10 Mar 05, 2022
Social Framework

Social Int Framework Social Int Framework its a Selenium script that scrape the IG photos and do a Reverse search on google and yandex for finding ano

29 Dec 06, 2022
A results generator and automatic token checker for Yandex Contest

Yandex contest Python checking tools A results generator and automatic token checker for Yandex Contest. Версия на русском языке Installation Clone th

Nikolay Chechulin 9 Dec 14, 2022
8300-account-nuker - A simple accoutn nuker or can use it full closing dm and leaving server

8300 ACCOUNT NUKER VERISON: its just simple accoutn nuker or can use it full clo

†† 5 Jan 26, 2022
Weee - Advanced project's versions bumper

Weee - Advanced project's versions bumper

Yan Kurbatov 2 Jun 06, 2022
veez music bot is a telegram music bot project, allow you to play music on voice chat group telegram.

🎶 Veez Music Bot Music bot for playing music on telegram voice chat group. Requirements 📝 FFmpeg NodeJS nodesource.com Python 3.7+ PyTgCalls 🧪 Get

levina 143 Jun 19, 2022
A program that generates discord.py code

discord-py-generator A program that generates discord.py code Setup in cmds.txt file add your user id, client id and bot token you can change the bot

3 Dec 15, 2022