A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
Changes the Telegram bio, profile picture, first and last name to the song that the user is currently listening to.

TGBIOFY - Telegram & Spotify integration Changes the Telegram bio, profile picture, first and last name to the song that the user is currently listeni

elpideus 26 Dec 07, 2022
Webservice that notifies users on Slack when a change in GitLab concern them.

Gitlab Slack Notifier Webservice that notifies users on Slack when a change in GitLab concern them. Setup Slack Create a Slack app, go to "OAuth & Per

Heuritech 2 Nov 04, 2021
Image captioning service for healthcare domains in Vietnamese using VLP

Image captioning service for healthcare domains in Vietnamese using VLP This service is a web service that provides image captioning services for heal

CS-UIT AI Club 2 Nov 04, 2021
This project uses Youtube data API's to do youtube tags analysis based on viewCount, comments etc.

Youtube video details analyser Steps to run this project Please set the AuthKey which you can fetch from google developer console and paste it in the

1 Nov 21, 2021
Fetch fund data from avanza.se using Python and some web scraping with bs4

Py(A)vanza Fetch fund data from avanza.se using Python and some web scraping with bs4. The default way is to display the data in the terminal, apply -

dunderrrrrr 1 Jan 27, 2022
OMDB-and-TasteDive-Mashup - Mashing up data from two different APIs to make movie recommendations.

OMDB-and-TasteDive-Mashup This hadns-on project is in the Python 3 Programming Specialization offered by University of Michigan via Coursera. Mashing

Eszter Pai 1 Jan 05, 2022
The gPodder podcast client.

___ _ _ ____ __ _| _ \___ __| |__| |___ _ _ |__ / / _` | _/ _ \/ _` / _` / -_) '_| |_ \ \__, |_| \___/\__,_\__,_\___|_| |_

gPodder and related projects 1.1k Jan 04, 2023
iCloudPy is a simple iCloud webservices wrapper library written in Python

iCloudPy 🤟 Please star this repository if you end up using the library. It will help me continue supporting this product. 🙏 iCloudPy is a simple iCl

Mandar Patil 49 Dec 26, 2022
Fun telegram bot =)

Recolor Bot About Fun telegram bot, that can change your hair color. Preparations Update package lists sudo apt-get update; Make sure Git and docker-c

Just Koala 4 Jul 09, 2022
Passive income method via SerpClix. Uses a bot to accept clicks.

SerpClixBotSearcher This bot allows you to get passive income from SerpClix. Each click is usually $0.10 (sometimes $0.05 if offer isnt from the US).

Jason Mei 3 Sep 01, 2021
A python script to acquire multiple aws ec2 instances in a forensically sound-ish way

acquire_ec2.py The script acquire_ec2.py is used to automatically acquire AWS EC2 instances. The script needs to be run on an EC2 instance in the same

Deutsche Telekom Security GmbH 31 Sep 10, 2022
Nflmetrics - Johns Hopkins Spring 2022 Sports Analytics research project about NFL Draft Metrics

nflmetrics GitHub repo for Johns Hopkins Spring 2022 Sports Analytics research p

Anish Kulkarni 4 Feb 24, 2022
discord.js nuker (50 bans a sec)

js-nuker discord.js nuker (50 bans a sec) I was to lazy to make the scraper in js, but this works too. DISCLAIMER This is tool was made for educationa

4 Sep 11, 2021
Buscar y descargar canciones de YouTube automáticamente desde la web

🎶 DescargarCanciones 🎶 Buscar y descargar canciones o playlist de Spotify o YouTube automáticamente con todos los metadatos de la canciones en forma

1 Dec 20, 2021
HASOKI DDOS TOOL- powerful DDoS toolkit for penetration tests

DDoS Attack Panel includes CloudFlare Bypass (UAM, CAPTCHA, GS ,VS ,BFM, etc..) This is open source code. I am not responsible if you use it for malic

Rebyc 1 Dec 02, 2022
Discord Bot that leverages the idea of nested containers using podman, runs untrusted user input, executes Quantum Circuits, allows users to refer to the Qiskit Documentation, and provides the ability to search questions on the Quantum Computing StackExchange.

Discord Bot that leverages the idea of nested containers using podman, runs untrusted user input, executes Quantum Circuits, allows users to refer to the Qiskit Documentation, and provides the abilit

Mehul 23 Oct 18, 2022
Wonderful Phoenix-Bot

Phoenix Bot Discord Phoenix Bot is inspired by Natewong1313's Bird Bot project yet due to lack of activity by their team. We have decided to revive th

Senior Developer 0 Aug 12, 2021
⬇️ Telegram Bot to download TikTok videos without watermark in a snap with Inline mode support.

⬇️ Tokmate - Telegram Bot to download TikTok videos ⛲ Features Superfast and supports all type of TikTok links Download any TikTok videos without mate

Hemanta Pokharel 35 Jan 05, 2023
A heraldry-related bot, designed for the Heraldry Community.

Heraldtron A heraldry-related bot, designed for the Heraldry Community. Requirements Python 3.9+ discord.py aiohttp (comes installed with discord.py)

1 Mar 31, 2022
AminoAutoRegFxck/AutoReg For AminoApps.com

AminoAutoRegFxck AminoAutoRegFxck/AutoReg For AminoApps.com Termux apt update -y apt upgrade -y pkg install python git clone https://github.com/deluvs

3 Jan 18, 2022