AWS Glue PySpark - Apache Hudi Quick Start Guide

Overview

AWS Glue PySpark - Apache Hudi Quick Start Guide

Disclaimer:

This is a quick start guide for the Apache Hudi Python Spark connector, running on AWS Glue.

It's also specifically configured for the following Glue version:

  • AWS Glue 3.0
    • Spark 3.1.1
    • Python 3.7

Glue Configuration Reference: https://docs.aws.amazon.com/glue/latest/dg/add-job.html

Apache Hudi Reference: https://hudi.apache.org/docs/quick-start-guide/ for more information

Prerequisites:

- Python 3.6 or higher
- AWS CLI - Profile named 'dev' with Administrator Access (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html)

Folder Structure:

glue-hudi-hello
├── README.md
├── cloud-formation
│   ├── command.md
│   └── GlueJobPySparkHudi.yaml
├── jars
│   ├── command.md
│   ├── hudi-spark3-bundle_2.12-0.9.0.jar
│   └── spark-avro_2.12-3.0.1.jar
├── job
│   ├── command.md
│   └── job.py
│   └── upload_job.py
├── requirements.txt

Step 1: Create and activate a virtualenv:

Create a new virtual environment for the project in its root directory:

python3 -m venv venv

Activate it:

source venv/bin/activate

Run from the root directory the pip install to get boto3.

pip install -r requirements.txt

Step 2: Create the AWS Resources:

Now, with a aws configured profile named as dev, cd into the cloud-formation folder and run the command in command.md.

As a AWS Cloud Formation exercise, read the command Parameters and how they are used on the GlueJobPySparkHudi.yaml file to dynamically create the Glue Job and S3 Bucket.

Step 3: Upload the Job and Jars to S3:

cd into the job folder and run the command in command.md.

cd into the jars folder and run the commands in command.md. Note: There is one command for each jar.

Step 4: Check AWS Resources results:

Log into aws console and check the Glue Job and S3 Bucket.

On the AWS Glue console, you can run the Glue Job by clicking on the job name.

After the job is finished, you can check the Glue Data Catalog and query the new database from AWS Athena.

On AWS Athena check for the database: hudi_demo and for the table: hudi_trips.

Owner
Gabriel Amazonas Mesquita
Gabriel Amazonas Mesquita
Predict the Site EUI, given the characteristics of the building and the weather data for the location of the building.

wids_datathon_2022 Description: Contains a data pipeline used to predict energy EUI Goals: Dataset exploration Automating the parameter fitting, gener

1 Mar 25, 2022
Senexia - A powerful telegram bot to manage your groups as effectively as possible

⚡ Kenechi bot ⚡ A Powerful, Smart And Simple Group Manager ... Written with AioG

Akhi 2 Jan 11, 2022
A mass account list editor for python

Account-List-Editor This is an mass account list editor Usage Run the editor.py file with python (python3 ./editor.py) Press a button (1/2) and drag &

ExtremeDev 1 Dec 20, 2021
Skautský discord bot

Jáchym 🤖 Open-source skautský discord bot postavený na discord.py O čem? • Funkce • TODO • Poděkování ❓ O čem? Jáchym vznikl jako projekt do odborky

10 May 12, 2022
API to retrieve the number of grades on the OGE website (Website listing the grades of students) to know if a new grade is available. If a new grade has been entered, the program sends a notification e-mail with the subject.

OGE-ESIREM-API Introduction API to retrieve the number of grades on the OGE website (Website listing the grades of students) to know if a new grade is

Benjamin Milhet 5 Apr 27, 2022
A Telegram bot for combining emojis.

combimoji combimoji is a Telegram bot for combining emojis. How can I use it? You can find combimoji at @combimoji_bot, however it is not up (as of No

Yarema Mishchenko 2 Dec 02, 2021
Discord-Mass-Mention - Yup the title says it all

Protocol - Mass Mention (i havent tested this with any token other than my own t

Mallowies 14 Nov 06, 2022
Un petit tool qui est la pour envoier des message avec des webhook en bêta

📎 Webhook-discord Le but de se tool c'est que tu peux envoier vos webhook discord sur vos serveur et les customiser Pour lancer le projet il faut avo

2 Oct 10, 2021
Telegram hack bot [ For Dev ]

Telegram hack bot [ For Dev ]

Alison Parker 1 Jul 04, 2022
LHXP・Official "LH - Cyber Security" Discord Leveling-Bot

LHXP・Official "LH - Cyber Security" Discord Leveling-Bot Based on nsde/NOVΛLIX Feature Overview /clear @user Requires admin permission Purges all XP

Felix・onlix 2 Mar 09, 2022
Scrapes an instagram user's photos and videos

Instagram Scraper instagram-scraper is a command-line application written in Python that scrapes and downloads an instagram user's photos and videos.

7.3k Nov 18, 2022
Instadev - Crack Instagram IqbalDev

Crack Instagram IqbalDev ⇨ Install Script Di Termux $ pkg update && upgrade $

Dicky Wahyudi 1 Feb 27, 2022
A powerful bot to copy your google drive data to your team drive

⚛️ Clonebot - Heroku version ⚡ CloneBot is a telegram bot that allows you to copy folder/team drive to team drives. One of the main advantage of this

MsGsuite 269 Dec 23, 2022
Schedule Twitter updates with easy

coo: schedule Twitter updates with easy Coo is an easy to use Python library for scheduling Twitter updates. To use it, you need to first apply for a

wilfredinni 46 Nov 03, 2022
Automatically updates the twitter banner with the images of 5 latest followers, using tweepy python

Auto twitter banner Automatically updates the twitter banner every few seconds with follower profile pics on it Here's how it looks! Installation git

Dhravya Shah 7 Jul 04, 2022
A simple Discord bot that notifies users of new Abitti versions

A simple Discord bot that notifies users of new Abitti versions. New features might be added later on. If you have good ideas, feel free to do a PR.

1 Feb 11, 2022
Neko is An Anime themed advance Telegram group management bot.

NekoRobot A modular telegram Python bot running on python3 with an sqlalchemy, mongodb database. ╒═══「 Status 」 Maintained Support Group Included Free

Lovely Prince 11 Oct 11, 2022
A reddit.com bot that will return reference links from official python documentation site for the standard library.

Python Docs Bot A reddit.com bot that will return documentation links for the library and language reference sections of the python docs website. The

Trevor Miller 2 Sep 14, 2021
A file-based quote bot written in Python

Let's Write a Python Quote Bot! This repository will get you started with building a quote bot in Python. It's meant to be used along with the Learnin

1 Feb 03, 2022
A template / demo bot for the Halcyon matrix bot library

Halcyon stock bot Hello! This is an example / template bot using the halcyon matrix bot library. Feel free to ask questions in the matrix chat #halcyo

Wes Ring 1 Feb 04, 2022