AWS Glue PySpark - Apache Hudi Quick Start Guide

Overview

AWS Glue PySpark - Apache Hudi Quick Start Guide

Disclaimer:

This is a quick start guide for the Apache Hudi Python Spark connector, running on AWS Glue.

It's also specifically configured for the following Glue version:

  • AWS Glue 3.0
    • Spark 3.1.1
    • Python 3.7

Glue Configuration Reference: https://docs.aws.amazon.com/glue/latest/dg/add-job.html

Apache Hudi Reference: https://hudi.apache.org/docs/quick-start-guide/ for more information

Prerequisites:

- Python 3.6 or higher
- AWS CLI - Profile named 'dev' with Administrator Access (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html)

Folder Structure:

glue-hudi-hello
├── README.md
├── cloud-formation
│   ├── command.md
│   └── GlueJobPySparkHudi.yaml
├── jars
│   ├── command.md
│   ├── hudi-spark3-bundle_2.12-0.9.0.jar
│   └── spark-avro_2.12-3.0.1.jar
├── job
│   ├── command.md
│   └── job.py
│   └── upload_job.py
├── requirements.txt

Step 1: Create and activate a virtualenv:

Create a new virtual environment for the project in its root directory:

python3 -m venv venv

Activate it:

source venv/bin/activate

Run from the root directory the pip install to get boto3.

pip install -r requirements.txt

Step 2: Create the AWS Resources:

Now, with a aws configured profile named as dev, cd into the cloud-formation folder and run the command in command.md.

As a AWS Cloud Formation exercise, read the command Parameters and how they are used on the GlueJobPySparkHudi.yaml file to dynamically create the Glue Job and S3 Bucket.

Step 3: Upload the Job and Jars to S3:

cd into the job folder and run the command in command.md.

cd into the jars folder and run the commands in command.md. Note: There is one command for each jar.

Step 4: Check AWS Resources results:

Log into aws console and check the Glue Job and S3 Bucket.

On the AWS Glue console, you can run the Glue Job by clicking on the job name.

After the job is finished, you can check the Glue Data Catalog and query the new database from AWS Athena.

On AWS Athena check for the database: hudi_demo and for the table: hudi_trips.

Owner
Gabriel Amazonas Mesquita
Gabriel Amazonas Mesquita
Desktop Backup Client for Borg

Vorta Backup Client Vorta is a backup client for macOS and Linux desktops. It integrates the mighty BorgBackup with your desktop environment to protec

BorgBase.com 1.5k Jan 03, 2023
Michelle is a Discord Bot coded in Python with Discord.py by Mudit07.

Michelle is a Discord Bot coded in Python with Discord.py by Mudit07.

Michelle 3 Oct 09, 2021
Remedy when Amazon ECR is not running basic scans for container CVEs.

Welcome to your CDK Python project! This is a blank project for Python development with CDK. The cdk.json file tells the CDK Toolkit how to execute yo

4n6ir 4 Nov 05, 2022
A powerfull Zee5 Downloader Bot With Permeneant Thumbnail Support 💯 With Love From NexonHex

Zᴇᴇ5 DL A ᴘᴏᴡᴇʀғᴜʟʟ Zᴇᴇ5 Dᴏᴡɴʟᴏᴀᴅᴇʀ Bᴏᴛ Wɪᴛʜ Pᴇʀᴍᴇɴᴇᴀɴᴛ Tʜᴜᴍʙɴᴀɪʟ Sᴜᴘᴘᴏʀᴛ 💯 Wɪᴛʜ Lᴏᴠᴇ Fʀᴏᴍ NᴇxᴏɴHᴇx Wʜᴀᴛ Cᴀɴ I Dᴏ ? • ɪ ᴄᴀɴ Uᴘʟᴏᴀᴅ ᴀs ғɪʟᴇ/ᴠɪᴅᴇᴏ ғʀᴏᴍ

Psycharmers 4 Jan 19, 2022
A Python interface to AFL, allowing for easy injection of testcases and other functionality.

Fuzzer This module provides a Python wrapper for interacting with AFL (American Fuzzy Lop: http://lcamtuf.coredump.cx/afl/). It supports starting an A

Shellphish 614 Dec 26, 2022
A Telegram user bot to count telegram channel subscriber or group member.

Subscriber Count Userbot A Telegram user bot to count telegram channel subscriber or group member. This tool is only for educational purpose. You coul

IDNCoderX 8 Nov 30, 2022
discord.py bot written in Python.

bakerbot Bakerbot is a discord.py bot written in Python :) Originally made as a learning exercise, now used by friends as a somewhat useful bot and us

8 Dec 04, 2022
Discord Bot Personnal Server - Ha-Neul

Haneul Bot, it's a discord for help me on my personnal discord, she do a lot of boring and repetitive stain. You can use on your own server if you want, you just need to find a host for the programm

Maxvyr 1 Feb 03, 2022
Python library to download market data via Bloomberg, Eikon, Quandl, Yahoo etc.

findatapy findatapy creates an easy to use Python API to download market data from many sources including Quandl, Bloomberg, Yahoo, Google etc. using

Cuemacro 1.3k Jan 04, 2023
A python client for the Software-Challenge Germany.

sc-client-python A python client for the Software-Challenge Germany. Creating a new project (Optional) Install virtualenv virtualenv is a tool that cr

rpkak 3 Jan 22, 2022
A virus/stealer made in py

python-virus A virus/stealer made in py. Features: Discord token stealer, Password stealer, Windows key stealer, Credit-card stealer, Image grab, Anti

SKYNETMARCI 5 Dec 12, 2022
Bot Maker For Discord - Python Edition

BMFD-PE Bot Maker For Discord - Python Edition BMFD-PE is a new version of BMFD write in Python The Version of BMFD-PE is : alpha0.1 Longer support :

Téo 2 Dec 22, 2021
OpenZeppelin Contracts written in Cairo for StarkNet, a decentralized ZK Rollup

OpenZeppelin Cairo Contracts A library for secure smart contract development written in Cairo for StarkNet, a decentralized ZK Rollup. ⚠️ WARNING! ⚠️

OpenZeppelin 592 Jan 04, 2023
A Python Discord bot project generator

Heater Heat up a Discord bot in a blink What is Heater? Heater is a Command Line Interface tool which allows you to generate a barebones Python Discor

DevGuyAhnaf 5 Jan 14, 2022
Python library to interact with a Z-Wave JS server.

zwave-js-server-python Python library for communicating with zwave-js-server. Goal for this library is to replicate the structure and the events of Z-

Home Assistant Libraries 54 Dec 18, 2022
A simple discord bot written in python which can surf subreddits, send a random meme, jokes and also weather of a given place

A simple Discord Bot A simple discord bot written in python which can surf subreddits, send a random meme, jokes and also weather of a given place. We

1 Jan 24, 2022
Tools for Twitter

Tools for Twitter Data This is a start of a collection of tools to use for collecting data via the Twitter API. If you do not have a Twitter Developer

DiscoverText 36 Oct 13, 2022
Telegram RAT written in Python

teleRAT Python based RAT that uses Telegram for sending commands and receiving data to and from a victim computer. Setup.py Insert your API key into t

96 Jan 01, 2023
Automatically Edits Videos and Uploads to Tiktok with 1 line of code.

TiktokAutoUploader - Open to code contributions Automatically Edits Videos and Uploads to Tiktok with 1 line of code. Setup pip install -r requirement

Michael Peres 199 Dec 27, 2022
Sadew Jayasekara 23 Oct 21, 2022