AWS Glue ETL Code Samples

This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilities.

You can find the AWS Glue open-source Python libraries in a separate repository at: awslabs/aws-glue-libs.

Content

FAQ and How-to

Helps you get started using the many ETL capabilities of AWS Glue, and answers some of the more common questions people have.

Examples

You can run these sample job scripts on any of AWS Glue ETL jobs, container, or local environment.

Join and Relationalize Data in S3

This sample ETL script shows you how to use AWS Glue to load, transform, and rewrite data in AWS S3 so that it can easily and efficiently be queried and analyzed.
Clean and Process

This sample ETL script shows you how to take advantage of both Spark and AWS Glue features to clean and transform data for efficient analysis.
The resolveChoice Method

This sample explores all four of the ways you can resolve choice types in a dataset using DynamicFrame's resolveChoice method.
Converting character encoding

This sample ETL script shows you how to use AWS Glue job to convert character encoding.

Utilities

Hive metastore migration

This utility can help you migrate your Hive metastore to the AWS Glue Data Catalog.
Crawler undo and redo

These scripts can undo or redo the results of a crawl under some circumstances.
Spark UI

You can use this Dockerfile to run Spark history server in your container. See details: Launching the Spark History Server and Viewing the Spark UI Using Docker
use only IAM access controls

AWS Lake Formation applies its own permission model when you access data in Amazon S3 and metadata in AWS Glue Data Catalog through use of Amazon EMR, Amazon Athena and so on. If you currently use Lake Formation and instead would like to use only IAM Access controls, this tool enables you to achieve it.

GlueCustomConnectors

AWS Glue provides built-in support for the most commonly used data stores such as Amazon Redshift, MySQL, MongoDB. Powered by Glue ETL Custom Connector, you can subscribe a third-party connector from AWS Marketplace or build your own connector to connect to data stores that are not natively supported.

Development

Development guide with examples of connectors with simple, intermediate, and advanced functionalities. These examples demonstrate how to implement Glue Custom Connectors based on Spark Data Source or Amazon Athena Federated Query interfaces and plug them into Glue Spark runtime.
Local Validation Tests

This user guide describes validation tests that you can run locally on your laptop to integrate your connector with Glue Spark runtime.
Validation

This user guide shows how to validate connectors with Glue Spark runtime in a Glue job system before deploying them for your workloads.
Glue Spark Script Examples

Python scripts examples to use Spark, Amazon Athena and JDBC connectors with Glue Spark runtime.
Create and Publish Glue Connector to AWS Marketplace

If you would like to partner or publish your Glue custom connector to AWS Marketplace, please refer to this guide and reach out to us at [email protected] for further details on your connector.

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

AWS Glue ETL Code Samples

Related tags

Overview

AWS Glue ETL Code Samples

Content

Examples

Utilities

GlueCustomConnectors

License Summary

Owner

AWS Samples

My first Python project is a simple Mad Libs program.

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

An orchestration platform for the development, production, and observation of data assets.

A tax calculator for stocks and dividends activities.

PyPSA: Python for Power System Analysis

Very basic but functional Kakuro solver written in Python.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Repository created with LinkedIn profile analysis project done

Average time per match by division

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Python script for transferring data between three drives in two separate stages

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Developed for analyzing the covariance for OrcVIO

Sample code for Harry's Airflow online trainng course

A real data analysis and modeling project - restaurant inspections

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis