Free Data Engineering course!

Overview

Data Engineering Zoomcamp

Syllabus

Taking the course

Self-paced mode

All the materials of the course are freely available, so you can take the course at your own pace

  • Follow the suggested syllabus (see below) week by week
  • You don't need to fill in the registration form. Just start watching the videos and join Slack
  • Check FAQ if you have problems
  • If you can't find a solution to your problem in FAQ, ask for help in Slack

2022 Cohort

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Syllabus

Week 1: Introduction & Prerequisites

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

More details

Week 2: Data ingestion

  • Data Lake
  • Workflow orchestration
  • Setting up Airflow locally
  • Ingesting data to GCP with Airflow
  • Ingesting data to local Postgres with Airflow
  • Moving data from AWS to GCP (Transfer service)
  • Homework

More details

Week 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitoning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • Integrating BigQuery with Airflow
  • BigQuery Machine Learning

More details

Week 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualising the data with google data studio and metabase

More details

Week 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

More details

Week 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

More details

Week 7, 8 & 9: Project

Putting everything we learned to practice

  • Week 7 and 8: working on your own project
  • Week 9: reviewing your peers

More details

Overview

Architecture diagram

Technologies

  • Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • SQL: Data Analysis & Exploration
  • Airflow: Pipeline Orchestration
  • dbt: Data Transformation
  • Spark: Distributed Processing
  • Kafka: Streaming

Prerequisites

To get most out of this course, you should feel comfortable with coding and command line, and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Tools

For this course you'll need to have the following software installed on your computer:

  • Docker and Docker-Compose
  • Python 3 (e.g. via Anaconda)
  • Google Cloud SDK
  • Terraform

See Week 1 for more details about installing these tools

FAQ

  • Q: I registered, but haven't received a confirmation email. Is it normal? A: Yes, it's normal. It's not automated. But you will receive an email eventually
  • Q: At what time of the day will it happen? A: Office hours will happen on Mondays at 17:00 CET. But everything will be recorded, so you can watch it whenever it's convenient for you
  • Q: Will there be a certificate? A: Yes, if you complete the project
  • Q: I'm 100% not sure I'll be able to attend. Can I still sign up? A: Yes, please do! You'll receive all the updates and then you can watch the course at your own pace.
  • Q: Do you plan to run a ML engineering course as well? A: Glad you asked. We do :)
  • Q: I'm stuck! I've got a technical question! A: Ask on Slack! And check out the student FAQ; many common issues have been answered already. If your issue is solved, please add how you solved it to the document. Thanks!

Our friends

Big thanks to other communities for helping us spread the word about the course:

Check them out - they are cool!

Owner
DataTalksClub
The place to talk about data
DataTalksClub
Flask-built web application that simulates a time and cost calculator for charging Electric Vehicles.

ev_charging_calculator Flask-built web application that simulates a time and cost calculator for charging Electric Vehicles. The project aims to simul

1 Nov 03, 2021
Singularity Containers on Apple M1 (ARM64)

Singularity Containers on Apple M1 (ARM64) This is a repository containing a ready-to-use environment for singularity in arm64 (M1). It has been prepa

Manuel Parra 4 Nov 14, 2022
This is a backport of the BaseExceptionGroup and ExceptionGroup classes from Python 3.11.

This is a backport of the BaseExceptionGroup and ExceptionGroup classes from Python 3.11. It contains the following: The exceptiongroup.BaseExceptionG

Alex Grönholm 19 Dec 15, 2022
Use a real time weather API to apply wind to your mouse cursor.

wind-cursor Use a real time weather API to apply wind to your mouse cursor. Requirements PyAutoGUI pyowm Usage This program uses the OpenWeatherMap AP

Andreas Schmid 1 Feb 07, 2022
A webapp that timestamps key moments in a football clip

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

Pranav 1 Dec 10, 2021
Repository specifically for tcss503-22-wi Students

TCSS503: Algorithms and Problem Solving for Software Developers Course Description Introduces advanced data structures and key algorithmic techniques

Kevin E. Anderson 3 Nov 08, 2022
Biohacking con Python honeycon21

biohacking-honeycon21 This repository includes the slides of the public presentation 'Biohacking con Python' in the Hack&Beers of HoneyCON21 (PPTX and

3 Nov 13, 2021
SpaCy3Urdu: run command to setup assets(dataset from UD)

Project setup run command to setup assets(dataset from UD) spacy project assets It uses project.yml file and download the data from UD GitHub reposito

Muhammad Irfan 1 Dec 14, 2021
Grouping nucleotide coordinate ranges.

NuclRanger Grouping nucleotide coordinate ranges. A quick pre-processing step for "bedtools getfasta":- https://bedtools.readthedocs.io/en/latest/cont

Sujanavan Tiruvayipati 1 Oct 04, 2022
Checks for Vaccine Availability at your district and notifies you using E-mail, subscribe to our website.

Vaccine Availability Notifier Project Description Checks for Vaccine Availability at your district and notifies you using E-mail every 10 mins. Kindly

Farhan Hai Khan 19 Jun 03, 2021
An open letter in support of Richard Matthew Stallman being reinstated by the Free Software Foundation

An open letter in support of RMS. To sign, click here and name the file username.yaml (replace username with your name) with the following content

2.4k Jan 07, 2023
Simple package to make requests throughout Tor with circuit renewal.

AutoTor Table of Contents About the Project Contents Dependencies Getting Started Installation Coding Contributing About the Project Simple package to

Salvador Belenguer 6 Jan 01, 2023
Online HackerRank problem solving challenges

LinkedListHackerRank Online HackerRank problem solving challenges This challenge is part of a tutorial track by MyCodeSchool You are given the pointer

Sefineh Tesfa 1 Nov 21, 2021
Proyecto desarrollado para el programa #FutureDevelopers, tabla periódica interactiva.

Tabla_Periodica Proyecto desarrollado para el programa #FutureDevelopers, tabla periódica interactiva. Descripcion primer entregable: Tabla periodica

1 Dec 04, 2021
gwcheck is a tool to check .gnu.warning.* sections in ELF object files and display their content.

gwcheck Description gwcheck is a tool to check .gnu.warning.* sections in ELF object files and display their content. For an introduction to .gnu.warn

Frederic Cambus 11 Oct 28, 2022
Simple programming language built on Python.

Serial Another programming language. Built on Python. Building and running program In order to run the program on serial, unfortunately you still need

Aleksey Demchenkov 1 Dec 09, 2021
An easy FASTA object handler, reader, writer and translator for small to medium size projects without dependencies.

miniFASTA An easy FASTA object handler, reader, writer and translator for small to medium size projects without dependencies. Installation Using pip /

Jules Kreuer 3 Jun 30, 2022
Parametric Bottle in CADQuery

Parametric Bottle using CADQuery The proposed code makes it possible to generate different types and sizes of 3D bottles in order to train Pixel2mesh

Ayoub EL HOUDRI 1 May 22, 2022
A calculator developed in Python.

Calculadora Uma simples calculadora... ( + − × ÷ ) 💻 Situação do projeto: Projeto finalizado ✔️ 🛠 Tecnologias: Python Tkinter (GUI) ⚙️ Pré-requisito

Arthur V.B.S. 1 Jan 27, 2022
Rick Astley Language is a rick roll oriented, dynamic, strong, esoteric programming language.

Rick Roll Language / Rick Astley Language A rick roll oriented, dynamic, strong, esoteric programming language. Prolegomenon The reasons that I made t

Rick Roll Programming Language 658 Jan 09, 2023