Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Make sure that VPN is switched on, so that you can use Twitter. In some countries Twitter is blocked.

Moreover, you should have own consumer_key, consumer_secret, and access_token with its secret inside config.py file

  • Create environment using conda with Python 3.8:
    • conda create -n python38 python=3.8
    • conda activate python38
    • Check requirements inside requirements.txt and install then using conda:
      • conda install -c conda-forge tweepy==4.4.0
      • conda install -c conda-forge kafka-python==2.0.2
  • Kafka should be installed in your machine, check the documentation for installation. if you use brew with Mac you can use brew install kafka
  • Start zookeeper: zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties, port: 2181
  • On another terminal window start broker: kafka-server-start /usr/local/etc/kafka/server.properties, port: 9092 - In terminal window list topics you have: kafka-topics --list --bootstrap-server localhost:9092
  • Create Kafka topic "tweeter" with 1 partition and no replication because we use local machine: kafka-topics --create --topic tweeter --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
  • Now list again, the topics you have: kafka-topics --list --bootstrap-server localhost:9092
  • Let's see what we have inside the "tweeter" topic kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning, absolutely noting), but when we start streaming, data will be generated
  • Now run python kafka_producer.py to start stream Twitter and push message to topic.
  • And now check that the data is inside topic with kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning
  • Congrats! You have done it!

So what's next?

You can use generated data with Kafka Stream and Spark Streaming, and practice more!

Owner
Rustam Zokirov
15x Engineer • Data Engineer
Rustam Zokirov
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

Sebastian Schäfer 10 Dec 08, 2022
Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt Labs 6.3k Jan 08, 2023
Semi-Automated Data Processing

Perform semi automated exploratory data analysis, feature engineering and feature selection on provided dataset by visualizing every possibilities on each step and assisting the user to make a meanin

Arun Singh Babal 1 Jan 17, 2022
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

Hyeong Kyun (Daniel) Park 1 Jun 28, 2022
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 03, 2023
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
The micro-framework to create dataframes from functions.

The micro-framework to create dataframes from functions.

Stitch Fix Technology 762 Jan 07, 2023
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
Data processing with Pandas.

Processing-data-with-python This is a simple example showing how to use Pandas to create a dataframe and the processing data with python. The jupyter

1 Jan 23, 2022
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
Analysis scripts for QG equations

qg-edgeofchaos Analysis scripts for QG equations FIle/Folder Structure eigensolvers.py - Spectral and finite-difference solvers for Rossby wave eigenf

Norman Cao 2 Sep 27, 2022
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022
Hydrogen (or other pure gas phase species) depressurization calculations

HydDown Hydrogen (or other pure gas phase species) depressurization calculations This code is published under an MIT license. Install as simple as: pi

Anders Andreasen 13 Nov 26, 2022
vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

gg I wasn't satisfied with any of the other available Gemini clients, so I wrote my own. Requires Python 3.9 (maybe older, I haven't checked) and opti

RAFAEL RODRIGUES 5 Jan 03, 2023
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023