Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
Wordy is a Wordle-like Discord bot but with a twist.

Wordy Discord Bot Wordy is a Wordle-like Discord bot but with a twist. It already supports 6 languages from the beginning: English, Italian, French, G

The Coding Channel 2 Sep 06, 2022
Wrapper around the UPS API for creating shipping labels and fetching a package's tracking status.

ClassicUPS: A Useful UPS Library ClassicUPS is an Apache2 Licensed wrapper around the UPS API for creating shipping labels and fetching a package's tr

Jay Goel 55 Dec 12, 2022
Python API Client for Twitter API v2

🐍 Python Client For Twitter API v2 🚀 Why Twitter Stream ? Twitter-Stream.py a python API client for Twitter API v2 now supports FilteredStream, Samp

Twitivity 31 Nov 19, 2022
Kang Sticker bot

Kang Sticker Bot A simple Telegram bot which creates sticker packs from other stickers, images, documents and URLs. Based on kangbot Deploy Credits: s

Hafitz Setya 11 Jan 02, 2023
Automatically pick a winner who Retweeted, Commented, and Followed your Twitter account!

AutomaticTwitterGiveaways automates selecting winners for "Retweet, Comment, Follow" type Twitter giveaways.

1 Jan 13, 2022
A simple versatile telgeram bot written in Python using pyTelegramBotAPI library.

A simple versatile telgeram bot written in Python using pyTelegramBotAPI library.

Benyamin Zojaji 15 Jun 17, 2022
A clean, easy to scale discord bot template

A clean, easy to scale discord bot template. Develope using nextcord library and can be use with any other discord.py forked library.

めがねこ 3 Mar 03, 2022
PancakeTrade - Limit orders and more for PancakeSwap on Binance Smart Chain

PancakeTrade helps you create limit orders and more for your BEP-20 tokens that swap against BNB on PancakeSwap. The bot is controlled by Telegram so you can interact from anywhere.

Valentin Bersier 187 Dec 20, 2022
Disqus API bindings for Python

disqus-python Let's start with installing the API: pip install disqus-python Use the API by instantiating it, and then calling the method through dott

DISQUS 163 Oct 14, 2022
A telegram bot writen in python for mirroring files on the internet to our beloved Google Drive

[] Mirror Bot This is a telegram bot writen in python for mirroring files on the internet to our beloved Google Drive. Deploying on Heroku Give Star &

43 Mar 06, 2022
Use Node JS Keywords In Python!!!

Use Node JS Keywords In Python!!!

Sancho Godinho 1 Oct 23, 2021
A python package that allows you to place automated trades using the TD Ameritrade API.

Template Repo Table of Contents Overview Setup Usage Support These Projects Overview Setup Setup - Requirements Install:* For this particular project,

Alex Reed 4 Jan 25, 2022
Oussama has taken his first dose of vaccine D days ago

Oussama has taken his first dose of vaccine D days ago. He may take the second dose no less than L days and no more than R days since his first dose. Determine if Oussama is too early, too late, or i

INDIA - ENSAM Rabat 2 Feb 01, 2022
OGE-2022-na-Python - Solving problems in python for the OGE 2022

OGE-2022-na-Python Решение задачек на питоне для ОГЭ 2022 Тут разобраны разные в

Slava 0 Oct 14, 2022
Python Library to Extract youtube video Tags without Youtube API

YoutubeTags Python Library to Extract youtube video Tags without Youtube API Installation pip install YoutubeTags Example import YoutubeTags from Yout

Nuhman Pk 17 Nov 12, 2022
A python script that changes our background based on current weather and time of the day.

Desktop background on Windows 10, based on current weather and time A python script that changes our background based on current weather and time of t

Maj Gaberšček 1 Nov 16, 2021
Telegram group manager moderen and simple.

Upin Robot A Advanced Powerful, Smart And Intelligent Group Management Bot With New And Powerful Features ... Written with Pyrogram and Telethon... If

Muhammad Nawawi 3 Dec 23, 2021
🥀 Find the start of the token !

Discord Token Finder Find half of your target's token with just their ID. Install 🔧 pip install -r requeriments.txt Gui Usage 💻 Go to Discord Setti

zeytroxxx 2 Dec 22, 2021
Insane Weather Bot is here! Give suggestions, fork, and do much more to help us enhance the abilities of Insane Weather Bot.

Insane_Weather_Bot Insane Weather Bot is here! Give suggestions, fork, and do much more to help us enhance the abilities of Insane Weather Bot. Weathe

1 Jan 02, 2022
Support for Competitive Coding badges to add in Github readme or portfolio websites.

Support for Competitive Coding badges to add in Github readme or portfolio websites.

Akshat Aggarwal 2 Feb 14, 2022