Scrapping malaysianpaygap & Extracting data from the Instagram posts

Last update: Nov 09, 2022

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

Run the following to get conda environment setup

  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt

Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.

# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.

# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.

Scrapping malaysianpaygap & Extracting data from the Instagram posts

Related tags

Overview

Scrapping malaysianpaygap & Extracting data from the posts

How to run

Data

FAQ

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?

Optimizations

Owner

Yudhiesh Ravindranath

Google Search Results via SERP API pip Python Package

A discord self-bot to automate shitposting for your everyday needs.

A EddieHub API python package.

Automatically Edits Videos and Uploads to Tiktok with 1 line of code.

Example notebooks for working with SageMaker Studio Lab. Sign up for an account at the link below!

Properly-formatted dynamic timestamps for Discord messages

Orca is an extensive and extendable Python 3.x library for the Discord API.

A program to convert YouTube channel registration information into Json files for ThirdTube.

Python: Asynchronous client for the Open-Meteo API.

WikiChecker - Repositorio oficial del complemento WikiChecker para NVDA.

Updater for PGCG (Paradox Game Converters Group) converters written in Python.

LoL API is a Python application made to serve League of Legends data.

Python library to download market data via Bloomberg, Eikon, Quandl, Yahoo etc.

Open Source Discord bot with many cool features like Weather, Balance, Avatar, User, Server, RP-commands, Gif search, YouTube search, VK post search etc.

Ubuntu env build; Nginx build; DB build;

An inline real-time media searching robot without any database.

A small module to communicate with Triller's API

Instagram auto reporting tool 100% working

Generate direct m3u playlist for all the channels subscribed in the Tata Sky portal

Interact and easily use Google Chat room webhooks.

Scrapping malaysianpaygap & Extracting data from the Instagram posts

Related tags

Overview

Scrapping malaysianpaygap & Extracting data from the posts

How to run

Data

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

Optimizations

Owner

Yudhiesh Ravindranath

Google Search Results via SERP API pip Python Package

A discord self-bot to automate shitposting for your everyday needs.

A EddieHub API python package.

Automatically Edits Videos and Uploads to Tiktok with 1 line of code.

Example notebooks for working with SageMaker Studio Lab. Sign up for an account at the link below!

Properly-formatted dynamic timestamps for Discord messages

Orca is an extensive and extendable Python 3.x library for the Discord API.

A program to convert YouTube channel registration information into Json files for ThirdTube.

Python: Asynchronous client for the Open-Meteo API.

WikiChecker - Repositorio oficial del complemento WikiChecker para NVDA.

Updater for PGCG (Paradox Game Converters Group) converters written in Python.

LoL API is a Python application made to serve League of Legends data.

Python library to download market data via Bloomberg, Eikon, Quandl, Yahoo etc.

Open Source Discord bot with many cool features like Weather, Balance, Avatar, User, Server, RP-commands, Gif search, YouTube search, VK post search etc.

Ubuntu env build; Nginx build; DB build;

An inline real-time media searching robot without any database.

A small module to communicate with Triller's API

Instagram auto reporting tool 100% working

Generate direct m3u playlist for all the channels subscribed in the Tata Sky portal

Interact and easily use Google Chat room webhooks.

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?