Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
News API consisting various sources from Tanzania

Tanzania News API News API consisting various sources from Tanzania. Fork the project Clone the project git clone https://github.com/username/news-a

Innocent Zenda 6 Oct 06, 2022
A Telegram bot that can stream Telegram files to users over HTTP.

T.ME_FILE_TO_LINK Hi iam a file to link bot....best Bot telegram Telegram File To Link Generation Bot A Telegram bot that can stream Telegram files to

1 Oct 24, 2021
🐲 Powerfull Discord Token Stealer made in python

🐲 Follow me here 🐲 Discord | YouTube | Github ☕ Usage 💻 Downloading git clone https://github.com/KanekiWeb/Powerfull-Token-Stealer

Kaneki 61 Dec 19, 2022
A Python Program to determine Degree of Profanity of Tweets

tweetx tweetx is a program to detect racial slurs in Twitter Tweets. Racial Abuse on Twitter is becoming quite a serious issue in recent times. tweetx

Kartik Poojari 3 Nov 11, 2021
Telegram Bot to learn English by words and more.. ( in Arabic )

Get the mp3 files Extract the mp3.rar on the same file that bot.py on install requirements pip install -r requirements.txt #Then enter you bot token

Plugin 10 Feb 19, 2022
Python package for Calendly API v2

PyCalendly Python package to use Calendly API-v2. Installation Install with pip $ pip install PyCalendly Usage Getting Started See Getting Started wi

Lakshmanan Meiyappan 20 Dec 05, 2022
PS3API - PS3 API for TMAPI and CCAPI in python.

PS3API PS3 API for TMAPI and CCAPI in python. Examples Connecting and Attaching from ps3api import PS3API PS3 = PS3API(PS3API.API_TMAPI) if PS3.Conn

Adam 9 Sep 01, 2022
SimpleDCABot is a simple bot that buys crypto with a dollar-cost averaging strategy.

Simple Open Dollar Cost Averaging (DCA) Bot SimpleDCABot is a simple bot that buys crypto on a selected exchange at regular intervals for a prescribed

4 Mar 28, 2022
Google Sheets Python API

Google Spreadsheets Python API v4 Simple interface for working with Google Sheets. Features: Open a spreadsheet by title, key or url. Read, write, and

Anton Burnashev 6.2k Dec 30, 2022
Бот для мини-игры "Рабы" ("Рабство") ВКонтакте.

vk-slaves-bot Бот для мини-игры "Рабы" ("Рабство") ВК Группа в ВК, в ней публикуются новости и другая полезная информация. У группы есть беседа, в кот

Almaz 80 Dec 17, 2022
BaiduPCS API & App 百度网盘客户端

BaiduPCS-Py A BaiduPCS API and An App BaiduPCS-Py 是百度网盘 pcs 的非官方 api 和一个命令行运用程序。

Peter Ding 450 Jan 05, 2023
Discord-disnake - This package allows to use disnake without changing the discord namespace

This package is a shim This module allows to use disnake using discord namespace. This is not an independent library. Installing Python 3.8 or higher

5 Dec 13, 2022
GUI Pancakeswap V2 and Uniswap V3 trading client (and bot)MOST ADVANCE TRADING BOT SUPPORT WINDOWS LINUX MAC

GUI Pancakeswap 2 and Uniswap 3 trading client (and bot) (MOST ADVANCE TRADING BOT SUPPORT WINDOWS LINUX MAC) UPDATE: MUTI TRADE TOKEN ENABLE ,TRADE 1

2 Dec 27, 2021
Revolt.py - An async library to interact with the https://revolt.chat api.

Revolt.py An async library to interact with the https://revolt.chat api. This library will be focused on making bots and i will not implement anything

Zomatree 0 Oct 08, 2022
PancakeTrade - Limit orders and more for PancakeSwap on Binance Smart Chain

PancakeTrade helps you create limit orders and more for your BEP-20 tokens that swap against BNB on PancakeSwap. The bot is controlled by Telegram so you can interact from anywhere.

Valentin Bersier 187 Dec 20, 2022
It is a temporary project to study discord interactions. You can set permissions conveniently when you invite a particular disk code bot.

Permission Bot 디스코드 내에 있는 message-components 를 연구하기 위하여 제작된 봇입니다. Setup /config/config_example.ini 파일을 /config/config.ini으로 변환합니다. config 파일의 기본 양식은 아

gunyu1019 4 Mar 07, 2022
Python client for Toyota North America service API

toyota-na Python client for Toyota North America service API Install pip install toyota-na[qt] [qt] is required for generating authorization code. Us

Gavin Ni 18 Sep 06, 2022
IACR Events Scraper

IACR Events Scraper This scrapes https://iacr.org/events/ and exports it as a calendar file. I host a version of this for myself under https://arrrr.c

Karolin Varner 6 May 28, 2022
ShadowMusic - A Telegram Music Bot with proper functions written in Python with Pyrogram and Py-Tgcalls.

⭐️ Shadow Music ⭐️ A Telegram Music Bot written in Python using Pyrogram and Py-Tgcalls Ready to use method A Support Group, Updates Channel and ready

TeamShadow 8 Aug 17, 2022