A web scraper that exports your entire WhatsApp chat history.

Overview

WhatSoup 🍲

A web scraper that exports your entire WhatsApp chat history.

Table of Contents

  1. Overview
  2. Demo
  3. Prerequisites
  4. Instructions
  5. Frequently Asked Questions

Overview

Problem

  1. Exports are limited up to a maximum of 40,000 messages
  2. Exports skip the text portion of media-messages by replacing the entire message with instead of for example My favorite selfie of us 😻🐶🤳
  3. Exports are limited to a .txt file format

Solution

WhatSoup solves these problems by loading the entire chat history in a browser, scraping the chat messages (only text, no media), and exporting it to .txt, .csv, or .html file formats.

Example output:

WhatsApp Chat with Bob Ross.txt

02/14/2021, 02:04 PM - Eddy Harrington: Hey Bob 👋 Let's move to Signal!
02/14/2021, 02:05 PM - Bob Ross: You can do anything you want. This is your world.
02/15/2021, 08:30 AM - Eddy Harrington: How about we use WhatSoup 🍲 to backup our cherished chats?
02/15/2021, 08:30 AM - Bob Ross: However you think it should be, that’s exactly how it should be.
02/15/2021, 08:31 AM - Eddy Harrington: You're the best, Bob ❤
02/19/2021, 11:24 AM - Bob Ross:  My latest happy 🌲 painting for you.

Demo

Watch the video on YouTube

Prerequisites

  • You have a WhatsApp account
  • You have Chrome browser installed
  • You have some familiarity with setting up and running Python scripts
  • Your terminal supports unicode (UTF-8) characters (for chat emoji's)

Instructions

  1. Make sure your WhatsApp chat settings are set to English language. This needs to be done on your phone (instructions here). You can change it back afterwards, but for now the script relies on certain HTML elements/attributes that contain English characters/words.

  2. Clone the repo:

    git clone https://github.com/eddyharrington/WhatSoup.git
    
  3. Create a virtual environment:

    # Windows
    python -m venv env
    
    # Linux & Mac
    python3 -m venv env
    
  4. Activate the virtual environment:

    # Windows
    env/Scripts/activate
    
    # Linux & Mac
    source env/bin/activate
    
  5. Install the dependencies:

    # Windows
    pip install -r requirements.txt
    
    # Linux & Mac
    python3 -m pip install -r requirements.txt
    
  6. Setup your environment

  • Download ChromeDriver and extract it to a local folder (such as the env folder)

  • Get your Chrome browser Profile Path by opening Chrome and entering chrome://version into the URL bar

  • Create an .env file with an entry for DRIVER_PATH and CHROME_PROFILE that specify the directory paths for your ChromeDriver and your Chrome Profile from above steps:

    # Windows
    DRIVER_PATH = 'C:\path-to-your-driver\chromedriver.exe'
    CHROME_PROFILE = 'C:\Users\your-username\AppData\Local\Google\Chrome\User Data'
    
    # Linux & Mac
    DRIVER_PATH = '/Users/your-username/path-to-your-driver/chromedriver'
    CHROME_PROFILE = '/Users/your-username/Library/Application Support/Google/Chrome/Default'
    
  1. Run the script

    # Windows
    python whatsoup.py
    
    # Linux & Mac
    python3 whatsoup.py
    

    Note for Mac users: you may get blocked when trying to run the script the first time with a message about chromedriver not being from an identified developer. This is normal. Follow these instructions to grant chromedriver an exception, then re-run the script.

Frequently Asked Questions

Does it download pictures / media?

No.

How large of chats can I load/export?

The most demanding part of the process is loading the entire chat in the browser, in which performance heavily depends on how much memory your computer has and how well Chrome handles the large DOM load. For reference, my largest chat (~50k messages) uses about 10GB of RAM. If you load more than the current record let me know and add yourself to the leader board.

WhatSoup Largest Chat Leader Board

# Name Date Message Count Time
🥇 Eddy 2021-02-28 47,550 28139 sec / 7.8 hrs
🥈 ? ? ? ?
🥉 ? ? ? ?

How long does it take to load/export?

Depends on the chat size and how performant your computer is, however below is a ballpark range to expect. For large chats, I recommend turning your PC's sleep/power settings to OFF and running the script in the evening or before bed so it loads over night.

# of msgs in chat history Load time
500 1 min
5,000 12 min
10,000 35 min
25,000 3.5 hrs
50,000 8 hrs

Why is it so slow?!

Basically, browsers become easily bottlenecked when loading massive amounts of rich data in WhatsApp, which is a WebSocket application and is constantly sending/receiving information and changing the HTML/DOM.

I'm open to ideas but most of the things I tried didn't help performance:

  • Chrome vs Firefox
  • Headless browsing
  • Disabling images
  • Removing elements from DOM
  • Changing 'experimental' browser settings to allocate more memory

Can I...

  1. Use Firefox instead of Chrome? Yes, not out of the box though. There are a few Selenium differences and nuances to get it working, which I can share if there's interest. TODO.

  2. Use headless? Yes, but I only got this to work with Firefox and not Chrome.

  3. Use WhatSoup to scrape a local WhatsApp HTML file? Yes, you'd just need to bypass a few functions from main() and load the HTML file into Selenium's driver, then run the scraping/exporting functions like the below. If there's enough interest I can look into adding this to WhatSoup myself. TODO.

    # Load and scrape data from local HTML file
    def local_scrape(driver):
        driver.get('C:\your-WhatSoup-dir\source.html')
        scraped = scrape_chat(driver)
        scrape_is_exported("source", scraped)
    
  4. Contribute to WhatSoup? Please do!

Owner
Eddy Harrington
Eddy Harrington
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 04, 2023
CreamySoup - a helper script for automated SourceMod plugin updates management.

CreamySoup/"Creamy SourceMod Updater" (or just soup for short), a helper script for automated SourceMod plugin updates management.

3 Jan 03, 2022
Extract embedded metadata from HTML markup

extruct extruct is a library for extracting embedded metadata from HTML markup. Currently, extruct supports: W3C's HTML Microdata embedded JSON-LD Mic

Scrapinghub 725 Jan 03, 2023
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

Repositório contendo scripts Python que realizam a consulta de CPF e CNPJ diretamente no site da Receita Federal.

Josué Campos 5 Nov 29, 2021
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 07, 2023
a way to scrape a database of all of the isef projects

ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

William Kaiser 1 Mar 18, 2022
A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

onlyfans-scraper A command-line program to download media, like and unlike posts, and more from creators on OnlyFans. Installation You can install thi

185 Jul 23, 2022
A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

Jiuli Gao 2k Jan 05, 2023
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
A simple code to fetch comments below an Instagram post and save them to a csv file

fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

2 Jul 14, 2022
Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

Manvir Mann 1 Jan 07, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
The first public repository that provides free BUBT website scraping API script on Github.

BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

Md Imam Hossain 3 Feb 10, 2022
Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

Eban'ko 19 Dec 07, 2022
Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data. Then used Yahoo Finance to get the related stock data and displayed them in the form of chart

Samrat Mitra 3 Sep 09, 2022
京东抢茅台,秒杀成功很多次讨论,天猫抢购,赚钱交流等。

Jd_Seckill 特别声明: 请添加个人微信:19972009719 进群交流讨论 目前群里很多人抢到【扫描微信添加群就好,满200关闭群,有喜欢薅信用卡羊毛的也可以找我交流】 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性

50 Jan 05, 2023
Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

About 千葉県の地域別の詳細感染者統計(Excelファイル) をCSVに変換し、かつ地域別の日時感染者集計値を出力するスクリプトです。 Requirement POSIX互換なシェル, e.g. GNU Bash (1) curl (1) python = 3.8 pandas = 1.1.

Conv4Japan 1 Nov 29, 2021
A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

11 May 06, 2022