This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

Overview

crawler_to_visual_gmane

Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library.

This is a set of tools that allow you to pull down an archive of a gmane repository using the instructions at:

http://gmane.org/export.php

In order not to overwhelm the gmane.org server, I have used a copy of the messages present at:

http://mbox.dr-chuck.net/

This server will be faster and take a lot of load off the gmane.org server.

You should install the SQLite browser to view and modify the databases from:

http://sqlitebrowser.org/

The first step is to spider the gmane repository. The base URL is hard-coded in the gmane.py and is hard-coded to the Sakai developer list. You can spider another repository by changing that base url. Make sure to delete the content.sqlite file if you switch the base url. The gmane.py file operates as a spider in that it runs slowly and retrieves one mail message per second so as to avoid getting throttled by gmane.org. It stores all of its data in a database and can be interrupted and re-started as often as needed. It may take many hours to pull all the data down. So you may need to restart several times.

The program scans content.sqlite from 1 up to the first message number not already spidered and starts spidering at that message. It continues spidering until it has spidered the desired number of messages or it reaches a page that does not appear to be a properly formatted message.

Sometimes gmane.org is missing a message. Perhaps administrators can delete messages or perhaps they get lost - I don't know. If your spider stops, and it seems it has hit a missing message, go into the SQLite Manager and add a row with the missing id - leave all the other fields blank - and then restart gmane.py. This will unstick the spidering process and allow it to continue. These empty messages will be ignored in the next phase of the process.

IMPORTANT

One nice thing is that once you have spidered all of the messages and have them in content.sqlite, you can run gmane.py again to get new messages as they get sent to the list. gmane.py will quickly scan to the end of the already-spidered pages and check if there are new messages and then quickly retrieve those messages and add them to content.sqlite.

The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. This is intentional as it allows you to look at content.sqlite to debug the process. It would be a bad idea to run any queries against this database as they would be slow.

The second process is running the program gmodel.py. gmodel.py reads the rough/raw data from content.sqlite and produces a cleaned-up and well-modeled version of the data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X smaller) than content.sqlite because it also compresses the header and body text.

Each time gmodel.py runs - it completely wipes out and re-builds index.sqlite, allowing you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the data cleaning process.

You can re-run the gmodel.py over and over as you look at the data, and add mappings to make the data cleaner and cleaner. When you are done, you will have a nicely indexed version of the email in index.sqlite. This is the file to use to do data analysis. With this file, data analysis will be really quick.

You can look at the data in index.sqlite and if you find a problem, you can update the Mapping table and DNSMapping table in content.sqlite and re-run gmodel.py.

  1. The first, simplest data analysis is to do a "who does the most" and "which organzation does the most"? This is done using gbasic.py.

  2. There is a simple vizualization of the word frequence in the subject lines in the file gword.py.

  3. A second visualization is in gline.py. It visualizes email participation by organizations over time.

Owner
Saim Zafar
I am a student at University of Waterloo, currently studying Mathematics. My interests include Computer Science, problem analysis, Algorithm design and gaming.
Saim Zafar
This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

Boss Perry (Pez) 1 Jan 23, 2022
FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

UserGhost411 1 Nov 17, 2022
NASA APOD Discord Bot - Fetches information from NASA APOD site.

NASA APOD Discord Bot - Fetches information from NASA APOD site.

Astronomy Club IITK 4 Apr 23, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

crawlersuseragents This Python script can be used to check if there is any differences in responses of an application when the request comes from a se

Podalirius 13 Dec 27, 2022
Nekopoi scraper using python3

Features Scrap from url Todo [+] Search by genre [+] Search by query [+] Scrap from homepage Example # Hentai Scraper from nekopoi import Hent

MhankBarBar 9 Apr 06, 2022
Haphazard scripts for scraping bitcoin/bitcoin data from GitHub

This is a quick-and-dirty tool used to scrape bitcoin/bitcoin pull request and commentary data. Each output/pr number folder contains comments.json:

James O'Beirne 8 Oct 12, 2022
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

Douglas Trajano 2 Jan 24, 2022
Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

Carmelo Gonzales 71 Oct 04, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
Dex-scrapper - Hobby project for scrapping dex data on VeChain

Folders /zumo_abis # abi extracted from zumo repo /zumo_pools # runtime e

3 Jan 20, 2022
Scraping web pages to get data

Scraping Data Get public data and save in database This is project use Python How to run a project 1 - Clone the repository 2 - Install beautifulsoup4

Soccer Project 2 Nov 01, 2021
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 05, 2022
京东抢茅台,秒杀成功很多次讨论,天猫抢购,赚钱交流等。

Jd_Seckill 特别声明: 请添加个人微信:19972009719 进群交流讨论 目前群里很多人抢到【扫描微信添加群就好,满200关闭群,有喜欢薅信用卡羊毛的也可以找我交流】 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性

50 Jan 05, 2023
SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
Extract embedded metadata from HTML markup

extruct extruct is a library for extracting embedded metadata from HTML markup. Currently, extruct supports: W3C's HTML Microdata embedded JSON-LD Mic

Scrapinghub 725 Jan 03, 2023
mlscraper: Scrape data from HTML pages automatically with Machine Learning

🤖 Scrape data from HTML websites automatically with Machine Learning

Karl Lorey 798 Dec 29, 2022
Pseudo API for Google Trends

pytrends Introduction Unofficial API for Google Trends Allows simple interface for automating downloading of reports from Google Trends. Only good unt

General Mills 2.6k Dec 28, 2022
Subscrape - A Python scraper for substrate chains

subscrape A Python scraper for substrate chains that uses Subscan. Usage copy co

ChaosDAO 14 Dec 15, 2022