A Telegram crawler to search groups and channels automatically and collect any type of data from them.

Overview

Introduction

This is a crawler I wrote in Python using the APIs of Telethon months ago. This tool was not intended to be publicly available for a number of reasons, but eventually I decided to distribute it "as it is". Any contribution to the project is more than welcome :)

Installation

Python 3.8.2 and Telethon 1.21.1 are required (along with other common packages, just read the imports), I don't guarantee it works with newer versions of Telethon.

To install Telethon just read their documentation, while to install this repository just git clone and run python3.8 scraper.py only after configurated the script properly (next section).

Configuration

To use this tool you have to first obtain an API ID and an API HASH from Telegram: you can do this by following this page. Once done the ID and the HASH can be inserted into the code and the script can be launched. The first time it runs, it will ask to insert the telephone number.

Usage

In the code, there are two methods to initialize the crawler: init_empty() and init(). The former is used for the very first time that the script has been launched, while the latter is needed only in specific situations (read the code for details). Once the crawler has been launched with init_empty() and terminated, it basically processed all the groups/channels where the account is already in, collecting all the links shared in the chats along with a number of other data such that:

  1. Name of the group/channel
  2. Username
  3. List of members (just for groups)
  4. List of the last n messages
  5. Other metadata...

groups

These information have been saved in a pickle file called groups. Other files given in output are to_be_processed and edges. The former is a list of links that will be processed in the next iteration (see later) and the other one is a list of tuples of the form (group id, [group id list]) where the first entry represents the so called destination vertex and the list represents the origin vertices. Indeed, this uncommon data structure is the edge list of the search graph produced by the crawler (this is useful for data mining purposes, for instance I used it to perform link prediction between groups/channels exploiting both the graph structure and the messages). Probably is not the best data structure, since you have to reverse it later on if you want to actually use it for other tasks, but it is faster than other solutions to update.

Once the initialization is complete, you can comment init_empty() and uncomment start() in main() to process the new links collected before. This will generate three new files: groups2, to_be_processed2 and edges2. Now you have to merge the old files with the new ones (I actually have a script that does that, but it is customized according to my environment so I will publish it as soon as I have time to generalize it): pay attention to the to_be_processed file, because you don't want to process it entirely in each run. You need to divide it since it will take too long, indeed there are some limitations to process a lot of groups, therefore you want to play smarter... next section.

Limitations

Telegram doesn't want that you play with users' data therefore for each account, if I recall correctly, you can join 25 groups per hour, then Telegram will stop you. The script handles this, so if your usage is not so intensive it will not make your task unfeasible. But if you need hundreds or thousands of groups, well, you have to parallelize the script. I won't publish the code for doing this, but it is not so difficult to make this idea practical.

Owner
Computer Science student at "La Sapienza"
茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

MaoTai 33 Sep 03, 2022
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

Gerapy 2.9k Jan 03, 2023
This is python to scrape overview and reviews of companies from Glassdoor.

Data Scraping for Glassdoor This is python to scrape overview and reviews of companies from Glassdoor. Please use it carefully and follow the Terms of

Houping 5 Jun 23, 2022
A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

Muhammad Abdullah 273 Dec 31, 2022
Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Scrapy Cluster This Scrapy project uses Redis and Kafka to create a distributed

Hanh Pham Van 0 Jan 06, 2022
A way to scrape sports streams for use with Jellyfin.

Sportyfin Description Stream sports events straight from your Jellyfin server. Sportyfin allows users to scrape for live streamed events and watch str

axelmierczuk 38 Nov 05, 2022
对于有验证码的站点爆破,用于安全合法测试

使用方法 python3 main.py + 配置好的文件 python3 main.py Verify.json python3 main.py NoVerify.json 以上分别对应有验证码的demo和无验证码的demo Tips: 你可以以域名作为配置文件名字加载:python3 main

47 Nov 09, 2022
Simple proxy scraper made by using ProxyScrape's api.

What is Moon? Moon is a lightweight and fast proxy scraper made by using ProxyScrape's api. What can i do with this? You can use proxies for varietys

1 Jul 04, 2022
爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

Bywalks 95 Jan 07, 2023
This is my CS 20 final assesment.

eeeeeSpider This is my CS 20 final assesment. How to use: Open program Run to your hearts content! There are no external dependancies that you will ha

1 Jan 17, 2022
Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

Comment Webpage Screenshot is a GitHub Action that helps maintainers visually review HTML file changes introduced on a Pull Request by adding comments with the screenshots of the latest HTML file cha

Maksudul Haque 21 Sep 29, 2022
Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a ra

1 Jan 04, 2022
This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Rahul Siyanwal 1 Dec 13, 2021
Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
Collection of code files to scrap different kinds of websites.

STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

Tapasweni Pathak 15 Jun 08, 2022
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

David Souza 1 Jan 12, 2022
A Python web scraper to scrape latest posts from official Coinbase's Blog.

Coinbase Blog Scraper A Python web scraper to scrape latest posts from official Coinbase's Blog. IDEA It scrapes up latest blog posts from https://blo

Lucas Villela 3 Feb 18, 2022
A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working Links.

WaGpScraper A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working

Muhammed Rizad 27 Dec 18, 2022
A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

Amazon-Web-Scarper Created a web scraper using simple functions to check price of a product on amazon (can be duplicated to check price at other marke

Swaroop Todankar 1 Jan 17, 2022
Crawl the information of a given keyword on Google search engine

Crawl the information of a given keyword on Google search engine

4 Nov 09, 2022