PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

Overview

PaperRobot

PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。

PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。

example

Installation

  • Download this tool
git clone https://github.com/mo-xiaoxi/PaperRobot.git
  • Install dependencies
sudo pip3 install -r requirements.txt

Python version: Python 3 (>=3.7).

Why build this tool?

  1. 通过这个工具可以构建自己的论文数据库。具体参考:如何建立独属于你自己的论文数据库
  2. 一个方便的论文调研工具: Secpaper. 论文调研必备!
  3. 提取论文的摘要,自动翻译推送整理一些会议的研究简报,可以快速地过一下每个会议论文的内容,感兴趣的再阅读对应的pdf。
  4. 对会议研究热点、作者变化等等进行归类与整理。 如Computer Science Rankings.

Usage

$ python run.py --help
usage: run.py [-h] [-m {d,s}] [-c {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}] [-s YEAR_START] [-e YEAR_END] [-b BIBTEX] [-t TITLE] [-u URL] [--all {bibtex,pdf}]

OPTIONS:
  -h, --help            show this help message and exit
  -m {d,s}, --mode {d,s}
                        s:show info, d: download
  -c {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}, --conference {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}
                        The target conference.
  -s YEAR_START, --year_start YEAR_START
                        The start year of paper.
  -e YEAR_END, --year_end YEAR_END
                        The end year of paper.
  -b BIBTEX, --bibtex BIBTEX
                        Download with bibtex file.
  -t TITLE, --title TITLE
                        Download with Google search.
  -u URL, --url URL     Dowanload with url.
  --all {bibtex,pdf}    Download all bibbex or papers,2001-2022 by default

Example

  • 基于Title下载论文 python run.py -t "A Large-scale Analysis of Email Sender Spoofing Attacks"
  • 基于URL下载论文 python run.py -u "https://www.usenix.org/conference/usenixsecurity21/presentation/shen-kaiwen"
  • 基于bib下载论文 python run.py -b bibtex/example.bib
  • 获取NDSS 2021会议论文 python run.py -c ndss -s 2021 -e 2022
  • 获取NDSS 2001-2021会议论文 python run.py -c ndss -s 2001 -e 2022
  • 获取所有会议的bibtex文件 python run.py --all bibtex
  • 获取所有会议的pdf文件 python run.py --all bibtex

其他说明:

  • PaperRobot通过dblp抓取对应会议的bibtex,以保证通用性,理论上支持任意DBLP上收录的会议。

    通过配置下列数据,可以增加新的会议支持。

    LIB = {
        "ccs": "CCS",
        "uss": "Usenix_Security",
        "sp": "S&P",
        "ndss": "NDSS",
        "dsn": "DSN",
        "raid": "RAID",
        "imc": "IMC",
        "asiaccs": "ASIACCS",
        "acsac": "ACSAC",
        "sigcomm": "SIGCOMM",
    }
  • 多个PDF辅助抓取接口:

    • 通过doi序列号在SCI-HUB抓取论文(zotera适用方法)
    • 论文官方网站抓取论文
    • 通过google搜索抓取论文
    • 通过crossRef网站抓取论文(这个接口效果不是特别好)
  • keep_cookies.py 用于维护某些站点的登陆状态,需要单独运行。

    • 维护登陆状态的原因是某些网站(如dl.acm)需要登陆才能下载pdf。

      用户需要单独配置config中的账号密码,账号密码为学校账号与密码。

    • 若在教育网IP内访问, 则不需要维护Cookie信息,教育网IP直接可以下载PDF。

    • 用户也可以手动维护cookie信息,利用burpsuite等一系列工具导出cookie,写入data/cookie.json文件即可。

TODO

  • 更好的文档说明,中英文文档分开。
  • 修改日志信息到英文版本
  • 多进程+多协程并发处理
  • 代理池构建
  • 使用重试修饰器重写需重试的函数
Owner
moxiaoxi
CTF Player of Tea-Deliverers, Blue-Lotus. Ph.D. Student at Tsinghua University. Research on Protocol Security.
moxiaoxi
This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

Rodney 4 Nov 18, 2022
jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人, 照顾我们这样的马大哈, 不会忘记抢购了, 祝大家过年都能喝上茅台. 特别声明: 本仓库发布的jd_maotai_rpa项目定义为自动化rpa项目, 是用于防止忘记参与jd茅台的活动(由于本人时常忘记), 而不是为了秒杀和抢

35 Nov 18, 2022
原神爬虫 抓取原神界面圣遗物信息

原神圣遗物半自动爬虫 说明 直接抓取原神界面中的圣遗物数据 目前只适配了背包页面的抓取 准确率:97.5%(普通通用接口,对 40 件随机圣遗物识别,统计完全正确的数量为 39) 准确率:100%(4k 屏幕,普通通用接口,对 110 件圣遗物识别,统计完全正确的数量为 110) 不排除还有小错误的

hwa 28 Oct 10, 2022
Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

AMEY THAKUR 8 Apr 28, 2022
Dex-scrapper - Hobby project for scrapping dex data on VeChain

Folders /zumo_abis # abi extracted from zumo repo /zumo_pools # runtime e

3 Jan 20, 2022
A scalable frontier for web crawlers

Frontera Overview Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large sc

Scrapinghub 1.2k Jan 02, 2023
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

2 Jun 06, 2022
An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022
A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

Sahil Ladhani 1 Jan 14, 2022
A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

combined-shop-scraper A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items. Features Define an

2 Dec 13, 2021
An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

TikTok Scraper An utility library to scrape data from TikTok hassle-free Go to the website » View Demo · Report Bug · Request Feature About The Projec

6 Jan 08, 2023
A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022
Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

Anton Ivarsson 1 Nov 07, 2021
Scraping weather data using Python to receive umbrella reminders

A Python package which scrapes weather data from google and sends umbrella reminders to specified email at specified time daily.

Edula Vinay Kumar Reddy 1 Aug 23, 2022
Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

1 Jul 09, 2022
A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

3 Dec 07, 2021
Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

Louie Cai 13 Oct 15, 2022
This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

1 Jan 31, 2022
crypto currency scraping

SCRYPTO What ? Crypto currencies scraping (At the moment, only bitcoin and ethereum crypto currencies are supported) How ? A python script is running

15 Sep 01, 2022
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022