PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

Overview

PaperRobot

PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。

PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。

example

Installation

  • Download this tool
git clone https://github.com/mo-xiaoxi/PaperRobot.git
  • Install dependencies
sudo pip3 install -r requirements.txt

Python version: Python 3 (>=3.7).

Why build this tool?

  1. 通过这个工具可以构建自己的论文数据库。具体参考:如何建立独属于你自己的论文数据库
  2. 一个方便的论文调研工具: Secpaper. 论文调研必备!
  3. 提取论文的摘要,自动翻译推送整理一些会议的研究简报,可以快速地过一下每个会议论文的内容,感兴趣的再阅读对应的pdf。
  4. 对会议研究热点、作者变化等等进行归类与整理。 如Computer Science Rankings.

Usage

$ python run.py --help
usage: run.py [-h] [-m {d,s}] [-c {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}] [-s YEAR_START] [-e YEAR_END] [-b BIBTEX] [-t TITLE] [-u URL] [--all {bibtex,pdf}]

OPTIONS:
  -h, --help            show this help message and exit
  -m {d,s}, --mode {d,s}
                        s:show info, d: download
  -c {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}, --conference {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}
                        The target conference.
  -s YEAR_START, --year_start YEAR_START
                        The start year of paper.
  -e YEAR_END, --year_end YEAR_END
                        The end year of paper.
  -b BIBTEX, --bibtex BIBTEX
                        Download with bibtex file.
  -t TITLE, --title TITLE
                        Download with Google search.
  -u URL, --url URL     Dowanload with url.
  --all {bibtex,pdf}    Download all bibbex or papers,2001-2022 by default

Example

  • 基于Title下载论文 python run.py -t "A Large-scale Analysis of Email Sender Spoofing Attacks"
  • 基于URL下载论文 python run.py -u "https://www.usenix.org/conference/usenixsecurity21/presentation/shen-kaiwen"
  • 基于bib下载论文 python run.py -b bibtex/example.bib
  • 获取NDSS 2021会议论文 python run.py -c ndss -s 2021 -e 2022
  • 获取NDSS 2001-2021会议论文 python run.py -c ndss -s 2001 -e 2022
  • 获取所有会议的bibtex文件 python run.py --all bibtex
  • 获取所有会议的pdf文件 python run.py --all bibtex

其他说明:

  • PaperRobot通过dblp抓取对应会议的bibtex,以保证通用性,理论上支持任意DBLP上收录的会议。

    通过配置下列数据,可以增加新的会议支持。

    LIB = {
        "ccs": "CCS",
        "uss": "Usenix_Security",
        "sp": "S&P",
        "ndss": "NDSS",
        "dsn": "DSN",
        "raid": "RAID",
        "imc": "IMC",
        "asiaccs": "ASIACCS",
        "acsac": "ACSAC",
        "sigcomm": "SIGCOMM",
    }
  • 多个PDF辅助抓取接口:

    • 通过doi序列号在SCI-HUB抓取论文(zotera适用方法)
    • 论文官方网站抓取论文
    • 通过google搜索抓取论文
    • 通过crossRef网站抓取论文(这个接口效果不是特别好)
  • keep_cookies.py 用于维护某些站点的登陆状态,需要单独运行。

    • 维护登陆状态的原因是某些网站(如dl.acm)需要登陆才能下载pdf。

      用户需要单独配置config中的账号密码,账号密码为学校账号与密码。

    • 若在教育网IP内访问, 则不需要维护Cookie信息,教育网IP直接可以下载PDF。

    • 用户也可以手动维护cookie信息,利用burpsuite等一系列工具导出cookie,写入data/cookie.json文件即可。

TODO

  • 更好的文档说明,中英文文档分开。
  • 修改日志信息到英文版本
  • 多进程+多协程并发处理
  • 代理池构建
  • 使用重试修饰器重写需重试的函数
Owner
moxiaoxi
CTF Player of Tea-Deliverers, Blue-Lotus. Ph.D. Student at Tsinghua University. Research on Protocol Security.
moxiaoxi
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
HappyScrapper - Google news web scrapper with python

HappyScrapper ~ Google news web scrapper INSTALLATION ♦ Clone the repository ♦ O

Jhon Aguiar 0 Nov 07, 2022
New World Market Scraper

Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

4 Sep 21, 2022
A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

11 May 06, 2022
A universal package of scraper scripts for humans

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.

299 Dec 15, 2022
Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

Mgs. M. Rizqi Fadhlurrahman 2 Dec 23, 2021
Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye, you can search with various keywords and usernames on Twitter.

Jolanda de Koff 19 Dec 12, 2022
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022
Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
基于Github Action的定时HITsz疫情上报脚本,开箱即用

HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口 定时自动上报脚本,开箱即用。 感谢 @JellyBeanXiewh 提供原始脚本和 idea。 感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

Ter 56 Nov 27, 2022
Incredibly fast crawler designed for OSINT.

Photon Incredibly fast crawler designed for OSINT. Photon Wiki • How To Use • Compatibility • Photon Library • Contribution • Roadmap Key Features Dat

Somdev Sangwan 9.3k Jan 02, 2023
Find papers by keywords and venues. Then download it automatically

paper finder Find papers by keywords and venues. Then download it automatically. How to use this? Search CLI python search.py -k "knowledge tracing,kn

Jiahao Chen (TabChen) 2 Dec 15, 2022
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
A web service for scanning media hosted by a Matrix media repository

Matrix Content Scanner A web service for scanning media hosted by a Matrix media repository Installation TODO Development In a virtual environment wit

Brendan Abolivier 5 Dec 01, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023
Web Scraping COVID 19 Meta Portal with Python

Web-Scraping-COVID-19-Meta-Portal-with-Python - Requests API and Beautiful Soup to scrape real-time COVID statistics from worldometer website and perform data cleaning and visual analysis in Jupyter

Aarif Munwar Jahan 1 Jan 04, 2022
Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

Comment Webpage Screenshot is a GitHub Action that helps maintainers visually review HTML file changes introduced on a Pull Request by adding comments with the screenshots of the latest HTML file cha

Maksudul Haque 21 Sep 29, 2022
Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

2 Nov 22, 2021