An Arxiv Spider

做为一个cser，杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时，杰出男孩都会感叹道：”师兄这是每天都挂在arxiv上呀，跑的好快~“。于是杰出男孩找了找 github，借鉴了一下其他大佬们的脚本，实现了一个每天向自己的邮件发送('cs.CV','cs.AI','stat.ML','cs.LG','cs.RO')里面感兴趣的文章的spider，支持自定义key word以及感兴趣的author。

How to run

配置main.py里面的邮箱用户名和密码，记得开启邮箱的pop3验证
修改run.sh里面代码的目录和运行的python env的路径
使用crontab设置定时任务
```
crontab -e
```
contrab内容为
```
0 10 * * 1,2,3,4,5 bash your_dir/arxiv_spider/run.sh
```
即每周一到周五，早上10点定时推送arxiv当天更新到邮箱

arxiv是一个非常棒的网站，用脚本高频率爬取肯定是要被谴责的行为。但文章每天只更新一次，所以建议大家每天运行一次脚本，相当于每天逛一次arxiv了~

Result

Today arxiv has 338 new papers in ['cs.CV', 'cs.AI', 'stat.ML', 'cs.LG', 'cs.RO'] area, and 127 of them is about CV, 2/2 of them contain your keywords.

Ensure your keywords is ['(?i)offline.*(RL|reinforcement learning)', '(?i)(RL|reinforcement learning).*offline'].

This is your paperlist.Enjoy!

------------1------------
arXiv:2110.12468
Title: SCORE: Spurious COrrelation REduction for Offline Reinforcement Learning
['Machine Learning (cs.LG)', 'Artificial Intelligence (cs.AI)']
https://arxiv.org/abs/2110.12468

------------2------------
arXiv:2110.13060
Title: Safely Bridging Offline and Online Reinforcement Learning
['Machine Learning (cs.LG)', 'Machine Learning (stat.ML)']
https://arxiv.org/abs/2110.13060

Ensure your authors is ['Sergey Levine', 'Song Han'].

This is your paperlist.Enjoy!

------------1------------
arXiv:2110.12080
Title: C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks
['Machine Learning (cs.LG)', 'Artificial Intelligence (cs.AI)']
https://arxiv.org/abs/2110.12080

------------2------------
arXiv:2110.12543
Title: Understanding the World Through Action
['Machine Learning (cs.LG)']
https://arxiv.org/abs/2110.12543

Acknowledgement

This code is built upon the implementation from https://github.com/ZihaoZhao/Arxiv_daily

An arxiv spider

Related tags

Overview

An Arxiv Spider

How to run

Result

Acknowledgement

Owner

Jie Liu

基于Github Action的定时HITsz疫情上报脚本，开箱即用

TikTok Username Swapper/Claimer/etc

抖音批量下载用户所有无水印视频

Jobinja.ir jobs scraper.

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

A package designed to scrape data from Yahoo Finance.

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

Scrap the 42 Intranet's elearning videos in a single click

A repository with scraping code and soccer dataset from understat.com.

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

A way to scrape sports streams for use with Jellyfin.

Find thumbnails and original images from URL or HTML file.

Telegram group scraper tool

👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

Visual scraping for Scrapy

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A simplistic scraper made to download tons of random screenshots made by people.

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.