A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Overview

Universal Online Judge Spider

Introduction

This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/).

It also works for all other Online Judges using the UOJ system.

This spider is written in python3, using python selenium webdriver library and ChromeDriver.

It is only tested on Ubuntu 20.04, so the commands in the following section are only available for this system as well.

Features

  • Automatic login, no need to obtain cookies manually.
  • Convert pages into PDFs with reproducible text rather than simple screenshots.
  • Automatically detects the loading of MathJax to ensure that the mathematical formula within the results are displayed correctly.
  • Automatically skips pages that already exist (if the corresponding PDF file already exists locally).
  • Support for proxy.
  • Support for all websites using the UOJ system.

Installation

1. Install python3 and ChromeDriver:

apt install python3 python-pip3 chromium-browser chromium-chromedriver

2. Install selenium library for python3

pip3 install selenium

3. Download this program

Usage

Firstly you have to set these variables:

# [Basic settings]
url = ""
username = ""
password = ""
start_number = 1
end_number = 100
save_dir = "downloads"

# [Advanced settings]
proxy = ""
page_404_title = "404 - "
max_login_time = 60
max_mathjax_start_time = 60
max_mathjax_load_time = 60

Basic settings

  • url: the index URL of your target, e.g. https://uoj.ac/. Please note that the value must end in a slash /.
  • username: your username.
  • password: your password.
  • start_number: the number of the first problem crawled (minimum).
  • end_number: the number of the last problem crawled (maximum).
  • save_dir: the name of the folder where the result will be stored.

Advanced settings

If you don't know what the advanced settings are for, you're probably better not to change them.

  • proxy: the address of your proxy server, e.g. HTTP://127.0.0.1:1080, or SOCKS5://127.0.0.1:1081. Leave it blank (empty string) if you do not need to use a proxy.
  • page_404_title: the title of OJ's 404 page. You may use a substring of the title, like 404 - . If the program gets a page title that contains this string, the download of that page will be skipped.
  • max_login_time: the maximum waiting time for a login attempt, in seconds.
  • max_mathjax_start_time: the maximum wait time for a MathJax loading message to appear, in seconds.
  • max_mathjax_load_time: the maximum wait time for a MathJax loading message to disappear (i.e. MathJax rendering is finished), in seconds.

After completing the setup, run:

python3 main.py

Sample result

page1

page2

License

MIT License.

Owner
TriNitroTofu
QAQ...
TriNitroTofu
mlscraper: Scrape data from HTML pages automatically with Machine Learning

🤖 Scrape data from HTML websites automatically with Machine Learning

Karl Lorey 798 Dec 29, 2022
爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

My-Actions 个人收集并适配Github Actions的各类签到大杂烩 不要fork了 ⭐️ star就行 使用方式 新建仓库并同步代码 点击Settings - Secrets - 点击绿色按钮 (如无绿色按钮说明已激活。直接到下一步。) 新增 new secret 并设置 Secr

280 Dec 30, 2022
A simple app to scrap data from Twitter.

Twitter-Scraping-App A simple app to scrap data from Twitter. Available Features Search query. Select number of data you want to fetch from twitter. C

Davis David 2 Oct 31, 2022
Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

Manvir Mann 1 Jan 07, 2022
Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

Partho 12 Oct 28, 2022
Crawler in Python 3.7, 3.8. 3.9. Pypy3

Description Python Crawler written Python 3. (Supports major Python releases Python3.6, Python3.7 and Python 3.8) Installation and Use Setup VirtualEn

Vinit Kumar 2 Mar 12, 2022
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022
哔哩哔哩爬取器:以个人为中心

Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

Boshen Shi 3 Oct 21, 2021
Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

Eban'ko 19 Dec 07, 2022
Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

Mohammad Sadegh Salimi 4 Aug 30, 2022
Visual scraping for Scrapy

Portia Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web pag

Scrapinghub 8.7k Jan 05, 2023
for those who dont want to pay $10/month for high school game footage with ads

nfhs-scraper Disclaimer: I am in no way responsible for what you choose to do with this script and guide. I do not endorse avoiding paywalls or any il

Conrad Crawford 5 Apr 12, 2022
A simple code to fetch comments below an Instagram post and save them to a csv file

fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

2 Jul 14, 2022
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 01, 2023
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
Auto Join: A GitHub action script to automatically invite everyone to the organization who star your repository.

Auto Invite To The Organization By Star A GitHub Action script to automatically invite everyone to your organization that stars your repository. What

Max Base 11 Dec 11, 2022
An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

GOKUL A.P 13 Dec 21, 2022
WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

2 Apr 26, 2022
python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤(从2月份稳定运行至今)

python+selenium实现的web端自动打卡 说明 本打卡脚本适用于郑州大学健康打卡,其他web端打卡也可借鉴学习。(自己用的,从2月分稳定运行至今) 仅供学习交流使用,请勿依赖。开发者对使用本脚本造成的问题不负任何责任,不对脚本执行效果做出任何担保,原则上不提供任何形式的技术支持。 为防止

Sunday 1 Aug 27, 2022