A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Overview

Universal Online Judge Spider

Introduction

This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/).

It also works for all other Online Judges using the UOJ system.

This spider is written in python3, using python selenium webdriver library and ChromeDriver.

It is only tested on Ubuntu 20.04, so the commands in the following section are only available for this system as well.

Features

  • Automatic login, no need to obtain cookies manually.
  • Convert pages into PDFs with reproducible text rather than simple screenshots.
  • Automatically detects the loading of MathJax to ensure that the mathematical formula within the results are displayed correctly.
  • Automatically skips pages that already exist (if the corresponding PDF file already exists locally).
  • Support for proxy.
  • Support for all websites using the UOJ system.

Installation

1. Install python3 and ChromeDriver:

apt install python3 python-pip3 chromium-browser chromium-chromedriver

2. Install selenium library for python3

pip3 install selenium

3. Download this program

Usage

Firstly you have to set these variables:

# [Basic settings]
url = ""
username = ""
password = ""
start_number = 1
end_number = 100
save_dir = "downloads"

# [Advanced settings]
proxy = ""
page_404_title = "404 - "
max_login_time = 60
max_mathjax_start_time = 60
max_mathjax_load_time = 60

Basic settings

  • url: the index URL of your target, e.g. https://uoj.ac/. Please note that the value must end in a slash /.
  • username: your username.
  • password: your password.
  • start_number: the number of the first problem crawled (minimum).
  • end_number: the number of the last problem crawled (maximum).
  • save_dir: the name of the folder where the result will be stored.

Advanced settings

If you don't know what the advanced settings are for, you're probably better not to change them.

  • proxy: the address of your proxy server, e.g. HTTP://127.0.0.1:1080, or SOCKS5://127.0.0.1:1081. Leave it blank (empty string) if you do not need to use a proxy.
  • page_404_title: the title of OJ's 404 page. You may use a substring of the title, like 404 - . If the program gets a page title that contains this string, the download of that page will be skipped.
  • max_login_time: the maximum waiting time for a login attempt, in seconds.
  • max_mathjax_start_time: the maximum wait time for a MathJax loading message to appear, in seconds.
  • max_mathjax_load_time: the maximum wait time for a MathJax loading message to disappear (i.e. MathJax rendering is finished), in seconds.

After completing the setup, run:

python3 main.py

Sample result

page1

page2

License

MIT License.

Owner
TriNitroTofu
QAQ...
TriNitroTofu
👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

Ace Attorney reddit bot 👨🏼‍⚖️ Reddit bot that turns comment chains into ace attorney scenes. You'll need to sign up for streamable and reddit and se

763 Nov 17, 2022
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党

taobao_seckill 淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党 依赖 安装chrome浏览器,根据浏览器的版本找到对应的chromedriver下载安装 web版使用说明 1、抢购前需要校准本地时间,然后把需要抢购的商品加入购物车 2、如果要打包成可执行文件,可使用pyinstalle

2k Jan 05, 2023
爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

My-Actions 个人收集并适配Github Actions的各类签到大杂烩 不要fork了 ⭐️ star就行 使用方式 新建仓库并同步代码 点击Settings - Secrets - 点击绿色按钮 (如无绿色按钮说明已激活。直接到下一步。) 新增 new secret 并设置 Secr

280 Dec 30, 2022
Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

Guilherme Silva Uchoa 3 Oct 04, 2022
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

2 Jun 06, 2022
🐞 Douban Movie / Douban Book Scarpy

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

Xingbo Jia 1 Dec 03, 2022
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

Jiuli Gao 2k Jan 05, 2023
A Python web scraper to scrape latest posts from official Coinbase's Blog.

Coinbase Blog Scraper A Python web scraper to scrape latest posts from official Coinbase's Blog. IDEA It scrapes up latest blog posts from https://blo

Lucas Villela 3 Feb 18, 2022
A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

Hitesh Rana 4 Jun 02, 2022
腾讯课堂,模拟登陆,获取课程信息,视频下载,视频解密。

腾讯课堂脚本 要学一些东西,但腾讯课堂不支持自定义变速,播放时有水印,且有些老师的课一遍不够看,于是这个脚本诞生了。 时间比较紧张,只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有,如果你想一起完善这个脚本,欢迎pr 2020.5.22测试可用 使用方法 很简单,三部完成 下载代码,

163 Dec 30, 2022
python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤(从2月份稳定运行至今)

python+selenium实现的web端自动打卡 说明 本打卡脚本适用于郑州大学健康打卡,其他web端打卡也可借鉴学习。(自己用的,从2月分稳定运行至今) 仅供学习交流使用,请勿依赖。开发者对使用本脚本造成的问题不负任何责任,不对脚本执行效果做出任何担保,原则上不提供任何形式的技术支持。 为防止

Sunday 1 Aug 27, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
The core packages of security analyzer web crawler

Security Analyzer 🐍 A large scale web crawler (considered also as vulnerability scanner tool) to take an overview about security of Moroccan sites Cu

Security Analyzer 10 Jul 03, 2022
An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework Published at palewi.re/docs/first-github-scraper/ Contrib

Ben Welsh 15 Nov 24, 2022
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

appelsiensam 4 Jul 26, 2022
This script is intended to crawl license information of repositories through the GitHub API.

GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

schutera 4 Oct 25, 2022
Discord webhook spammer with proxy support and proxy scraper

Discord webhook spammer with proxy support and proxy scraper

3 Feb 27, 2022