This tool crawls a list of websites and download all PDF and office documents

Last update: Sep 30, 2022

Related tags

Overview

simplA11yPDFCrawler

simplA11yReport is a tool supporting the simplified accessibility monitoring method as described in the commission implementing decision EU 2018/1524. It is used by SIP (Information and Press Service) in Luxembourg to monitor the websites of public sector bodies.

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues. The generated files can then be used by the tool simplA11yGenReport to give an overview of the state of document accessibility on controlled websites.

Most of the accessibility reports (in french) published by SIP on data.public.lu have been generated using simplA11yGenReport and data coming from this tool.

Accessibility Tests

On all PDF files we execute the following tests:

name	description	WCAG SC	WCAG technique	EN 301 549
EmptyText	does the file contain text or only images? scanned document?	1.4.5 Image of text (AA)?	PDF 7	10.1.4.5
Tagged	is the document tagged?
Protected	is the document protected and blocks screen readers?
hasTitle	Has the document a title?	2.4.2 Page Titled (A)	PDF 18	10.2.4.2
hasLang	Has the document a default language?	3.1.1 Language of page (A)	PDF16	10.3.1.1
hasBookmarks	Has the document bookmarks?	2.4.1 Bypass Blocks (A)		10.2.4.1

Installation

git clone https://github.com/accessibility-luxembourg/simplA11yPDFCrawler.git
cd simplA11yPDFCrawler
npm install
pip install -r requirements.txt
mkdir crawled_files ; mkdir out 
chmod a+x *.sh

Usage

To be able to use this tool, you need a list of websites to crawl. Store this list in a file named list-sites.txt, one domain per line (without protocol and without path). Example of content for this file:

test.public.lu
etat.public.lu

Then the tool is used in two steps:

Crawl all the files. Launch the following command crawl.sh. It will crawl all the sites mentioned in list-sites.txt. Each site is crawled during maximum 4 hours (it can be adjusted in crawl.sh). The resulting files will be placed in the crawled_filesfolder. This step can be quite long.
Analyse the files and detect accessibility issues. Launch the command analyse.sh. The resulting files will be placed in the outfolder.

License

This software is developed by the Information and press service of the luxembourgish government and licensed under the MIT license.

This tool crawls a list of websites and download all PDF and office documents

Related tags

Overview

simplA11yPDFCrawler

Accessibility Tests

Installation

Usage

License

Owner

AccessibilityLU

Creating Scrapy scrapers via the Django admin interface

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

中国大学生在线四史自动答题刷分(现仅支持英雄篇)

A universal package of scraper scripts for humans

Example of scraping a paginated API endpoint and dumping the data into a DB

学习强国自动化百分百正确、瞬间答题，分值45分

An arxiv spider

京东云无线宝积分推送，支持查看多设备积分使用情况

A web scraper for nomadlist.com, made to avoid website restrictions.

A Python module to bypass Cloudflare's anti-bot page.

The first public repository that provides free BUBT website scraping API script on Github.

Google Maps crawler using Selenium

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

Find papers by keywords and venues. Then download it automatically

Transistor, a Python web scraping framework for intelligent use cases.

download NCERT books using scrapy

Web scrapper para cotizar articulos

A simplistic scraper made to download tons of random screenshots made by people.

A list of Python Bots used to extract data from several websites

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

This tool crawls a list of websites and download all PDF and office documents

Related tags

Overview

simplA11yPDFCrawler

Accessibility Tests

Installation

Usage

License

Owner

AccessibilityLU

Creating Scrapy scrapers via the Django admin interface

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

A universal package of scraper scripts for humans

Example of scraping a paginated API endpoint and dumping the data into a DB

学习强国 自动化 百分百正确、瞬间答题，分值45分

An arxiv spider

京东云无线宝积分推送，支持查看多设备积分使用情况

A web scraper for nomadlist.com, made to avoid website restrictions.

A Python module to bypass Cloudflare's anti-bot page.

The first public repository that provides free BUBT website scraping API script on Github.

Google Maps crawler using Selenium

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

Find papers by keywords and venues. Then download it automatically

Transistor, a Python web scraping framework for intelligent use cases.

download NCERT books using scrapy

Web scrapper para cotizar articulos

A simplistic scraper made to download tons of random screenshots made by people.

A list of Python Bots used to extract data from several websites

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

中国大学生在线四史自动答题刷分(现仅支持英雄篇)

学习强国自动化百分百正确、瞬间答题，分值45分