This tool crawls a list of websites and download all PDF and office documents

Overview

simplA11yPDFCrawler

simplA11yReport is a tool supporting the simplified accessibility monitoring method as described in the commission implementing decision EU 2018/1524. It is used by SIP (Information and Press Service) in Luxembourg to monitor the websites of public sector bodies.

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues. The generated files can then be used by the tool simplA11yGenReport to give an overview of the state of document accessibility on controlled websites.

Most of the accessibility reports (in french) published by SIP on data.public.lu have been generated using simplA11yGenReport and data coming from this tool.

Accessibility Tests

On all PDF files we execute the following tests:

name description WCAG SC WCAG technique EN 301 549
EmptyText does the file contain text or only images? scanned document? 1.4.5 Image of text (AA)? PDF 7 10.1.4.5
Tagged is the document tagged?
Protected is the document protected and blocks screen readers?
hasTitle Has the document a title? 2.4.2 Page Titled (A) PDF 18 10.2.4.2
hasLang Has the document a default language? 3.1.1 Language of page (A) PDF16 10.3.1.1
hasBookmarks Has the document bookmarks? 2.4.1 Bypass Blocks (A) 10.2.4.1

Installation

git clone https://github.com/accessibility-luxembourg/simplA11yPDFCrawler.git
cd simplA11yPDFCrawler
npm install
pip install -r requirements.txt
mkdir crawled_files ; mkdir out 
chmod a+x *.sh

Usage

To be able to use this tool, you need a list of websites to crawl. Store this list in a file named list-sites.txt, one domain per line (without protocol and without path). Example of content for this file:

test.public.lu
etat.public.lu

Then the tool is used in two steps:

  1. Crawl all the files. Launch the following command crawl.sh. It will crawl all the sites mentioned in list-sites.txt. Each site is crawled during maximum 4 hours (it can be adjusted in crawl.sh). The resulting files will be placed in the crawled_filesfolder. This step can be quite long.
  2. Analyse the files and detect accessibility issues. Launch the command analyse.sh. The resulting files will be placed in the outfolder.

License

This software is developed by the Information and press service of the luxembourgish government and licensed under the MIT license.

Owner
AccessibilityLU
AccessibilityLU
Scrap-mtg-top-8 - A top 8 mtg scraper using python

Scrap-mtg-top-8 - A top 8 mtg scraper using python

1 Jan 24, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 07, 2023
Automated Linkedin bot that will improve your visibility and increase your network.

LinkedinSpider LinkedinSpider is a small project using browser automating to increase your visibility and network of connections on Linkedin. DISCLAIM

Frederik 2 Nov 26, 2021
抖音批量下载用户所有无水印视频

Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3, 安装依赖

28 Dec 08, 2022
Web and PDF Scraper Refactoring

Web and PDF Scraper Refactoring This repository contains the example code of the Web and PDF scraper code roast. Here are the links to the videos: Par

18 Dec 31, 2022
Automatically scrapes all menu items from the Taco Bell website

Automatically scrapes all menu items from the Taco Bell website. Returns as PANDAS dataframe.

Sasha 2 Jan 15, 2022
A universal package of scraper scripts for humans

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.

299 Dec 15, 2022
Iptvcrawl - A scrapy project for crawl IPTV playlist

iptvcrawl a scrapy project for crawl IPTV playlist. Dependency Python3 pip insta

Zhijun 18 May 05, 2022
Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

Manvir Mann 1 Jan 07, 2022
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

Eban'ko 19 Dec 07, 2022
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
A python script to extract answers to any question on Quora (Quora+ included)

quora-plus-bypass A python script to extract answers to any question on Quora (Quora+ included) Requirements Python 3.x

Nitin Narayanan 10 Aug 18, 2022
🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

Max Humber 692 Dec 22, 2022
A tool can scrape product in aliexpress: Title, Price, and URL Product.

Scrape-Product-Aliexpress A tool can scrape product in aliexpress: Title, Price, and URL Product. Usage: 1. Install Python 3.8 3.9 padahal halaman ins

Rahul Joshua Damanik 1 Dec 30, 2021
CreamySoup - a helper script for automated SourceMod plugin updates management.

CreamySoup/"Creamy SourceMod Updater" (or just soup for short), a helper script for automated SourceMod plugin updates management.

3 Jan 03, 2022
Web scrapping

Project Setup Table of Contents Project Setup Table of Contents Run project locally Install Requirements Run script Run project locally Install Requir

Charles 3 Feb 04, 2022
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
Scrapy-based cyber security news finder

Cyber-Security-News-Scraper Scrapy-based cyber security news finder Goal To keep up to date on the constant barrage of information within the field of

2 Nov 01, 2021
Haphazard scripts for scraping bitcoin/bitcoin data from GitHub

This is a quick-and-dirty tool used to scrape bitcoin/bitcoin pull request and commentary data. Each output/pr number folder contains comments.json:

James O'Beirne 8 Oct 12, 2022