A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
Hack-All is a simple CLI tool that helps ethical-hackers to make a reverse connection without knowing the target device in use is it computer or phone

Hack-All is a simple CLI tool that helps ethical-hackers to make a reverse connection without knowing the target device in use is it computer

LightYagami17 5 Nov 22, 2022
asciinema - Terminal session recorder 📹

asciinema - Terminal session recorder 📹

asciinema 11.1k Dec 27, 2022
🎄 Advent of Code command-line tool.

🎄 advent-cli advent-cli is a command-line tool for interacting with Advent of Code, specifically geared toward writing solutions in Python. It can be

Christian Ferguson 6 Dec 01, 2022
Apple Silicon 'top' CLI

asitop pip install asitop What A nvtop/htop style/inspired command line tool for Apple Silicon (aka M1) Macs. Note that it requires sudo to run due to

Timothy Liu 1.2k Dec 31, 2022
Convert markdown to HTML using the GitHub API and some additional tweaks with Python.

Convert markdown to HTML using the GitHub API and some additional tweaks with Python. Comes with full formula support and image compression.

phseiff 70 Dec 23, 2022
A Python-based Wordle solver and CLI player

Wordle A Python-based Wordle solver and CLI player This was created using Python 3.9.7. SPOILER ALERT: the data directory contains spoilers for upcomi

Will Fitzgerald 1 Jul 24, 2022
Fun project to generate The Matrix Code effect on you terminal.

Fun project to generate The Matrix Code effect on you terminal.

Henrique Bastos 11 Jul 13, 2022
Detect secret in source code, scan your repo for leaks. Find secrets with GitGuardian and prevent leaked credentials. GitGuardian is an automated secrets detection & remediation service.

GitGuardian Shield: protect your secrets with GitGuardian GitGuardian shield (ggshield) is a CLI application that runs in your local environment or in

GitGuardian 1.2k Jan 06, 2023
A Python3 rewrite of my original PwnedConsole project from almost a decade ago

PwnedConsoleX A CLI shell for performing queries against the HaveIBeenPwned? API to gather breach information for user-supplied email addresses. | wri

1 Jul 23, 2022
This is a command line program to play cricket made using Python.

SimpleCricketPython This is a command line program to play cricket made using Python How it works First you have the option of selecting whether you

Imira Randeniya 1 Sep 11, 2022
GoSearch for anything from your terminal

GoSearch for anything from your terminal Requirements pip install beautifulsoup4

Malik Mouhiidine 1 Oct 02, 2021
pls is a better ls for developers, pronounced /pliːz/ as in 'please'

pls is a better ls for developers. The "p" stands for ("pro" as in "professional"/"programmer") or "prettier". It works in a manner similar to ls, in

Dhruv Bhanushali 572 Dec 28, 2022
CLI/GUI Math commands based on python 3

PyMath Commands Syntax Installation Commands: pymath add: usage: pymath add 12.5 12.5 sub: usage: pymath sub 25 12.5 div: usage: pymath div 144 12 mul

eggsnham07 0 Nov 22, 2021
stonky is a simple command line dashboard for monitoring stocks.

stonky is a simple command line dashboard for monitoring stocks.

Jessy Williams 228 Dec 14, 2022
A terminal written in Python.

PyDOS Read the title and then you'll figure out what this actually is. Running First, download or clone this repo. Next, run run.py. After this, you c

TechStudent10 2 Mar 01, 2022
Python CLI script to solve wordles.

Wordle Solver Python CLI script to solve wordles. You need at least python 3.8 installed to run this. No dependencies. Sample Usage Let's say the word

Rachel Brindle 1 Jan 16, 2022
Simple CLI tool to track your cryptocurrency portfolio in real time.

Simple tool to track your crypto portfolio in realtime. It can be used to track any coin on the BNB network, even obscure coins that are not listed or trackable by major portfolio tracking applicatio

Trevor White 69 Oct 24, 2022
Todo list console based application. Todo's save to a seperate file.

Todo list console based application. Todo's save to a seperate file.

1 Dec 24, 2021
MiShell is a multi-platform, multi-architecture project based on the first version (MiShell32)

MiShell is a multi-platform, multi-architecture project based on the first version (MiShell32), which offers super super small reverse shell payloads great for injection in buffer overflow vulnerabil

Kamyar Hatamnezhad 0 Oct 27, 2022
Squirrel - A cli program to track writing progress

Squirrel Very much a WIP project squirrel is a command line program that tracks you writing progress and gives you useful information and cute and pic

3 Mar 23, 2022