A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
A stupidly simple task list to keep you productive and focused.

StupidlySimple-TaskList A stupidly simple task list to keep you productive and focused. There is really nothing to it. This is a terminal-based script

Jack Soderstrom 1 Nov 28, 2021
A Python-based Wordle solver and CLI player

Wordle A Python-based Wordle solver and CLI player This was created using Python 3.9.7. SPOILER ALERT: the data directory contains spoilers for upcomi

Will Fitzgerald 1 Jul 24, 2022
Centauro - a command line tool with some network management functionality

Centauro Ferramenta de rede O Centauro é uma ferramenta de linha de comando com

1 Jan 01, 2022
A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli.

ABOUT A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Installation pip install -r requirements.txt It use

Janardon Hazarika 17 Dec 11, 2022
A command line utility for tracking a stock market portfolio. Primarily featuring high resolution braille graphs.

A command line stock market / portfolio tracker originally insipred by Ericm's Stonks program, featuring unicode for incredibly high detailed graphs even in a terminal.

Conrad Selig 51 Nov 29, 2022
CLI client for RFC 4226's HOTP and RFC 6238's TOTP.

One Time Password (OTP, TOTP/HOTP) OTP serves as additional protection in case of password leaks. onetimepass allows you to manage OTP codes and gener

Apptension 4 Jan 05, 2022
A Python package for a basic CLI and GUI user interface

Organizer CLI Organizer CLI is a python command line tool that goes through a given directory and organizes all un-folder bound files into folders by

Caltech Library 12 Mar 25, 2022
EODAG is a command line tool and a plugin-oriented Python framework for searching, aggregating results and downloading remote sensed images while offering a unified API for data access regardless of the data provider

EODAG (Earth Observation Data Access Gateway) is a command line tool and a plugin-oriented Python framework for searching, aggregating results and downloading remote sensed images while offering a un

CS GROUP 205 Jan 03, 2023
RSS reader client for CLI (Command Line Interface),

rReader is RSS reader client for CLI(Command Line Interface)

Lee JunHaeng 10 Dec 24, 2022
The Pythone Script will generate a (.)sh file with reverse shell codes then you can execute the script on the target

Pythone Script will generate a (.)sh file with reverse shell codes then you can execute the script on the targetPythone Script will generate a (.)sh file with reverse shell codes then you can execute

Boy From Future 15 Sep 16, 2022
Fylm is a wonderful automated command line app for organizing your film media.

Overview Fylm is a wonderful automated command line app for organizing your film media. You can pronounce it Film or File 'em, whichever you like! It

Brandon Shelley 30 Dec 05, 2022
Declarative CLIs with argparse and dataclasses

argparse_dataclass Declarative CLIs with argparse and dataclasses. Features Features marked with a ✓ are currently implemented; features marked with a

Mike DePalatis 29 Dec 06, 2022
A ZSH plugin that enables you to use OpenAI's powerful Codex AI in the command line.

A ZSH plugin that enables you to use OpenAI's powerful Codex AI in the command line.

Tom Dörr 976 Jan 03, 2023
Package installer for python

This is a package that adds a JSON file to your project that records all of the packages used in it and allows people to install it with a single command.

Anmol Malik 1 May 23, 2022
Program Command Line Interface (CLI) Sederhana: Pemesanan Nasi Goreng Hekel

Program ini merupakan aplikasi yang berjalan di dalam command line (terminal). Program ini menggunakan built-in library python yaitu argparse yang dapat menerima parameter saat program ini dijalankan

Habib Abdurrasyid 5 Nov 19, 2021
Ntfy - 🖥️📱🔔 A utility for sending notifications, on demand and when commands finish.

About ntfy ntfy brings notification to your shell. It can automatically provide desktop notifications when long running commands finish or it can send

Daniel Schep 4.5k Jan 01, 2023
kitty - the fast, feature-rich, cross-platform, GPU based terminal

kitty - the fast, feature-rich, cross-platform, GPU based terminal

Kovid Goyal 17.3k Jan 04, 2023
Pyrdle - Play Wordle in the CLI. Write an algorithm to play Wordle for you. Ruin all of the fun you've been having

Pyrdle - Play Wordle in the CLI. Write an algorithm to play Wordle for you. Ruin all of the fun you've been having

Charles Tapley Hoyt 11 Feb 11, 2022
Autosub - Command-line utility for auto-generating subtitles for any video file

Auto-generated subtitles for any video Autosub is a utility for automatic speech recognition and subtitle generation. It takes a video or an a

Anastasis Germanidis 3.9k Jan 05, 2023
A webmining CLI tool & library for python.

minet is a webmining command line tool & library for python (= 3.6) that can be used to collect and extract data from a large variety of web sources

médialab Sciences Po 165 Dec 17, 2022