Tools for working with MARC data in Catalogue Bridge.

Overview

catbridge_tools

Tools for working with MARC data in Catalogue Bridge.

Borrows heavily from PyMarc (https://pypi.org/project/pymarc/).

Requirements

Requires the regex module from https://bitbucket.org/mrabarnett/mrab-regex. The built-in re module is not sufficient.

Also requires py2exe.

Installation

From GitHub:

git clone https://github.com/victoriamorris/catbridge_tools
cd catbridge_tools

To install as a Python package:

python setup.py install

To create stand-alone executable (.exe) files for individual scripts:

python setup.py py2exe 

Executable files will be created in the folder \dist, and should be copied to an executable path.

Both of the above commands can be carried out by running the shell script:

compile_catbridge_tools.sh

Scripts

The scripts listed below can be run from anywhere, once the package is installed and the .exe files have been copied to an executable path.

Correspondence with original Catalogue Bridge tools

Original Catalogue Bridge tool New tool Original syntax Corresponding new syntax
cn-find cn-find CN-FIND cn_find -i -o -c
cn-tidy cn-find CN-FIND cn_find -i -o -c --tidy

Features common to all scripts

File formats

Unless otherwise specified, MARC files are in MARC 21 format, with .lex file extensions. Unless otherwise specified, text files are UTF-8-encoded, with .txt, .csv or .tsv file extensions. Config files are also text files, but may have the file extension .cfg for convenience.

Help

For any script, use the option --help to display help text.

Logs and debugging

Logs will be written to catbridge.log within the working directory. This is a UTF-8 encoded text field and can be read in any text editor. The default logging level is INFO; if option --debug is set, the logging level is changed to DEBUG. See https://docs.python.org/3/library/logging.html#levels for information about logging levels.

cn_find

cn_find is a utility which extracts extract control numbers from specified fields and subfields within a file of MARC records.

The fields and subfields to be extracted are specified in a config file.

Usage: cn_find -i 
   
     -o 
    
      -c 
     
       [options]

Options:
    --conv  Convert 10-digit ISBNs to 13-digit form where possible
    --rid   Include record ID as the first column of the output file
    --tidy  Sort and de-duplicate list

    --debug	Debug mode.
    --help	Show help message and exit.

     
    
   

Files

is the name of the input file, which must be a file of MARC 21 records.

is the name of the file to which the control numbers will be written. This should be a text file.

is the name of the file containing the configuration directives.

The config file

The format of the configuration file is as follows, with one entry per line

FIELD TAG $ subfield character [tab] control number specification

Each line must match the regular expression

^([0-9A-Z]{3})\s*\$?\s*([a-z0-9]?)\s*\t(.*?)\s*$

The field tag is specified using three numbers or UPPERCASE letters.

The subfield code are specified using a single number or lowercase letter. If '$' appears without any following subfield characters, all subfields will be searched for control numbers.

The control number specification tells the script what kind of control number to search for within the subfield. This can either take a value from a pre-defined list, or a regular expression can be used to search for control numbers with any other structure. Regular expressions are case-sensitive.

Control number specification Description Regular expression
ISBN Any structurally plausible ISBN* \b(?=(?:[0-9]+[- ]?){10})[0-9]{9}[0-9Xx]\b|\b(?=(?:[0-9]+[- ]?){13})[0-9]{1,5}[- ][0-9]+[- ][0-9]+[- ][0-9Xx]\b|\b97[89][0-9]{10}\b|\b(?=(?:[0-9]+[- ]){4})97[89][- 0-9]{13}[0-9]\b
ISBN10 Any structurally plausible 10-digit ISBN* \b(?=(?:[0-9]+[- ]?){10})[0-9]{9}[0-9Xx]\b|\b(?=(?:[0-9]+[- ]?){13})[0-9]{1,5}[- ][0-9]+[- ][0-9]+[- ][0-9Xx]\b
ISBN13 Any structurally plausible 13-digit ISBN* \b97[89][0-9]{10}\b|\b(?=(?:[0-9]+[- ]){4})97[89][- 0-9]{13}[0-9]\b
ISSN 8 digits with a hyphen in the middle, where the last digit may be an X \b[0-9]{4}[ -]?[0-9]{3}[0-9Xx]\b
BL001 9 digits \b[0-9]{9}\b
BNB See https://www.bl.uk/collection-metadata/metadata-services/structure-of-the-bnb-number \bGB([0-9]{7}|[A-Z][0-9][A-Z0-9][0-9]{4})\b
LCCN See https://www.loc.gov/marc/bibliographic/bd010.html \b[a-z][a-z ][a-z ]?[0-9]{2}[0-9]{6} ?\b
OCLC "(OCoLC)" followed by digits (OCoLC)[0-9]+\b
ISNI 16 digits separated into groups of 4 with spaces or hyphens \b[0]{4}[ -]?[0-9]{4}[ -]?[0-9]{4}[ -]?[0-9]{3}[0-9Xx]\b
FAST "fst" followed by digits \bfst[0-9]{8}\b

*Note: The ISBN check digit is not validated.

Multiple fields and subfields may be specified. Fields may be repeated with different subfields.

Example:

001 BL001
015$a	BNB
020	ISBN
020$z	ISBN10
500$a	\b[a-z]{7}\b
035$a	OCLC

In the example above, field 500 subfield $a is being searched for 7-character words.

Options

--conv

If option --conv is used, 10-digit ISBNs will be converted to 13-digit form whenever possible (i.e. whenever they are valid ISBNs).

--rid

By default, the output file consists of a single column of strings. If option --rid is used, the output file will consist of two columns: the first column will be the record control number from field 001 and the second column will be as per the default output.

--tidy

If option --tidy is used, the list of control numbers in the output file will be sorted and de-duplicated. Any duplicate control numbers will be written to an additional output file named with the prefix "dp-".

Note: option --tidy cannot be used at the same time as option --rid

A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Theis Lab 2 Mar 29, 2022
Big Data & Cloud Computing for Oceanography

DS2 Class 2022, Big Data & Cloud Computing for Oceanography Home of the 2022 ISblue Big Data & Cloud Computing for Oceanography class (IMT-A, ENSTA, I

Ocean's Big Data Mining 5 Mar 19, 2022
General Assembly's 2015 Data Science course in Washington, DC

DAT8 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15). Instructor: Kevin Markham (

Kevin Markham 1.6k Jan 07, 2023
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

Sentinel Hub 18 Dec 23, 2022
🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

Fabian Gebhart 13 Dec 15, 2022
[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

Jun Li 65 Dec 09, 2022
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021
Get mutations in cluster by querying from LAPIS API

Cluster Mutation Script Get mutations appearing within user-defined clusters. Usage Clusters are defined in the clusters dict in main.py: clusters = {

neherlab 1 Oct 22, 2021
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

Rustam Zokirov 1 Dec 06, 2021
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

Zekeriyya Demirci 5 Nov 13, 2022
A model checker for verifying properties in epistemic models

Epistemic Model Checker This is a model checker for verifying properties in epistemic models. The goal of the model checker is to check for Pluralisti

Thomas Träff 2 Dec 22, 2021
Spectral Analysis in Python

SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect

Thomas Cokelaer 280 Dec 16, 2022
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
The lastest all in one bombing tool coded in python uses tbomb api

BaapG-Attack is a python3 based script which is officially made for linux based distro . It is inbuit mass bomber with sms, mail, calls and many more bombing

59 Dec 25, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022