Catalogue data - A Python Scripts to prepare catalogue data

Last update: Mar 03, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Owner

BigScience Workshop

sportsdataverse python package

Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Data collection, enhancement, and metrics calculation.

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

Randomisation-based inference in Python based on data resampling and permutation.

Exploring the Top ML and DL GitHub Repositories

Ejercicios Panda usando Pandas

Tools for the analysis, simulation, and presentation of Lorentz TEM data.

An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R

This mini project showcase how to build and debug Apache Spark application using Python

Data processing with Pandas.

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Developed for analyzing the covariance for OrcVIO

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

Additional tools for particle accelerator data analysis and machine information

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

Tkinter Izhikevich Neuron Model With Python