Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

Overview

DataTuner

You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task.

Installation

Environment Creation

Assuming you have an existing conda setup, you can setup the environment with the following script. In order to activate the conda environment within the bash script, you need the location of the conda.sh file:

bash setup.sh  ~/miniconda3/etc/profile.d/conda.sh

You can update your existing environment:

conda env update -f=environment.yml

To start development, activate your environment:

conda activate finetune

Alternatively, you can always use the python binary with the absolute path, e.g.: ~/miniconda3/envs/finetune/bin/python.

Data

For any task you want to fine-tune on, you need the data to be a json file containing a list of json objects, one per data point. For example:

[
  {
    "question": "question text 1",
    "query": "query 1"
  },
  {
    "question": "question text 2",
    "query": "query 2 with [SpecialToken example]"
  }
]

The library assumes that you have placed your data in a single directory with three files: train.json, validation.json, and test.json.

Configuration

Now that we have the data in shape, we need to create a new task configuration file that specifies how we want the data to be formatted and what fields should be considered. You can create new config files in the folder src/datatuner/lm/task_configs.

A typical config file would look as follows:

{
"name": "dataset_name",
"data_shape": [
        {
            "id": "<question>",
            "type": "special",
            "learn": false
        },
        {
            "id": "question",
            "type": "text",
            "learn": false
        },
        {
            "id": "<query>",
            "type": "special",
            "learn": false
        },
        {
            "id": "query",
            "type": "text",
            "learn": true,
            "metrics": [
                "match"
            ]
        }
    ],
"extra_special_tokens": ["[SpecialToken"],
"extra_fields": []
}

For each item in the data shape:

  • type (required): special if special token, text if normal text.
  • id (required): the special token ID if type is special; the key for the text in the json data if type is text
  • learn (required): whether to allow the model to learn this part of the text. If false, the model masks that part during fine-tuning.
  • metrics (optional): the list of metrics that the model should compute upon evaluation. Each metric should have a corresponding function with the same name in metrics.py.
  • converter (optional): the name of the converter function in converters.py to apply on that text field after reading the text from the file.

The value of extra_special_tokens is a list of special tokens to be added to the vocabulary. Alternatively (especially if the list is too long or is generated automatically), you can create a text file with one special token per line and pass that as an argument during training via the --special_tokens_file argument.

The value of extra_fields is a list of additional fields to include from the input json files to output during evaluation, aside from the main fields used as inputs/outputs.

Training

The training script train.py can be used in single GPU or multi GPU settings.

cd src/datatuner/lm

# single gpu
python train.py --model_checkpoint ~/data/openai-gpt/  --dataset_path ../../../data/my_dataset/  --task_config ./task_configs/my_task_config.json --n_epoch 3 --lr 1e-5

# multi gpu
python -m torch.distributed.launch --nproc_per_node=4 train.py --model_checkpoint ~/data/openai-gpt/  --dataset_path ../../../data/my_dataset/  --task_config ./task_configs/my_task_config.json --n_epoch 3 --lr 1e-5

Evaluating the Model

You can run the following to evaluate the model on any test set. The data format is the same as the training data. Notice that you have to currently specify the model_type parameter matching the model you're loading:

cd src/datatuner/lm

python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/2020-01-01_01-01-01  --filename ../../../data/my_dataset/test.json --max_length 200 --model_type gpt --top_k 1

# or if you just want to evaluate the latest model you trained 
RUN=$(ls -t ./runs | head -1) && python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/$RUN  --filename ../../../data/my_dataset/test.json --max_length 200 --model_type gpt  --top_k 1

# or if you want to use the latest intermediate checkpoint while the model is training:
RUN=$(ls -t ./runs | head -1) && CHECKPOINT=$(ls -t ./runs/$RUN/checkpoint* | head -1) && cp $CHECKPOINT runs/$RUN/pytorch_model.bin

During evaluation, the outputs that do not exactly match the expected outputs will be printed. Also, the metrics will be printed (a dictionary with keys <metric_name>_<field_name>). At the end of evaluation, you will find the file with all the generated ouputs in the file eval_results/<run_folder_name>/<task_name>_<test_file_name>_<model_type>_generated.json.

Interacting with the model

You can also interact with the models. The client will ask you to input the fields required, and it will generate the fields it learnt.

cd src/datatuner/lm

python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/2020-01-01_01-01-01  --max_length 200 --model_type gpt  --top_k 1 --input

# or if you just want to evaluate the latest model you trained 
RUN=$(ls -t ./runs | head -1) && python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/$RUN  --max_length 200 --model_type gpt  --top_k 1 --input
Automatically fishes for you while you are afk :)

Dank-memer-afk-script A simple and quick way to make easy money in Dank Memer! How to use Open a discord channel which has the Dank Memer bot enabled.

Pranav Doshi 9 Nov 11, 2022
Solution for Problem 1 by team codesquad for AIDL 2020. Uses ML Kit for OCR and OpenCV for image processing

CodeSquad PS1 Solution for Problem Statement 1 for AIDL 2020 conducted by @unifynd technologies. Problem Given images of bills/invoices, the task was

Burhanuddin Udaipurwala 111 Nov 27, 2022
A webcam-based 3x3x3 rubik's cube solver written in Python 3 and OpenCV.

Qbr Qbr, pronounced as Cuber, is a webcam-based 3x3x3 rubik's cube solver written in Python 3 and OpenCV. 🌈 Accurate color detection 🔍 Accurate 3x3x

Kim 金可明 502 Dec 29, 2022
This is used to convert a string to an Image with Handwritten Characters.

Text-to-Handwriting-using-python This is used to convert a string to an Image with Handwritten Characters. text_to_handwriting(string: str, save_to: s

Akashdeep Mahata 3 Aug 15, 2022
FOTS Pytorch Implementation

News!!! Recognition branch now is added into model. The whole project has beed optimized and refactored. ICDAR Dataset SynthText 800K Dataset detectio

Ning Lu 599 Dec 19, 2022
An official PyTorch implementation of the paper "Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences", ICCV 2021.

PyTorch implementation of Learning by Aligning (ICCV 2021) This is an official PyTorch implementation of the paper "Learning by Aligning: Visible-Infr

CV Lab @ Yonsei University 30 Nov 05, 2022
OCR software for recognition of handwritten text

Handwriting OCR The project tries to create software for recognition of a handwritten text from photos (also for Czech language). It uses computer vis

Břetislav Hájek 562 Jan 03, 2023
MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

Deep Insight 99 Nov 01, 2022
Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

81 Jan 01, 2023
A curated list of resources for text detection/recognition (optical character recognition ) with deep learning methods.

awesome-deep-text-detection-recognition A curated list of awesome deep learning based papers on text detection and recognition. Text Detection Papers

2.4k Jan 08, 2023
Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

SceneTextPapers Tracking the latest progress in Scene Text Detection and Recognition: must-read papers well organized Information about this repositor

Shangbang Long 763 Jan 01, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

CUTIE TensorFlow implementation of the paper "CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor." Xiaohu

Zhao,Xiaohui 147 Dec 20, 2022
Generate a list of papers with publicly available source code in the daily arxiv

2021-06-08 paper code optimal network slicing for service-oriented networks with flexible routing and guaranteed e2e latency networkslicing multi-moda

79 Jan 03, 2023
[ICCV, 2021] Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks

Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks This is an official PyTorch code repository of the paper "Cloud Transformers:

Visual Understanding Lab @ Samsung AI Center Moscow 27 Dec 15, 2022
A simple OCR API server, seriously easy to be deployed by Docker, on Heroku as well

ocrserver Simple OCR server, as a small working sample for gosseract. Try now here https://ocr-example.herokuapp.com/, and deploy your own now. Deploy

Hiromu OCHIAI 541 Dec 28, 2022
An Optical Character Recognition system using Pytesseract/Extracting data from Blood Pressure Reports.

Optical_Character_Recognition An Optical Character Recognition system using Pytesseract/Extracting data from Blood Pressure Reports. As an IOT/Compute

Ramsis Hammadi 1 Feb 12, 2022
Framework for the Complete Gaze Tracking Pipeline

Framework for the Complete Gaze Tracking Pipeline The figure below shows a general representation of the camera-to-screen gaze tracking pipeline [1].

Pascal 20 Jan 06, 2023
CNN+LSTM+CTC based OCR implemented using tensorflow.

CNN_LSTM_CTC_Tensorflow CNN+LSTM+CTC based OCR(Optical Character Recognition) implemented using tensorflow. Note: there is No restriction on the numbe

Watson Yang 356 Dec 08, 2022
Characterizing possible failure modes in physics-informed neural networks.

Characterizing possible failure modes in physics-informed neural networks This repository contains the PyTorch source code for the experiments in the

Aditi Krishnapriyan 55 Jan 02, 2023