Code and data for "TURL: Table Understanding through Representation Learning"

Related tags

Deep LearningTURL
Overview

TURL

This Repo contains code and data for "TURL: Table Understanding through Representation Learning".

overview_0

Environment and Setup

The model is mainly developped using PyTorch and Transformers. You can access the docker image we used here docker pull xdeng/transformers:latest

Data

Link for processed pretraining and evaluation data, as well as the model checkpoints can be accessed here. This is created based on the original WikiTables corpus (http://websail-fe.cs.northwestern.edu/TabEL/)

TODO: Instruction for preparing code from original WikiTable Corpus

Pretraining

Data

The [split]_tables.jsonl files are used for pretraining and creation of all test datasets, with 570171 / 5036 / 4964 tables for training/validation/testing.

'_id': '27289759-6', # table id
'pgTitle': '2010 Santos FC season', # page title
'sectionTitle': 'Out', # section title
'tableCaption': '', # table caption
'pgId': 27289759, # wikipedia page id
'tableId': 6, # index of the table in the wikipedia page
'tableData': [[{'text': 'DF', # cell value
    'surfaceLinks': [{'surface': 'DF',
      'locType': 'MAIN_TABLE',
      'target': {'id': 649702,
       'language': 'en',
       'title': 'Defender_(association_football)'},
      'linkType': 'INTERNAL'}] # urls in the cell
      } # one for each cell,...]
      ...]
'tableHeaders': [['Pos.', 'Name', 'Moving to', 'Type', 'Source']], # row headers
'processed_tableHeaders': ['pos.', 'name', 'moving to', 'type', 'source'], # processed headers that will be used
'merged_row': [], # merged rows, we identify them by comparing the cell values
'entityCell': [[1, 1, 1, 0, 0],...], # whether the cell is an entity cell, get by checking the urls inside
'entityColumn': [0, 1, 2], # whether the column is an entity column
'column_type': [0, 0, 0, 4, 2], # more finegrained column type for debug, here we only use 0: entity columns
'unique': [0.16, 1.0, 0.75, 0, 0], # the ratio of unique entities in that column
'entity_count': 72, # total number of entities in the table
'subject_column': 1 # the column index of the subject column

Each line represents a Wikipedia table. Table content is stored in the field tableData, where the target is the actual entity links to the cell, and is also the entity to retrieve. The id and title are the Wikipedia_id and Wikipedia_title of the entity. entityCell and entityColumn shows the cells and columns that pass our filtering and are identified to contain entity information.

There is also an entity_vocab.txt file contains all the entities we used in all experiments (these are the entities shown in pretraining). Each line contains vocab_id, Wikipedia_id, Wikipedia_title, freebase_mid, count of an entity.

Get representation for a given table To use the pretrained model as a table encoder, use the HybridTableMaskedLM model class. There is a example in evaluate_task.ipynb for cell filling task, which also shows how to get representation for arbitrary table.

Finetuning & Evaluation

To systematically evaluate our pre-trained framework as well as facilitate research, we compile a table understanding benchmark consisting of 6 widely studied tasks covering table interpretation (e.g., entity linking, column type annotation, relation extraction) and table augmentation (e.g., row population, cell filling, schema augmentation).

Please see evaluate_task.ipynb for running evaluation for different tasks.

Entity Linking

We use two datasets for evaluation in entity linking. One is based on our train/dev/test split, the linked entity to each cell is the target for entity linking. For the WikiGS corpus, please find the original release here http://www.cs.toronto.edu/~oktie/webtables/ .

We use entity name, together with entity description and entity type to get KB entity representation for entity linking. There are three variants for the entity linking: 0: name + description + type, 1: name + type, 2: name + description.

Evaluation

Please see EL in evaluate_task.ipynb

Data

Data are stored in [split].table_entity_linking.json

'23235546-1', # table id
'Ivan Lendl career statistics', # page title
'Singles: 19 finals (8 titles, 11 runner-ups)', # section title
'', # caption
['outcome', 'year', ...], # headers
[[[0, 4], 'Björn Borg'], [[9, 2], 'Wimbledon'], ...], # cells, [index, entity mention (cell text)]
[['Björn Borg', 'Swedish tennis player', []], ['Björn Borg', 'Swedish swimmer', ['Swimmer']], ...], # candidate entities, this the merged set for all cells. [entity name, entity description, entity types]
[0, 12, ...] # labels, this is the index of the gold entity in the candidate entities
[[0, 1, ...], [11, 12, 13, ...], ...] # candidates for each cell

Column Type Annotation

We divide the information available in the table for column type annotation as: entity mention, table metadata and entity embedding. We experiment under 6 settings: 0: all information, 1: only entity related, 2: only table metadata, 3: no entity embedding, 4: only entity mention, 5: only entity embedding.

Data

Data are stored in [split].table_col_type.json. There is a type_vocab.txt store the target types.

'27295818-29', # table id
 '2010–11 rangers f.c. season', # page title
 27295818, # Wikipedia page id
 'overall', # section title
 '', # caption
 ['competition', 'started round', 'final position / round'], # headers
 [[[[0, 0], [26980923, 'Scottish Premier League']],
   [[1, 0], [18255941, 'UEFA Champions League']],
   ...],
  ...,
  [[[1, 2], [18255941, 'Group stage']],
   [[2, 2], [20795986, 'Round of 16']],
   ...]], # cells, [index, [entity id, entity mention (cell text)]]
 [['time.event'], ..., ['time.event']] # column type annotations, a column may have multiple types.

Relation Extraction

There is a relation_vocab.txt store the target relations. In the [split].table_rel_extraction.json file, each example contains table_id, pgTitle, pgId, secTitle, caption, valid_headers, entities, relations similar to column type classification. Note here the relation is between the subject column (leftmost) and each of the object columns (the rest). We do this to avoid checking all column pairs in the table.

Row Population

For row population, the task is to predict the entities linked to the entity cells in the leftmost entity column. A small amount of tables is further filtered out from test_tables.jsonl which results in the final 4132 tables for testing.

Cell Filling

Please see Pretrained and CF in evaluate_task.ipynb. You can directly load the checkpoint under pretrained, as we do not finetune the model for cell filling.

We have three baselines for cell filling: Exact, H2H, H2V. The header vectors and co-occurrence statistics are pre-computed, please see baselines/cell_filling/cell_filling.py for details.

Schema Augmentation

TODO: Refactoring the evaluation scripts and add instruction.

Acknowledgement

We use the WikiTable corpus for developing the dataset for pretraining and most of the evaluation. We also adopt the WikiGS for evaluation of entity linking.

We use multiple existing systems as baseline for evaluation. We took the code released by the author and made minor changes to fit our setting, please refer to the paper for more details.

Owner
SunLab-OSU
SunLab-OSU
TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Fernando Pérez-García 1.6k Jan 06, 2023
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30

Aiden Nibali 25 Jun 20, 2021
ZeroVL - The official implementation of ZeroVL

This repository contains source code necessary to reproduce the results presente

31 Nov 04, 2022
Audio2Face - Audio To Face With Python

Audio2Face Discription We create a project that transforms audio to blendshape w

FACEGOOD 724 Dec 26, 2022
Tensorflow Tutorials using Jupyter Notebook

Tensorflow Tutorials using Jupyter Notebook TensorFlow tutorials written in Python (of course) with Jupyter Notebook. Tried to explain as kindly as po

Sungjoon 2.6k Dec 22, 2022
FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Detectron is deprecated. Please see detectron2, a ground-up rewrite of Detectron in PyTorch. Detectron Detectron is Facebook AI Research's software sy

Facebook Research 25.5k Jan 07, 2023
RetinaFace: Deep Face Detection Library in TensorFlow for Python

RetinaFace is a deep learning based cutting-edge facial detector for Python coming with facial landmarks.

Sefik Ilkin Serengil 512 Dec 29, 2022
Neural Oblivious Decision Ensembles

Neural Oblivious Decision Ensembles A supplementary code for anonymous ICLR 2020 submission. What does it do? It learns deep ensembles of oblivious di

25 Sep 21, 2022
OCRA (Object-Centric Recurrent Attention) source code

OCRA (Object-Centric Recurrent Attention) source code Hossein Adeli and Seoyoung Ahn Please cite this article if you find this repository useful: For

Hossein Adeli 2 Jun 18, 2022
Machine learning library for fast and efficient Gaussian mixture models

This repository contains code which implements the Stochastic Gaussian Mixture Model (S-GMM) for event-based datasets Dependencies CMake Premake4 Blaz

Omar Oubari 1 Dec 19, 2022
Reinforcement Learning for Portfolio Management

qtrader Reinforcement Learning for Portfolio Management Why Reinforcement Learning? Learns the optimal action, rather than models the market. Adaptive

Angelos Filos 406 Jan 01, 2023
Constraint-based geometry sketcher for blender

Constraint-based sketcher addon for Blender that allows to create precise 2d shapes by defining a set of geometric constraints like tangent, distance,

1.7k Dec 31, 2022
Interpretation of T cell states using reference single-cell atlases

Interpretation of T cell states using reference single-cell atlases ProjecTILs is a computational method to project scRNA-seq data into reference sing

Cancer Systems Immunology Lab 139 Jan 03, 2023
Jittor implementation of PCT:Point Cloud Transformer

PCT: Point Cloud Transformer This is a Jittor implementation of PCT: Point Cloud Transformer.

MenghaoGuo 547 Jan 03, 2023
The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch. Feel free to make a pu

Ritchie Ng 9.2k Jan 02, 2023
WHENet: Real-time Fine-Grained Estimation for Wide Range Head Pose

WHENet: Real-time Fine-Grained Estimation for Wide Range Head Pose Yijun Zhou and James Gregson - BMVC2020 Abstract: We present an end-to-end head-pos

368 Dec 26, 2022
QMagFace: Simple and Accurate Quality-Aware Face Recognition

Quality-Aware Face Recognition 26.11.2021 start readme QMagFace: Simple and Accurate Quality-Aware Face Recognition Research Paper Implementation - To

Philipp Terhörst 59 Jan 04, 2023
Tensorflow python implementation of "Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos"

Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos This repository is the official tensorflow python implementation

Yasamin Jafarian 287 Jan 06, 2023
MaskTrackRCNN for video instance segmentation based on mmdetection

MaskTrackRCNN for video instance segmentation Introduction This repo serves as the official code release of the MaskTrackRCNN model for video instance

411 Jan 05, 2023
Final project for Intro to CS class.

Financial Analysis Web App https://share.streamlit.io/mayurk1/fin-web-app-final-project/webApp.py 1. Project Description This project is a technical a

Mayur Khanna 1 Dec 10, 2021