BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Related tags

Text Data & NLPbros
Overview

BROS

Introduction

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which are text and bounding box pairs, it can perform various key information extraction tasks, such as extracting an ordered item list from receipts. For more details, please refer to our paper:

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents
Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park
AAAI 2022 (to appear)

Pre-trained models

name # params Hugging Face - Models
bros-base-uncased < 110M naver-clova-ocr/bros-base-uncased
bros-large-uncased < 340M naver-clova-ocr/bros-large-uncased

Model usage

The example code below is written with reference to LayoutLM.

import torch
from bros import BrosTokenizer, BrosModel


tokenizer = BrosTokenizer.from_pretrained("naver-clova-ocr/bros-base-uncased")
model = BrosModel.from_pretrained("naver-clova-ocr/bros-base-uncased")


width, height = 1280, 720

words = ["to", "the", "moon!"]
quads = [
    [638, 451, 863, 451, 863, 569, 638, 569],
    [877, 453, 1190, 455, 1190, 568, 876, 567],
    [632, 566, 1107, 566, 1107, 691, 632, 691],
]

bbox = []
for word, quad in zip(words, quads):
    n_word_tokens = len(tokenizer.tokenize(word))
    bbox.extend([quad] * n_word_tokens)

cls_quad = [0.0] * 8
sep_quad = [width, height] * 4
bbox = [cls_quad] + bbox + [sep_quad]

encoding = tokenizer(" ".join(words), return_tensors="pt")
input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]

bbox = torch.tensor([bbox])
bbox[:, :, [0, 2, 4, 6]] = bbox[:, :, [0, 2, 4, 6]] / width
bbox[:, :, [1, 3, 5, 7]] = bbox[:, :, [1, 3, 5, 7]] / height

outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask)
last_hidden_state = outputs.last_hidden_state

print("- last_hidden_state")
print(last_hidden_state)
print()
print("- last_hidden_state.shape")
print(last_hidden_state.shape)

Result

- last_hidden_state
tensor([[[-0.0342,  0.2487, -0.2819,  ...,  0.1495,  0.0218,  0.0484],
         [ 0.0792, -0.0040, -0.0127,  ..., -0.0918,  0.0810,  0.0419],
         [ 0.0808, -0.0918,  0.0199,  ..., -0.0566,  0.0869, -0.1859],
         [ 0.0862,  0.0901,  0.0473,  ..., -0.1328,  0.0300, -0.1613],
         [-0.2925,  0.2539,  0.1348,  ...,  0.1988, -0.0148, -0.0982],
         [-0.4160,  0.2135, -0.0390,  ...,  0.6908, -0.2985,  0.1847]]],
       grad_fn=
   
    )

- last_hidden_state.shape
torch.Size([1, 6, 768])

   

Fine-tuning examples

Please refer to docs/finetuning_examples.md.

Acknowledgements

We referenced the code of LayoutLM when implementing BROS in the form of Hugging Face - transformers.
In this repository, we used two public benchmark datasets, FUNSD and SROIE.

License

Copyright 2022-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Owner
Clova AI Research
Open source repository of Clova AI Research, NAVER & LINE
Clova AI Research
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
Codename generator using WordNet parts of speech database

codenames Codename generator using WordNet parts of speech database References: https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/ ht

possiblywrong 27 Oct 30, 2022
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

EasyTransfer is designed to make the development of transfer learning in NLP applications easier. The literature has witnessed the success of applying

Alibaba 819 Jan 03, 2023
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

Nicholas Heller 4 Jan 22, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 02, 2023
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Lewi Uberg 55 Nov 17, 2022
GPT-2 Model for Leetcode Questions in python

Leetcode using AI 🤖 GPT-2 Model for Leetcode Questions in python New demo here: https://huggingface.co/spaces/gagan3012/project-code-py Note: the Ans

Gagan Bhatia 100 Dec 12, 2022
Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

Advanced Machine Learning specialisation by HSE 1.1k Jan 01, 2023
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022