MMDA - multimodal document analysis

Related tags

Text Data & NLPmmda
Overview

MMDA - multimodal document analysis

This is work in progress...

Setup

conda create -n mmda python=3.8
pip install -r requirements.txt

Parsers

  • SymbolScraper - Apache 2.0

    • Quoted from their README: From the main directory, issue make. This will run the Maven build system, download dependencies, etc., compile source files and generate .jar files in ./target. Finally, a bash script bin/sscraper is generated, so that the program can be easily used in different directories.

Library walkthrough

1. Creating a Document for the first time

In this example, we use the SymbolScraperParser. Each parser implements its own .parse().

import os
from mmda.parsers.symbol_scraper_parser import SymbolScraperParser
from mmda.types.document import Document

ssparser = SymbolScraperParser(sscraper_bin_path='...')
doc: Document = ssparser.parse(infile='...pdf', outdir='...', outfname='...json')

Because we provided outdir and outfname, the document is also serialized for you:

assert os.path.exists(os.path.join(outdir, outfname))

2. Loading a serialized Document

Each parser implements its own .load().

doc: Document = ssparser.load(infile=os.path.join(outdir, outfname))

3. Iterating through a Document

The minimum requirement for a Document is its .text field, which is just a .

But the usefulness of this library really is when you have multiple different ways of segmenting the .text. For example:

for page in doc.pages:
    print(f'\n=== PAGE: {page.id} ===\n\n')
    for row in page.rows:
        print(row.text)

shows two nice aspects of this library:

  • Document provides iterables for different segmentations of text. Options include pages, tokens, rows, sents, blocks. Not every Parser will provide every segmentation, though. For example, SymbolScraperParser only provides pages, tokens, rows.

  • Each one of these segments (precisely, DocSpan objects) is aware of (and can access) other segment types. For example, you can call page.rows to get all Rows that intersect a particular Page. Or you can call sent.tokens to get all Tokens that intersect a particular Sentence. Or you can call sent.block to get the Block(s) that intersect a particular Sentence. These indexes are built dynamically when the Document is created and each time a new DocSpan type is loaded. In the extreme, one can do:

for page in doc.pages:
    for block in page.blocks:
        for sent in block.sents:
            for row in sent.rows:
                for token in sent.tokens:
                    pass

4. Loading new DocSpan type

Not all Documents will have all segmentations available at creation time. You may need to load new definitions to an existing Document.

It's strongly recommended to create the full Document using a Parser.load() but if you need to build it up step by step using the DocSpan class and Document.load() method:

from mmda.types.span import Span
from mmda.types.document import Document, DocSpan, Token, Page, Row, Sent, Block

doc: Document(text='I live in New York. I read the New York Times.')
page_jsons = [{'start': 0, 'end': 46, 'id': 0}]
sent_jsons = [{'start': 0, 'end': 19, 'id': 0}, {'start': 20, 'end': 46, 'id': 1}]

pages = [
    DocSpan.from_span(span=Span.from_json(span_json=page_json), 
                      doc=doc, 
                      span_type=Page)
    for page_json in page_jsons
]
sents = [
    DocSpan.from_span(span=Span.from_json(span_json=sent_json), 
                      doc=doc, 
                      span_type=Sent)
    for sent_json in sent_jsons
]

doc.load(sents=sents, pages=pages)

assert doc.sents
assert doc.pages

5. Changing the Document

We currently don't support any nice tools for mutating the data in a Document once it's been created, aside from loading new data. Do at your own risk.

But a note -- If you're editing something (e.g. replacing some DocSpan in tokens), always call:

Document._build_span_type_to_spans()
Document._build_span_type_to_index()

to keep the indices up-to-date with your modified DocSpan.

Comments
  • VILA predictor service

    VILA predictor service

    https://github.com/allenai/scholar/issues/29184

    REST service for VILA predictors

    Any code that I didn't comment on in this PR is s2age-maker boilerplate.

    opened by rodneykinney 12
  • cleanup Annotation class: remove `uuid`, pull `id` out of `metadata`, remove `dataclasses`, add `getter/setter` for `text` and `type`, make `Metadata()` take args

    cleanup Annotation class: remove `uuid`, pull `id` out of `metadata`, remove `dataclasses`, add `getter/setter` for `text` and `type`, make `Metadata()` take args

    @soldni im tryin to migrate off dataclass, but the tests are failing at:

    ERROR tests/test_eval/test_metrics.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_internal_ai2/test_api.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_parsers/test_grobid_header_parser.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_parsers/test_override.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_parsers/test_pdf_plumber_parser.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_predictors/test_bibentry_predictor.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_predictors/test_dictionary_word_predictor.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_predictors/test_span_group_classification_predictor.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_predictors/test_vila_predictors.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_types/test_document.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_types/test_indexers.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_types/test_json_conversion.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_types/test_metadata.py - TypeError: add_deprecated_field only works on dataclasses
    ERROR tests/test_types/test_span_group.py - TypeError: add_deprecated_field only works on dataclasses
    

    not sure how best to handle

    opened by kyleclo 7
  • Add attributes to API data classes

    Add attributes to API data classes

    This PR adds metadata to API data classes.

    API data classes and mmda types differ in a few significant aspects:

    • id and type (and text for SpanGroup) are stored in metadata for mmda types; in the APIs, they are part of the top-level attributes.
    • metadata can store arbitrary content in the mmda types; in the data API, all attributes that are not explicitly declared are dropped.
      • we expect applications that require specific fields to declare custom Metadata and SpanGroup/BoxGroup classes that inherit from their parent class. For an example, see tests/test_internal_ai2/test_api.py
      • metadata entries are mapped to an attributes field to match how data is stored in the Annotation Store
    opened by soldni 4
  • Egork/merge spans

    Egork/merge spans

    Adding a class to the utils for merging spans with optional parameter of x, y. x, y are used as a distance added to the boundaries of the boxes to decided if they overlap.

    Here is an example of tokens which are represented by list of spans, the task is to merge them into a single span and box image

    Result of merging tokens with x=0.04387334, y=0.01421097 being average size of the token in the document image

    Another example with same x=0.04387334, y=0.01421097

    image image

    opened by comorado 4
  • Bib Entry Parser/Predictor

    Bib Entry Parser/Predictor

    https://github.com/allenai/scholar/issues/32461

    Pretty standard model and interface implementation. You can see the result on http://bibentry-predictor.v0.dev.models.s2.allenai.org/ or

    curl -X 'POST' \
      'http://bibentry-predictor.v0.dev.models.s2.allenai.org/invocations' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "instances": [
        {
          "bib_entry": "[4] Wei Zhuo, Qianyi Zhan, Yuan Liu, Zhenping Xie, and Jing Lu. Context attention heterogeneous network embed- ding. Computational Intelligence and Neuroscience , 2019. doi: 10.1155/2019/8106073."
        }
      ]
    }'
    

    Tests: integration test passed and dev deployment works

    TODO: release as version 0.0.10 after merge

    opened by stefanc-ai2 4
  • Dockerized pipeline

    Dockerized pipeline

    Pattern for wrapping services in a lighter-weight containers.

    • Replace full sub-projects in the services directory with a Dockerfile plus a xxx_api.py file for each service.
    • Add a docker-compose file that build the services plus a python container for running a REPL or scripts
    • Add pipeline.py sample end-to-end pipeline script using the services
    • Add run-pipeline.sh file to run pipeline

    @rauthur

    opened by rodneykinney 4
  • MMDA predictor evaluation

    MMDA predictor evaluation

    • [x] Grobid implemented as a Parser
    • [x] Script to obtain S2-VLUE (check w/ shannons on location)
    • [x] Script to run S2-VLUE through an mmda.Predictor & obtain evaluation metrics
    • [x] Definition/implementation of end-to-end evaluation metrics.
    S2-VLUE evaluation looks like this:
    [('VILA', 'title'), ('a', 'title'), ('new', 'title')...] 
    and evaluation associated with it in VILA paper is token-level F1.
    
    But if we want to compare against GROBID or other systsems that use different parsers (i.e. different ._symbols and .tokens), what we want for evaluation is
    {'title': 'VILA a new...', ...}
    
    Our evaluation metric is based on string match / edit-distance metrics.  
    

    Focus for now on title and abstract. Other S2-VLUE classes may not even exist in Grobid or the other tools we want to compare against. If we have time, other types of content-types we'd want are:

    • Section names
    • Author names
    • Bibliographies (split out; optionally, also-parsed)
    • Body text
    • Captions
    • Footnotes
    • Tables/Figures
    opened by kyleclo 4
  • Speed up vila pre-processing

    Speed up vila pre-processing

    From earlier testing, I remember that convert_document_page_to_pdf_dict takes a significant fraction of the total prediction time for vila. Here's a simple change that does all the work in a single iteration over tokens instead of multiple list comprehensions. I tested this on some production PDFs and saw a 3x speed-up.

    @cmwilhelm @yoganandc

    https://github.com/allenai/scholar/issues/32695

    opened by rodneykinney 3
  • Kylel/2022 09/span group utils

    Kylel/2022 09/span group utils

    minor PR: added tests for a pretty important functionality that was being used -- how to combine Spans that are next to each other into a single big Span. the key thing that was undocumented was that Boxes for the underlying Spans actually disappear after this merging functionality, which the tests now capture.

    dont think this is intended behavior we want to support in future, but for now, this is how this utility is being used

    opened by kyleclo 2
  • `Document._annotate_box_group` returns empty SpanGroups

    `Document._annotate_box_group` returns empty SpanGroups

    Very bizarre bug I've encounter when trying to annotate a document with blocks from layout parser. To reproduce, run following code:

    from mmda.parsers.pdfplumber_parser import PDFPlumberParser
    from mmda.rasterizers.rasterizer import PDF2ImageRasterizer
    from mmda.predictors.lp_predictors import LayoutParserPredictor
    
    import torch
    import warnings
    from cached_path import cached_path        # must install via `pip install cached_path`
    
    
    pdfplumber_parser = PDFPlumberParser()
    rasterizer = PDF2ImageRasterizer()
    layout_predictor = LayoutParserPredictor.from_pretrained(
        "lp://efficientdet/PubLayNet"
    )
    
    path = str(cached_path('https://arxiv.org/pdf/2110.08536.pdf'))
    doc = pdfplumber_parser.parse(path)
    images = rasterizer.rasterize(input_pdf_path=path, dpi=72)
    doc.annotate_images(images)
    
    with torch.no_grad(), warnings.catch_warnings():
        layout_regions = layout_predictor.predict(doc)
        doc.annotate(blocks=layout_regions)
    
    # these asserts should fail
    assert doc.blocks[0].spans == []
    assert doc.blocks[0].tokens == []
    

    I've done a bit of poking around and it seems to stem from the following snipped of code in mmda/types/document.py:

    derived_span_groups = sorted(
        derived_span_groups, key=lambda span_group: span_group.start
    )
    

    In particular, it seems like the spans attribute for each SpanGroup gets emptied after sorting.

    No clue what would be causing this, but perhaps there's an explanation?

    opened by soldni 2
  • VILA models crashing when bounding boxes are not int

    VILA models crashing when bounding boxes are not int

    VILA models crashing when bounding boxes are not int

    Because of changes in #69 , bounding boxes are now float instead of int, which VILA does not like:

      File "/Users/lucas/miniforge3/envs/pdod/lib/python3.10/site-packages/mmda/predictors/hf_predictors/vila_predictor.py", line 161, in predict
        model_outputs = self.model(**self.model_input_collator(model_inputs))
      File "/Users/lucas/miniforge3/envs/pdod/lib/python3.10/site-packages/mmda/predictors/hf_predictors/vila_predictor.py", line 178, in model_input_collator
        return {
      File "/Users/lucas/miniforge3/envs/pdod/lib/python3.10/site-packages/mmda/predictors/hf_predictors/vila_predictor.py", line 179, in <dictcomp>
        key: torch.tensor(val, dtype=torch.int64, device=self.device)
    TypeError: 'float' object cannot be interpreted as an integer
    

    This PR adds an explicit cast operation to get around this issue during pre-processing.

    opened by soldni 2
  • Bib predictor index error bug fix

    Bib predictor index error bug fix

    attempt at fix for part 1 of https://github.com/allenai/scholar/issues/34858 I think we can work around the index error that keeps popping up this way.

    tt verify integration test passes

    next steps:

    • [ ] tt push
    • [ ] update timo-services config for bib-predictor to use new code which will trigger new deployment
    opened by geli-gel 4
  • Add fontinfo to tokens without requiring word split

    Add fontinfo to tokens without requiring word split

    This PR appends font information (font name and size) as metadata to 'tokens' on a Document without requiring tokens to be split on that information (i.e., "best" effort if a token contains many font names or sizes). The code subclasses the WordExtractor provided by PDFPlumber.

    Currently, in a default configuration of PDFPlumberParser, tokens are already extracted with font name and size information (although it is discarded and only used for splitting). This could be added to the metadata as-is, however, the method used is the extra_attrs argument of PDFPlumber's extract_words which forces token splitting if font name and size does not match. I believe this is a bad default and further is not required (users can override this argument). The approach here guarantees this metadata will be captured.

    Re: "best effort" in case of many name/size options for a token: I have maintained the logic of just taking the font name and size from the first character in the token. Another approach provides the min, max and average font sizes (or others) as well as a set of font names. This could be provided in the future or by adapting the append_attrs argument. For current use cases (section nesting prediction) this approach is sufficient.

    opened by rauthur 0
  • Incomplete sentences in README

    Incomplete sentences in README

    I was going through the readme and noticed a couple sentences that start, but don't end. Since I'm new to the project, I don't know how to finish the sentences myself.

    Screen Shot 2022-11-22 at 8 57 37 AM Screen Shot 2022-11-22 at 8 57 17 AM documentation 
    opened by dmh43 0
  • cleanup JSON conversion for all data types

    cleanup JSON conversion for all data types

    Noticed JSON serialization had inconsistent behavior across various data types, especially in cases where certain fields were empty or None.

    This PR adds a set of comprehensive tests in tests/test_types/test_json_conversion.py that documents these behaviors. PR also includes resolving inconsistencies. For example, now Metadata that's attached to a SpanGroup or BoxGroup won't get accidentally serialized as an empty dictionary.

    opened by kyleclo 0
  • adding relations

    adding relations

    This PR extends this library functionality substantially -- Adding a new Annotation type called Relation. A Relation is a link between 2 annotations (e.g. a Citation linked to its Bib Entry). The input Annotations are called key and value.

    A few things needed to change to support Relations:

    Annotation Names

    Relations store references to Annotation objects. But we didn't want Relation.to_json() to also .to_json() those objects. We only want to store minimal identifiers of the key and value. Something short like bib_entry-5 or sentence-13. We call these short strings names.

    To do this, we added to Annotation class, an optional attribute called field: str which stores this name. It's automatically populated when you run Document.annotate(new_field = list_of_annotations); each of those input annotations will have the new field name stored under .field.

    We also added a method name() that returns the name of a particular Annotation object that is unique at the document-level. Names are a minimal class that basically stores .field and .id.

    In short, now after you annotate a Document with annotations, you can do stuff like:

    doc.tokens[15].name   ==   AnnotationName(field='tokens', id=15)
    str(annotation_name)  ==   'tokens-15'
    AnnotationName.from_str('tokens-15')  ==  AnnotationName(field='tokens', id=15)
    

    Lookups based on names

    To support reconstructing a Relation object given the names of key and value, we need the ability to lookup those involved Annotations. We introduce a new method to enable this:

    annotation_name = AnnotationName.from_str('paragraphs-99')
    a = document.locate_annotation( annotation_name )   -->  returns the specific Annotation object
    assert a.id == 99
    assert a.field == 'paragraphs'
    

    to and from JSON

    Finally, we need some way to serializing to JSON and reconstructing from JSON. For serialization, now that we have Names, this makes the JSON quite minimal:

    {'key': <name_of_key>, 'value': <name_of_value>, ...other stuff that all Annotation objects have,  like Metadata...}
    

    Reconstructing a Relation from JSON is more tricky because it's meaningless without a Document object. The Document object must also store the specific Annotations correctly so we can correctly perform the lookup based on these Names.

    The API for this is similar, but you must also pass in the Document object:

    relation = Relation.from_json(my_relation_dict, my_document_containing_necessary_fields)
    
    opened by kyleclo 2
Releases(0.2.7)
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Libo Qin 132 Nov 25, 2022
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

Gagan Bhatia 364 Jan 03, 2023
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
NLP Text Classification

多标签文本分类任务 近年来随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以

Jason 1 Nov 11, 2021
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022
基于pytorch_rnn的古诗词生成

pytorch_peot_rnn 基于pytorch_rnn的古诗词生成 说明 config.py里面含有训练、测试、预测的参数,更改后运行: python main.py 预测结果 if config.do_predict: result = trainer.generate('丽日照残春')

西西嘛呦 3 May 26, 2022
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

40 Nov 25, 2022
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech UnOfficial PyTorch implementation of LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

Rishikesh (ऋषिकेश) 54 Dec 03, 2022
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 05, 2022
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
Natural Language Processing Best Practices & Examples

NLP Best Practices In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive bus

Microsoft 6.1k Dec 31, 2022
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
Chatbot with Pytorch, Python & Nextjs

Installation Instructions Make sure that you have Python 3, gcc, venv, and pip installed. Clone the repository $ git clone https://github.com/sahr

Rohit Sah 0 Dec 11, 2022
原神抽卡记录数据集-Genshin Impact gacha data

提要 持续收集原神抽卡记录中 可以使用抽卡记录导出工具导出抽卡记录的json,将json文件发送至[email protected],我会在清除个人信息后

117 Dec 27, 2022
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 01, 2023