MMDA - multimodal document analysis
This is work in progress...
Setup
conda create -n mmda python=3.8
pip install -r requirements.txt
Parsers
-
SymbolScraper - Apache 2.0
- Quoted from their
README: From the main directory, issuemake. This will run the Maven build system, download dependencies, etc., compile source files and generate .jar files in./target. Finally, a bash scriptbin/sscraperis generated, so that the program can be easily used in different directories.
- Quoted from their
Library walkthrough
1. Creating a Document for the first time
In this example, we use the SymbolScraperParser. Each parser implements its own .parse().
import os
from mmda.parsers.symbol_scraper_parser import SymbolScraperParser
from mmda.types.document import Document
ssparser = SymbolScraperParser(sscraper_bin_path='...')
doc: Document = ssparser.parse(infile='...pdf', outdir='...', outfname='...json')
Because we provided outdir and outfname, the document is also serialized for you:
assert os.path.exists(os.path.join(outdir, outfname))
2. Loading a serialized Document
Each parser implements its own .load().
doc: Document = ssparser.load(infile=os.path.join(outdir, outfname))
3. Iterating through a Document
The minimum requirement for a Document is its .text field, which is just a
.
But the usefulness of this library really is when you have multiple different ways of segmenting the .text. For example:
for page in doc.pages:
print(f'\n=== PAGE: {page.id} ===\n\n')
for row in page.rows:
print(row.text)
shows two nice aspects of this library:
-
Documentprovides iterables for different segmentations oftext. Options includepages, tokens, rows, sents, blocks. Not every Parser will provide every segmentation, though. For example,SymbolScraperParseronly providespages, tokens, rows. -
Each one of these segments (precisely,
DocSpanobjects) is aware of (and can access) other segment types. For example, you can callpage.rowsto get all Rows that intersect a particular Page. Or you can callsent.tokensto get all Tokens that intersect a particular Sentence. Or you can callsent.blockto get the Block(s) that intersect a particular Sentence. These indexes are built dynamically when theDocumentis created and each time a newDocSpantype is loaded. In the extreme, one can do:
for page in doc.pages:
for block in page.blocks:
for sent in block.sents:
for row in sent.rows:
for token in sent.tokens:
pass
4. Loading new DocSpan type
Not all Documents will have all segmentations available at creation time. You may need to load new definitions to an existing Document.
It's strongly recommended to create the full Document using a Parser.load() but if you need to build it up step by step using the DocSpan class and Document.load() method:
from mmda.types.span import Span
from mmda.types.document import Document, DocSpan, Token, Page, Row, Sent, Block
doc: Document(text='I live in New York. I read the New York Times.')
page_jsons = [{'start': 0, 'end': 46, 'id': 0}]
sent_jsons = [{'start': 0, 'end': 19, 'id': 0}, {'start': 20, 'end': 46, 'id': 1}]
pages = [
DocSpan.from_span(span=Span.from_json(span_json=page_json),
doc=doc,
span_type=Page)
for page_json in page_jsons
]
sents = [
DocSpan.from_span(span=Span.from_json(span_json=sent_json),
doc=doc,
span_type=Sent)
for sent_json in sent_jsons
]
doc.load(sents=sents, pages=pages)
assert doc.sents
assert doc.pages
5. Changing the Document
We currently don't support any nice tools for mutating the data in a Document once it's been created, aside from loading new data. Do at your own risk.
But a note -- If you're editing something (e.g. replacing some DocSpan in tokens), always call:
Document._build_span_type_to_spans()
Document._build_span_type_to_index()
to keep the indices up-to-date with your modified DocSpan.


