A knowledge base construction engine for richly formatted data

Overview

Fonduer

CI-CD Code Climate Codecov ReadTheDocs PyPI PyPIVersion GitHubStars License CodeStyle

Fonduer is a Python package and framework for building knowledge base construction (KBC) applications from richly formatted data.

Note that Fonduer is still actively under development, so feedback and contributions are welcome. Submit bugs in the Issues section or feel free to submit your contributions as a pull request.

Getting Started

Check out our Getting Started Guide to get up and running with Fonduer.

Learning how to use Fonduer

The Fonduer tutorials cover the Fonduer workflow, showing how to extract relations from hardware datasheets and scientific literature.

Reference

Fonduer: Knowledge Base Construction from Richly Formatted Data (blog):

@inproceedings{wu2018fonduer,
  title={Fonduer: Knowledge Base Construction from Richly Formatted Data},
  author={Wu, Sen and Hsiao, Luke and Cheng, Xiao and Hancock, Braden and Rekatsinas, Theodoros and Levis, Philip and R{\'e}, Christopher},
  booktitle={Proceedings of the 2018 International Conference on Management of Data},
  pages={1301--1316},
  year={2018},
  organization={ACM}
}

Acknowledgements

Fonduer leverages the work of Emmental and Snorkel.

Comments
  • Using candidates for prediction (Fonduer Prediction Pipeline)

    Using candidates for prediction (Fonduer Prediction Pipeline)

    Scenario:

    For my use case I have a set of financial documents.

    The entire document set is divided into train,dev and test. The documents are parsed and the mentions and candidates are extracted with some rules.

    The featurized training candidates are used to train a Fonduer Learning model and the model is used to predict on the test candidates, as per the normal fonduer pipeline as demonstrated in the hardware tutorial.

    Problems & Questions

    1. Is the fonduer prediction pipeline production ready? How can we fine tune it to achieve better accuracy? Should the main focus be on the quality of the extracted mentions?

    With my initial analysis and usage following the hardware tutorial, I could not obtain good results.

    1. Can we separate the training and test pipeline?

    As in the current scenario, with a new document that I will feed for prediction, The entire corpus will have to be parsed to extract the mentions and candidates and store the feature keys.

    Please correct me, if that won't be the case and help me with a snippet to showcase the separation.

    opened by atulgupta9 16
  • Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537

    Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537

    Description of the problems or issues

    Is your pull request related to a problem? Please describe. See #534. This request redoes #537, which needs prior fixing #538 (fixed by #539).

    Does your pull request fix any issue. See #534

    Description of the proposed changes

    In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

    Test plan

    This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    opened by YasushiMiyata 15
  • parser.apply does not return for a long time even though the progress bar indicates it finishes parsing

    parser.apply does not return for a long time even though the progress bar indicates it finishes parsing

    Description of the bug

    This is not a bug, but a performance issue. This is not noticeable when parsing a small number of documents, but parser.apply does not return even though the progress bar indicates it finishes parsing a long time ago (1 hour or more ago).

    To Reproduce

    Steps to reproduce the behavior:

    1. Parse many documents (my case: ~2500)

    Expected behavior

    parser.apply returns when the progress bar indicates it finished parsing all the documents.

    Error Logs/Screenshots

    If applicable, add error logs or screenshots to help explain your problem.

    Environment (please complete the following information)

    • OS: Debian Buster
    • PostgreSQL Version: 12.1
    • Poppler Utils Version: N/A
    • Fonduer Version: 0.8.3+dev (01e0d9319b523aff7aa7f5c583a9f330b0705ecc)

    Additional context

    Add any other context about the problem here.

    bug 
    opened by HiromuHota 14
  • Execute preprocessing and parsing in parallel

    Execute preprocessing and parsing in parallel

    Description of the problems or issues

    Is your pull request related to a problem? Please describe.

    Currently, preprocessor and parser are executed in a complete sequential order. i.e., preprocess N docs (and load them into a queue), then parse N docs. This has two drawbacks:

    1. the progress bar shows nothing during preprocessing.
    2. the machine RAM has to be large enough to hold N preprocessed docs at a time.

    They become more serious when N is large and/or each HTML file is large.

    Does your pull request fix any issue.

    Fix #435

    Description of the proposed changes

    A clear and concise description of what you propose.

    This PR

    • places a cap on the in_queue so that only a certain number of documents are loaded to in_queue.
    • executes preprocessor and parser in parallel (ie the main process does preprocessing and child process(es) do parsing in parallel).

    Test plan

    A clear and concise description of how you test the new changes.

    For the 1st issue: I manually check the progress bar starts showing progress right after starting parse.apply.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [ ] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    enhancement 
    opened by HiromuHota 13
  • [Errno 32] Broken pipe for Parser in parallel execution on OSX

    [Errno 32] Broken pipe for Parser in parallel execution on OSX

    Hi,

    In fonduer-tutorials, after running cell:

    corpus_parser = OmniParser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
    %time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)
    

    whenever is PARALLEL smaller than max_docs, I've got:

    Traceback (most recent call last):
      File "/anaconda3/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
        send_bytes(obj)
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
        self._send(buf)
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
        n = write(self._handle, buf)
    BrokenPipeError: [Errno 32] Broken pipe
    

    Otherwise (with PARALLEL bigger or equal than max_docs) result is empty tables in Postgresql. When turning off parallelisation, it works.

    Best regards

    bug 
    opened by mladvladimir 13
  • Feat/multary candidates

    Feat/multary candidates

    Description of the problems or issues

    The feature extraction only supports unary and binary candidates

    Does your pull request fix any issue. Closes #455

    Description of the proposed changes

    Add new functions that supports multary-relation between spans for the feature extraction

    Test plan

    A clear and concise description of how you test the new changes. Use a candidate with more then two mentions, and try the feature extraction part.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.

    Note:

    In order for this to run the multary-candidates in textual features, we need a new version of treedlib based on this PR: treedlib#46 So if you can contact them, please do.

    Also if someone can jump-in to improve the coverage, I can't get the tabular_features up

    enhancement 
    opened by wajdikhattel 12
  • Add HOCRDocProprocessor and HocrVisualParser

    Add HOCRDocProprocessor and HocrVisualParser

    Description of the problems or issues

    Is your pull request related to a problem? Please describe.

    This is the second patch that follows #518 .

    Does your pull request fix any issue.

    N/A.

    Description of the proposed changes

    Add HOCRDocProprocessor and HocrVisualParser

    Test plan

    I added a few real hOCR example files.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    enhancement 
    opened by HiromuHota 9
  • Duplicate key error while adding two mentions which are same

    Duplicate key error while adding two mentions which are same

    Suppose that I have two mentions (say for example zip-code and tax code) whose matchers return true (checking 5 digit regex match for both mentions) for the same span in document, then I think Fonduer is throwing this error. please help me in resolving this.

    
    sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "context_stable_id_key"
    DETAIL:  Key (stable_id)=(1443208965_10_subset::span_mention:23313:23321) already exists.
    
    [SQL: INSERT INTO context (type, stable_id) VALUES (%(type)s, %(stable_id)s) RETURNING context.id]
    
    opened by saikalyan9981 9
  • unable to read images in the pdf file

    unable to read images in the pdf file

    Hi

    I am passing html to fonduer and it is saying unable read image from figure I have taken a pdf converted to html via pdftotree and passing the html to fonduer. Is this the issue with pdftotree that it is not able to render images. I want to what is the mechanism so that we can have images linked/embed in html so that fonduer can read it

    Please help/advice as i am stuck with this issue

    opened by ashleo25 8
  • Non-deterministic behavior in featurization

    Non-deterministic behavior in featurization

    Describe the bug When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1 set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.

    To Reproduce Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven't been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1. Note that there is only a single difference on line 65454.

    feature_table.tar.gz

    Note that it isn't always one difference, and the difference is not deterministic. The different attached is just an example.

    Expected behavior We would expect that these feature tables are identical between runs.

    Error Logs/Screenshots For convenience, here is the differing line in screenshot form image

    Additional context If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.

    bug 
    opened by lukehsiao 8
  • Type hints

    Type hints

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    I'm always frustrated when I have to look at the source codes to check the type of arguments/return.

    Describe the solution you'd like A clear and concise description of what you want to happen.

    1. Type hints (PEP484) are written to source codes like
    def greeting(name: str) -> str:
        return 'Hello ' + name
    
    1. (Eventually) enforce type checking during pre-commit

    For example by flake8-mypy

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Depending on the editor (PyCharm, etc.), type/rtype documentation like below gives you type hinting. However, I'm not sure this is equivalent to the type hints (PEP484).

    def greeting(name):
        """
        greeting
    
        :param name: description
        :type name: type description
        :return: description
        :rtype: type description
        """
        return 'Hello ' + name
    

    Additional context Add any other context or screenshots about the feature request here.

    enhancement help wanted 
    opened by HiromuHota 8
  • CandidateExtractor doesn't scale for larger relations

    CandidateExtractor doesn't scale for larger relations

    Hello, thanks for providing this framework. My group has run into a bit of a snag:

    For context, we've successfully completed candidate extraction & labeling for binary relations, with reasonable runtimes. With parallelism = 6, candidate extraction takes ~2 minutes per document.

    We've since moved on to a 3-ary relation that is very similar to the binary relation. This 3-ary relation shares some mentions with the binary relation, and uses a very similar candidate extractor. We have done performance testing for the 3-ary throttler function and found it to have a very similar runtime to the binary throttler. Candidate extraction now takes 4 hours per document. This immense slowdown is due to the fact that computational complexity scales exponentially for each entity added to a relation.

    Here are some numbers from our use case:

    • Mention A: 800 mentions found
    • Mention B: 140 mentions found
    • Mention C: 150 mentions found

    If our relation only includes (A,B), we have a total of 800*140 = 112,000 temporary candidates to evaluate with our throttler. Should we add mention C to form the relation (A,B,C), our total now grows to 800*140*150 = 16.8 million temporary candidates. We're unable to narrow our mention matchers further without excluding true positives.

    This slowdown makes the Fonduer framework effectively unusable for any large-scale use case that requires relations with more than 2 entities. Can you provide guidance to address this issue?

    opened by robbieculkin 1
  • Tables aren't redefined for re-runs of UDF apply

    Tables aren't redefined for re-runs of UDF apply

    Description of the bug

    As part of iterative development in a Jupyter environment, apply may be re-run several times. The developer might need to update candidates or create a new labeling function, for example. When this happens, the corresponding Postgres table is cleared but not dropped. This means that the definition of the table cannot change to accommodate the updated parameters for apply.

    To Reproduce

    Steps to reproduce the behavior:

    1. Run the max_storage_temp_tutorial notebook in fonduer-tutorials, up to and including the Labeling Functions section.
    2. Add a new LF, doesn't need to do anything in particular (could return ABSTAIN every time). Add this to the stg_temp_lfs list.
    3. Re-run the remainder of cells in the section.

    Upon calling LFAnalysis, the following exception is thrown:

    ValueError: Number of LFs (7) and number of LF matrix columns (6) are different
    

    Expected behavior

    Underlying tables for a re-run of a UDF apply method should not only be cleared, but dropped.

    Error Logs/Screenshots

    Full stack trace:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-62-e005feee6300> in <module>
          5 sorted_lfs = sorted(lfs, key=lambda lf: lf.name)
          6 
    ----> 7 LFAnalysis(L=L_train[0], lfs=sorted_lfs).lf_summary(Y=L_gold_train[0].reshape(-1))
    
    ~/.venv/lib/python3.7/site-packages/snorkel/labeling/analysis.py in __init__(self, L, lfs)
         44             if len(lfs) != self._L_sparse.shape[1]:
         45                 raise ValueError(
    ---> 46                     f"Number of LFs ({len(lfs)}) and number of "
         47                     f"LF matrix columns ({self._L_sparse.shape[1]}) are different"
         48                 )
    
    ValueError: Number of LFs (7) and number of LF matrix columns (6) are different
    

    Environment (please complete the following information)

    • OS: Ubuntu 18.04
    • PostgreSQL Version: 12.1
    • Poppler Utils Version: 0.71.0-5
    • Fonduer Version: 0.8.3

    Additional context

    https://github.com/HazyResearch/fonduer/issues/263#issuecomment-527588765 advises restarting Python, but this does not appear to solve the problem.

    opened by robbieculkin 5
  • Parser is not splitting multiple lines sentences properly

    Parser is not splitting multiple lines sentences properly

    Description of the bug

    I'm trying to Train a model that can build a Knowledge Base from the OPC UA Companions specification as a part of my Thesis. I have the Dataset as PDFs and used a third-party program to convert them into HTML and tried my best to preserve the data structure information (i'm getting the same result even if i just Parsed on the PDFs alone).

    Then i followed the hardware_fonduer_model Tutorial to Extract the Candidates accordingly.

    the Problem is that the Parser is splitting the sentences wrongly, namely it is getting the end of a Line as an end of a sentence. I tried to debug using a SimpleParser.split_sentences(text) command and turned out that python needs a backslash to split a statement into multiple lines.

    So i thought maybe i could use the replacements=['[\n]', ' '] parameter so the Split could function better but i'm getting the ValueError: too many values to unpack (expected 2). What is the default configuration for the sentence segmentation?
    How could i get a multiple Sentences as a mention? (i tried MentionNgram till n_max =100 and still getting just one).

    I would really appreciate getting back from you.

    many thanks in advance

    Example: Text to be parsed

    Boolean indicating if a profile /signature should be generated by this move command request.If the optional VariableSignatureRequestStatus is not provided on the Object, this parameter is ignored by the Server.

    Expected behavior

    sentence 1 : Boolean indicating if a profile /signature should be generated by this move command request. sentence 2 : If the optional VariableSignatureRequestStatus is not provided on the Object, this parameter is ignored by the Server.

    Actual behavior

    sentence 1 : Boolean indicating if a profile /signature should be generated by this move command sentence 2 : request. sentence 3 : request.If the optional VariableSignatureRequestStatus is not provided on the Object, this sentence 4 : parameter is ignored by the Server.

    Environment

    opened by eng-khaled1 3
  • Suggestion required: Getting error while applying Featurizer

    Suggestion required: Getting error while applying Featurizer

    @SenWu @HiromuHota .. can you pls suggest if my analogy is right?

    I am getting error :- File "abcd./anaconda3/lib/python3.7/site-packages/fonduer/utils/data_model_utils/structural.py", line 55, in _get_node return doc_etree.xpath(sentence.xpath)[0] IndexError: list index out of range

    I am following Hardware tutorial on some Email HTML msgs and getting mentions count near 4000

    Also :-- train_cands = candidate_extractor.get_candidates(split=0) dev_cands = candidate_extractor.get_candidates(split=1) test_cands = candidate_extractor.get_candidates(split=2)

    Above steps returned outputs but,

    on applying featurizer: featurizer.apply(split=0, train=True, parallelism=PARALLEL)

    I am getting error mentioned on top.

    I looked stackoverflow but the reason that HTML syntax issue,.. is not there as it is rendering good on browser. So can you share your thoughts on :

    1. can it be because no candidates being generated? or
    2. something else

    Thanks.

    opened by AshutoshUpadhya 3
  • How can i extract a paragraph and all associated sentences in document

    How can i extract a paragraph and all associated sentences in document

    How can i extract a paragraph and all associated sentences in document
    Basically i need paragraphs with associated sentences @lukehsiao @SenWu @vincentschen @ZZWENG @stephenbach

    Appreciate your help

    needs-info 
    opened by ashleo25 1
  • Featurizer.get_keys() does not honor candidate classes in context

    Featurizer.get_keys() does not honor candidate classes in context

    Description of the bug

    Unlike other methods (eg Featurizer.drop_keys() and Featurizer.upsert_keys(), Featurizer.get_keys() does not honor candidate classes in context but returns all feature keys no matter which candidate class they are associated with. This is confusing.

    See https://github.com/HazyResearch/fonduer/issues/511#issuecomment-696618392 for how this actually confused a user.

    To Reproduce

    This is a design error.

    Expected behavior

    These methods should behave similarly. Either

    • None of these honor candidate classes, or
    • All of these honor them.

    Error Logs/Screenshots

    N/A

    Environment (please complete the following information)

    • Fonduer Version: 0.8.3

    Additional context

    Add any other context about the problem here.

    opened by HiromuHota 0
Releases(v0.9.0)
  • v0.9.0(Jun 23, 2021)

    0.9.0 - 2021-06-22

    This is a long-awaited release with some performance improvements and some breaking changes. See the changelog for details.

    Added

    Changed

    • @HiromuHota: Renamed VisualLinker to PdfVisualParser, which assumes the followings: (#518)

      • pdf_path should be a directory path, where PDF files exist, and cannot be a file path.
      • The PDF file should have the same basename (os.path.basename) as the document. E.g., the PDF file should be either "123.pdf" or "123.PDF" for "123.html".
    • @HiromuHota: Changed Parser's signature as follows: (#518)

      • Renamed vizlink to visual_parser.
      • Removed pdf_path. Now this is required only by PdfVisualParser.
      • Removed visual. Provide visual_parser if visual information is to be parsed.
    • @YasushiMiyata: Changed UDFRunner's and UDF's data commit process as follows: (#545)

      • Removed add process on single-thread in _apply in UDFRunner.
      • Added UDFRunner._add of y on multi-threads to Parser, Labeler and Featurizer.
      • Removed y of document parsed result from out_queue in UDF.

    Fixed

    Source code(tar.gz)
    Source code(zip)
    fonduer-0.9.0-py3-none-any.whl(146.07 KB)
    fonduer-0.9.0.tar.gz(102.10 KB)
  • v0.8.3(Sep 11, 2020)

    0.8.3 - 2020-09-11

    This is a big release with a lot of changes. These changes are summarized here. Check the Changelog for more details.

    Added

    Changed

    • @YasushiMiyata: Enable RegexMatchSpan with concatenates words by sep="(separator)" option. (#270) (#492)
    • @HiromuHota: Enabled "Type hints (PEP 484) support for the Sphinx autodoc extension." (#421)
    • @HiromuHota: Switched the Cython wrapper for Mecab from mecab-python3 to fugashi. Since the Japanese tokenizer remains the same, there should be no impact on users. (#384) (#432)
    • @HiromuHota: Log a stack trace on parsing error for better debug experience. (#478) (#479)
    • @HiromuHota: get_cell_ngrams and get_neighbor_cell_ngrams yield nothing when the mention is not tabular. (#471) (#504)

    Deprecated

    Fixed

    • @senwu: Fix pdf_path cannot be without a trailing slash. (#442) (#459)
    • @kaikun213: Fix bug in table range difference calculations. (#420)
    • @HiromuHota: mention_extractor.apply with clear=True now works even if it's not the first run. (#424)
    • @HiromuHota: Fix get_horz_ngrams and get_vert_ngrams so that they work even when the input mention is not tabular. (#425) (#426)
    • @HiromuHota: Fix the order of args to Bbox. (#443) (#444)
    • @HiromuHota: Fix the non-deterministic behavior in VisualLinker. (#412) (#458)
    • @HiromuHota: Fix an issue that the progress bar shows no progress on preprocessing by executing preprocessing and parsing in parallel. (#439)
    • @HiromuHota: Adopt to mlflow>=1.9.0. (#461) (#463)
    • @HiromuHota: Correct the entity type for NumberMatcher from "NUMBER" to "CARDINAL". (#473) (#477)
    • @HiromuHota: Fix _get_axis_ngrams not to return None when the input is not tabular. (#481)
    • @HiromuHota: Fix Visualizer.display_candidates not to draw rectangles on wrong pages. (#488)
    • @HiromuHota: Persist doc only when no error happens during parsing. (#489) (#490)
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.3-py3-none-any.whl(136.97 KB)
    fonduer-0.8.3.tar.gz(99.00 KB)
  • v0.8.2(Apr 29, 2020)

    0.8.2 - 2020-04-28

    A summary of the changes of this release are below. Check the Changelog for more details.

    Deprecated

    • @HiromuHota: Use of undecorated labeling functions is deprecated and will not be supported as of v0.9.0. Please decorate them with snorkel.labeling.labeling_function.

    Fixed

    • @HiromuHota: Labeling functions can now be decorated with snorkel.labeling.labeling_function. (#400 <https://github.com/HazyResearch/fonduer/issues/400>) (#401 <https://github.com/HazyResearch/fonduer/pull/401>)
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.2-py3-none-any.whl(126.83 KB)
    fonduer-0.8.2.tar.gz(88.07 KB)
  • v0.8.1(Apr 13, 2020)

    0.8.1 - 2020-04-13

    A summary of the changes of this release are below. Check the Changelog for more details.

    Fonduer has a new mode argument to support switching between different learning modes (e.g., STL or MLT).

    Click to see example usage
    # Create task for each relation.
    tasks = create_task(
        task_names = TASK_NAMES,
        n_arities = N_ARITIES,
        n_features = N_FEATURES,
        n_classes = N_CLASSES,
        emb_layer = EMB_LAYER,
        model="LogisticRegression",
        mode = MODE,
    )
    

    Added

    • @senwu: Add mode argument in create_task to support STL and MTL.
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.1-py3-none-any.whl(128.52 KB)
    fonduer-0.8.1.tar.gz(87.80 KB)
  • v0.8.0(Apr 8, 2020)

    0.8.0 - 2020-04-07

    A summary of the changes of this release are below. Check the Changelog for more details.

    Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning.

    Click to see example usage
    # With Emmental, you need do following steps to perform learning:
    # 1. Create task for each relations and EmmentalModel to learn those tasks.
    # 2. Wrap candidates into EmmentalDataLoader for training.
    # 3. Training and inference (prediction).
    
    import emmental
    
    # Collect word counter from candidates which is used in LSTM model.
    word_counter = collect_word_counter(train_cands)
    
    # Initialize Emmental. For customize Emmental, please check here:
    # https://emmental.readthedocs.io/en/latest/user/config.html
    emmental.init(fonduer.Meta.log_path)
    
    #######################################################################
    # 1. Create task for each relations and EmmentalModel to learn those tasks.
    #######################################################################
    
    # Generate special tokens which are used for LSTM model to locate mentions.
    # In LSTM model, we pad sentence with special tokens to help LSTM to learn
    # those mentions. Example:
    # Original sentence: Then Barack married Michelle.
    # ->  Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
    arity = 2
    special_tokens = []
    for i in range(arity):
        special_tokens += [f"~~[[{i}", f"{i}]]~~"]
    
    # Generate word embedding module for LSTM.
    emb_layer = EmbeddingModule(
        word_counter=word_counter, word_dim=300, specials=special_tokens
    )
    
    # Create task for each relation.
    tasks = create_task(
        ATTRIBUTE,
        2,
        F_train[0].shape[1],
        2,
        emb_layer,
        mode="mtl",
        model="LogisticRegression",
    )
    
    # Create Emmental model to learn the tasks.
    model = EmmentalModel(name=f"{ATTRIBUTE}_task")
    
    # Add tasks into model
    for task in tasks:
        model.add_task(task)
    
    #######################################################################
    # 2. Wrap candidates into EmmentalDataLoader for training.
    #######################################################################
    
    # Here we only use the samples that have labels, which we filter out the
    # samples that don't have significant marginals.
    diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
    train_idxs = np.where(diffs > 1e-6)[0]
    
    # Create a dataloader with weakly supervisied samples to learn the model.
    train_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE,
            train_cands[0],
            F_train[0],
            emb_layer.word2id,
            train_marginals,
            train_idxs,
        ),
        split="train",
        batch_size=100,
        shuffle=True,
    )
    
    
    # Create test dataloader to do prediction.
    # Build test dataloader
    test_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
        ),
        split="test",
        batch_size=100,
        shuffle=False,
    )
    
    
    #######################################################################
    # 3. Training and inference (prediction).
    #######################################################################
    
    # Learning those tasks.
    emmental_learner = EmmentalLearner()
    emmental_learner.learn(model, [train_dataloader])
    
    # Predict based the learned model.
    test_preds = model.predict(test_dataloader, return_preds=True)
    

    Changed

    • @senwu: Switch to Emmental as the default learning engine.
    • @HiromuHota: Change ABSTAIN to -1 to be compatible with Snorkel of 0.9.X. Accordingly, user-defined labels should now be 0-indexed (used to be 1-indexed). (#310) (#320)
    • @HiromuHota: Use executemany_mode="batch" instead of deprecated use_batch_mode=True. (#358)
    • @HiromuHota: Use tqdm.notebook.tqdm instead of deprecated tqdm.tqdm_notebook. (#360)
    • @HiromuHota: To support ImageMagick7, expand the version range of Wand. (#373)
    • @HiromuHota: Comply with PEP 561 for type-checking codes that use Fonduer.
    • @HiromuHota: Make UDF.apply of all child classes unaware of the database backend, meaning PostgreSQL is not required if UDF.apply is directly used instead of UDFRunner.apply. (#316) (#368)

    Fixed

    • @senwu: Fix mention extraction to return mention classes instead of data model classes.
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.0-py3-none-any.whl(126.29 KB)
    fonduer-0.8.0.tar.gz(87.53 KB)
Owner
HazyResearch
We are a CS research group led by Prof. Chris Ré.
HazyResearch
Official implementation of UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

UTNet (Accepted at MICCAI 2021) Official implementation of UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation Introduction Transf

110 Jan 01, 2023
Bayesian Meta-Learning Through Variational Gaussian Processes

vmgp This is the repository of Vivek Myers and Nikhil Sardana for our CS 330 final project, Bayesian Meta-Learning Through Variational Gaussian Proces

Vivek Myers 2 Nov 17, 2022
METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)

Nautilus-OCR The National Library of Luxembourg (BnL) started its first initiative in digitizing newspapers, with layout recognition and OCR on articl

National Library of Luxembourg 36 Dec 05, 2022
Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Couler What is Couler? Couler aims to provide a unified interface for constructing and managing workflows on different workflow engines, such as Argo

Couler Project 781 Jan 03, 2023
Posterior predictive distributions quantify uncertainties ignored by point estimates.

Posterior predictive distributions quantify uncertainties ignored by point estimates.

DeepMind 177 Dec 06, 2022
This repository contains the code for the paper Neural RGB-D Surface Reconstruction

Neural RGB-D Surface Reconstruction Paper | Project Page | Video Neural RGB-D Surface Reconstruction Dejan Azinović, Ricardo Martin-Brualla, Dan B Gol

Dejan 406 Jan 04, 2023
Codes for “A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection”

DSAMNet The pytorch implementation for "A Deeply-supervised Attention Metric-based Network and an Open Aerial Image Dataset for Remote Sensing Change

Mengxi Liu 41 Dec 14, 2022
Java and SHACL code commented in the paper "Towards compliance checking in reified I/O logic via SHACL" submitted to ICAIL 2021

shRIOL The subfolder shRIOL contains Java files to execute the SHACL files on the OWL ontology. To compile the Java files: "javac -cp ./src/;./lib/* -

1 Dec 06, 2022
YOLOv5 in PyTorch > ONNX > CoreML > TFLite

This repository represents Ultralytics open-source research into future object detection methods, and incorporates lessons learned and best practices evolved over thousands of hours of training and e

Ultralytics 34.1k Dec 31, 2022
Docker containers of baseline agents for the Crafter environment

Crafter Baselines This repository contains Docker containers for running various baselines on the Crafter environment. Reward Agents DreamerV2 based o

Danijar Hafner 17 Sep 25, 2022
Instant neural graphics primitives: lightning fast NeRF and more

Instant Neural Graphics Primitives Ever wanted to train a NeRF model of a fox in under 5 seconds? Or fly around a scene captured from photos of a fact

NVIDIA Research Projects 10.6k Jan 01, 2023
Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution.

convolver Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution. Created by Sean Higley

Sean Higley 1 Feb 23, 2022
A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python.

c is for Camera A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python. The purpose of this project is to explore and underst

Daniele Procida 146 Sep 26, 2022
Unofficial implementation of HiFi-GAN+ from the paper "Bandwidth Extension is All You Need" by Su, et al.

HiFi-GAN+ This project is an unoffical implementation of the HiFi-GAN+ model for audio bandwidth extension, from the paper Bandwidth Extension is All

Brent M. Spell 134 Dec 30, 2022
Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance

Models for natural language understanding (NLU) tasks often rely on the idiosyncratic biases of the dataset, which make them brittle against test cases outside the training distribution.

Ubiquitous Knowledge Processing Lab 22 Jan 02, 2023
A simple library that implements CLIP guided loss in PyTorch.

pytorch_clip_guided_loss: Pytorch implementation of the CLIP guided loss for Text-To-Image, Image-To-Image, or Image-To-Text generation. A simple libr

Sergei Belousov 74 Dec 26, 2022
A Python package for generating concise, high-quality summaries of a probability distribution

GoodPoints A Python package for generating concise, high-quality summaries of a probability distribution GoodPoints is a collection of tools for compr

Microsoft 28 Oct 10, 2022
Pythonic particle-based (super-droplet) warm-rain/aqueous-chemistry cloud microphysics package with box, parcel & 1D/2D prescribed-flow examples in Python, Julia and Matlab

PySDM PySDM is a package for simulating the dynamics of population of particles. It is intended to serve as a building block for simulation systems mo

Atmospheric Cloud Simulation Group @ Jagiellonian University 32 Oct 18, 2022
A "gym" style toolkit for building lightweight Neural Architecture Search systems

A "gym" style toolkit for building lightweight Neural Architecture Search systems

Jack Turner 12 Nov 05, 2022
Pytorch implementation of "Geometrically Adaptive Dictionary Attack on Face Recognition" (WACV 2022)

Geometrically Adaptive Dictionary Attack on Face Recognition This is the Pytorch code of our paper "Geometrically Adaptive Dictionary Attack on Face R

6 Nov 21, 2022