Python package for performing Entity and Text Matching using Deep Learning.

Overview

DeepMatcher

https://travis-ci.org/anhaidgroup/deepmatcher.svg?branch=master

DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and utilities that enable you to train and apply state-of-the-art deep learning models for entity matching in less than 10 lines of code. The models are also easily customizable - the modular design allows any subcomponent to be altered or swapped out for a custom implementation.

As an example, given labeled tuple pairs such as the following:

https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/docs/source/_static/match_input_ex.png

DeepMatcher uses labeled tuple pairs and trains a neural network to perform matching, i.e., to predict match / non-match labels. The trained network can then be used to obtain labels for unlabeled tuple pairs.

Paper and Data

For details on the architecture of the models used, take a look at our paper Deep Learning for Entity Matching (SIGMOD '18). All public datasets used in the paper can be downloaded from the datasets page.

Quick Start: DeepMatcher in 30 seconds

There are four main steps in using DeepMatcher:

  1. Data processing: Load and process labeled training, validation and test CSV data.
import deepmatcher as dm
train, validation, test = dm.data.process(path='data_directory',
    train='train.csv', validation='validation.csv', test='test.csv')
  1. Model definition: Specify neural network architecture. Uses the built-in hybrid model (as discussed in section 4.4 of our paper) by default. Can be customized to your heart's desire.
model = dm.MatchingModel()
  1. Model training: Train neural network.
model.run_train(train, validation, best_save_path='best_model.pth')
  1. Application: Evaluate model on test set and apply to unlabeled data.
model.run_eval(test)

unlabeled = dm.data.process_unlabeled(path='data_directory/unlabeled.csv', trained_model=model)
model.run_prediction(unlabeled)

Installation

We currently support only Python versions 3.5 and 3.6. Installing using pip is recommended:

pip install deepmatcher

Note that during installation you may see an error message that says "Failed building wheel for fasttextmirror". You can safely ignore this - it does NOT mean that there are any problems with installation.

Tutorials

Using DeepMatcher:

  1. Getting Started: A more in-depth guide to help you get familiar with the basics of using DeepMatcher.
  2. Data Processing: Advanced guide on what data processing involves and how to customize it.
  3. Matching Models: Advanced guide on neural network architecture for entity matching and how to customize it.

Entity Matching Workflow:

End to End Entity Matching: A guide to develop a complete entity matching workflow. The tutorial discusses how to use DeepMatcher with Magellan to perform blocking, sampling, labeling and matching to obtain matching tuple pairs from two tables.

DeepMatcher for other matching tasks:

Question Answering with DeepMatcher: A tutorial on how to use DeepMatcher for question answering. Specifically, we will look at WikiQA, a benchmark dataset for the task of Answer Selection.

API Reference

API docs are here.

Support

Take a look at the FAQ for common issues. If you run into any issues or have questions not answered in the FAQ, please file GitHub issues and we will address them asap.

The Team

DeepMatcher was developed by University of Wisconsin-Madison grad students Sidharth Mudgal and Han Li, under the supervision of Prof. AnHai Doan and Prof. Theodoros Rekatsinas.

Python package for Turkish Language.

PyTurkce Python package for Turkish Language. Documentation: https://pyturkce.readthedocs.io. Installation pip install pyturkce Usage from pyturkce im

Mert Cobanov 14 Oct 09, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
Backend for the Autocomplete platform. An AI assisted coding platform.

Introduction A custom predictor allows you to deploy your own prediction implementation, useful when the existing serving implementations don't fit yo

Tatenda Christopher Chinyamakobvu 1 Jan 31, 2022
NLP applications using deep learning.

NLP-Natural-Language-Processing NLP applications using deep learning like text generation etc. 1- Poetry Generation: Using a collection of Irish Poem

KASHISH 1 Jan 27, 2022
Course project of [email protected]

NaiveMT Prepare Clone this repository git clone [email protected]:Poeroz/NaiveMT.git

Poeroz 2 Apr 24, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022
使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。 基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。 Pretraining Task Mask Language Model,简称Mask LM,即

Desmond Ng 24 Dec 10, 2022
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
Chatbot with Pytorch, Python & Nextjs

Installation Instructions Make sure that you have Python 3, gcc, venv, and pip installed. Clone the repository $ git clone https://github.com/sahr

Rohit Sah 0 Dec 11, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 03, 2022
Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

min95 169 Jan 05, 2023
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 754 Jan 09, 2023
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

74 Dec 05, 2022
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Bort Companion code for the paper "Optimal Subarchitecture Extraction for BERT." Bort is an optimal subset of architectural parameters for the BERT ar

Alexa 461 Nov 21, 2022