Facilitating the design, comparison and sharing of deep text matching models.


MatchZoo Tweet

Facilitating the design, comparison and sharing of deep text matching models.
MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。

Python 3.6 Pypi Downloads Documentation Status Build Status codecov License Requirements Status

🔥 News: MatchZoo-py (PyTorch version of MatchZoo) is ready now.

The goal of MatchZoo is to provide a high-quality codebase for deep text matching research, such as document retrieval, question answering, conversational response ranking, and paraphrase identification. With the unified data processing pipeline, simplified model configuration and automatic hyper-parameters tunning features equipped, MatchZoo is flexible and easy to use.

Tasks Text 1 Text 2 Objective
Paraphrase Indentification string 1 string 2 classification
Textual Entailment text hypothesis classification
Question Answer question answer classification/ranking
Conversation dialog response classification/ranking
Information Retrieval query document ranking

Get Started in 60 Seconds

To train a Deep Semantic Structured Model, import matchzoo and prepare input data.

import matchzoo as mz

train_pack = mz.datasets.wiki_qa.load_data('train', task='ranking')
valid_pack = mz.datasets.wiki_qa.load_data('dev', task='ranking')

Preprocess your input data in three lines of code, keep track parameters to be passed into the model.

preprocessor = mz.preprocessors.DSSMPreprocessor()
train_processed = preprocessor.fit_transform(train_pack)
valid_processed = preprocessor.transform(valid_pack)

Make use of MatchZoo customized loss functions and evaluation metrics:

ranking_task = mz.tasks.Ranking(loss=mz.losses.RankCrossEntropyLoss(num_neg=4))
ranking_task.metrics = [

Initialize the model, fine-tune the hyper-parameters.

model = mz.models.DSSM()
model.params['input_shapes'] = preprocessor.context['input_shapes']
model.params['task'] = ranking_task

Generate pair-wise training data on-the-fly, evaluate model performance using customized callbacks on validation data.

train_generator = mz.PairDataGenerator(train_processed, num_dup=1, num_neg=4, batch_size=64, shuffle=True)
valid_x, valid_y = valid_processed.unpack()
evaluate = mz.callbacks.EvaluateAllMetrics(model, x=valid_x, y=valid_y, batch_size=len(valid_x))
history = model.fit_generator(train_generator, epochs=20, callbacks=[evaluate], workers=5, use_multiprocessing=False)



English Documentation


If you're interested in the cutting-edge research progress, please take a look at awaresome neural models for semantic match.


MatchZoo is dependent on Keras and Tensorflow. Two ways to install MatchZoo:

Install MatchZoo from Pypi:

pip install matchzoo

Install MatchZoo from the Github source:

git clone https://github.com/NTMC-Community/MatchZoo.git
cd MatchZoo
python setup.py install


  1. DRMM: this model is an implementation of A Deep Relevance Matching Model for Ad-hoc Retrieval.

  2. MatchPyramid: this model is an implementation of Text Matching as Image Recognition

  3. ARC-I: this model is an implementation of Convolutional Neural Network Architectures for Matching Natural Language Sentences

  4. DSSM: this model is an implementation of Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

  5. CDSSM: this model is an implementation of Learning Semantic Representations Using Convolutional Neural Networks for Web Search

  6. ARC-II: this model is an implementation of Convolutional Neural Network Architectures for Matching Natural Language Sentences

  7. MV-LSTM:this model is an implementation of A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations

  8. aNMM: this model is an implementation of aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model

  9. DUET: this model is an implementation of Learning to Match Using Local and Distributed Representations of Text for Web Search

  10. K-NRM: this model is an implementation of End-to-End Neural Ad-hoc Ranking with Kernel Pooling

  11. CONV-KNRM: this model is an implementation of Convolutional neural networks for soft-matching n-grams in ad-hoc search

  12. models under development: Match-SRNN, DeepRank, BiMPM ....


If you use MatchZoo in your research, please use the following BibTex entry.

 author = {Guo, Jiafeng and Fan, Yixing and Ji, Xiang and Cheng, Xueqi},
 title = {MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching},
 booktitle = {Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR'19},
 year = {2019},
 isbn = {978-1-4503-6172-9},
 location = {Paris, France},
 pages = {1297--1300},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/3331184.3331403},
 doi = {10.1145/3331184.3331403},
 acmid = {3331403},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {matchzoo, neural network, text matching},

Development Team

​ ​ ​ ​

Fan Yixing

Core Dev

Wang Bo

Core Dev
M.S. TU Delft

Wang Zeyi

Core Dev
B.S. UC Davis

Pang Liang

Core Dev

Yang Liu

Core Dev

Wang Qinghua

B.S. Shandong Univ.

Wang Zizhen


Su Lixin


Yang Zhou


Tian Junfeng



Please make sure to read the Contributing Guide before creating a pull request. If you have a MatchZoo-related paper/project/compnent/tool, send a pull request to this awesome list!

Thank you to all the people who already contributed to MatchZoo!

Jianpeng Hou, Lijuan Chen, Yukun Zheng, Niuguo Cheng, Dai Zhuyun, Aneesh Joshi, Zeno Gantner, Kai Huang, stanpcf, ChangQF, Mike Kellogg

Project Organizers

  • Jiafeng Guo
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage
  • Yanyan Lan
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage
  • Xueqi Cheng
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage



Copyright (c) 2015-present, Yixing Fan (faneshion)

  • Run aNMM

    Run aNMM

    I am new to MatrchZoo. I wonder how to run aNMM. The docs don't have usage for aNMM. I feel I have to run a script for calculating the bin_sizes for aNMM? But I cannot find where this script lies.

    Furthermore, my training data does need to have a format like here: https://github.com/NTMC-Community/MatchZoo/blob/master/matchzoo/datasets/toy/train.csv


    And, where are the batches created? Since you have positive and negative documents for each query, the batch should contain examples with pos and negs samples, right?

    How can I load my own data?


    opened by ctrado18 26
  • Suggestions for MatchZoo 2.0

    Suggestions for MatchZoo 2.0

    Anybody wanting to make suggestions for MZ 2.0, please add it in this issue.

    Here are my suggestions:

    • [x] Add docstrings for all functions and classes
    • [ ] Make MZ OS independent
    • [ ] Make MZ usable by providing custom data
    • [ ] Allow External Benchmarking
    • [ ] Siamese Recurrent Networks (Proposed Model)
    • [ ] docker, conda, virtualenv support (wishlist)

    More details at https://github.com/faneshion/MatchZoo/issues/106

    2.0 discussion 
    opened by aneesh-joshi 24
  • Reproduction of Benchmark Results

    Reproduction of Benchmark Results

    When running through the procedure described in the readme for the benchmark results of WikiQA, the reproduced values for [email protected], [email protected], and MAP are roughly half of the values shown in the table. Could you provide insight as to why this may be occuring?

    bug question 
    opened by ghost 21
  • External benchmarking on Match Zoo

    External benchmarking on Match Zoo

    Hi, I am trying to establish benchmark results on all the document similarity models at MatchZoo. While there are some established benchmarks, it would be good if we had a MatchZoo-code independent system for evaluating results.


        input_data -> MZ -> result_data
        result_data - > independent_evaluation_code -> metric scores (Example: [email protected], map, etc.)

    The current scenario is that the evaluation code is strongly ingrained in the MZ code, which can cause problems with different commits over time. As seen in https://github.com/faneshion/MatchZoo/issues/99

    1. Is there a way already for doing this? I assume TREC is for that. Could someone direct me on how to use it? 2. Could some one direct me on how to go about making such an evaluation code? (Once developed, I will push it back into MZ and it could be like a Continuous Integration test.)

    What do you think, @faneshion @yangliuy @bwanglzu @millanbatra @mandroid6?


    opened by aneesh-joshi 20
  • Tensorflow2.0目前是否全面支持?


    如题,我目前的运行环境使用是tf2.0版本,keras是为2.3.0。 但无法执行 报错信息如下:

    ~/anaconda3/lib/python3.7/site-packages/keras/engine/training.py in _prepare_total_loss(self, masks)
        691                     output_loss = loss_fn(
    --> 692                         y_true, y_pred, sample_weight=sample_weight)
        694                 if len(self.outputs) > 1:
    TypeError: __call__() got an unexpected keyword argument 'sample_weight'
    opened by hezhefly 18
  • DSSM returning NaN for loss when used with tensorflow-gpu backend.

    DSSM returning NaN for loss when used with tensorflow-gpu backend.

    I have been running DSSM on quite a large dataset and was looking at tensorflow-gpu to speed up the training. However the returning loss and mae are always NaN for both the train and evaluation phase. I have tried a very basic tensorflow model from their tutorials and it works fine.

    Im not really sure where to start debugging with this, any help would be greatly appreciated.

    The model works fine with the cpu version of tensorflow. Example:

    model.fit(x,y, epochs=2)
    Epoch 1/2
    10000/10000 [==============================] - 1s 139us/step - loss: nan - mean_absolute_error: nan
    Epoch 2/2
    10000/10000 [==============================] - 1s 138us/step - loss: nan - mean_absolute_error: nan
    opened by MichaelStarkey 18
  • Using a model as a search engine

    Using a model as a search engine

    I see that the models usually needs a text1 and text2 to perform the training and predictions. Usually on search engines I just need the text2 (document) to perform the indexing step (training).

    How can I train the model like a search engine? i.e. I don't have the text1 information (query/question) and I want to index my documents.

    Does using the same text for text1 and text2 works for training?

    opened by denisb411 18
  • add preparation data for TREC data set

    add preparation data for TREC data set

    I've added all modules for processing TREC dataset, mainly: the modifications enable to get TREC like run with corresponding ids for queries and documents. Hence, the evaluation with trec_eval is possible now. In addition to performing n-cross validation with MatchZoo. Soon, I'll add programs for constructing TREC input files that are needed by the added functions.

    opened by thiziri 18
  • support keras 2.3 and tensorflow 2.0

    support keras 2.3 and tensorflow 2.0

    • update requirements.txt: keras=2.3.0 and tensorflow >= 2.0.0
    • upgrade pip in .travis.yml (tf 2.0 requires pip >= 19)
    • make raking losses inherit keras.losses.Loss to support sample_weight keyword param
    • replace some keras.backend.tf with tf (K.tf does not exist anymore in 2.3.0 as keras is going to be synced with tf.keras and drop multi-backend)
    • add clear_session before prepare in model tests to prevent OOM during CI test

    fix #789

    opened by matthew-z 17
  • A question about the manner of input data to model.fit_generator()

    A question about the manner of input data to model.fit_generator()

    I find that input data is sent to model by outside circulation iteration. Seeing the follow plot.


    I am feeling uncertain why do it and I change it to this(because I want to use tensorboard by callback function). image

    I just use model.fit_generator() to handel data and train. However, it raises a exception that is caused by validation_data2018-08-15 19-46-50 I trace it into keras inner cores and find it occurs when model starts to run evaluate_generator()。In the function evaluate_generator(),eval data generator is empty and lead to a exception at a epoch! However, it is strange and confuses me why the exception does not occur in the start epoch。I trace code and think it may be a bugger of Keras,is it true? Additional, whether this is the reason that you make a outside iteration to train model。


    opened by Star-in-Sky 17
  • Predict a new query

    Predict a new query

    I already searched here. I use right now v1. Is there any sample code (I just found a broken link)? I have my trained DRRM model and want to predict ranking documents for a new query.

    How is the current state in v2 to that?

    I handleld to train the modle for my own custom text data with own fast word embedding. Normally I just would predict a new query but the output are the text IDs. So for DRRM are new words ignored which have no embedding in the dict?

    Thank you very much!

    opened by datistiquo 15
  • TypeSpec error while DRMM model build

    TypeSpec error while DRMM model build

    Hello, I am getting the following error while I am trying to build the DRMM model for my ranking task at this line (here)

    TypeError: Could not build a TypeSpec for KerasTensor(type_spec=TensorSpec(shape=(1, None, 10, 1), dtype=tf.float32, name=None), name='tf.operators.add/AddV2:0', description="created by layer 'tf.operators.add'") of unsupported type <class 'keras.engine.keras_tensor.KerasTensor'>.

    Please note that I am using the following environment configuration in my local machine

    Python 3.8.11 MatchZoo 2.1.0 tensorflow 2.8.0

    Describe your attempts

    • I checked the documentation and found no answer
    • I checked to make sure that this is not a duplicate issue
    • Additionally, I also tried different kinds of solution like this here
    • And here


    • OS [macOS 12.4]:
    • Hardware [Metal M1]

    Thank-you for your help and time.

    Regards, Govind Shukla

    opened by govind17 0
  • Bug/enhancement


    Describe the bug

    MatchZoo breaks when run in google colab beacause of deprocated dependencies in keras

    To Reproduce

    Attempt to import match zoo in google colab:

    !pip3 install matchzoo

    import tensorflow from tensorflow import keras import matchzoo as mz import nltk import pandas as pd

    Describe your attempts

    Attempted to run matchzoo in google colab Fixed dependecy issues

    You should also provide code snippets you tried as a workaround, StackOverflow solution that you have walked through, or your best guess of the cause that you can't locate (e.g. cosmic radiation).


    Nine FIles Needed edit: attention layer.py #from keras.engine import Layer from keras.layers import Layer # Changed from previous line to fix tensorflow toolchain

    data_generator.py import tensorflow # Added to fix toolchain issues #import keras from tensorflow import keras # Changed from previous line

    decating_dropout_layer.py #from keras.engine import Layer from keras.layers import Layer # Changed from previous line to fix tensorflow toolchain

    dynamic_pooling_layer.py #from keras.engine import Layer from keras.layers import Layer # Changed from previous line to fix tensorflow toolchain

    matching_layer.py #from keras.engine import Layer from keras.layers import Layer # Changed from previous line to fix tensorflow toolchain

    matching_tensor_layer.py #from keras.engine import Layer from keras.layers import Layer # Changed from previous line to fix tensorflow toolchain

    multi_perspective_layer.py #from keras.engine import Layer from keras.layers import Layer # Changed from previous line to fix tensorflow toolchain

    semantic_composite_layer.py #from keras.engine import Layer from keras.layers import Layer # Changed from previous line to fix tensorflow toolchain

    spatial_gru.py #from keras.engine import Layer from keras.layers import Layer # Changed from previous line to fix tensorflow toolchain

    Additional Information

    I clone the repo and will push this update as a contribution to the code base

    opened by jerrycearley 2
  • Can Deep Component (Representation-focused model) is there in matchzoo?

    Can Deep Component (Representation-focused model) is there in matchzoo?

    Hello, I'm working on SLGen(Structure Learning for Headline Generation(AAAI-20)) paper. In this paper, they are utilizing Deep and Wide component for Structure representation of Text documents.

    Now, the question is does MatchZoo provide me the facility of finding a Deep component- Interaction-focused model and Representation-focused model - for a Text document.

    If anybody working on this please help me out!

    Thank you Darshan Tank

    opened by Darshan2104 0
  • DSSM model.predict() scores rank does not match with the rank by dot layer cosine similarity

    DSSM model.predict() scores rank does not match with the rank by dot layer cosine similarity

    Describe the Question

    I have a trained DSSM model and wanted to compare the ranked items based on dssm model.predict() scores against the cosine similarity scores after the model's dot layer, I would expect the two ranks to be the same since model.predict() is just the final score after a linear activation but the results are completely the opposite and I'm trying to understand how that might be given the linear coefficient from the final dense layer is positive.

    Describe your attempts

    • [x] I walked through the tutorials
    • [x] I checked the documentation
    • [x] I checked to make sure that this is not a duplicate question

    1. DSSM model summary 2. Predicted scores comparison 3. Predicted dataframe with two sets of scores, sorted by pred_score here which gives completely opposite rank compared to if sorted by dot score

    opened by jchen0529 0
  • set_up.py missing tensorflow

    set_up.py missing tensorflow

    Describe the bug

    the project needs TensorFlow, but set_up.py does not contain the package. Although the requirements.txt contain the package, but when execute the command: pip install -e ., it will not install the package and occur no module error? Actually, is there any reason that not containing TensorFlow in set_up.py???

    To Reproduce

    pip3 install -e . python3 -m pytest -v tests/unit_test/processor_units/test_processor_units.py ============================= test session starts ============================== platform linux -- Python 3.7.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /mnt/zejun/smp/data/python_star_2000repo/MatchZoo/venv_test_7/bin/python3.7 cachedir: .pytest_cache rootdir: /mnt/zejun/smp/data/python_star_2000repo/MatchZoo plugins: cov-3.0.0, mock-3.6.1 collecting ... collected 0 items / 1 error

    ==================================== ERRORS ==================================== ___ ERROR collecting tests/unit_test/processor_units/test_processor_units.py ___ ImportError while importing test module '/mnt/zejun/smp/data/python_star_2000repo/MatchZoo/tests/unit_test/processor_units/test_processor_units.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /usr/lib/python3.7/importlib/init.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) tests/unit_test/processor_units/test_processor_units.py:4: in from matchzoo.preprocessors import units matchzoo/init.py:20: in from . import preprocessors matchzoo/preprocessors/init.py:1: in from . import units matchzoo/preprocessors/units/init.py:13: in from .tokenize import Tokenize matchzoo/preprocessors/units/tokenize.py:2: in from matchzoo.utils.bert_utils import is_chinese_char,
    matchzoo/utils/init.py:4: in from .make_keras_optimizer_picklable import make_keras_optimizer_picklable matchzoo/utils/make_keras_optimizer_picklable.py:1: in import keras venv_test_7/lib/python3.7/site-packages/keras/init.py:21: in from tensorflow.python import tf2 E ModuleNotFoundError: No module named 'tensorflow' =========================== short test summary info ============================ ERROR tests/unit_test/processor_units/test_processor_units.py !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! =============================== 1 error in 0.91s ===============================

    Describe your attempts

    • [x] I checked the documentation and found no answer
    • [x] I checked to make sure that this is not a duplicate issue


    • Ubutun
    opened by idiomaticrefactoring 1
  • GPU-Utils is low 1%

    GPU-Utils is low 1%

    Describe the bug

    run the example in Get Started in 60 Seconds



    • OS Ubuntu18.04
    • Hardware Tesla 80k, cuda 10.1,cudnn7.0
    • matchzoo 2.2.0, tensorflow2.2.0, keras2.3.0

    Additional Information

    Other things you want the developers to know.

    opened by lonelydancer 0
  • v2.2(Oct 9, 2019)

  • v2.1(Apr 4, 2019)

    • add automation modules
      • mz.auto.tuner that automatically search for model hyper parameters
      • mz.auto.preprer that unifies model preprocessing and training processes
    • add QuoraQP dataset
    • rewrite mz.DataGenerator to be callback-based
    • fix models behaviors under classification tasks
    • reorganize project structure, the most significant one being moving processor_units to preprocessors.units
    • rename redundant names (e.g. NaiveModel -> Naive, TokenizeUnit -> Tokenize)
    • update the tutorials
    • various other updates
    Source code(tar.gz)
    Source code(zip)
Neural Text Matching Community
Neural Text Matching Community
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

Rahmat Agung Julians 1 Sep 14, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

Nathan Raw 185 Dec 21, 2022
Backend for the Autocomplete platform. An AI assisted coding platform.

Introduction A custom predictor allows you to deploy your own prediction implementation, useful when the existing serving implementations don't fit yo

Tatenda Christopher Chinyamakobvu 1 Jan 31, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

Jörg Thalheim 70 Dec 26, 2022
Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

Shutdown 213 Dec 22, 2022
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022