For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Overview

Cross-Encoder-with-Bi-Encoder

For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Data

Data used by "Open-Domain Question Answering Competition" hosted by Aistages, and copyrights can be used under CC-BY-2.0.

+- data
|   +- train_dataset
    |   +- train
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- validataion
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- test_dataset
    |   +- validation
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- wikipedia_documents.json
  • Wikipedia data can be uploaded to the folder location above and used.
!git clone https://github.com/jjonhwa/Cross-Encoder-with-Bi-Encoder.git # git clone
% cd ./Cross-Encoder-with-Bi-Encoder/_data                              # change directory (input your own path)

!gdown --id 1O-kxt4DupOibNhkwmg3luTLt07faRgvO # wiki data upload        # download wikipedia data

Setup

Dependencies

  • datasets==1.5.0
  • transformers==4.5.0
  • tqdm==4.41.1
  • pandas==1.1.4
  • CUDA==11.0

Install Requirements

bash install_requirements.sh

Hardware

  • GPU : Tesla V100 (32GB)

Checkpoint

  • You can check the code in the Colab environment using Demo.
  • It does not work in Colab Basic.

What can we do to improve the performance of Retriever?

1. Explore the data set production process.

  • Sparse Embedding may be better in tasks for viewing Passage and creating a question (if there is an annotation bias), such as SQuAD.
  • In most other data, documents can be extracted with higher accuracy if Dense Passage Retreat is used.

2. Sparse Embedding & Dense Embedding

  • Most of the content was knowledge obtained by referring to Paper, and based on this, it led to improvement in Retriever performance.
  • Prior to the application of DPR, in the case of 'KLUE MRC database' in which datasets were configured in the same manner as SQuAD, it would be better to utilize techniques such as Sparse embedding technique BM25 compared to DPR.
  • Actually, until ReRank Strategy was applied, the highest performance was achieved with elastic search based on BM25.
  • When only biencoder was used, Retrieval accuracy was far below elastic search in the 'KLUE MRC competition'
  • Retrieval Accuracy in our Data
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR Bi-Encoder - 0.775 0.85

3. ReRank Strategy with CrossEncoder (In-Batch_Negative Samples)

  • Our purpose is to bring high performance from KLUE MRC competition to End-to-End from Retrieval to Reader. From this, the ReRank strategy using Cross Encoder was used.
  • In addition, when implementing Cross Encoder, the key point is to extract a negative sample within Batch and use it to calculate loss.
  • After extracting the Retrival Passage of the Top-500 using the Bi-Encoder, only a small number of Passages are finally extracted by returning to the Cross Encoder.
  • Retrieval Accuracy in our Data
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR without CrossEncoder - 0.775 0.85
DPR with CrossEncoder 0.825 0.95 -

4. Ensemble

  • In this process, the contents of CrossEncoder were mainly written, and the contents of Ensemble were omitted.
  • An experiment was conducted assuming that performance improvement would be achieved from different types of Retrival combinations by conducting Ensemble using Sparse Embedding and Dense Embedding.
  • Top-100 was selected using Elastic Search and Top-100 was selected using DPR and Cross Encoder, and the final output score was calculated by combining them 1 to 1 and normalizing them.
  • When the final Reader model was tested, when Top-5 was input, the performance was the best, so the experiment was conducted after limiting the number of passages to be returned to five.
  • Actually, the performance has improved significantly, and the retrival accuracy is as follows.
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR with CrossEncoder 0.825 0.95 -
Ensemble 0.9082 - -

Train CrossEncoder & BiEncoder

  • Learn crossencoder and biencoder and store them.
  • Modify only the data path to match your data. (find "your_dataset_path")
python train.py --encoder 'cross' --output_directory './save_directory/'

or

python train.py --encoder 'bi' --output_directory './save_directory/'

Run ReRank

  • It precedes creating an encoder using crossencoder and biencoder. (Before Run ReRank, you have to run 'train.py' to make)
  • Modify only the data path to match your data. (find "your_dataset_path")
python rerank.py --input_directory './save_directory/'

Run Retriever Demo

  • Top 500 Passages are Retrieved from about 60000 data using Biencoder, and Top 5 is finally retrieved using CrossEncoder.
  • Passage Embedding about wiki data, Cross Encoder and Bi-Encoder can be downloaded and utilized
  • Open In Colab
Drug Discovery App Using Lipinski's Rule-of-Five.

Drug Discovery App A Drug Discovery App Using Lipinski's Rule-of-Five. TAPIWA CHAMBOKO 🚀 About Me I'm a full stack developer experienced in deploying

tapiwa chamboko 3 Nov 08, 2022
Generating rent availability info from Effort rent

Rent-info Generating rent availability info from Effort rent Pre-Installation Latest version of python Pip module json, os, requests, datetime, time i

Laixuan 1 Oct 20, 2021
Simple project to learn more about Bézier curves

Python Quadratic Bézier Simple project to learn more about Bézier curves. On this project i used some api's to graphics and gui pygame thorpy in theor

Kenned Ferreira 2 Mar 06, 2022
Simple python code for compile brainfuck program.

py-brainf*ck Just a basic compiled that compiles your brainf*ck codes and gives you informations about memory, used cells, dumped version, logs etc...

4 Jun 13, 2021
Always fill your package requirements without the user having to do anything! Simple and easy!

WSL Should now work always-fill-reqs-python3 Always fill your package requirements without the user having to do anything! Simple and easy! Supported

Hashm 7 Jan 19, 2022
Discover and load entry points from installed packages

Entry points are a way for Python packages to advertise objects with some common interface. The most common examples are console_scripts entry points,

Thomas Kluyver 69 Jul 05, 2022
A hackers attempt at an MVP anki plugin

my anki plugin if you have found this by accident, you should probably run away this is nothing more than a hackers attempt at an MVP anki plugin I re

Chris Hall 1 Nov 02, 2021
a package that provides a marketstrategy for whitelisting on golem

filterms a package that provides a marketstrategy for whitelisting on golem watching requestor logs distribute 10 tasks asynchronously is fun. but you

KJM 3 Aug 03, 2022
Module for working with the site dnevnik.ru with python

dnevnikru Module for working with the site dnevnik.ru with python Dnevnik object accepts login and password from the dnevnik.ru account Methods: homew

Aleksandr 21 Nov 21, 2022
A Bot Which Can generate Random Account Based On Your Hits.

AccountGenBot This Bot Can Generate Account With Hits You Save (Randomly) Keyfeatures Join To Use Support Limit Account Generation Using Sql Customiza

DevsExpo 30 Oct 21, 2022
Structural basis for solubility in protein expression systems

Structural basis for solubility in protein expression systems Large-scale protein production for biotechnology and biopharmaceutical applications rely

ProteinQure 16 Aug 18, 2022
An application for automation of the mining function in the game Alienworlds.IO

alienautomation A Python script made to automate the tidious job of mining on AlienWorlds This script: Automatically opens the browser Automatically l

anonieXdev 42 Dec 03, 2022
Python package for handling and analyzing PSRFITS files

PyPulse A pure-Python package for handling and analyzing PSRFITS files. Read the documentation here. This is an alternate code base from PSRCHIVE. Req

Michael Lam 15 Nov 30, 2022
Python script for the radio in the Junior float.

hoco radio 2021 Python script for the radio in the Junior float. Populate the ./music directory with 2 or more .wav files and run radio2.py. On the Ra

Kevin Yu 2 Jan 18, 2022
Simulation simplifiée du fonctionnement du protocole RIP

ProjetRIPlay v2 Simulation simplifiée du fonctionnement du protocole RIP par Eric Buonocore le 18/01/2022 Sur la base de l'exercice 5 du sujet zéro du

Eric Buonocore 2 Feb 15, 2022
Create a simple program by applying the use of class

TUGAS PRAKTIKUM 8 💻 Nama : Achmad Mahfud NIM : 312110520 Kelas : TI.21.C5 Perintah : Buat program sederhana dengan mengaplikasikan pengguna

Achmad Mahfud 1 Dec 23, 2021
Tool to audit and fix Python project requirements.

Requirement Auditor Utility to revise and updated python requirement files.

Luis Carlos Berrocal 1 Nov 07, 2021
Would upload anything I do with/related to brainfuck

My Brainfu*k Repo Basically wanted to create something with Brainfu*k but realized that with the smol brain I have, I need to see the cell values real

Rafeed 1 Mar 22, 2022
A basic notes app to store your notes.

Notes Webapp A basic notes webapp to keep your notes.You can add, edit and delete notes after signing up. To add a note type your note in the text box

2 Oct 23, 2021
Extract continuous and discrete relaxation spectra from G(t)

pyReSpect-time Extract continuous and discrete relaxation spectra from stress relaxation modulus G(t). The papers which describe the method and test c

3 Nov 03, 2022