For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Overview

Cross-Encoder-with-Bi-Encoder

For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Data

Data used by "Open-Domain Question Answering Competition" hosted by Aistages, and copyrights can be used under CC-BY-2.0.

+- data
|   +- train_dataset
    |   +- train
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- validataion
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- test_dataset
    |   +- validation
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- wikipedia_documents.json
  • Wikipedia data can be uploaded to the folder location above and used.
!git clone https://github.com/jjonhwa/Cross-Encoder-with-Bi-Encoder.git # git clone
% cd ./Cross-Encoder-with-Bi-Encoder/_data                              # change directory (input your own path)

!gdown --id 1O-kxt4DupOibNhkwmg3luTLt07faRgvO # wiki data upload        # download wikipedia data

Setup

Dependencies

  • datasets==1.5.0
  • transformers==4.5.0
  • tqdm==4.41.1
  • pandas==1.1.4
  • CUDA==11.0

Install Requirements

bash install_requirements.sh

Hardware

  • GPU : Tesla V100 (32GB)

Checkpoint

  • You can check the code in the Colab environment using Demo.
  • It does not work in Colab Basic.

What can we do to improve the performance of Retriever?

1. Explore the data set production process.

  • Sparse Embedding may be better in tasks for viewing Passage and creating a question (if there is an annotation bias), such as SQuAD.
  • In most other data, documents can be extracted with higher accuracy if Dense Passage Retreat is used.

2. Sparse Embedding & Dense Embedding

  • Most of the content was knowledge obtained by referring to Paper, and based on this, it led to improvement in Retriever performance.
  • Prior to the application of DPR, in the case of 'KLUE MRC database' in which datasets were configured in the same manner as SQuAD, it would be better to utilize techniques such as Sparse embedding technique BM25 compared to DPR.
  • Actually, until ReRank Strategy was applied, the highest performance was achieved with elastic search based on BM25.
  • When only biencoder was used, Retrieval accuracy was far below elastic search in the 'KLUE MRC competition'
  • Retrieval Accuracy in our Data
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR Bi-Encoder - 0.775 0.85

3. ReRank Strategy with CrossEncoder (In-Batch_Negative Samples)

  • Our purpose is to bring high performance from KLUE MRC competition to End-to-End from Retrieval to Reader. From this, the ReRank strategy using Cross Encoder was used.
  • In addition, when implementing Cross Encoder, the key point is to extract a negative sample within Batch and use it to calculate loss.
  • After extracting the Retrival Passage of the Top-500 using the Bi-Encoder, only a small number of Passages are finally extracted by returning to the Cross Encoder.
  • Retrieval Accuracy in our Data
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR without CrossEncoder - 0.775 0.85
DPR with CrossEncoder 0.825 0.95 -

4. Ensemble

  • In this process, the contents of CrossEncoder were mainly written, and the contents of Ensemble were omitted.
  • An experiment was conducted assuming that performance improvement would be achieved from different types of Retrival combinations by conducting Ensemble using Sparse Embedding and Dense Embedding.
  • Top-100 was selected using Elastic Search and Top-100 was selected using DPR and Cross Encoder, and the final output score was calculated by combining them 1 to 1 and normalizing them.
  • When the final Reader model was tested, when Top-5 was input, the performance was the best, so the experiment was conducted after limiting the number of passages to be returned to five.
  • Actually, the performance has improved significantly, and the retrival accuracy is as follows.
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR with CrossEncoder 0.825 0.95 -
Ensemble 0.9082 - -

Train CrossEncoder & BiEncoder

  • Learn crossencoder and biencoder and store them.
  • Modify only the data path to match your data. (find "your_dataset_path")
python train.py --encoder 'cross' --output_directory './save_directory/'

or

python train.py --encoder 'bi' --output_directory './save_directory/'

Run ReRank

  • It precedes creating an encoder using crossencoder and biencoder. (Before Run ReRank, you have to run 'train.py' to make)
  • Modify only the data path to match your data. (find "your_dataset_path")
python rerank.py --input_directory './save_directory/'

Run Retriever Demo

  • Top 500 Passages are Retrieved from about 60000 data using Biencoder, and Top 5 is finally retrieved using CrossEncoder.
  • Passage Embedding about wiki data, Cross Encoder and Bi-Encoder can be downloaded and utilized
  • Open In Colab
A bunch of codes for procedurally modeling and texturing futuristic cities.

Procedural Futuristic City This is our final project for CPSC 479. We created a procedural futuristic city complete with Worley noise procedural textu

1 Dec 22, 2021
Cairo-math-64x61 - Fixed point 64.61 math library for Cairo / Starknet

Cairo Math 64x61 A fixed point 64.61 math library for Cairo & Starknet Signed 64

Influence 63 Dec 05, 2022
Python interface to IEX and IEX cloud APIs

Python interface to IEX Cloud Referral Please subscribe to IEX Cloud using this referral code. Getting Started Install Install from pip pip install py

IEX Cloud 41 Dec 21, 2022
Catalogue CRUD Application

This Python program creates a relational SQL database hosted on the Snowflake platform, then opens a CRUD GUI to manipulate and view the data. In this application, it is used as a book catalogue. CUR

0 Dec 13, 2022
A simple watcher for the XTZ/kUSD pool on Quipuswap

Kolibri Quipuswap Watcher This repo holds the source code for the QuipuBot bot deployed to the #quipuswap-updates channel in the Kolibri Discord Setup

Hover Labs 1 Nov 18, 2021
Clackety Keyboards Powered by Python

KMK: Clackety Keyboards Powered by Python KMK is a feature-rich and beginner-friendly firmware for computer keyboards written and configured in Circui

KMK Firmware 780 Jan 03, 2023
Some out-of-the-box hooks for pre-commit

pre-commit-hooks Some out-of-the-box hooks for pre-commit. See also: https://github.com/pre-commit/pre-commit Using pre-commit-hooks with pre-commit A

pre-commit 3.6k Dec 29, 2022
An end-to-end Python-based Infrastructure as Code framework for network automation and orchestration.

Nectl An end-to-end Python-based Infrastructure as Code framework for network automation and orchestration. Features Data modelling and validation. Da

Adam Kirchberger 15 Oct 14, 2022
This module extends twarc to allow you to print out tweets as text for easy testing on the command line

twarc-text This module extends twarc to allow you to print out tweets as text for easy testing on the command line. Maybe it's useful for spot checkin

Documenting the Now 2 Oct 12, 2021
Scraping comments from the political section of popular Nigerian blog (Nairaland), and saving in a CSV file.

Scraping_Nairaland This project scraped comments from the political section of popular Nigerian blog www.nairaland.com using the Python BeautifulSoup

Ansel Orhero 1 Nov 14, 2021
EloGGs 🎮 is a 1v1.LOL Trophy Boosting Program (PATCHED)

EloGGs 🎮 is an old patched 1v1.LOL boosting program I developed months ago, My team made around $1000 total off of this, but now it's been patched by the developers.

doop 1 Jul 22, 2022
this is a basic python project that I made using python

this is a basic python project that I made using python. This project is only for practice because my python skills are still newbie.

Elvira Firmansyah 2 Dec 14, 2022
Python library for ODE integration via Taylor's method and LLVM

heyoka.py Modern Taylor's method via just-in-time compilation Explore the docs » Report bug · Request feature · Discuss The heyókȟa [...] is a kind of

Francesco Biscani 45 Dec 21, 2022
Open Source defrag's mod code

Open Source defrag's mod code Goals: Code & License: Respect FOSS philosophy. Open source and community focus. Eliminate all traces of q3a-sdk licensi

sOkam! 1 Dec 10, 2022
Gerenciador de processos e registros pessoais do Departamento de Fiscalização de Produtos Controlados.

CRManager Gerenciador de processos e registros pessoais do Departamento de Fiscalização de Produtos Controlados. Descrição Este projeto tem como objet

Wolfgang Almeida 1 Nov 15, 2021
Running a complete single-node all-in-one cluster instance of TIBCO ActiveMatrix™ BusinessWorks 6.8.0.

TIBCO ActiveMatrix™ BusinessWorks 6.8 Docker Image Image for running a complete single-node all-in-one cluster instance of TIBCO ActiveMatrix™ Busines

Federico Alpi 1 Dec 10, 2021
These are my solutions to Advent of Code problems.

Advent of Code These are my solutions to Advent of Code problems. If you want to join my leaderboard, the code is 540750-9589f56d. When I solve for sp

Sumner Evans 5 Dec 19, 2022
Batch Python Program Verify

Batch Python Program Verify About As a TA(teaching assistant) of Programming Class, it is very annoying to test students' homework assignments one by

Han-Wei Li 7 Dec 20, 2022
OB_Template is a vault template reference for using Obsidian.

Obsidian Template OB_Template is a vault template reference for using Obsidian. If you've tested out Obsidian. and worked through the "Obsidian Help"

323 Dec 27, 2022
This tool allows you to do goole dorking much easier

This tool allows you to do goole dorking much easier

Steven 8 Mar 06, 2022