Protein Language Model

Overview

ProteinLM

We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing Protein Embeddings), which contains a set of five biologically relevant semi-supervised learning tasks. And our pretrained model achieved good performance on these tasks.

Overview

The proposal of pre-training models such as Bert have greatly promoted the development of natural language processing, improving the performance of language models. Inspired by the similarity of amino acid sequence and text sequence, we consider applying the method of pre-training language model to biological data.

Guidance

We provide pretrain and finetune code in two separate folders. If you use the pretrained model we provide, you can simply download the checkpoint and follow the finetune guide. If you want to pretrain your own model yourself, you can refer to the pretrain guide.

Download ProteinLM

ProteinLM (200M)

For the pretrained model with 200 million parameters, you can download model checkpoint via GoogleDrive, or TsinghuaCloud.

ProteinLM (3B)

For the pretrained model with 3 billion parameters, you can download model checkpoint from here.

Project Structure

.
├── pretrain                (protein language model pretrain)
│   ├── megatron            (model folder)
│   ├── pretrain_tools      (multi-node pretrain)
│   ├── protein_tools       (data preprocess shells)
└── tape
    ├── conda_env           (conda env in yaml format)
    ├── converter           (converter script and model config files)
    ├── scripts             (model generator, finetune)
    └── tape                (tape model)

Usage

As the structure above shows, there are two stages as follows.

  • Pretrain
    • Prepare dataset (PFAM)
    • Preprocess data
    • Pretrain
  • Finetune
    • Convert pretrain protein model checkpoint
    • Finetune on downstream tasks

Detailed explanations are given in each folder's readme.

Downstream Tasks Performance

Task Metric TAPE ProteinLM (200M) ProteinLM (3B)
contact prediction [email protected]/5 0.36 0.52 0.75
remote homology Top 1 Accuracy 0.21 0.26 0.30
secondary structure Accuracy (3-class) 0.73 0.75 0.79
fluorescence Spearman's rho 0.68 0.68 0.68
stability Spearman's rho 0.73 0.77 0.79

Contact

If you have any problem using ProteinLM, feel free to contact us.

Reference

Our work is based on the following papers.

Besides, part of the code is based on Megatron-LM and TAPE.

Evaluating Protein Transfer Learning with TAPE

@article{DBLP:journals/corr/abs-1909-08053,
  author    = {Mohammad Shoeybi and
               Mostofa Patwary and
               Raul Puri and
               Patrick LeGresley and
               Jared Casper and
               Bryan Catanzaro},
  title     = {Megatron-LM: Training Multi-Billion Parameter Language Models Using
               Model Parallelism},
  journal   = {CoRR},
  volume    = {abs/1909.08053},
  year      = {2019},
  url       = {http://arxiv.org/abs/1909.08053},
  archivePrefix = {arXiv},
  eprint    = {1909.08053},
  timestamp = {Tue, 24 Sep 2019 11:33:51 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1909-08053.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

@article{DBLP:journals/corr/abs-1906-08230,
  author    = {Roshan Rao and
               Nicholas Bhattacharya and
               Neil Thomas and
               Yan Duan and
               Xi Chen and
               John F. Canny and
               Pieter Abbeel and
               Yun S. Song},
  title     = {Evaluating Protein Transfer Learning with {TAPE}},
  journal   = {CoRR},
  volume    = {abs/1906.08230},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.08230},
  archivePrefix = {arXiv},
  eprint    = {1906.08230},
  timestamp = {Sat, 23 Jan 2021 01:20:25 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-08230.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
Owner
THUDM
Data Mining Research Group at Tsinghua University
THUDM
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 6.4k Jan 09, 2023
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
189 Jan 02, 2023
This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

LipGAN Generate realistic talking faces for any human speech and face identity. [Paper] | [Project Page] | [Demonstration Video] Important Update: A n

Rudrabha Mukhopadhyay 438 Dec 31, 2022
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates Vibhor Agarwal, Sagar Joglekar, Anthony P. Young an

Vibhor Agarwal 2 Jun 30, 2022
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ( Fang-Pen Lin 82 Jun 28, 2022

COVID-19 Chatbot with Rasa 2.0: open source conversational AI

COVID-19 chatbot implementation with Rasa open source 2.0, conversational AI framework.

Aazim Parwaz 1 Dec 23, 2022
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022
A library for finding knowledge neurons in pretrained transformer models.

knowledge-neurons An open source repository replicating the 2021 paper Knowledge Neurons in Pretrained Transformers by Dai et al., and extending the t

EleutherAI 96 Dec 21, 2022
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

Vanderlei Munhoz 4 Oct 24, 2021
Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

张宇 6 Dec 20, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
jiant is an NLP toolkit

🚨 Update 🚨 : As of 2021/10/17, the jiant project is no longer being actively maintained. This means there will be no plans to add new models, tasks,

ML² AT CILVR 1.5k Dec 28, 2022