BMS-Molecular-Translation

Introduction

This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got bronze medals in this competition. Significant part of code was originated from Y.Nakama's notebook

This competition was about image-to-text translation of images with molecular skeletal strucutures to InChI chemical formula identifiers.

InChI=1S/C16H13Cl2NO3/c1-10-2-4-11(5-3-10)16(21)22-9-15(20)19-14-8-12(17)6-7-13(14)18/h2-8H,9H2,1H3,(H,19,20)

Solution

General Encoder-Decoder concept

Most participants used CNN encoder to acquire features with decoder (LSTM/GRU/Transformer) to get text sequences. That's a casual approach to image captioning problem.

Pseudo-labelling with InChI validation using RDKit

RDKit is an open source toolkit for cheminformatics and it was quite useful while solving the problem. When we trained our first model, it scored around 7-8 on public leaderboard and we decided to make pseudo-labelling on test data. However, in common scenario you get a significant amount of wrong predictions in your extended training set from pseudo-labelling. With RDKit we validated all of our predicted formulas and select around 800k correct samples. Lack of wrong labels in pseudo labels improved the score.

Predictions normalization

This notebook tells about InChI normalization

Blending

Finally, we blended ~20 predictions from 2 models (mostly from different epochs) using RDKit validation to choose only formulas which have possible InChI structure.

Pipeline for chemical image-to-text competition

Related tags

Overview

BMS-Molecular-Translation

Introduction

Solution

General Encoder-Decoder concept

Pseudo-labelling with InChI validation using RDKit

Predictions normalization

Blending

Final private LB score 1.79

Owner

Maksim Zhdanov

ADCS - Automatic Defect Classification System (ADCS) for SSMC

تولید اسم های رندوم فینگیلیش

Wind Speed Prediction using LSTMs in PyTorch

A programming language with logic of Python, and syntax of all languages.

Google and Stanford University released a new pre-trained model called ELECTRA

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Higher quality textures for the Metal Gear Solid series.

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

🏖 Easy training and deployment of seq2seq models.

Python utility library for compositing PDF documents with reportlab.

HAIS_2GNN: 3D Visual Grounding with Graph and Attention

DVC-NLP-Simple-usecase

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Implementation of ProteinBERT in Pytorch

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

The code for two papers: Feedback Transformer and Expire-Span.