Structural basis for solubility in protein expression systems

Overview

Structural basis for solubility in protein expression systems

Twitter Follow GitHub repo size

Large-scale protein production for biotechnology and biopharmaceutical applications rely on high protein solubility in expression systems. Solubility has been measured for a significant fraction of E. coli and S. cerevisiae proteomes and these datasets are routinely used to train predictors of protein solubility in different organisms. Thanks to continued advances in experimental structure-determination and modelling, many of these solubility measurements can now be paired with accurate structural models.

The challenge is mentored by Christopher Ing and Mark Fingerhuth.

Aim of the challenge

It is the objective of this project to use our provided dataset of protein structure and solubility value pairs in order to produce a solubility predictor with comparable accuracy to sequence-based predictors reported in the literature. The provided dataset to be used in this project is created by following the dataset curation procedure described in the SOLart paper, and this hackathon project has a similar aim to this manuscript.

The dataset

The process of generating the dataset is described in the SOLArt manuscript. At a high level, all experimentally tested E. coli and S. cerevisiae proteins were matched through Uniprot IDs to known crystallographic structures or high sequence similarity homology models. After balancing the fold types using CATH, a dataset containing a balanced spread of solubility values was produced. The resulting proteins for the training and testing of these models were prepared and disclosed in the supplemental material of this paper as a list of (Uniprot,PDB,Chain,Solubility) pairs. The PDB files were not included in this work so we had to re-extract them from SWISS-MODEL. Whenever a crystallographic structure was present, it was used, assuming high coverage over the Uniprot sequence. In some cases, the original PDB templates used within the original SOLArt paper had been superceded by improved templates, and we opted to take the highest resolution, highest sequence identity, models in our updated dataset. We stripped away all irrelevant chains and heteroatoms.

If issues are identified with individual structures, please refer to the Uniprot ID and manually investigate the best template. In some cases, we needed to improve structure correctness by modelling missing atoms/residues inside the Chemical Computing Group software MOE on a case-by-case basis.

The dataset can be found in the data/ subdirectory - it is already divided into training/ and test/ data. The training/ data comes with solubility_values.csv and solublity_values.yaml (same content just different format) which both contain the solubility target values for all the PDB files provided in that directory. Note that each PDB file is named after the Uniprot identifier of the respective protein and the protein column in the solubility_values.csv also contains the Uniprot identifiers.

The test/ dataset consists of three different subdirectories (protein structures derived from different organisms and with different approaches) and you should NOT use them for any training. Only the yeast_crystal_structs/ directory contains solubility_values.csv and solublity_values.yaml (same content just different format) files which you can use for some local testing & validation. In order to find out your performance on the entire test dataset you need to use the automated benchmarking system (see below).

Example output

Your code should output a file called predictions.csv in the following format:

protein,solubility
P69829,83
P31133,62

whereby the protein column contains the Uniprot ID (corresponds to the filename of the PDB files) and the solubility column contains the predicted solubility value (can be int or float).

Note, that there are three (!) test subsets but you are expected to submit all the predictions in one file (not three) for the benchmarking system to work.

Automated benchmarking system

The continuous integration script in .github/workflows/ci.yml will automatically build the Dockerfile on every commit to the main branch. This docker image will be published as your hackathon submission to https://biolib.com//. For this to work, make sure you set the BIOLIB_TOKEN and BIOLIB_PROJECT_URI accordingly as repository secrets.

To read more about the benchmarking system click here.

Say thanks

Give this repo a star: GitHub Repo stars

Star the ProteinQure org on Github: GitHub Org's stars

Owner
ProteinQure
ProteinQure
This suite consists of two different scripts, made to automate attacks against NoSQL databases.

NoSQL-Attack-Suite This suite consists of two different scripts, made to automate attacks against NoSQL databases. The first one looks for a NoSQL Aut

16 Dec 26, 2022
A ULauncher/Albert extension that supports currency, units and date time conversion, as well as a calculator that supports complex numbers and functions.

Ulauncher/Albert Calculate Anything Ulauncher/Albert Calculate Anything is an extension for Ulauncher and Albert to calculate things like currency, ti

tchar 67 Jan 01, 2023
My collection of mini-projects in various languages

Mini-Projects My collection of mini-projects in various languages About: This repository consists of a number of small projects. Most of these "mini-p

Siddhant Attavar 1 Jul 11, 2022
Python Freecell Solver

freecell Python Freecell Solver Very early version right now. You can pick a board by changing the file path in freecell.py If you want to play a game

Ben Kaufman 1 Nov 26, 2021
Python NZ COVID Pass Verifier/Generator

Python NZ COVID Pass Verifier/Generator This is quick proof of concept verifier I coded up in a few hours using various libraries to parse and generat

NZ COVID Pass Community 12 Jan 03, 2023
XAC HID Gamepad implementation for CircuitPython 7 or above.

CircuitPython_XAC_Gamepad Setup process Install CircuitPython 7 or above in your board. Add the init.py file under \lib\adafruit_hid directory of CIRC

5 Dec 19, 2022
Find out where all films you want to watch are streaming

Just Watch Letterboxd Find out where all films you want to watch are streaming Ever wonder what films you want to watch are already on the streaming p

Jordan Oslislo 2 Feb 04, 2022
tidevice can be used to communicate with iPhone device

h 该工具能够用于与iOS设备进行通信, 提供以下功能 截图 获取手机信息 ipa包的安装和卸载 根据bundleID 启动和停止应用 列出安装应用信息 模拟Xcode运行XCTest,常用的如启动WebDriverAgent测试

Alibaba 1.8k Dec 30, 2022
This is an API to get user details for competitive coding platforms - Codeforces, Codechef, SPOJ, Interviewbit. More Platform will be Added Soon.

Competitive-Programming-Score-API An API to get user details for competitive coding platforms - Codeforces, Codechef, SPOJ, Interviewbit Platforms Ava

Aaditya Prakash 3 Jan 17, 2022
Python tools for working with Orbit Ephemeris Messages (OEMs).

Python Orbit Ephemeris Message tools Python tools for working with Orbit Ephemeris Messages (OEMs). Development Status Installation The oem package is

Brad Sease 4 Apr 06, 2022
A very basic ciphering/deciphering tool

ckrett-python-library This is an useful python library for people who care about privacy, this library is useful to cipher and decipher text using 4 s

SasiVatsal 8 Oct 18, 2022
A Google sheet which keeps track of the locations that want to visit and a price cutoff

FlightDeals Here's how the program works. First, I have a Google sheet which keeps track of the locations that I want to visit and a price cutoff. It

Lynne Munini 5 Nov 21, 2022
Goddard A collection of small, simple strategies for Freqtrade

Goddard A collection of small, simple strategies for Freqtrade. Simply add the strategy you choose in your strategies folder and run. ⚠️ General Crypt

Shane Jones 118 Dec 14, 2022
Procedural modeling of fruit and sandstorm in Blender (bpy).

SandFruit Procedural modelling of fruit and sandstorm. Created by Adriana Arcia and Maya Boateng. Last updated December 19, 2020 Goal & Inspiration Ou

Adriana Arcia 2 Mar 20, 2022
HomeAssistant Linux Companion

Application to run on linux desktop computer to provide sensors data to homeasssistant, and get notifications as if it was a mobile device.

Javier Lopez 10 Dec 27, 2022
Gerador do Arquivo Magnético Sintegra em Python

pysintegra é uma lib simples com o objetivo de facilitar a geração do arquivo SINTEGRA seguindo o Convênio ICMS 57/95. Com o surgimento do SPED, muito

Felipe Correa 5 Apr 07, 2022
Coinloggr - A learning resource and social platform for the coin collecting community

Coinloggr A learning resource and social platform for the coin collecting commun

John Galiszewski 1 Jan 10, 2022
Convert-Decimal-to-Binary-Octal-and-Hexadecimal

Convert-Decimal-to-Binary-Octal-and-Hexadecimal We have a number in a decimal number, and we have to convert it into a binary, octal, and hexadecimal

Maanyu M 2 Oct 08, 2021
This project intends to take the user's CEP (brazilian adress code) and return the local in which the CEP is placed.

This project aims to simply return the CEP's (the brazilian resident adress code) User of the application. The project uses a request and passes on to

Daniel Soares Saldanha 4 Nov 17, 2021
My tools box script for sigma

sigma_python_toolbox My tools box script for sigma purpose My goal is not to replace sigma but to put at disposal the scripts that I think to help me

4 Jun 20, 2022