Write reproducible code for getting and processing ChEMBL

Last update: Dec 25, 2022

Related tags

Overview

chembl_downloader

Don't worry about downloading/extracting ChEMBL or versioning - just use chembl_downloader to write code that knows how to download it and use it automatically.

Installation

$ pip install chembl-downloader

Usage

Download A Specific Version

import chembl_downloader

path = chembl_downloader.download(version='28')

After it's been downloaded and extracted once, it's smart and does not need to download again. It gets stored using pystow automatically in the ~/.data/chembl directory.

We'd like to implement something such that it could load directly into SQLite from the archive, but it appears this is a paid feature.

Download the Latest Version

First, you'll have to install bioversions with pip install bioversions, whose job it is to look up the latest version of many databases. Then, you can modify the previous code slightly by omitting the version keyword argument:

import chembl_downloader

path = chembl_downloader.download()

The version keyword argument is available for all functions in this package (e.g., including connect(), cursor(), and query()), but will be omitted below for brevity.

Automate Connection

Inside the archive is a single SQLite database file. Normally, people manually untar this folder then do something with the resulting file. Don't do this, it's not reproducible! Instead, the file can be downloaded and a connection can be opened automatically with:

import chembl_downloader

with chembl_downloader.connect() as conn:
    with conn.cursor() as cursor:
        cursor.execute(...)  # run your query string
        rows = cursor.fetchall()  # get your results

The cursor() function provides a convenient wrapper around this operation:

import chembl_downloader

with chembl_downloader.cursor() as cursor:
    cursor.execute(...)  # run your query string
    rows = cursor.fetchall()  # get your results

Run a query and get a pandas DataFrame

The most powerful function is query() which builds on the previous connect() function in combination with pandas.read_sql to make a query and load the results into a pandas DataFrame for any downstream use.

import chembl_downloader

sql = """
SELECT
    MOLECULE_DICTIONARY.chembl_id,
    MOLECULE_DICTIONARY.pref_name
FROM MOLECULE_DICTIONARY
JOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregno
WHERE molecule_dictionary.pref_name IS NOT NULL
LIMIT 5
"""

df = chembl_downloader.query(sql)
df.to_csv(..., sep='\t', index=False)

Suggestion 1: use pystow to make a reproducible file path that's portable to other people's machines (e.g., it doesn't have your username in the path).

Suggestion 2: RDKit is now pip-installable with pip install rdkit-pypi, which means most users don't have to muck around with complicated conda environments and configurations. One of the powerful but understated tools in RDKit is the rdkit.Chem.PandasTools module.

Store in a Different Place

If you want to store the data elsewhere using pystow (e.g., in pyobo I also keep a copy of this file), you can use the prefix argument.

import chembl_downloader

# It gets downloaded/extracted to 
# ~/.data/pyobo/raw/chembl/29/chembl_29/chembl_29_sqlite/chembl_29.db
path = chembl_downloader.download(prefix=['pyobo', 'raw', 'chembl'])

See the pystow documentation on configuring the storage location further.

The prefix keyword argument is available for all functions in this package (e.g., including connect(), cursor(), and query()).

Download via CLI

After installing, run the following CLI command to ensure it and send the path to stdout

$ chembl_downloader

Use --test to show two example queries

$ chembl_downloader --test

Contributing

If you'd like to contribute, there's a submodule called chembl_downloader.queries where you can add an SQL query along with a description of what it does for easy importing.

Comments

Repo status

Dear @cthoyt,

I know that you have multiple responsibilities, but I was wondering if the current repo is in working condition or if is it a legacy repo which worked with a specific version of ChEMBL? It would be great if you could add a batch on the repo for the same.

Thank You.

opened by YojanaGadiya 4
Add SQL for getting activities by target

This PR adds some functionality for generating target-based datasets, motivated by https://github.com/PatWalters/yamc/issues/14.

See the notebook here (note that this is pinned with a permalink to the state after merging this PR).

opened by cthoyt 1
Improve ChEBI mapping notebook

This filters out about 10% of the possible ChEMBL - ChEBI curations since ChEBI externally already took care of that

-> move this into biomappings repo

opened by cthoyt 0
Call for additional functionality
What other operations do people commonly want to do with the entire ChEMBL database/SDF file that would be good to wrap (including loading other files released by ChEMBL)?

What other operations like the RDKit supplier exist in other libraries that might be worth wrapping?

@iwatobipen do you have any suggestions?
opened by cthoyt 0
Add functionality for bacting

@egonw are there any bulk SMILES, InChI, or SDF loading operations in bacting that are exposed by pybacting that would be nice to wrap inside this library for full loading of ChEMBL? On the readme, you can see I made a specific function for RDKit's "supplier" that reads an SDF file

opened by cthoyt 3

Releases(v0.4.1)

v0.4.1(Nov 19, 2022)
What's Changed

Add SQL for getting activities by target by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/8

Improve ChEBI mapping notebook by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/10

Add UniProt target mapping functions by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/11

Full Changelog: https://github.com/cthoyt/chembl-downloader/compare/v0.4.0...v0.4.1
Source code(tar.gz)
Source code(zip)
v0.4.0(Oct 28, 2022)
This PR does several things:

Removes dependency on bioversions and just implements the code locally

Adds a CLI for generating a statistics table for all versions of ChEMBL

Add proper project skeleton (documentation, unit tests, code quality assurance, CI)

Improve SQLite loading in case you delete the compressed data

Notebooks

Adds notebook about drug indications

Adds notebook about mapping to ChEBI

Source code(tar.gz)
Source code(zip)
v0.3.0(Mar 19, 2022)
This release adds two new functions:

chembl_downloader.download_monomer_library which gets this file https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_30_monomer_library.xml for whatever version you specify

chembl_downloader.get_monomer_library_root which does the same as the downloader but also parses the XML for you

Thanks to @iwatobipen and his recent blog post for inspiring this.
Source code(tar.gz)
Source code(zip)
v0.2.0(Jan 14, 2022)
New Functions

chembl_downloader.download_fps downloads the pre-computed Morgan fingerprint file

chembl_downloader.download_chemreps downloads the chembl-smiles-inchi-inchikey map

chembl_downloader.get_chemreps_df builds on chembl_downloader.download_chemreps and loads them in a pandas dataframe

Misc

Add isort to code quality checking

Enable many functions with return_version to make a tuple with the version, which is useful if you're having it infer the latest version.

Source code(tar.gz)
Source code(zip)
v0.1.3(Dec 20, 2021)
This release adds the get_substructure_library() for automating the generation of an RDKit substructure library as described in Greg Landrum's RDKit blog post, Some new features in the SubstructLibrary. The following example shows how it can be used to accomplish some of the first tasks presented in the post:

from rdkit import Chem import chembl_downloader library = chembl_downloader.get_substructure_library() query = Chem.MolFromSmarts('[O,N]=C-c:1:c:c:n:c:c:1') matches = library.GetMatches(query)

Full Changelog: https://github.com/cthoyt/chembl-downloader/compare/v0.1.2...v0.1.3
Source code(tar.gz)
Source code(zip)
v0.1.2(Dec 20, 2021)
Add get_assay_sql() function

Full Changelog: https://github.com/cthoyt/chembl-downloader/compare/v0.1.1...v0.1.2
Source code(tar.gz)
Source code(zip)
v0.1.1(Aug 5, 2021)

Add more top-level imports for download_sdf(), download_sqlite(), and latest()
Source code(tar.gz)
Source code(zip)
v0.1.0(Aug 4, 2021)
rename download() to download_extract_sqlite() to make room for other download functions

added supplier() function for loading the SDF dump through RDKit

Source code(tar.gz)
Source code(zip)
v0.0.4(Jul 28, 2021)
Update pandas backend for query() function

Improve CLI

Source code(tar.gz)
Source code(zip)
v0.0.3(Jul 27, 2021)

Add query() function for automatically generating pandas DataFrames from a given SQL query
Source code(tar.gz)
Source code(zip)
v0.0.2(Jul 27, 2021)
Fix bug when version not given

Fix bug where different chembl versions' different folder structures causes problem

Source code(tar.gz)
Source code(zip)
v0.0.1(Jul 27, 2021)

Initial release has a download(), connect(), and cursor() function.
Source code(tar.gz)
Source code(zip)

Owner

Charles Tapley Hoyt

Bio/cheminformatician, open scientist, maintainer of @pybel and @pykeen, part of @indralab (he/him)

GitHub Repository

Write reproducible code for getting and processing ChEMBL

Related tags

Overview

chembl_downloader

Installation

Usage

Download A Specific Version

Download the Latest Version

Automate Connection

Run a query and get a pandas DataFrame

Store in a Different Place

Download via CLI

Contributing

Comments

Repo status

Add SQL for getting activities by target

Improve ChEBI mapping notebook

Call for additional functionality

Add functionality for bacting

Releases(v0.4.1)

v0.4.1(Nov 19, 2022)

What's Changed

v0.4.0(Oct 28, 2022)

Notebooks

v0.3.0(Mar 19, 2022)

v0.2.0(Jan 14, 2022)

New Functions

Misc

v0.1.3(Dec 20, 2021)

v0.1.2(Dec 20, 2021)

v0.1.1(Aug 5, 2021)

v0.1.0(Aug 4, 2021)

v0.0.4(Jul 28, 2021)

v0.0.3(Jul 27, 2021)

v0.0.2(Jul 27, 2021)

v0.0.1(Jul 27, 2021)

Owner

Charles Tapley Hoyt

Discord Nitro Generator + Checker

A Simple YouTube Video Downloader With Python

The sole purpose of this script is to download any NFT collection from OpenSea

Download Thumbnail of YouTube Videos

Automatically download and crop key information from the arxiv daily paper. (cpu version)

Mobile based API for Crunchyroll BETA (and Downloader).

Downloads data from OSM API and uploads it to the mapping sandbox.

This is a Text Data Analysis Project Involving (YouTube Case Study).

Download clips from youtube videos with a few clicks and a GUI!

⚙️ A CLI tool that can download songs from youtube.

Python script for downloading audio from YouTube songs/videos.

Tool To download 4KHDR DV SDR from AppleTV

Download videos and audio with a graphical interface in python

mescrappy - Python + Selenium Youtube scraper

A Telegram bot to download TikTok videos without any watermark.

A web app for downloading Facebook comments as a csv file

Downloader Middleware to support Playwright in Scrapy & Gerapy

Download candlestick data fast & easy for analysis

Making the process of downloading youtube videos faster and more convinient.

A Telegram bot to download Subtitle for movies and tv shows.