Write reproducible code for getting and processing ChEMBL

Overview

chembl_downloader

PyPI PyPI - Python Version PyPI - License DOI Code style: black

Don't worry about downloading/extracting ChEMBL or versioning - just use chembl_downloader to write code that knows how to download it and use it automatically.

Installation

$ pip install chembl-downloader

Usage

Download A Specific Version

import chembl_downloader

path = chembl_downloader.download(version='28')

After it's been downloaded and extracted once, it's smart and does not need to download again. It gets stored using pystow automatically in the ~/.data/chembl directory.

We'd like to implement something such that it could load directly into SQLite from the archive, but it appears this is a paid feature.

Download the Latest Version

First, you'll have to install bioversions with pip install bioversions, whose job it is to look up the latest version of many databases. Then, you can modify the previous code slightly by omitting the version keyword argument:

import chembl_downloader

path = chembl_downloader.download()

The version keyword argument is available for all functions in this package (e.g., including connect(), cursor(), and query()), but will be omitted below for brevity.

Automate Connection

Inside the archive is a single SQLite database file. Normally, people manually untar this folder then do something with the resulting file. Don't do this, it's not reproducible! Instead, the file can be downloaded and a connection can be opened automatically with:

import chembl_downloader

with chembl_downloader.connect() as conn:
    with conn.cursor() as cursor:
        cursor.execute(...)  # run your query string
        rows = cursor.fetchall()  # get your results

The cursor() function provides a convenient wrapper around this operation:

import chembl_downloader

with chembl_downloader.cursor() as cursor:
    cursor.execute(...)  # run your query string
    rows = cursor.fetchall()  # get your results

Run a query and get a pandas DataFrame

The most powerful function is query() which builds on the previous connect() function in combination with pandas.read_sql to make a query and load the results into a pandas DataFrame for any downstream use.

import chembl_downloader

sql = """
SELECT
    MOLECULE_DICTIONARY.chembl_id,
    MOLECULE_DICTIONARY.pref_name
FROM MOLECULE_DICTIONARY
JOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregno
WHERE molecule_dictionary.pref_name IS NOT NULL
LIMIT 5
"""

df = chembl_downloader.query(sql)
df.to_csv(..., sep='\t', index=False)

Suggestion 1: use pystow to make a reproducible file path that's portable to other people's machines (e.g., it doesn't have your username in the path).

Suggestion 2: RDKit is now pip-installable with pip install rdkit-pypi, which means most users don't have to muck around with complicated conda environments and configurations. One of the powerful but understated tools in RDKit is the rdkit.Chem.PandasTools module.

Store in a Different Place

If you want to store the data elsewhere using pystow (e.g., in pyobo I also keep a copy of this file), you can use the prefix argument.

import chembl_downloader

# It gets downloaded/extracted to 
# ~/.data/pyobo/raw/chembl/29/chembl_29/chembl_29_sqlite/chembl_29.db
path = chembl_downloader.download(prefix=['pyobo', 'raw', 'chembl'])

See the pystow documentation on configuring the storage location further.

The prefix keyword argument is available for all functions in this package (e.g., including connect(), cursor(), and query()).

Download via CLI

After installing, run the following CLI command to ensure it and send the path to stdout

$ chembl_downloader

Use --test to show two example queries

$ chembl_downloader --test

Contributing

If you'd like to contribute, there's a submodule called chembl_downloader.queries where you can add an SQL query along with a description of what it does for easy importing.

Comments
  • Repo status

    Repo status

    Dear @cthoyt,

    I know that you have multiple responsibilities, but I was wondering if the current repo is in working condition or if is it a legacy repo which worked with a specific version of ChEMBL? It would be great if you could add a batch on the repo for the same.

    Thank You.

    opened by YojanaGadiya 4
  • Add SQL for getting activities by target

    Add SQL for getting activities by target

    This PR adds some functionality for generating target-based datasets, motivated by https://github.com/PatWalters/yamc/issues/14.

    See the notebook here (note that this is pinned with a permalink to the state after merging this PR).

    opened by cthoyt 1
  • Improve ChEBI mapping notebook

    Improve ChEBI mapping notebook

    This filters out about 10% of the possible ChEMBL - ChEBI curations since ChEBI externally already took care of that

    -> move this into biomappings repo

    opened by cthoyt 0
  • Call for additional functionality

    Call for additional functionality

    • What other operations do people commonly want to do with the entire ChEMBL database/SDF file that would be good to wrap (including loading other files released by ChEMBL)?
    • What other operations like the RDKit supplier exist in other libraries that might be worth wrapping?

    @iwatobipen do you have any suggestions?

    opened by cthoyt 0
  • Add functionality for bacting

    Add functionality for bacting

    @egonw are there any bulk SMILES, InChI, or SDF loading operations in bacting that are exposed by pybacting that would be nice to wrap inside this library for full loading of ChEMBL? On the readme, you can see I made a specific function for RDKit's "supplier" that reads an SDF file

    opened by cthoyt 3
Releases(v0.4.1)
  • v0.4.1(Nov 19, 2022)

    What's Changed

    • Add SQL for getting activities by target by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/8
    • Improve ChEBI mapping notebook by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/10
    • Add UniProt target mapping functions by @cthoyt in https://github.com/cthoyt/chembl-downloader/pull/11

    Full Changelog: https://github.com/cthoyt/chembl-downloader/compare/v0.4.0...v0.4.1

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Oct 28, 2022)

    This PR does several things:

    1. Removes dependency on bioversions and just implements the code locally
    2. Adds a CLI for generating a statistics table for all versions of ChEMBL
    3. Add proper project skeleton (documentation, unit tests, code quality assurance, CI)
    4. Improve SQLite loading in case you delete the compressed data

    Notebooks

    1. Adds notebook about drug indications
    2. Adds notebook about mapping to ChEBI
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Mar 19, 2022)

    This release adds two new functions:

    1. chembl_downloader.download_monomer_library which gets this file https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_30_monomer_library.xml for whatever version you specify
    2. chembl_downloader.get_monomer_library_root which does the same as the downloader but also parses the XML for you

    Thanks to @iwatobipen and his recent blog post for inspiring this.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jan 14, 2022)

    New Functions

    • chembl_downloader.download_fps downloads the pre-computed Morgan fingerprint file
    • chembl_downloader.download_chemreps downloads the chembl-smiles-inchi-inchikey map
    • chembl_downloader.get_chemreps_df builds on chembl_downloader.download_chemreps and loads them in a pandas dataframe

    Misc

    • Add isort to code quality checking
    • Enable many functions with return_version to make a tuple with the version, which is useful if you're having it infer the latest version.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Dec 20, 2021)

    This release adds the get_substructure_library() for automating the generation of an RDKit substructure library as described in Greg Landrum's RDKit blog post, Some new features in the SubstructLibrary. The following example shows how it can be used to accomplish some of the first tasks presented in the post:

    from rdkit import Chem
    
    import chembl_downloader
    
    library = chembl_downloader.get_substructure_library()
    query = Chem.MolFromSmarts('[O,N]=C-c:1:c:c:n:c:c:1')
    matches = library.GetMatches(query)
    

    Full Changelog: https://github.com/cthoyt/chembl-downloader/compare/v0.1.2...v0.1.3

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Dec 20, 2021)

  • v0.1.1(Aug 5, 2021)

  • v0.1.0(Aug 4, 2021)

    • rename download() to download_extract_sqlite() to make room for other download functions
    • added supplier() function for loading the SDF dump through RDKit
    Source code(tar.gz)
    Source code(zip)
  • v0.0.4(Jul 28, 2021)

  • v0.0.3(Jul 27, 2021)

  • v0.0.2(Jul 27, 2021)

  • v0.0.1(Jul 27, 2021)

Owner
Charles Tapley Hoyt
Bio/cheminformatician, open scientist, maintainer of @pybel and @pykeen, part of @indralab (he/him)
Charles Tapley Hoyt
Download minecraft head or skin, allows TLauncher accounts

Minecraft-skin-downloader Download minecraft head or skin, allows TLauncher accounts by BoBkiNN_ Contact: https://vk.com/bobkinnvk Requirements: Modul

3 Apr 03, 2022
this is udemy course downloader, before a start you know how to get access token.

udemy_downloader this is udemy course downloader, before a start you know how to get access token. To get the access_token on Google Chrome (once on U

OkUgur 18 Dec 04, 2022
squid-dl is a massively parallel yt-dlp-based YouTube downloader.

squid-dl squid-dl is a massively parallel yt-dlp-based YouTube downloader. Installation Run the setup.py, which will install squid-dl and its two depe

tuxlovesyou 51 Jan 05, 2023
apkizer is a mass downloader for android applications for all available versions.

apkizer apkizer collects all available versions of an Android application from apkpure.com Purpose Sometimes mobile applications can be useful to dig

Kamil Onur Özkaleli 41 Dec 16, 2022
A discord bot for downloading youtube video and audio files

disctube disctube is a discord bot for downloading video and audio files from youtube using python pytube. disclaimer i am not the best python program

razor420 3 Feb 03, 2022
Script for YouTube creators to share dislike count with their viewers.

Stahování disliků z YouTube - milafon Tento skript slouží jako možnost zobrazit divákům počet disliků u YouTube videí. Vyžaduje implementaci ze strany

4 Sep 28, 2022
Python script to download all images/webms of a 4chan thread

Python3 script to continuously download all images/webms of multiple 4chan thread simultaneously - without installation

Micha Fink 208 Jan 04, 2023
抖音去水印视频批量下载,完全使用抖音官方接口

TikTokDownload 抖音去水印视频下载,使用抖音官方接口 使用教程(Win7) Win10环境暂时没测,bug情况应该比Win7少 运行软件前先打开目录下 conf.ini 文件按照要求进行配置 批量下载可直接修改配置文件,单一视频下载请直接打开粘贴视频链接即可

JohnserfSeed 2k Jan 04, 2023
Download candlestick data fast & easy for analysis

crypto-candlesticks 📈 The goal behind this project is to facilitate downloading cryptocurrency candlestick data fast & simple. Currently only the Bit

Pedro Torres 31 Dec 11, 2022
Shit-fetch - Shitpost fetcher (downloader)

shit-fetch Download shitpost (random) from https://random-shitpost.com/ Usage ./shitfetch.py --nsfw (true/false) --output ~/Downloads (default : ./)

Pinokaille 1 Jan 02, 2022
Download India Stocks Historical Data

Kite Helper - Download Stock Market Data 🌎 Website Simple Application to Download any stock market data in .csv format using Kite 🏃‍♂️ Running Serve

Pishang Ujeniya 12 Dec 06, 2022
TikTok - TikTok Bot to download video or audio from TikTok

TikTok - TikTok Bot to download video or audio from TikTok

JMTHON 51 Mar 04, 2022
Youtube video downloader and info extractor for python.

tube_dl Tube_dl is a Simple Youtube video downloader for Python. A Modular approach to bypass and download Youtube Videos and Playlist from Youtube us

Shekhar Chander 16 Jul 09, 2022
Twitter Media Downloader (Telegram Bot)

Twitter Media Downloader (Telegram Bot)

Matin Baloochestani 8 Oct 27, 2022
Pypixiv - A fully-typed, asynchronous api wrapper for pixiv

pypixiv this library is a fully-typed, asynchronous api wrapper for pixiv. featu

DeltaLaboratory 2 Nov 16, 2022
A collection of modules I have created to programmatically search for/download imagery from live cam feeds across the state of California.

A collection of modules that I have created to programmatically search for/download imagery from all publicly available live cam feeds across the state of California. In no way am I affiliated with a

Chad Groom 5 Nov 21, 2022
Download history data from binance and save to dataframe or csv file

Binance history data downloader Download history data from binance and save to dataframe or csv file

10 Dec 02, 2022
Pytube ve tkinter kütüphanesi ile yapmış olduğum basit ve temel bir youtube video indirme programı.

PyTube Pytube ve tkinter kütüphanesi ile yapmış olduğum basit ve temel bir youtube video indirme programı. Videolar 720p çözünürlükte indirilmektedir.

1 Nov 12, 2021
Scripts to download files and folders programmatically from Google Drive

Google Drive Downloader Scripts Every time I need to download a lot of files from Google Drive (e.g. a dataset), it's always incredibly frustrating an

Ivan Evtimov 6 Jul 22, 2021
Tool To download Amazon 4k SDR HDR 1080, CDM IS Not Included

WV-AMZN-4K-RIPPER Tool To download Amazon 4k SDR HDR 1080, CDM IS Not Included For CDM You can Mail :- Denis Trunov 179 Dec 17, 2022