Trainable PyTorch reproduction of AlphaFold 2

Related tags

Deep Learningopenfold
Overview

OpenFold

A faithful PyTorch reproduction of DeepMind's AlphaFold 2.

Features

OpenFold carefully reproduces (almost) all of the features of the original open source inference code. The sole exception is model ensembling, which fared poorly in DeepMind's own ablation testing and is being phased out in future DeepMind experiments. It is omitted here for the sake of reducing clutter. In cases where the Nature paper differs from the source, we always defer to the latter.

OpenFold is built to support inference with AlphaFold's original JAX weights. Try it out with our Colab notebook.

Unlike DeepMind's public code, OpenFold is also trainable. It can be trained with DeepSpeed and with mixed precision. bfloat16 training is not currently supported, but will be in the future.

Installation (Linux)

Python dependencies available through pip are provided in requirements.txt. OpenFold depends on openmm==7.5.1 and pdbfixer, which are only available via conda. For producing sequence alignments, you'll also need kalign, the HH-suite, and one of {jackhmmer, MMseqs2} installed on on your system. Finally, some download scripts require aria2c.

For convenience, we provide a script that installs Miniconda locally, creates a conda virtual environment, installs all Python dependencies, and downloads useful resources (including DeepMind's pretrained parameters). Run:

scripts/install_third_party_dependencies.sh

To activate the environment, run:

source scripts/activate_conda_env.sh

To deactivate it, run:

source scripts/deactivate_conda_env.sh

To install the HH-suite to /usr/bin, run

# scripts/install_hh_suite.sh

Usage

To download DeepMind's pretrained parameters and common ground truth data, run:

scripts/download_data.sh data/

You have two choices for downloading protein databases, depending on whether you want to use DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or ColabFold's, which uses the faster MMseqs2 instead. For the former, run:

scripts/download_alphafold_databases.sh data/

For the latter, run:

scripts/download_mmseqs_databases.sh data/    # downloads .tar files
scripts/prep_mmseqs_databases.sh data/        # unpacks and preps the databases

Make sure to run the latter command on the machine that will be used for MSA generation (the script estimates how the precomputed database index used by MMseqs2 should be split according to the memory available on the system).

Alternatively, you can use raw MSAs from ProteinNet. After downloading the database, use scripts/prepare_proteinnet_msas.py to convert the data into a format recognized by the OpenFold parser. The resulting directory becomes the alignment_dir used in subsequent steps. Use scripts/unpack_proteinnet.py to extract .core files from ProteinNet text files.

Inference

To run inference on a sequence or multiple sequences using a set of DeepMind's pretrained parameters, run e.g.:

python3 run_pretrained_openfold.py \
    target.fasta \
    data/uniref90/uniref90.fasta \
    data/mgnify/mgy_clusters_2018_12.fa \
    data/pdb70/pdb70 \
    data/pdb_mmcif/mmcif_files/ \
    data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --output_dir ./ \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --model_device cuda:1 \
    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
    --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign

where data is the same directory as in the previous step. If jackhmmer, hhblits, hhsearch and kalign are available at the default path of /usr/bin, their binary_path command-line arguments can be dropped. If you've already computed alignments for the query, you have the option to circumvent the expensive alignment computation here.

Training

After activating the OpenFold environment with source scripts/activate_conda_env.sh, install OpenFold by running

python setup.py install

To train the model, you will first need to precompute protein alignments.

You have two options. You can use the same procedure DeepMind used by running the following:

python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \
    data/uniref90/uniref90.fasta \
    data/mgnify/mgy_clusters_2018_12.fa \
    data/pdb70/pdb70 \
    data/pdb_mmcif/mmcif_files/ \
    data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --cpus 16 \
    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
    --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign

As noted before, you can skip the binary_path arguments if these binaries are at /usr/bin. Expect this step to take a very long time, even for small numbers of proteins.

Alternatively, you can generate MSAs with the ColabFold pipeline (and templates with HHsearch) with:

python3 scripts/precompute_alignments_mmseqs.py input.fasta \
    data/mmseqs_dbs \
    uniref30_2103_db \
    alignment_dir \
    ~/MMseqs2/build/bin/mmseqs \
    /usr/bin/hhsearch \
    --env_db colabfold_envdb_202108_db
    --pdb70 data/pdb70/pdb70

where input.fasta is a FASTA file containing one or more query sequences. To generate an input FASTA from a directory of mmCIF and/or ProteinNet .core files, we provide scripts/data_dir_to_fasta.py.

Next, generate a cache of certain datapoints in the mmCIF files:

python3 scripts/generate_mmcif_cache.py \
    mmcif_dir/ \
    mmcif_cache.json \
    --no_workers 16

This cache is used to minimize the number of mmCIF parses performed during training-time data preprocessing. Finally, call the training script:

python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ \
    2021-10-10 \ 
    --template_release_dates_cache_path mmcif_cache.json \ 
    --precision 16 \
    --gpus 8 --replace_sampler_ddp=True \
    --seed 42 \ # in multi-gpu settings, the seed must be specified
    --deepspeed_config_path deepspeed_config.json \
    --resume_from_ckpt ckpt_dir/

where --template_release_dates_cache_path is a path to the .json file generated in the previous step. A suitable DeepSpeed configuration file can be generated with scripts/build_deepspeed_config.py. The training script is written with PyTorch Lightning and supports the full range of training options that entails, including multi-node distributed training. For more information, consult PyTorch Lightning documentation and the --help flag of the training script.

Testing

To run unit tests, use

scripts/run_unit_tests.sh

The script is a thin wrapper around Python's unittest suite, and recognizes unittest commands. E.g., to run a specific test verbosely:

scripts/run_unit_tests.sh -v tests.test_model

Certain tests require that AlphaFold (v. 2.0.1) be installed in the same Python environment. These run components of AlphaFold and OpenFold side by side and ensure that output activations are adequately similar. For most modules, we target a maximum difference of 1e-4.

Copyright notice

While AlphaFold's and, by extension, OpenFold's source code is licensed under the permissive Apache Licence, Version 2.0, DeepMind's pretrained parameters remain under the more restrictive CC BY-NC 4.0 license, a copy of which is downloaded to openfold/resources/params by the installation script. They are thereby made unavailable for commercial use.

Contributing

If you encounter problems using OpenFold, feel free to create an issue! We also welcome pull requests from the community.

Citing this work

Stay tuned for an OpenFold DOI. Any work that cites OpenFold should also cite AlphaFold.

Comments
  • Cuda/Pytorch/Installation Issues

    Cuda/Pytorch/Installation Issues

    Hello! So I have been struggling with a strange issue that I hope you or someone would be able to help me with. Let me start by providing some information:

    • OS: Ubuntu 20.04.4
    • GPU: NVIDIA RTX A6000
    • NVIDIA-SMI/Driver Version: 470.129.06
    • CUDA Version: 11.4

    So I am not sure if this is a problem with how I am attempting to install openfold, or if something else is going on. Essentially after cloning the repo the first thing I would do is run scripts/install_third_party_dependencies.sh. This would then create an environment called openfold_venv, however this environment does not seem to withhold many of the required packages (i.e. torch is absent). Following this with scripts/activate_environment.sh seems to fail. I have tried alternatively used conda env create -f environment.yml, which sets up an environment in a different location. Either way, after setting up the environment I end up with one of the following issues, either during python setup.py install or during inference:

    • "The detected CUDA version (10.1) mismatches the version that was used to compile PyTorch (11.2). Please make sure to use the same CUDA versions." (despite torch.version.cuda returning 11.3)
    • "runtimeerror: Cuda error: no kernal image is available for execution on the device"

    These are run into on clean installs with no conda or cudatoolkits installed anywhere else on the machine, so it is rather puzzling. As I said I am not sure if this is due to performing the install sequence incorrectly but I have tried several different solutions and they all seem to circle back to one of these errors.

    I apologize as I know this is rather vague, but if you can offer any sort of guidance it would be greatly appreciated!

    opened by Cweb118 40
  • ModuleNotFoundError: No module named 'torch'

    ModuleNotFoundError: No module named 'torch'

    Latest version's installation fails when trying to also install FlashAttention:

    (...)
    Attempting to install FlashAttention
    Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
    Collecting git+https://github.com/HazyResearch/[email protected]
      Cloning https://github.com/HazyResearch/flash-attention.git (to revision 5b838a8bef78186196244a4156ec35bbb58c337d) to /tmp/pip-req-build-2hclpm0v
      Running command git clone -q https://github.com/HazyResearch/flash-attention.git /tmp/pip-req-build-2hclpm0v
      Running command git rev-parse -q --verify 'sha^5b838a8bef78186196244a4156ec35bbb58c337d'
      Running command git fetch -q https://github.com/HazyResearch/flash-attention.git 5b838a8bef78186196244a4156ec35bbb58c337d
      Running command git checkout -q 5b838a8bef78186196244a4156ec35bbb58c337d
      Resolved https://github.com/HazyResearch/flash-attention.git to commit 5b838a8bef78186196244a4156ec35bbb58c337d
      Running command git submodule update --init --recursive -q
        ERROR: Command errored out with exit status 1:
         command: /usr/local/openfold/openfold/lib/conda/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-2hclpm0v/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-2hclpm0v/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-_azdauhm
             cwd: /tmp/pip-req-build-2hclpm0v/
        Complete output (5 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-req-build-2hclpm0v/setup.py", line 10, in <module>
            import torch
        ModuleNotFoundError: No module named 'torch'
        ----------------------------------------
    WARNING: Discarding git+https://github.com/HazyResearch/[email protected]. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    (...)
    
    Also, it seems like aws is now also needed, so awscli should probably be added to environment.yml ?
    opened by lucajovine 18
  • OOM with bfloat16, no speed-up

    OOM with bfloat16, no speed-up

    New issue based on: https://github.com/aqlaboratory/openfold/issues/34

    Turning on bfloat16 in deepspeed doesn't seem to have the desired effect. Model params size remains unchanged. Hitting OOM in validation which works fine in FP16.

    Training with bfloat16 in pytorch-lightning fails:

    File "openfold/openfold/utils/loss.py", line 46, in sigmoid_cross_entropy log_p = torch.nn.functional.logsigmoid(logits) RuntimeError: "log_sigmoid_forward_cuda" not implemented for 'BFloat16'

    Support still missing in deepspeed? https://github.com/microsoft/DeepSpeed/issues/974

    Tested on A100 with torch 1.10.1+cu113

    opened by lhatsk 14
  • Is there any alignment files to download?

    Is there any alignment files to download?

    Hi,

    We're trying to reproduce the training process. However, the alignment seems to take extremely long time.

    We used 128 nodes to align 128 mmcif files (1 file on each node), but it took 13 hours to finish the entire job.

    I'm wondering if there is tar file that already aligned all mmcif files for us to download which will helps a lot.

    Thanks

    opened by Zhang690683220 13
  • ModuleNotFoundError: No module named 'flash_attn'

    ModuleNotFoundError: No module named 'flash_attn'

    After last update (commit 9225f8725b53d19643d1469a57f7d7baea3c0625):

    > python3 run_pretrained_openfold.py
    Traceback (most recent call last):
      File "run_pretrained_openfold.py", line 49, in <module>
        from openfold.config import model_config, NUM_RES
      File "/usr/local/openfold/openfold/openfold/__init__.py", line 1, in <module>
        from . import model
      File "/usr/local/openfold/openfold/openfold/model/__init__.py", line 11, in <module>
        _modules = [(m, importlib.import_module("." + m, __name__)) for m in __all__]
      File "/usr/local/openfold/openfold/openfold/model/__init__.py", line 11, in <listcomp>
        _modules = [(m, importlib.import_module("." + m, __name__)) for m in __all__]
      File "/usr/local/openfold/openfold/lib/conda/envs/openfold_venv/lib/python3.7/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "/usr/local/openfold/openfold/openfold/model/evoformer.py", line 22, in <module>
        from openfold.model.primitives import Linear, LayerNorm
      File "/usr/local/openfold/openfold/openfold/model/primitives.py", line 21, in <module>
        from flash_attn.bert_padding import unpad_input, pad_input
    ModuleNotFoundError: No module named 'flash_attn'
    
    opened by lucajovine 12
  • Unusual predicted structures from pretrained OpenFold on Pascal GPU

    Unusual predicted structures from pretrained OpenFold on Pascal GPU

    This is most likely some kind of local configuration error, but I haven't been able to pin down the cause. If anyone has encountered this behavior before or has an idea of what might be wrong based on these output structures, any hints would be greatly appreciated!

    Expected behavior:

    run_pretrained_openfold.py outputs predicted structures comparable to AlphaFold or OpenFold Colab output.

    I expected a structure similar to this unrelaxed prediction from OpenFold Colab model_1 with finetuning_1.pt:
    image

    Actual behavior:

    My run_pretrained_openfold.py predicted structures are not similar to AlphaFold or OpenFold Colab output.

    Predictions from model_1 with finetuning_1.pt (unrelaxed in tan, relaxed in blue):
    image

    Predictions from model_1 with params_model_1.npz:
    image

    Predictions from model_1 with params_model_1.npz using alignments from ColabFold MMseqs2 (ColabFold had predicted a reasonable expected structure):
    image

    Context:

    4 x NVidia 1080-TI GPUs Using CUDA 11.3 (if other system data is relevant I can find it)

    input/short.fasta

    >query
    MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH
    

    Run command:

    python3 run_pretrained_openfold.py \
        input \
        data/pdb_mmcif/mmcif_files/ \
        --output_dir output \
        --cpus 16 \
        --preset reduced_dbs \
        --uniref90_database_path data/uniref90/uniref90.fasta \
        --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
        --pdb70_database_path data/pdb70/pdb70 \
        --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
        --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
        --model_device "cuda:0" \
        --jackhmmer_binary_path $venv_bin_dir/jackhmmer \
        --hhblits_binary_path $venv_bin_dir/hhblits \
        --hhsearch_binary_path $venv_bin_dir/hhsearch \
        --kalign_binary_path $venv_bin_dir/kalign \
        --config_preset "model_1" \
        --openfold_checkpoint_path openfold/resources/openfold_params/finetuning_1.pt
    

    Other configurations I tried, which produced similarly strange outputs:

    • Removing --openfold_checkpoint_path to just use the AlphaFold weights
    • Using --config_preset "model_1_ptm" with finetuning_ptm_2.pt
    • Using --use_precomputed_alignments with alignment results from a previous OpenFold output
    • Using --use_precomputed_alignments with .a3m results from ColabFold
    • Using full_dbs instead of reduced_dbs
    opened by epenning 11
  • Custom template results in huge difference with alphafold

    Custom template results in huge difference with alphafold

    Hi there,

    Thanks a lot for your effort to implement trainable AlphaFold in PyTorch.

    I came across an interesting paper claiming using templates built with the information from experimental cryo-EM density maps can improve the AlphaFold accuracy.

    The authors provide a Colab notebook here. I tried the notebook, and it worked as intended.

    As an example, the PDB entry 7KU7: Input fasta sequence: PLREAKDLHTALHIGPRALSKACNISMQQAREVVQTCPHCNSAPALEAGVNPRGLGPLQIWQTDFTLEPRMAPRSWLAVTVDTASSAIVVTQHGRVTSVAVQHHWATAIAVLGRPKAIKTDNGSCFTSKSTREWLARWGIAHTTGIPGNSQGQAMVERANRLLKDKIRVLAEGDGFMKRIPTSKQGELLAKAMYALNHFERGENTKTPIQKHWRPTVLTEGPPVKIRIETGEWEKGWNVLVWGRGYAAVKNRDTDKVIWVPSRKVKPDITQKDEVTKK

    I supplemented a custom template in CIF format: https://drive.google.com/file/d/1DUN793nHr0aRRSp29_FwgTGUREwTHcfp/view?usp=sharing

    By using this template and turning off the MSA (skip_all_msa == True, equivalent to using dummy MSA), the mean plddt score is about 90, which is higher than the case with MSA but no custom template.


    When I tried to replicate the above procedure in OpenFold, however, it looked like the template didn't help. The mean plddt score was less than 40 for model_1 to 5.

    To quickly reproduce the results,

    1. I make an empty directory as the path for the use_precomputed_alignments, which will lead the data pipeline to use the dummy MSA and an empty template.

    2. Then I load template features generated in the Colab notebook template_feature_7ku7.pkl (https://drive.google.com/file/d/1pnZ8pwQZTgcOsHTikQ6X7PQ1bqQs3tqt/view?usp=sharing)

    import pickle
    with open("template_feature_7ku7.pkl", "rb") as f:
        template_feature = pickle.load(f)
    feature_dict = {**feature_dict, **template_feature}
    

    The rest of the codes are left intact. So, could you help me check if there is anything wrong with my approach, or is it due to something buggy with template associated codes within the OpenFold? Thank you very much.

    opened by empyriumz 11
  • Get low lddt score while running inference.

    Get low lddt score while running inference.

    Excellent work!

    I'm trying to run inference process of openfold. My input fasta is :

    HBA_HUMAN MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

    My shell command is:
    python3 run_pretrained_openfold.py \

    data/fasta_dir \
    data/pdb_mmcif/mmcif_files/ \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --output_dir output/ \
    --bfd_database_path data/small_bfd/bfd-first_non_consensus_sequences.fasta \
    --model_device "cuda:0" \
    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
    --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign \
    --config_preset "finetuning_ptm" \
    --openfold_checkpoint_path openfold/resources/openfold_params/finetuning_ptm_2.pt
    

    And i got the following result: image

    The average of lddt is pretty low. And this situation happens every time even when i choose simple sequence. Moreover, I notice that parameter 'use_small_bfd' is set to be false by default. But inference works when i set 'bfd_database_path' data/small_bfd/bfd-first_non_consensus_sequences.fasta.

    I'm wondering what happened and hope for your reply.

    opened by WeixuanXiong 10
  • Bug with template mask for batch inference

    Bug with template mask for batch inference

    Hello, my name is James, and I'm working on training a new AlphaFold variant using OpenFold. Thanks for the great tool!

    I think I may have found a bug in how the code processes templates for batch sizes larger than 1 (either that or I'm doing something wrong, in which case help would also be appreciated!). Here's a code snippet that reproduces the problem:

    import torch
    import torch.nn as nn
    import numpy as np
    
    from openfold.model.model import AlphaFold
    from openfold.config import model_config
    from openfold.utils.tensor_utils import tensor_tree_map
    from openfold.data import data_transforms
    
    model_name = "model_1_ptm"
    
    conf = model_config(model_name, train=True)
    conf.data.common.max_recycling_iters = 0
    conf.data.train.subsample_templates = False
    conf.data.train.max_msa_clusters = 1
    conf.data.train.max_extra_msa = 1
    conf.data.train.max_templates = 1
    
    # copied from openfold/test/data_utils.py
    def random_template_feats(n_templ, n, batch_size=None):
        b = []
        if batch_size is not None:
            b.append(batch_size)
        batch = {
            "template_mask": np.random.randint(0, 2, (*b, n_templ)),
            "template_pseudo_beta_mask": np.random.randint(0, 2, (*b, n_templ, n)),
            "template_pseudo_beta": np.random.rand(*b, n_templ, n, 3),
            "template_aatype": np.random.randint(0, 22, (*b, n_templ, n)),
            "template_all_atom_mask": np.random.randint(
                0, 2, (*b, n_templ, n, 37)
            ),
            "template_all_atom_positions": 
                np.random.rand(*b, n_templ, n, 37, 3) * 10,
            "template_torsion_angles_sin_cos": 
                np.random.rand(*b, n_templ, n, 7, 2),
            "template_alt_torsion_angles_sin_cos": 
                np.random.rand(*b, n_templ, n, 7, 2),
            "template_torsion_angles_mask": 
                np.random.rand(*b, n_templ, n, 7),
        }
        batch = {k: v.astype(np.float32) for k, v in batch.items()}
        batch["template_aatype"] = batch["template_aatype"].astype(np.int64)
        return batch
    
    
    def random_extra_msa_feats(n_extra, n, batch_size=None):
        b = []
        if batch_size is not None:
            b.append(batch_size)
        batch = {
            "extra_msa": np.random.randint(0, 22, (*b, n_extra, n)).astype(
                np.int64
            ),
            "extra_has_deletion": np.random.randint(0, 2, (*b, n_extra, n)).astype(
                np.float32
            ),
            "extra_deletion_value": np.random.rand(*b, n_extra, n).astype(
                np.float32
            ),
            "extra_msa_mask": np.random.randint(0, 2, (*b, n_extra, n)).astype(
                np.float32
            ),
        }
        return batch
    
    n_templ = 1
    n_res = 256
    n_extra_seq = 1
    n_seq = 1
    bsize = 2
    
    model = AlphaFold(conf).cuda()
    
    batch = {}
    
    
    tf = torch.randint(conf.model.input_embedder.tf_dim - 1, size=(bsize, n_res))
    batch["target_feat"] = nn.functional.one_hot(tf, conf.model.input_embedder.tf_dim).float()
    batch["aatype"] = torch.argmax(batch["target_feat"], dim=-1)
    
    batch["target_feat"] = torch.rand((bsize, n_res, conf.model.input_embedder.tf_dim))
    batch["residue_index"] = torch.rand((bsize, n_res))
    batch["msa_feat"] = torch.rand((bsize, n_seq, n_res, conf.model.input_embedder.msa_dim))
    
    
    t_feats = random_template_feats(n_templ, n_res, batch_size=bsize)
    batch.update({k: torch.tensor(v) for k, v in t_feats.items()})
    
    extra_feats = random_extra_msa_feats(n_extra_seq, n_res, batch_size=bsize)
    batch.update({k: torch.tensor(v) for k, v in extra_feats.items()})
    
    batch["msa_mask"] = torch.randint(low=0, high=2, size=(bsize, n_seq, n_res)).float()
    batch["seq_mask"] = torch.randint(low=0, high=2, size=(bsize, n_res)).float()
    batch.update(data_transforms.make_atom14_masks(batch))
    
    batch["no_recycling_iters"] = torch.tensor(0.)
    
    batch = tensor_tree_map(lambda t: t.unsqueeze(-1).cuda(), batch)
    
    out = model(batch)
    

    In this code I'm basically just running inference on the model with a batch size of 2, with templates enabled. For this demo I've created dummy inputs using the code from the /openfold/tests/ directory, although I've also had the same problem with a real data pipeline.

    The code above crashes with the error: RuntimeError: The size of tensor a (128) must match the size of tensor b (2) at non-singleton dimension 3, which occurs on line 189 of /openfold/openfold/model/model.py.:

    t = t * (torch.sum(batch["template_mask"], dim=-1) > 0) 
    

    This line is basically just masking out activations from templates that don't exist according to batch["template_mask"]. However, there seems to be a dimension mismatch. If I print out the dimensions, t has shape [2, 256, 256, 128] and batch["template_mask"] has shape [2]. Based on the PyTorch broadcasting rules (https://pytorch.org/docs/stable/notes/broadcasting.html), those shapes aren't compatible to multiply. If I change the code to the following:

    t = t * (torch.sum(batch["template_mask"], dim=-1) > 0).view([-1,1,1,1])
    

    Then everything works fine. Is this a real bug in the code, or have I done something wrong to trigger this error? Thanks! For reference, my environment is the following:

    • Python 3.10.4
    • PyTorch 1.12.1
    • Numpy 1.23.1
    • Cuda 11.1
    • Latest OpenFold commit (6e930a6ca4accb14aa128ae40bd3f27906796589)
    opened by jproney 9
  • OpenFold on Ampere Nvidia GPUs

    OpenFold on Ampere Nvidia GPUs

    Hi,

    I am trying to install OpenFold on a machine with two RTX A5000s, but running into issues with PyTorch not supporting cards with compute capability SM 86. I saw on a previous post that you had trained OF on A100s, which will have a similar compute capability. Is there a method for installing OpenFold on newer GPU architectures?

    Many thanks!

    opened by WillExeter 9
  • --trace_model performance

    --trace_model performance

    Hi, I have tested using the --trace_model mode on a small batch of sequences of the same length; I get an 80s tracing time followed by 20s inference for each sequence. If I just fold them without --trace_model it takes 18-19s for inference of each. Am I doing something wrong? There doesn't seem to be much documentation about this feature.

    opened by mrhoag5 8
  • Run OpenFold on CPU

    Run OpenFold on CPU

    Hello,

    I have issues when running openfold on a CPU.

    When I execute the run_pretrained_openfold.py script with the --model_device cpu argument set, I get the following error:

    Traceback (most recent call last):
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/run_pretrained_openfold.py", line 387, in <module>
        main(args)
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/run_pretrained_openfold.py", line 254, in main
        out = run_model(model, processed_feature_dict, tag, args.output_dir)
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/utils/script_utils.py", line 159, in run_model
        out = model(batch)
      File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/model.py", line 512, in forward
        outputs, m_1_prev, z_prev, x_prev = self.iteration(
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/model.py", line 366, in iteration
        z = self.extra_msa_stack(
      File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/evoformer.py", line 1007, in forward
        m, z = b(m, z)
      File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/evoformer.py", line 518, in forward
        self.msa_att_row(
      File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/msa.py", line 266, in forward
        m = self._chunk(
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/msa.py", line 121, in _chunk
        return chunk_layer(
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/utils/chunk_utils.py", line 299, in chunk_layer
        output_chunk = layer(**chunks)
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/msa.py", line 101, in fn
        return self.mha(
      File "/home/rjo21/anaconda3/envs/fold_serv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/model/primitives.py", line 492, in forward
        o = attention_core(q, k, v, *((biases + [None] * 2)[:2]))
      File "/scratch/SCRATCH_SAS/roman/fold_test/openfold/openfold/utils/kernel/attention_core.py", line 47, in forward
        attn_core_inplace_cuda.forward_(
    RuntimeError: input must be a CUDA tensor
    

    This tells me, I need to pass a CUDA tensor to some attention thing, but I run the code on CPU, there should be no CUDA involved?!

    This is the environment (env.txt) I'm using on a normal linux 64-bit OS.

    Thank for any help in advance. Roman

    opened by Old-Shatterhand 0
  • Alignment error

    Alignment error

    Hi

    I ran the following scripts and get the error in alignment.

    python3 scripts/precompute_alignments.py data/pdb_mmcif/mmcif_files/ data/alignment/ \
        --uniref90_database_path data/uniref90/uniref90.fasta \
        --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
        --pdb70_database_path data/pdb70/pdb70 \
        --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
        --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
        --cpus 16 \
        --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
        --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
        --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
        --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign
    

    Here is the error log:

    ERROR:root:- 09:17:56.463 ERROR: Error in /opt/conda/conda-bld/hhsuite_1645696999782/work/src/hhalignment.cpp:3539: MergeMasterSlave:
    ERROR:root:- 09:17:56.463 ERROR:        did not find 145 match states in sequence 1 of tr|A0A1D1YLJ1|A0A1D1YLJ1_9ARAE. Sequence:
    ERROR:root:GYKAPELTKMKDAGKESDIYSLGVIFLEMVTRKDTNSDFLPTWDLHLSNSLKNPVFDGKISEMISHGLLRQSREQNCITGEGLLMFLQLAIACRSPSPRLRPDIKQVLGKLEEIELWKLPNQFGGDRLPNRG
    ERROR:root:HHblits stderr end
    WARNING:root:Failed to run alignments for 7avy_A. Skipping...
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "scripts/precompute_alignments.py", line 40, in run_seq_group_alignments
        fasta_path, alignment_dir
      File "/scratch1/zx22/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/openfold-1.0.0-py3.7-linux-x86_64.egg/openfold/data/data_pipeline.py", line 485, in run
        self.hhblits_bfd_uniclust_runner.query(fasta_path)
      File "/scratch1/zx22/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/openfold-1.0.0-py3.7-linux-x86_64.egg/openfold/data/tools/hhblits.py", line 162, in query
        % (stdout.decode("utf-8"), stderr[:500_000].decode("utf-8"))
    RuntimeError: HHblits failed
    stdout:
    stderr:
    

    Could you help me with it? Thanks!

    opened by Ottovonxu 0
  • Training scripts.

    Training scripts.

    Hi

    I have downloaded the dataset following the DeepMind style and the inference works fine. Currently, my data folder has: bfd mgnify pdb70 pdb_mmcif uniclust30 uniref90

    May I ask how should I specify the mmcif_dir/ here?

    python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \
        --uniref90_database_path data/uniref90/uniref90.fasta \
        --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
        --pdb70_database_path data/pdb70/pdb70 \
        --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
        --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
        --cpus 16 \
        --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
        --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
        --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
        --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign
    

    Thanks!

    opened by Ottovonxu 0
  • openfold/np/protein.py:to_pdb(): chain_tag sometimes not set

    openfold/np/protein.py:to_pdb(): chain_tag sometimes not set

    I found what appears to be a rare case (once in millions of proteins) where the loop in to_pdb() sometimes fails to set chain_tag before closing the chain, causing an error:

    Traceback (most recent call last):
      File "/pscratch/sd/f/flowers/esm/scripts/esmfold_inference.py", line 186, in <module>
        pdbs = model.output_to_pdb(output)
      File "/pscratch/sd/f/flowers/miniconda3/lib/python3.9/site-packages/esm/esmfold/v1/esmfold.py", line 303, in output_to_pdb
        return output_to_pdb(output)
      File "/pscratch/sd/f/flowers/miniconda3/lib/python3.9/site-packages/esm/esmfold/v1/misc.py", line 115, in output_to_pdb
        pdbs.append(to_pdb(pred))
      File "/pscratch/sd/f/flowers/miniconda3/lib/python3.9/site-packages/openfold/np/protein.py", line 373, in to_pdb
        f"{chain_tag:>1}{residue_index[i]:>4}"
    UnboundLocalError: local variable 'chain_tag' referenced before assignment
    

    It's possible esmfold was passing bad parameters, but adding a check to set chain_tag to "A" if not set allowed the code to run without errors.

    The protein in question was

    MAPVKVFGPAKSRNVARVLVCLEEVGAEYEVVDMDLKALEHKSPEHLARNPFGQTPAFQDGDLLLFESRAISRYVLRKYKTNQVDLLREGNLKEAAMVDVWTEVDAHTYNPAISPVVYECLINPLVLGIPTNQKVVDESLEKLKKALEVYEAHLSKDKYLAGDFMSFADINHFPHTCSFMAAPHAVLFDSYPYVKAWWERLMARPSIKKLSASLAPPKA*

    And the tail of the output pdb (when run with the modified code) was:

    ATOM 1736 CB ALA A 219 -14.556 -18.156 -6.584 1.00 83.46 C
    ATOM 1737 O ALA A 219 -16.753 -18.815 -4.504 1.00 84.66 O
    TER 1738 UNK A 220 PARENT N/A TER 1739 ALA A 1 END

    opened by flowers9 0
  • Should train_chain_data_cache_path be a required argument?

    Should train_chain_data_cache_path be a required argument?

    Although the current argparse parser allows the user to not pass a value for train_chain_data_cache_path, the current implementation of data_modules.OpenFoldDataset (specifically the inner function, looped_samples) assumes that the cache object is not None. If the user does not supply a cache path, then the training script simply fails with a StopIteration, as it tries to get a cache entry from a None object on line 371:

    https://github.com/aqlaboratory/openfold/blob/59277de16825cfdafe37033012d0530595b9ad6d/openfold/data/data_modules.py#L360-L374

    It seems like OpenFold's datasets have been built to support parsing structure files on the fly as well, so which of the two options would be preferred going forward? 1) make train_chain_data_cache_path required, so the user does not have an unexpected failure when the data is loaded, or 2) Adding support in OpenFoldDataset/looped_samples for the case that the cache is None?

    Happy to help implement something either way!

    opened by jonathanking 0
Releases(v1.0.1)
  • v1.0.1(Nov 23, 2022)

    OpenFold as of the release of our manuscript. Many new features, including FP16 training + more stable training.

    What's Changed

    • use multiple models for inference by @decarboxy in https://github.com/aqlaboratory/openfold/pull/117
    • Update input processing by @brianloyal in https://github.com/aqlaboratory/openfold/pull/116
    • adding a caption to the image in the readme by @decarboxy in https://github.com/aqlaboratory/openfold/pull/133
    • Properly handling file outputs when multiple models are evaluated by @decarboxy in https://github.com/aqlaboratory/openfold/pull/142
    • Fix for issue in download_mgnify.sh by @josemduarte in https://github.com/aqlaboratory/openfold/pull/166
    • Fix tag-sequence mismatch when predicting for multiple fastas by @sdvillal in https://github.com/aqlaboratory/openfold/pull/164
    • Support openmm >= 7.6 by @sdvillal in https://github.com/aqlaboratory/openfold/pull/163
    • Fixing issue in download_uniref90.sh by @josemduarte in https://github.com/aqlaboratory/openfold/pull/171
    • Fix propagation of use_flash for offloaded inference by @epenning in https://github.com/aqlaboratory/openfold/pull/178
    • Update deepspeed version to 0.5.10 by @NZ99 in https://github.com/aqlaboratory/openfold/pull/185
    • Fixes errors when processing .pdb files by @NZ99 in https://github.com/aqlaboratory/openfold/pull/188
    • fix incorrect learning rate warm-up after restarting from ckpt by @Zhang690683220 in https://github.com/aqlaboratory/openfold/pull/182
    • Add opencontainers image-spec to Dockerfile by @SauravMaheshkar in https://github.com/aqlaboratory/openfold/pull/128
    • Write inference and relaxation timings to a file by @brianloyal in https://github.com/aqlaboratory/openfold/pull/201
    • Minor fixes in setup scripts by @timodonnell in https://github.com/aqlaboratory/openfold/pull/202
    • Minor optimizations & fixes to support ESMFold by @nikitos9000 in https://github.com/aqlaboratory/openfold/pull/199
    • Drop chains that are missing (structure) data in training by @timodonnell in https://github.com/aqlaboratory/openfold/pull/210
    • adding a script for threading a sequence onto a structure by @decarboxy in https://github.com/aqlaboratory/openfold/pull/206
    • Set pin_memory to True in default dataloader config. by @NZ99 in https://github.com/aqlaboratory/openfold/pull/212
    • Fix missing subtract_plddt argument in prep_output call by @mhrmsn in https://github.com/aqlaboratory/openfold/pull/217
    • fp16 fixes by @beiwang2003 in https://github.com/aqlaboratory/openfold/pull/222
    • Set clamped vs unclamped FAPE for each sample in batch independently by @ar-nowaczynski in https://github.com/aqlaboratory/openfold/pull/223
    • Fix probabilities type (int -> float) by @atgctg in https://github.com/aqlaboratory/openfold/pull/225
    • Small fix for prep_mmseqs_dbs. by @jonathanking in https://github.com/aqlaboratory/openfold/pull/232

    New Contributors

    • @brianloyal made their first contribution in https://github.com/aqlaboratory/openfold/pull/116
    • @josemduarte made their first contribution in https://github.com/aqlaboratory/openfold/pull/166
    • @sdvillal made their first contribution in https://github.com/aqlaboratory/openfold/pull/164
    • @epenning made their first contribution in https://github.com/aqlaboratory/openfold/pull/178
    • @NZ99 made their first contribution in https://github.com/aqlaboratory/openfold/pull/185
    • @Zhang690683220 made their first contribution in https://github.com/aqlaboratory/openfold/pull/182
    • @SauravMaheshkar made their first contribution in https://github.com/aqlaboratory/openfold/pull/128
    • @timodonnell made their first contribution in https://github.com/aqlaboratory/openfold/pull/202
    • @nikitos9000 made their first contribution in https://github.com/aqlaboratory/openfold/pull/199
    • @mhrmsn made their first contribution in https://github.com/aqlaboratory/openfold/pull/217
    • @beiwang2003 made their first contribution in https://github.com/aqlaboratory/openfold/pull/222
    • @ar-nowaczynski made their first contribution in https://github.com/aqlaboratory/openfold/pull/223
    • @atgctg made their first contribution in https://github.com/aqlaboratory/openfold/pull/225
    • @jonathanking made their first contribution in https://github.com/aqlaboratory/openfold/pull/232

    Full Changelog: https://github.com/aqlaboratory/openfold/compare/v1.0.0...v1.0.1

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jun 22, 2022)

    OpenFold at the time of the release of our original model parameters and training database. Adds countless improvements over the previous beta release, including, but not limited to:

    • Many bugfixes contribute to stabler, more correct, and more versatile training
    • Options to run OpenFold using our original weights
    • Custom attention kernels and alternative attention implementations that greatly reduce peak memory usage
    • A vastly superior Colab notebook that runs inference many times faster than the original
    • Efficient scripts for computation of alignments, including the option to run MMSeqs2's alignment pipeline
    • Vastly improved logging during training & inference
    • Careful optimizations for significantly improved speeds & memory usage during both inference and training
    • Opportunistic optimizations that dynamically speed up inference on short (< ~1500 residues) chains
    • Certain changes borrowed from updates made to the AlphaFold repo, including bugfixes, GPU relaxation, etc.
    • "AlphaFold-Gap" support allows inference on complexes using OpenFold and AlphaFold weights
    • WIP OpenFold-Multimer implementation on the multimer branch
    • Improved testing for the data pipeline
    • Partial CPU offloading extends the upper limit on inference sequence lengths
    • Docker support
    • Missing features from the original release, including learning rate schedulers, distillation set support, etc.

    Full Changelog: https://github.com/aqlaboratory/openfold/compare/v0.1.0...v1.0.0

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Nov 18, 2021)

Owner
AQ Laboratory
AQ Laboratory
Localizing Visual Sounds the Hard Way

Localizing-Visual-Sounds-the-Hard-Way Code and Dataset for "Localizing Visual Sounds the Hard Way". The repo contains code and our pre-trained model.

Honglie Chen 58 Dec 07, 2022
High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models arXiv | BibTeX High-Resolution Image Synthesis with Latent Diffusion Models Robin Rombach*, Andreas Blattmann*, Dominik Lorenz

CompVis Heidelberg 5.6k Dec 30, 2022
Algorithmic Trading using RNN

Deep-Trading This an implementation adapted from Rachnog Neural networks for algorithmic trading. Part One — Simple time series forecasting and this c

Hazem Nomer 29 Sep 04, 2022
[ICCV 2021] Code release for "Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks"

Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks By Yikai Wang, Yi Yang, Fuchun Sun, Anbang Yao. This is the pytorc

Yikai Wang 26 Nov 20, 2022
Codes for the compilation and visualization examples to the HIF vegetation dataset

High-impedance vegetation fault dataset This repository contains the codes that compile the "Vegetation Conduction Ignition Test Report" data, which a

1 Dec 12, 2021
Pytorch implementation of OCNet series and SegFix.

openseg.pytorch News 2021/09/14 MMSegmentation has supported our ISANet and refer to ISANet for more details. 2021/08/13 We have released the implemen

openseg-group 1.1k Dec 23, 2022
A collection of IPython notebooks covering various topics.

ipython-notebooks This repo contains various IPython notebooks I've created to experiment with libraries and work through exercises, and explore subje

John Wittenauer 2.6k Jan 01, 2023
This code is part of the reproducibility package for the SANER 2022 paper "Generating Clarifying Questions for Query Refinement in Source Code Search".

Clarifying Questions for Query Refinement in Source Code Search This code is part of the reproducibility package for the SANER 2022 paper "Generating

Zachary Eberhart 0 Dec 04, 2021
💡 Learnergy is a Python library for energy-based machine learning models.

Learnergy: Energy-based Machine Learners Welcome to Learnergy. Did you ever reach a bottleneck in your computational experiments? Are you tired of imp

Gustavo Rosa 57 Nov 17, 2022
Official PyTorch implementation of MAAD: A Model and Dataset for Attended Awareness

MAAD: A Model for Attended Awareness in Driving Install // Datasets // Training // Experiments // Analysis // License Official PyTorch implementation

7 Oct 16, 2022
RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into tables through jointly extracting intervention, outcome and outcome measure entities and their relations.

Randomised controlled trial abstract result tabulator RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into

2 Sep 16, 2022
Chinese clinical named entity recognition using pre-trained BERT model

Chinese clinical named entity recognition (CNER) using pre-trained BERT model Introduction Code for paper Chinese clinical named entity recognition wi

Xiangyang Li 109 Dec 14, 2022
A minimalist implementation of score-based diffusion model

sdeflow-light This is a minimalist codebase for training score-based diffusion models (supporting MNIST and CIFAR-10) used in the following paper "A V

Chin-Wei Huang 89 Dec 20, 2022
Retina blood vessel segmentation with a convolutional neural network

Retina blood vessel segmentation with a convolution neural network (U-net) This repository contains the implementation of a convolutional neural netwo

Orobix 1.2k Jan 06, 2023
UAV-Networks-Routing is a Python simulator for experimenting routing algorithms and mac protocols on unmanned aerial vehicle networks.

UAV-Networks Simulator - Autonomous Networking - A.A. 20/21 UAV-Networks-Routing is a Python simulator for experimenting routing algorithms and mac pr

0 Nov 13, 2021
Pytorch tutorials for Neural Style transfert

PyTorch Tutorials This tutorial is no longer maintained. Please use the official version: https://pytorch.org/tutorials/advanced/neural_style_tutorial

Alexis David Jacq 135 Jun 26, 2022
Xi Dongbo 78 Nov 29, 2022
my graduation project is about live human face augmentation by projection mapping by using CNN

Live-human-face-expression-augmentation-by-projection my graduation project is about live human face augmentation by projection mapping by using CNN o

1 Mar 08, 2022
Single-Stage 6D Object Pose Estimation, CVPR 2020

Overview This repository contains the code for the paper Single-Stage 6D Object Pose Estimation. Yinlin Hu, Pascal Fua, Wei Wang and Mathieu Salzmann.

CVLAB @ EPFL 89 Dec 26, 2022
Election Exit Poll Prediction and U.S.A Presidential Speech Analysis using Machine Learning

Machine_Learning Election Exit Poll Prediction and U.S.A Presidential Speech Analysis using Machine Learning This project is based on 2 case-studies:

Avnika Mehta 1 Jan 27, 2022