My take on a practical implementation of Linformer for Pytorch.

Overview

Linformer Pytorch Implementation

PyPI version

Linear Self Attention

A practical implementation of the Linformer paper. This is attention with only linear complexity in n, allowing for very long sequence lengths (1mil+) to be attended to on modern hardware.

This repo is an Attention Is All You Need style transformer, complete with an encoder and decoder module. The novelty here is that now, one can make the attention heads linear. Check out how to use it below.

This is in the process of being validated on wikitext-2. Currently, it performs at the same level as other sparse attention mechanisms, like the Sinkhorn Transformer, but the best hyperparameters still have to be found.

Visualization of the heads is also possible. To see more information, check out the Visualization section below.

I am not the author of the paper.

Open In Colab 1.23m tokens

Install

pip install linformer-pytorch

Alternatively,

git clone https://github.com/tatp22/linformer-pytorch.git
cd linformer-pytorch

Code example

Linformer Language Model

from linformer_pytorch import LinformerLM
import torch

model = LinformerLM(
        num_tokens=10000, # Number of tokens in the LM
        input_size=512, # Dimension 1 of the input
        channels=64, # Dimension 2 of the input
        dim_d=None, # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
        dim_k=128, # The second dimension of the P_bar matrix from the paper
        dim_ff=128, # Dimension in the feed forward network
        dropout_ff=0.15, # Dropout for feed forward network
        nhead=4, # Number of attention heads
        depth=2, # How many times to run the model
        dropout=0.1, # How much dropout to apply to P_bar after softmax
        activation="gelu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
        use_pos_emb=True, # Whether or not to use positional embeddings
        checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
        parameter_sharing="layerwise", # What level of parameter sharing to use. For more information, see below.
        k_reduce_by_layer=0, # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
        full_attention=False, # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
        include_ff=True, # Whether or not to include the Feed Forward layer
        w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
        emb_dim=128, # If you want the embedding dimension to be different than the channels for the Linformer
        causal=False, # If you want this to be a causal Linformer, where the upper right of the P_bar matrix is masked out.
        method="learnable", # The method of how to perform the projection. Supported methods are 'convolution', 'learnable', and 'no_params'
        ff_intermediate=None, # See the section below for more information
        ).cuda()
x = torch.randint(1,10000,(1,512)).cuda()
y = model(x)
print(y) # (1, 512, 10000)

Linformer self attention, stacks of MHAttention and FeedForwards

from linformer_pytorch import Linformer
import torch

model = Linformer(
        input_size=262144, # Dimension 1 of the input
        channels=64, # Dimension 2 of the input
        dim_d=None, # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
        dim_k=128, # The second dimension of the P_bar matrix from the paper
        dim_ff=128, # Dimension in the feed forward network
        dropout_ff=0.15, # Dropout for feed forward network
        nhead=4, # Number of attention heads
        depth=2, # How many times to run the model
        dropout=0.1, # How much dropout to apply to P_bar after softmax
        activation="gelu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
        checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
        parameter_sharing="layerwise", # What level of parameter sharing to use. For more information, see below.
        k_reduce_by_layer=0, # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
        full_attention=False, # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
        include_ff=True, # Whether or not to include the Feed Forward layer
        w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
        ).cuda()
x = torch.randn(1, 262144, 64).cuda()
y = model(x)
print(y) # (1, 262144, 64)

Linformer Multihead attention

from linformer_pytorch import MHAttention
import torch

model = MHAttention(
        input_size=512, # Dimension 1 of the input
        channels=64, # Dimension 2 of the input
        dim=8, # Dim of each attn head
        dim_k=128, # What to sample the input length down to
        nhead=8, # Number of heads
        dropout=0, # Dropout for each of the heads
        activation="gelu", # Activation after attention has been concat'd
        checkpoint_level="C2", # If C2, checkpoint each of the heads
        parameter_sharing="layerwise", # What level of parameter sharing to do
        E_proj, F_proj, # The E and F projection matrices
        full_attention=False, # Use full attention instead
        w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
        )
x = torch.randn(1, 512, 64)
y = model(x)
print(y) # (1, 512, 64)

The Linear attention head, the novelty of the paper

from linformer_pytorch import LinearAttentionHead
import torch

model = LinearAttentionHead(
        dim=64, # Dim 2 of the input
        dropout=0.1, # Dropout of the P matrix
        E_proj, F_proj, # The E and F layers
        full_attention=False, # Use Full Attention instead
        )
x = torch.randn(1, 512, 64)
y = model(x, x, x)
print(y) # (1, 512, 64)

An encoder/decoder module.

Note: For causal sequences, one can set the causal=True flag on in the LinformerLM to mask out the top right in the (n,k) attention matrix.

import torch
from linformer_pytorch import LinformerLM

encoder = LinformerLM(
    num_tokens=10000,
    input_size=512,
    channels=16,
    dim_k=16,
    dim_ff=32,
    nhead=4,
    depth=3,
    activation="relu",
    k_reduce_by_layer=1,
    return_emb=True,
    )
decoder = LinformerLM(
    num_tokens=10000,
    input_size=512,
    channels=16,
    dim_k=16,
    dim_ff=32,
    nhead=4,
    depth=3,
    activation="relu",
    decoder_mode=True,
    )

x = torch.randint(1,10000,(1,512))
y = torch.randint(1,10000,(1,512))

x_mask = torch.ones_like(x).bool()
y_mask = torch.ones_like(y).bool()

enc_output = encoder(x, input_mask=x_mask)
print(enc_output.shape) # (1, 512, 128)
dec_output = decoder(y, embeddings=enc_output, input_mask=y_mask, embeddings_mask=x_mask)
print(dec_output.shape) # (1, 512, 10000)

An easy way to get the E and F matrices can be done by calling the get_EF function. As an example, for an n of 1000 and a k of 100:

from linfromer_pytorch import get_EF
import torch

E = get_EF(1000, 100)

Downsampling Methods

With the methods flag, one can set the method that the linformer performs downsampling. Currently, three methods are supported:

  • learnable: This downsampling method creates a learnable n,k nn.Linear module.
  • convolution: This downsampling method creates a 1d convolution, with stride length and kernel size n/k.
  • no_params: This creates a fixed n,k matrix with values fron N(0,1/k)

In the future, I may include pooling or something else. But for now, these are the options that exist.

Checkpoint levels

As an attempt to further introduce memory savings, the concept of checkpoint levels have been introduced. The current three checkpoint levels are C0, C1, and C2. When going up checkpoint levels, one sacrifices speed for memory savings. That is, checkpoint level C0 is the fastest, but takes up the most space on the GPU, while C2 is the slowest, but takes up the least space on the GPU. The details of each checkpoint level are as follows:

  • C0: No checkpointing. The models runs while keeping all of the attention heads and ff layers in the GPU memory.
  • C1: Checkpoint each MultiHead attention as well as each ff layer. With this, increasing depth should have minimal impact on the memory.
  • C2: Along with the optimizations at the C1 level, checkpoint each head in each MultiHead Attention layer. With this, increasing nhead should have less of an impact on memory. However, concating the heads together with torch.cat still takes up a lot of memory, and this will hopefully be optimized out in the future.

Performance details are still unknown, but the option exists for users that want to try.

Parameter Sharing

Another attempt to introduce memory savings in the paper was to introduce parameter sharing between projections. This is mentioned in section 4 of the paper; in particular, there were 4 different types of parameter sharing that the authors discussed, and all have been implemented in this repo. The first option takes up the most memory, and each further option reduces the necessary memory requirements.

  • none: This is no parameter sharing. For every head and for every layer, a new E and a new F matrix is calculated for every head at each layer.
  • headwise: Each layer has a unique E and F matrix. All heads in the layer share this matrix.
  • kv: Each layer has a unique projection matrix P, and E = F = P for each layer. All heads share this projection matrix P.
  • layerwise: There is one projection matrix P, and every head in every layer uses E = F = P.

As started in the paper, this means that for a 12 layer, 12 head network, there would be 288, 24, 12 and 1 different projection matrices, respectively.

Note that with the k_reduce_by_layer option, the layerwise option will not be effective, since it will use the dimension of k for the first layer. Therefore, if the value of k_reduce_by_layer value is greater than 0, one should most likely not use the layerwise sharing option.

Also, note that according to the authors, in figure 3, this parameter sharing doesn't really affect the end result too much. So it may be best to just stick with layerwise sharing for everything, but the option exists for users to try it out.

Padder

One slight problem with the current implementation of the Linformer is that your sequence length has to match the input_size flag of the model. The Padder pads the input size such that the tensor can be fed into the network. An example:

from linformer_pytorch import Linformer, Padder
import torch

model = Linformer(
        input_size=512,
        channels=16,
        dim_d=32,
        dim_k=16,
        dim_ff=32,
        nhead=6,
        depth=3,
        checkpoint_level="C1",
        )
model = Padder(model)
x = torch.randn(1, 500, 16) # This does not match the input size!
y = model(x)
print(y) # (1, 500, 16)

Visualization

Attention Head Vis

Starting with version 0.8.0, one can now visualize the attention heads of the linformer! To see this in action, simply import the Visualizer class, and run the plot_all_heads() function to see a picture of all the attention heads at each level, of size (n,k). Make sure that you specify visualize=True in the forward pass, as this saves the P_bar matrix so that the Visualizer class can properly visualize the head.

A working example of the code can be found below, and the same code can be found in ./examples/example_vis.py:

import torch
from linformer_pytorch import Linformer, Visualizer

model = Linformer(
        input_size=512,
        channels=16,
        dim_k=128,
        dim_ff=32,
        nhead=4,
        depth=3,
        activation="relu",
        checkpoint_level="C0",
        parameter_sharing="layerwise",
        k_reduce_by_layer=1,
        )
# One can load the model weights here
x = torch.randn(1, 512, 16) # What input you want to visualize
y = model(x, visualize=True)
vis = Visualizer(model)
vis.plot_all_heads(title="All P_bar matrices", # Change the title if you'd like
                   show=True, # Show the picture
                   save_file="./heads.png", # If not None, save the picture to a file
                   figsize=(8,6), # How big the figure should be
                   n_limit=None # If not None, limit how much from the `n` dimension to show
                   )

A detailed explanation of what these heads mean can be found in https://github.com/tatp22/linformer-pytorch/issues/15.

Encoder Decoder Module

Similar to the Reformer, I will be attempting to make a Encoder/Decoder Module, so that training can be simplified. This works like 2 LinformerLM classes. Params can be adjusted individually for each one, with the encoder having the enc_ prefix for all of the hyperparams, and the decoder having the dec_ prefix in a similar fashion. So far, what is implemented is:

import torch
from linformer_pytorch import LinformerEncDec

encdec = LinformerEncDec(
    enc_num_tokens=10000,
    enc_input_size=512,
    enc_channels=16,
    dec_num_tokens=10000,
    dec_input_size=512,
    dec_channels=16,
)

x = torch.randint(1,10000,(1,512))
y = torch.randint(1,10000,(1,512))

output = encdec(x,y)

I am planning to have a way to generate text sequence for this.

ff_intermediate tuning

Now, the model dimension can be different in the intermediate layers. This change applies to the ff module, and only in the encoder. Now, if the flag ff_intermediate is not None, the layers will look like this:

channels -> ff_dim -> ff_intermediate (For layer 1)
ff_intermediate -> ff_dim -> ff_intermediate (For layers 2 to depth-1)
ff_intermediate -> ff_dim -> channels (For layer depth)

As opposed to

channels -> ff_dim -> channels (For all layers)

Practical Tips

  • Note that the Linformer has O(nk) time and space complexity. So, while it may be linear in n, make sure that your k is not too large as well. These are editable with input_size and dim_k, respectively.
  • Speaking about k, the authors found that empirical evidence supports the fact that "the performance of Linformer model is mainly determined by the projected dimension k instead of the ratio n/k". Therefore, even when increasing sequence lengths, it may be fine to keep a relatively low, constant k (the authors showed with k=256, that it still performed almost as good as a vanilla transformer).
  • One more tip for k: The authors recommend that k = O(d/eps^2), if self attention wants to be approximated by full attention, with eps error.
  • This code, so far, is pretty much only linear layers as well as matrix multiplications. So, libraries like apex should work with this, however, in practice, it has not been tested.
  • In practice, I found that the memory and time requirements are more on the order of O(nkd), with n=input_size, k=dim_k, and d=dim_d.

Future work

  • Run some benchmark tests to see what the performance is (Doing that now)
  • Complete the LinformerEncDec class

Disclaimer

This is the first time that I am reproducing a result from a paper, so some things may be wrong. If you see a problem, please open up an issue, and I will attempt to work on it.

Thanks

Thank you to lucidrains, whose other sparse attention repositories helped me in designing this Linformer Repo.

Citations

@misc{wang2020linformer,
    title={Linformer: Self-Attention with Linear Complexity},
    author={Sinong Wang and Belinda Z. Li and Madian Khabsa and Han Fang and Hao Ma},
    year={2020},
    eprint={2006.04768},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
@inproceedings{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
  booktitle={Advances in neural information processing systems},
  pages={5998--6008},
  year={2017}
}

"Listen with attention..."

Comments
  • padding mask and attention mask

    padding mask and attention mask

    Hi there,

    You've done a great job and thanks for the sharing. I'm wondering how you deal with the masking stuff in Linformer since the attention shape, key and value shape have now changed to (n, k) instead of (n, n). I didn't find these in the code. Thanks for your time!

    opened by ZEKAICHEN 10
  • Composed linear layers?

    Composed linear layers?

    Hey @tatp22 great repo!

    I'm having trouble wrapping my head around the w_q, w_k, and w_v linear layers in the LinearAttentionHead module. Are they needed? There's no activation between the previous linear layers, to_q, to_k, to_v in MHAttention, and those weights so they wouldn't add any expressivity to the model since you would just be multiplying two matrices together which is equivalent to one linear layer. The E and F projections also seem like they're being composed with w_k, and w_v without a non-linearity.

    Looking at Eq. 7 from the paper your implementation seems correct though.

    Any thoughts on this?

    opened by apeguero1 5
  • Error when using method=

    Error when using method="no_params" and GPU, because E and F incorrectly remain on CPU

    When you create a Linformer() with method="no_params" and then load the model on your cuda device, you will get an error when trying to use the model. This is because the E and F matrices in the model accidentally remain on the CPU. When you call forward with an input, you will get an error at some point because the attention heads are trying to multiply the E matrix on CPU with another matrix on the GPU.

    Basically, when you call Linformer().cuda() under this situation, the E and F matrices are not moved to the GPU. (From what I've read so far, in order for them to be put on the GPU with a cuda() call, you also need to assign E to self.E in Linformer. However, this still doesn't fix it because of the lambdas in your Linformer initializer I think. The cuda() call can't track down the E and F inside the MHAttention objects inside the lambda call it seems)

    My temporary fix is changing E_proj = get_EF(input_size, dim_k, method, head_dim) in the __init__ of Linformer to E_proj = get_EF(input_size, dim_k, method, head_dim).cuda(), but I think this would give an error if you do not have a gpu installed.

    opened by RaivoKoot 4
  • causal_mask of the decoder

    causal_mask of the decoder

    Hi , You've done a great job and thanks for the sharing. I don't understand the causal_mask of the decoder,the shape of attention matrix is (n, k) , only the (k,k) part is masked, Does it work? Is there any test results in language model? Thanks for your time!

    opened by burcehan 4
  • Loss goes to 0 when using LinformerLM

    Loss goes to 0 when using LinformerLM

    Hi, I used the LinformerLM class with casual=True to do some language modelling. However, there seems to be some leakage as the loss goes to 0 after 1 epoch. Or am I using it wrongly? Thank you.

    These are my settings

    model = LinformerLM(
            num_tokens=ntoken, # Number of tokens in the LM
            input_size=args.seq_len, # Dimension 1 of the input
            channels=args.embsize, # Dimension 2 of the input
            dim_d=None, # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
            dim_k=16, # The second dimension of the P_bar matrix from the paper
            dim_ff=args.nhid, # Dimension in the feed forward network
            dropout_ff=args.dropout, # Dropout for feed forward network
            nhead=8, # Number of attention heads
            depth=12, # How many times to run the model
            dropout=args.dropout, # How much dropout to apply to P_bar after softmax
            activation="relu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
            checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
            parameter_sharing="none", # What level of parameter sharing to use. For more information, see below.
            k_reduce_by_layer=0, # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
            full_attention=False, # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
            include_ff=True, # Whether or not to include the Feed Forward layer
            w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
            emb_dim=None, # If you want the embedding dimension to be different than the channels for the Linformer
            causal=True, # If you want this to be a causal Linformer, where the upper right of the P_bar matrix is masked out.
            method="learnable", # The method of how to perform the projection. Supported methods are 'convolution', 'learnable', and 'no_params'
            ff_intermediate=None, # See the section below for more information
            )
    
    opened by terencenwz 2
  • Different number of tokens and Character Level Modeling

    Different number of tokens and Character Level Modeling

    Hi Thank you for the open source code. I have been using Transformers for a while now and I generally use them for character level modeling - that is, translation between two different languages. I was wondering if you could answer the following questions

    1- Can I use different number of tokens for encoder and decoder? This is because two different languages will have different tokens 2- I can probably use your code for character level modeling, at what point should I split the input stream of string tokens to characters? Any particular module where you can point me to?

    I hope I am not asking for much :)

    Thank you!

    opened by wajihullahbaig 2
  • Error with DistributedDataParallel and parameter_sharing=

    Error with DistributedDataParallel and parameter_sharing="layerwise"

    Hi, I trying to run informer training with DistributedDataParallel, parameter_sharing="layerwise" and get this error

    Traceback (most recent call last):
      File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
        fn(i, *args)
      File "/home/jovyan/nlpdata/test_ddp_vanila_torch.py", line 95, in demo_basic
        loss_fn(output, labels).backward()
      File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph)
      File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
        allow_unreachable=True)  # allow_unreachable flag
    RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.3) Incorrect unused parameter detection. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. For unused parameters, DDP would not expect gradients from then. However, if an unused parameter becomes part of the autograd graph at a later point in time (e.g., in a reentrant backward when using `checkpoint`), the gradient will show up unexpectedly. If all parameters in the model participate in the backward pass, you can disable unused parameter detection by passing the keyword argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.
    Exception raised from mark_variable_ready at ../torch/csrc/distributed/c10d/reducer.cpp:484 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f62b61fd99b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
    frame #1: c10d::Reducer::mark_variable_ready(c10d::Reducer::VariableIndex) + 0xbe7 (0x7f62ef7edac7 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #2: c10d::Reducer::autograd_hook(c10d::Reducer::VariableIndex) + 0x93 (0x7f62ef7ede23 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #3: <unknown function> + 0xad2006 (0x7f62ef7ee006 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #4: <unknown function> + 0xad902a (0x7f62ef7f502a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #5: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x4f9 (0x7f62ea50b889 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #6: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x4b4 (0x7f62ea50d3f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #7: torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x33c (0x7f62ea50aa1c in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #8: torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x4c (0x7f62ef2495bc in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #9: torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x82f (0x7f62ea509d5f in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #10: torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x74 (0x7f62ef2492f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #11: THPEngine_run_backward(THPEngine*, _object*, _object*) + 0xa10 (0x7f62ef24a070 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #12: _PyCFunction_FastCallDict + 0x154 (0x5572c4395304 in /opt/conda/bin/python)
    frame #13: _PyCFunction_FastCallKeywords + 0x50 (0x5572c43c1cd0 in /opt/conda/bin/python)
    frame #14: <unknown function> + 0x199b0c (0x5572c441cb0c in /opt/conda/bin/python)
    frame #15: _PyEval_EvalFrameDefault + 0x10c9 (0x5572c44405d9 in /opt/conda/bin/python)
    frame #16: <unknown function> + 0x192f26 (0x5572c4415f26 in /opt/conda/bin/python)
    frame #17: <unknown function> + 0x193f31 (0x5572c4416f31 in /opt/conda/bin/python)
    frame #18: <unknown function> + 0x199be5 (0x5572c441cbe5 in /opt/conda/bin/python)
    frame #19: _PyEval_EvalFrameDefault + 0x30a (0x5572c443f81a in /opt/conda/bin/python)
    frame #20: PyEval_EvalCodeEx + 0x329 (0x5572c4417a49 in /opt/conda/bin/python)
    frame #21: <unknown function> + 0x195864 (0x5572c4418864 in /opt/conda/bin/python)
    frame #22: PyObject_Call + 0x3e (0x5572c439510e in /opt/conda/bin/python)
    frame #23: _PyEval_EvalFrameDefault + 0x1aaf (0x5572c4440fbf in /opt/conda/bin/python)
    frame #24: <unknown function> + 0x192f26 (0x5572c4415f26 in /opt/conda/bin/python)
    frame #25: _PyFunction_FastCallDict + 0x1be (0x5572c441740e in /opt/conda/bin/python)
    frame #26: _PyObject_FastCallDict + 0x26f (0x5572c43956cf in /opt/conda/bin/python)
    frame #27: _PyObject_Call_Prepend + 0x63 (0x5572c439a143 in /opt/conda/bin/python)
    frame #28: PyObject_Call + 0x3e (0x5572c439510e in /opt/conda/bin/python)
    frame #29: torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x193 (0x7f62ef2519f3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #30: <unknown function> + 0x29d82c5 (0x7f62ea5112c5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #31: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x14a8 (0x7f62ea50c838 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #32: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x4b4 (0x7f62ea50d3f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #33: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x99 (0x7f62ea504ec9 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
    frame #34: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x5a (0x7f62ef24905a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #35: <unknown function> + 0xbd6df (0x7f62fb49b6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #36: <unknown function> + 0x76db (0x7f6318d876db in /lib/x86_64-linux-gnu/libpthread.so.0)
    frame #37: clone + 0x3f (0x7f6318ab0a3f in /lib/x86_64-linux-gnu/libc.so.6)
    

    Code for reproducing

    import os
    import tempfile
    import torch
    import torch.distributed as dist
    import torch.nn as nn
    import torch.optim as optim
    import torch.multiprocessing as mp
    from linformer_pytorch import LinformerLM
    from torch.nn.parallel import DistributedDataParallel as DDP
    
    
    def setup(rank, world_size):
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '12355'
    
        # initialize the process group
        dist.init_process_group("gloo", rank=rank, world_size=world_size)
    
    def cleanup():
        dist.destroy_process_group()
        
    def demo_basic(rank, world_size):
        print(f"Running basic DDP example on rank {rank}.")
        setup(rank, world_size)
    
        # create model and move it to GPU with id rank
        
        model = LinformerLM(
                num_tokens=30522,  # Number of tokens in the LM
                input_size=5120,  # Dimension 1 of the input
                channels=128,  # Dimension 2 of the input
                dim_d=None,
                # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
                dim_k=128,  # The second dimension of the P_bar matrix from the paper
                dim_ff=128,  # Dimension in the feed forward network
                dropout_ff=0.15,  # Dropout for feed forward network
                nhead=16,  # Number of attention heads
                depth=12,  # How many times to run the model
                dropout=0.1,  # How much dropout to apply to P_bar after softmax
                activation="gelu",
                # What activation to use. Currently, only gelu and relu supported, and only on ff network.
                checkpoint_level="C2",  # What checkpoint level to use. For more information, see below.
                parameter_sharing="layerwise",  # What level of parameter sharing to use. For more information, see below.
                k_reduce_by_layer=0,
                # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
                full_attention=False,
                # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
                include_ff=True,  # Whether or not to include the Feed Forward layer
                w_o_intermediate_dim=None,
                # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
                emb_dim=128,  # If you want the embedding dimension to be different than the channels for the Linformer
            ).to(rank)
        ddp_model = DDP(model, device_ids=[rank])
    
        loss_fn = nn.CrossEntropyLoss()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
    
        optimizer.zero_grad()
        outputs = ddp_model(torch.randint(20000, (3, 5120)))
        labels = torch.randint(20000, (3, 5120)).to(rank)
        loss_mx = labels != -100
        output = outputs[loss_mx].view(-1, 30522)
        labels = labels[loss_mx].view(-1)
        loss_fn(output, labels).backward()
        optimizer.step()
    
        cleanup()
    
    
    def run_demo(demo_fn, world_size):
        mp.spawn(demo_fn,
                 args=(world_size,),
                 nprocs=world_size,
                 join=True)
        
    if __name__ == "__main__":
        run_demo(demo_basic, 2)
    

    Also, this issue reproducing with any parameter sharing besides the "none"

    opened by blizda 2
  • Error with DistributedDataParallel

    Error with DistributedDataParallel

    Hi, I trying to run informer training with DistributedDataParallel, and get error

    -- Process 1 terminated with the following error:
    Traceback (most recent call last):
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
        fn(i, *args)
      File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/test_ddp_vanila_torch.py", line 71, in demo_basic
        outputs = ddp_model(torch.randint(20000, (3, 5120)))
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
        output = self.module(*inputs[0], **kwargs[0])
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 364, in forward
        tensor = self.linformer(tensor, **kwargs)
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 321, in forward
        tensor = checkpoint(layer, tensor)
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint
        return CheckpointFunction.apply(function, preserve, *args)
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 74, in forward
        outputs = run_function(*args)
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 61, in forward
        tensor = tensor + self.fn(tensor, **kwargs)
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 235, in forward
        head_outputs.append(checkpoint(head,Q,K,V))
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint
        return CheckpointFunction.apply(function, preserve, *args)
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 74, in forward
        outputs = run_function(*args)
      File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 162, in forward
        P_bar = Q/torch.sqrt(torch.tensor(self.dim).type(Q.type()))
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
    

    Seems like this error connected with parameter sharing

    Code for reproducing

    import os
    import tempfile
    import torch
    import torch.distributed as dist
    import torch.nn as nn
    import torch.optim as optim
    import torch.multiprocessing as mp
    from linformer_pytorch import LinformerLM
    from torch.nn.parallel import DistributedDataParallel as DDP
    
    
    def setup(rank, world_size):
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '12355'
    
        # initialize the process group
        dist.init_process_group("gloo", rank=rank, world_size=world_size)
    
    def cleanup():
        dist.destroy_process_group()
        
    def demo_basic(rank, world_size):
        print(f"Running basic DDP example on rank {rank}.")
        setup(rank, world_size)
    
        # create model and move it to GPU with id rank
        
        model = LinformerLM(
                num_tokens=30522,  # Number of tokens in the LM
                input_size=5120,  # Dimension 1 of the input
                channels=128,  # Dimension 2 of the input
                dim_d=None,
                # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
                dim_k=128,  # The second dimension of the P_bar matrix from the paper
                dim_ff=128,  # Dimension in the feed forward network
                dropout_ff=0.15,  # Dropout for feed forward network
                nhead=16,  # Number of attention heads
                depth=12,  # How many times to run the model
                dropout=0.1,  # How much dropout to apply to P_bar after softmax
                activation="gelu",
                # What activation to use. Currently, only gelu and relu supported, and only on ff network.
                checkpoint_level="C2",  # What checkpoint level to use. For more information, see below.
                parameter_sharing="none",  # What level of parameter sharing to use. For more information, see below.
                k_reduce_by_layer=0,
                # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
                full_attention=False,
                # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
                include_ff=True,  # Whether or not to include the Feed Forward layer
                w_o_intermediate_dim=None,
                # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
                emb_dim=128,  # If you want the embedding dimension to be different than the channels for the Linformer
            ).to(rank)
        ddp_model = DDP(model, device_ids=[rank])
    
        loss_fn = nn.CrossEntropyLoss()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
    
        optimizer.zero_grad()
        outputs = ddp_model(torch.randint(20000, (3, 5120)))
        labels = torch.randint(20000, (3, 5120)).to(rank)
        loss_mx = labels != -100
        output = outputs[loss_mx].view(-1, 30522)
        labels = labels[loss_mx].view(-1)
        loss_fn(output, labels).backward()
        optimizer.step()
    
        cleanup()
    
    
    def run_demo(demo_fn, world_size):
        mp.spawn(demo_fn,
                 args=(world_size,),
                 nprocs=world_size,
                 join=True)
        
    if __name__ == "__main__":
        run_demo(demo_basic, 2)
    

    Also, with DataParallel training going normal

    opened by blizda 2
  • Use -inf as mask value for the causal mask

    Use -inf as mask value for the causal mask

    The value that is used for masking is currently set to -1e10. In FP16 respectively mixed precision training this leads to numerical issues. This can be fixed by using float('-inf') instead as infinity has an own special representation in IEEE 754.

    opened by kklemon 2
  • Enquiry about your implementation

    Enquiry about your implementation

    Thanks for your great work!

    I have a few enquiries about your implementations:

    1. Could you reproduce the paper results (or approximately similar) with your implementation?
    2. While ordinary transformer requires multiple GPUs to train from scratch, as for your implementation of Linformer, is it possible to train it from scratch with single GPU only(8GB/ 11GB)?

    Thanks, Alex Lau

    opened by riven314 2
  • Possible bug

    Possible bug

    I may have discovered a possible bug. When I run python pretrain_tutorial.py, I get some vertical lines when running the visualizer on the random data (Try it yourself, and run the trained model on a new random array).

    In the extreme case, I get that all of the queries are attending to the same key. This leads to the effect that, for an input of size (batch_size, input_len, ch), every vector of length ch on the input_len axis will have the same value. To put this in a concrete example, imagine a 32x32x3 picture (RGB, as an example), fed in the model of a batch size of 1,. If the input is (1,32*32,3), every channel of the image will be the same value on every pixel. For example, the R channel will have the value of 128 on each channel, the G channel might have 43, and the B channel may have 212. If the input is (1,3,32*32), every R,G,B channel will look like an image, but they will have the same exact pixel values.

    The problem is being investigated. I believe it is a problem with the MHAttention layer, or possibly the positional embedding, but I cannot say for sure. Will update when a fix is found.

    bug 
    opened by tatp22 2
  • Question: Is Linformer permutation equivariant (set-operation)?

    Question: Is Linformer permutation equivariant (set-operation)?

    Hi. Thanks for the wonderful implementation!

    I was wondering if linformer can be used with any unordered set of tensors (or is it just sequence data?). Specifically, is linformer permutation equivariant?

    I'm looking to apply linear attention on points in 3d space (e.g. a point cloud with ~100k points). Would linformer attention be meaningful?

    (I'm concerned about the n -> k projection, which assumes the n points in some order if I understand correctly)

    Thanks!

    opened by nmakes 5
Releases(0.19.3)
This repo is to present various code demos on how to use our Graph4NLP library.

Deep Learning on Graphs for Natural Language Processing Demo The repository contains code examples for DLG4NLP tutorials at NAACL 2021, SIGIR 2021, KD

Graph4AI 143 Dec 23, 2022
Spearmint Bayesian optimization codebase

Spearmint Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code n

Formerly: Harvard Intelligent Probabilistic Systems Group -- Now at Princeton 1.5k Dec 29, 2022
Turning pixels into virtual points for multimodal 3D object detection.

Multimodal Virtual Point 3D Detection Turning pixels into virtual points for multimodal 3D object detection. Multimodal Virtual Point 3D Detection, Ti

Tianwei Yin 204 Jan 08, 2023
Qt-GUI implementation of the YOLOv5 algorithm (ver.6 and ver.5)

YOLOv5-GUI 🎉 YOLOv5算法(ver.6及ver.5)的Qt-GUI实现 🎉 Qt-GUI implementation of the YOLOv5 algorithm (ver.6 and ver.5). 基于YOLOv5的v5版本和v6版本及Javacr大佬的UI逻辑进行编写

EricFang 12 Dec 28, 2022
Cupytorch - A small framework mimics PyTorch using CuPy or NumPy

CuPyTorch CuPyTorch是一个小型PyTorch,名字来源于: 不同于已有的几个使用NumPy实现PyTorch的开源项目,本项目通过CuPy支持

Xingkai Yu 23 Aug 17, 2022
A python code to convert Keras pre-trained weights to Pytorch version

Weights_Keras_2_Pytorch 最近想在Pytorch项目里使用一下谷歌的NIMA,但是发现没有预训练好的pytorch权重,于是整理了一下将Keras预训练权重转为Pytorch的代码,目前是支持Keras的Conv2D, Dense, DepthwiseConv2D, Batch

Liu Hengyu 2 Dec 16, 2021
Magisk module to enable hidden features on Android 12 Developer Preview 1.

Android 12 Extensions This is a Magisk module that enables hidden features on Android 12 Developer Preview 1. Features Scrolling screenshots Wallpaper

Danny Lin 384 Jan 06, 2023
Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

DialogBERT This is a PyTorch implementation of the DialogBERT model described in DialogBERT: Neural Response Generation via Hierarchical BERT with Dis

Xiaodong Gu 67 Jan 06, 2023
This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

SeerNet This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is

3 May 01, 2022
A tiny, friendly, strong baseline code for Person-reID (based on pytorch).

Pytorch ReID Strong, Small, Friendly A tiny, friendly, strong baseline code for Person-reID (based on pytorch). Strong. It is consistent with the new

Zhedong Zheng 3.5k Jan 08, 2023
The official PyTorch code implementation of "Human Trajectory Prediction via Counterfactual Analysis" in ICCV 2021.

Human Trajectory Prediction via Counterfactual Analysis (CausalHTP) The official PyTorch code implementation of "Human Trajectory Prediction via Count

46 Dec 03, 2022
tree-math: mathematical operations for JAX pytrees

tree-math: mathematical operations for JAX pytrees tree-math makes it easy to implement numerical algorithms that work on JAX pytrees, such as iterati

Google 137 Dec 28, 2022
ESTDepth: Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks (CVPR 2021)

ESTDepth: Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks (CVPR 2021) Project Page | Video | Paper | Data We present a novel metho

65 Nov 28, 2022
This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Dynamic-Vision-Transformer (Pytorch) This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT). Not All Ima

210 Dec 18, 2022
Official code for article "Expression is enough: Improving traffic signal control with advanced traffic state representation"

1 Introduction Official code for article "Expression is enough: Improving traffic signal control with advanced traffic state representation". The code s

Liang Zhang 10 Dec 10, 2022
The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Simple-DMA a simple Dual Memory Architecture for classifications. based on the paper Dual-Memory Deep Learning Architectures for Lifelong Learning of

1 Jan 27, 2022
Code accompanying the paper "How Tight Can PAC-Bayes be in the Small Data Regime?"

How Tight Can PAC-Bayes be in the Small Data Regime? This is the code to reproduce all experiments for the following paper: @inproceedings{Foong:2021:

5 Dec 21, 2021
Implementation of ICCV19 Paper "Learning Two-View Correspondences and Geometry Using Order-Aware Network"

OANet implementation Pytorch implementation of OANet for ICCV'19 paper "Learning Two-View Correspondences and Geometry Using Order-Aware Network", by

Jiahui Zhang 225 Dec 05, 2022
Code Release for the paper "TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation"

TriBERT This repository contains the code for the NeurIPS 2021 paper titled "TriBERT: Full-body Human-centric Audio-visual Representation Learning for

UBC Computer Vision Group 8 Aug 31, 2022
A minimalist environment for decision-making in autonomous driving

highway-env A collection of environments for autonomous driving and tactical decision-making tasks An episode of one of the environments available in

Edouard Leurent 1.6k Jan 07, 2023