LSTM and QRNN Language Model Toolkit for PyTorch

Overview

LSTM and QRNN Language Model Toolkit

This repository contains the code used for two Salesforce Research papers:

The model comes with instructions to train:

  • word level language models over the Penn Treebank (PTB), WikiText-2 (WT2), and WikiText-103 (WT103) datasets

  • character level language models over the Penn Treebank (PTBC) and Hutter Prize dataset (enwik8)

The model can be composed of an LSTM or a Quasi-Recurrent Neural Network (QRNN) which is two or more times faster than the cuDNN LSTM in this setup while achieving equivalent or better accuracy.

  • Install PyTorch 0.4
  • Run getdata.sh to acquire the Penn Treebank and WikiText-2 datasets
  • Train the base model using main.py
  • (Optionally) Finetune the model using finetune.py
  • (Optionally) Apply the continuous cache pointer to the finetuned model using pointer.py

If you use this code or our results in your research, please cite as appropriate:

@article{merityRegOpt,
  title={{Regularizing and Optimizing LSTM Language Models}},
  author={Merity, Stephen and Keskar, Nitish Shirish and Socher, Richard},
  journal={arXiv preprint arXiv:1708.02182},
  year={2017}
}
@article{merityAnalysis,
  title={{An Analysis of Neural Language Modeling at Multiple Scales}},
  author={Merity, Stephen and Keskar, Nitish Shirish and Socher, Richard},
  journal={arXiv preprint arXiv:1803.08240},
  year={2018}
}

Update (June/13/2018)

The codebase is now PyTorch 0.4 compatible for most use cases (a big shoutout to https://github.com/shawntan for a fairly comprehensive PR https://github.com/salesforce/awd-lstm-lm/pull/43). Mild readjustments to hyperparameters may be necessary to obtain quoted performance. If you desire exact reproducibility (or wish to run on PyTorch 0.3 or lower), we suggest using an older commit of this repository. We are still working on pointer, finetune and generate functionalities.

Software Requirements

Python 3 and PyTorch 0.4 are required for the current codebase.

Included below are hyper parameters to get equivalent or better results to those included in the original paper.

If you need to use an earlier version of the codebase, the original code and hyper parameters accessible at the PyTorch==0.1.12 release, with Python 3 and PyTorch 0.1.12 are required. If you are using Anaconda, installation of PyTorch 0.1.12 can be achieved via: conda install pytorch=0.1.12 -c soumith.

Experiments

The codebase was modified during the writing of the paper, preventing exact reproduction due to minor differences in random seeds or similar. We have also seen exact reproduction numbers change when changing underlying GPU. The guide below produces results largely similar to the numbers reported.

For data setup, run ./getdata.sh. This script collects the Mikolov pre-processed Penn Treebank and the WikiText-2 datasets and places them in the data directory.

Next, decide whether to use the QRNN or the LSTM as the underlying recurrent neural network model. The QRNN is many times faster than even Nvidia's cuDNN optimized LSTM (and dozens of times faster than a naive LSTM implementation) yet achieves similar or better results than the LSTM for many word level datasets. At the time of writing, the QRNN models use the same number of parameters and are slightly deeper networks but are two to four times faster per epoch and require less epochs to converge.

The QRNN model uses a QRNN with convolutional size 2 for the first layer, allowing the model to view discrete natural language inputs (i.e. "New York"), while all other layers use a convolutional size of 1.

Finetuning Note: Fine-tuning modifies the original saved model model.pt file - if you wish to keep the original weights you must copy the file.

Pointer note: BPTT just changes the length of the sequence pushed onto the GPU but won't impact the final result.

Character level enwik8 with LSTM

  • python -u main.py --epochs 50 --nlayers 3 --emsize 400 --nhid 1840 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.4 --wdrop 0.2 --wdecay 1.2e-6 --bptt 200 --batch_size 128 --optimizer adam --lr 1e-3 --data data/enwik8 --save ENWIK8.pt --when 25 35

Character level Penn Treebank (PTB) with LSTM

  • python -u main.py --epochs 500 --nlayers 3 --emsize 200 --nhid 1000 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.25 --dropouti 0.1 --dropout 0.1 --wdrop 0.5 --wdecay 1.2e-6 --bptt 150 --batch_size 128 --optimizer adam --lr 2e-3 --data data/pennchar --save PTBC.pt --when 300 400

Word level WikiText-103 (WT103) with QRNN

  • python -u main.py --epochs 14 --nlayers 4 --emsize 400 --nhid 2500 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.1 --wdrop 0 --wdecay 0 --bptt 140 --batch_size 60 --optimizer adam --lr 1e-3 --data data/wikitext-103 --save WT103.12hr.QRNN.pt --when 12 --model QRNN

Word level Penn Treebank (PTB) with LSTM

The instruction below trains a PTB model that without finetuning achieves perplexities of approximately 61.2 / 58.8 (validation / testing), with finetuning achieves perplexities of approximately 58.8 / 56.5, and with the continuous cache pointer augmentation achieves perplexities of approximately 53.2 / 52.5.

  • python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save PTB.pt
  • python finetune.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save PTB.pt
  • python pointer.py --data data/penn --save PTB.pt --lambdasm 0.1 --theta 1.0 --window 500 --bptt 5000

Word level Penn Treebank (PTB) with QRNN

The instruction below trains a QRNN model that without finetuning achieves perplexities of approximately 60.6 / 58.3 (validation / testing), with finetuning achieves perplexities of approximately 59.1 / 56.7, and with the continuous cache pointer augmentation achieves perplexities of approximately 53.4 / 52.6.

  • python -u main.py --model QRNN --batch_size 20 --clip 0.2 --wdrop 0.1 --nhid 1550 --nlayers 4 --emsize 400 --dropouth 0.3 --seed 9001 --dropouti 0.4 --epochs 550 --save PTB.pt
  • python -u finetune.py --model QRNN --batch_size 20 --clip 0.2 --wdrop 0.1 --nhid 1550 --nlayers 4 --emsize 400 --dropouth 0.3 --seed 404 --dropouti 0.4 --epochs 300 --save PTB.pt
  • python pointer.py --model QRNN --lambdasm 0.1 --theta 1.0 --window 500 --bptt 5000 --save PTB.pt

Word level WikiText-2 (WT2) with LSTM

The instruction below trains a PTB model that without finetuning achieves perplexities of approximately 68.7 / 65.6 (validation / testing), with finetuning achieves perplexities of approximately 67.4 / 64.7, and with the continuous cache pointer augmentation achieves perplexities of approximately 52.2 / 50.6.

  • python main.py --epochs 750 --data data/wikitext-2 --save WT2.pt --dropouth 0.2 --seed 1882
  • python finetune.py --epochs 750 --data data/wikitext-2 --save WT2.pt --dropouth 0.2 --seed 1882
  • python pointer.py --save WT2.pt --lambdasm 0.1279 --theta 0.662 --window 3785 --bptt 2000 --data data/wikitext-2

Word level WikiText-2 (WT2) with QRNN

The instruction below will a QRNN model that without finetuning achieves perplexities of approximately 69.3 / 66.8 (validation / testing), with finetuning achieves perplexities of approximately 68.5 / 65.9, and with the continuous cache pointer augmentation achieves perplexities of approximately 53.6 / 52.1. Better numbers are likely achievable but the hyper parameters have not been extensively searched. These hyper parameters should serve as a good starting point however.

  • python -u main.py --epochs 500 --data data/wikitext-2 --clip 0.25 --dropouti 0.4 --dropouth 0.2 --nhid 1550 --nlayers 4 --seed 4002 --model QRNN --wdrop 0.1 --batch_size 40 --save WT2.pt
  • python finetune.py --epochs 500 --data data/wikitext-2 --clip 0.25 --dropouti 0.4 --dropouth 0.2 --nhid 1550 --nlayers 4 --seed 4002 --model QRNN --wdrop 0.1 --batch_size 40 --save WT2.pt
  • python -u pointer.py --save WT2.pt --model QRNN --lambdasm 0.1279 --theta 0.662 --window 3785 --bptt 2000 --data data/wikitext-2

Speed

For speed regarding character-level PTB and enwik8 or word-level WikiText-103, refer to the relevant paper.

The default speeds for the models during training on an NVIDIA Quadro GP100:

  • Penn Treebank (batch size 20): LSTM takes 65 seconds per epoch, QRNN takes 28 seconds per epoch
  • WikiText-2 (batch size 20): LSTM takes 180 seconds per epoch, QRNN takes 90 seconds per epoch

The default QRNN models can be far faster than the cuDNN LSTM model, with the speed-ups depending on how much of a bottleneck the RNN is. The majority of the model time above is now spent in softmax or optimization overhead (see PyTorch QRNN discussion on speed).

Speeds are approximately three times slower on a K80. On a K80 or other memory cards with less memory you may wish to enable the cap on the maximum sampled sequence length to prevent out-of-memory (OOM) errors, especially for WikiText-2.

If speed is a major issue, SGD converges more quickly than our non-monotonically triggered variant of ASGD though achieves a worse overall perplexity.

Details of the QRNN optimization

For full details, refer to the PyTorch QRNN repository.

Details of the LSTM optimization

All the augmentations to the LSTM, including our variant of DropConnect (Wan et al. 2013) termed weight dropping which adds recurrent dropout, allow for the use of NVIDIA's cuDNN LSTM implementation. PyTorch will automatically use the cuDNN backend if run on CUDA with cuDNN installed. This ensures the model is fast to train even when convergence may take many hundreds of epochs.

Comments
  • KeyError: 'ax' in line `prm.data = optimizer.state[prm]['ax'].clone()`

    KeyError: 'ax' in line `prm.data = optimizer.state[prm]['ax'].clone()`

    Where the key 'ax' comes from in line: https://github.com/salesforce/awd-lstm-lm/blob/master/main.py#L245?

    if 't0' in optimizer.param_groups[0]:
        tmp = {}
        for prm in model.parameters():
            tmp[prm] = prm.data.clone()
            prm.data = optimizer.state[prm]['ax'].clone()
    

    I have made small changes to the code and after that I am getting a key error in the very next iteration whenever the code switches to the 'ASGD'. Where this key 'ax' is introduced?

    opened by wasiahmad 20
  • `ValueError: result of slicing is an empty tensor` when trying to run generate.py on QRNN

    `ValueError: result of slicing is an empty tensor` when trying to run generate.py on QRNN

    I've trained a QRNN, but when I try to use generate.py with it, I get the following:

      File "generate.py", line 68, in <module>
        output, hidden = model(input, hidden)
      File "/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/ubuntu/awd-lstm-lm/model.py", line 82, in forward
        raw_output, new_h = rnn(raw_output, hidden[l])
      File "/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
        result = self.forward(*input, **kwargs)
      File "/miniconda3/lib/python3.6/site-packages/torchqrnn/qrnn.py", line 60, in forward
        Xm1 = [self.prevX if self.prevX is not None else X[:1, :, :] * 0, X[:-1, :, :]]
      File "/miniconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 76, in __getitem__
        return Index.apply(self, key)
      File "/miniconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 16, in forward
        result = i.index(ctx.index)
    ValueError: result of slicing is an empty tensor
    
    opened by mhart 11
  • Issues with SplitCrossEntropyLoss

    Issues with SplitCrossEntropyLoss

    1. Parameters in SplitCrossEntropyLoss are not being updated since they are missing from the optimizer.
    2. Parameters in SplitCrossEntropyLoss are being initialized to 0. EDIT: After reflection, I'm not sure this matters.

    Locally, I fixed these issues and ran some very short experiments on 140 batches on wikitext-103 as follows:

    • master.txt are results for code at commit 1e24cc54 (current master)
    • optimizerfix.txt are the results with the issue 1 fixed
    • initializationfix.txt are the results with issue 1 fixed as well as the parameters being initialized the same as the embeddings.
    opened by ccarter-cs 8
  • Confused regarding motivation of randomized BPTT

    Confused regarding motivation of randomized BPTT

    Why does this exist?

    bptt = args.bptt if np.random.random() < 0.95 else args.bptt / 2.

    Yall already have...

    seq_len = max(5, int(np.random.normal(bptt, 5)))

    opened by PetrochukM 7
  • Model crashes under pytorch 0.4

    Model crashes under pytorch 0.4

    Hi, The folks over at pytorch are working on cutting a new 0.4 release. We'd like to make the transition as smooth as possible (if you were planning on upgrading), so we've been testing a number of community repos.

    I ran a model and it errors out due to a change in pytorch. Minimal repro:

    # Install pytorch-nightly (Currently our pre-release branch)
    conda install pytorch-nightly -c pytorch
    
    # Get data
    ./getdata.sh
    
    # Run model
    python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 1 && \
    python -u main.py --model QRNN --batch_size 20 --clip 0.2 --wdrop 0.1 --nhid 1550 --nlayers 4 --emsize 400 --dropouth 0.3 --seed 9001 --dropouti 0.4 --epochs
    1
    

    Stack trace: https://gist.github.com/zou3519/142d48df1c03db9fe9c11717ad9a59f2

    Pytorch 0.4 adds zero-dimensional tensors that cannot be iterated over, which seems to be what the error is complaining about. Changing https://github.com/salesforce/awd-lstm-lm/blob/f2e88672bdaf93ee709390fda2a24abb6db77989/utils.py#L8 in particular to handle this case should fix it.

    cc @soumith

    opened by zou3519 5
  • Multiple GPU option

    Multiple GPU option

    I extended the code to multiple GPU training, but the GPU usage is extremely imbalanced. The root cause is that we collect all outputs back and calculate loss on one GPU. I tried to put loss calculation inside model.forward() as following:

    class RNNModel(nn.Module):         def init(...):                 super(RNNModel, self).init()                 from splitcross import SplitCrossEntropyLoss                 splits = [2800, 20000, 76000]                 self.criterion = SplitCrossEntropyLoss(ninp, splits=splits, verbose=False)                 ... ...         def forward(...)                 ... ...                 result = output                 # calculate loss                 result = result.view(result.size(0)*result.size(1), -1)                 raw_loss = self.criterion(decoder_weight, decoder_bias, result, target)                 loss = raw_loss                 # activation regularization                 if args.alpha: loss = loss + sum(args.alpha * dropped_rnn_h.pow(2).mean() for dropped_rnn_h in outputs[-1:])                 # Temporal Activation Regularization (slowness)                 if args.beta: loss = loss + sum(args.beta * (rnn_h[1:] - rnn_h[:-1]).pow(2).mean() for rnn_h in raw_outputs[-1:])                 # expand loss to two dimensional space so it can be gathered via the second dimension                 loss = loss.unsqueeze(1)                 raw_loss = raw_loss.unsqueeze(1)                 if return_h:                         return raw_loss, loss, hidden, raw_outputs, outputs                 return raw_loss, loss, hidden

    Then, in my main.py, I collect the loss and use loss.mean().backward() to update parameters. The interesting thing is, I can successfully finish the first round loss.mean().backward() but failed the second round with error:

    RuntimeError: invalid argument 3: Index tensor must have same dimensions as input tensor at /pytorch/torch/lib/THC/generic/THCTensorScatterGather.cu:199

    Can anyone help? Thanks in advance!

    opened by songyuzhou324 4
  • GPU memory and cap

    GPU memory and cap

    Hi, training crashed not enough memory on Titan X 12GB with char-LSTM on enwik8

    The trick about reducing the "cap" on sequence length links to a 404 URL: could you please let me know where I can do that ?

    Thanks a lot for the great code !

    opened by cerisara 4
  • Hidden state init of LSTM layers (after the first)

    Hidden state init of LSTM layers (after the first)

    This code 'breaks' a multi-layer LSTM to l different 1-layer LSTMs. So after the forward pass of the first LSTM, the hidden state of the next LSTM should be initialised with the previous hidden state. To this end, shouldn't the following line of the code be inside the for loop??
    https://github.com/salesforce/awd-lstm-lm/blob/32fcb42562aeb5c7e6c9dec3f2a3baaaf68a5cb5/model.py#L88

    In my understanding, every LSTM is trained with an initial hidden state of zeros, which is ok for the first layer, but isn't it wrong that hidden states do not 'propagate' to the next layers?

    Thanks in advance.

    opened by mourga 3
  • Weight drop code masking the same

    Weight drop code masking the same "raw" weight?

    Hey,

    I was inspecting the weight drop (variant of dropconnect) code and I found it a bit confusing (https://github.com/salesforce/awd-lstm-lm/blob/master/weight_drop.py#L34):

    for name_w in self.weights:
          raw_w = getattr(self.module, name_w + '_raw')
          w = None
          if self.variational:
              mask = torch.autograd.Variable(torch.ones(raw_w.size(0), 1))
              if raw_w.is_cuda: mask = mask.cuda()
              mask = torch.nn.functional.dropout(mask, p=self.dropout, training=True)
              w = mask.expand_as(raw_w) * raw_w
          else:
              w = torch.nn.functional.dropout(raw_w, p=self.dropout, training=self.training)
          setattr(self.module, name_w, w)
    

    In every iteration the raw_w you get from name_w + '_raw' is the same, isn't it? Because you only setattr to name_w (e.g. weight_hh_l0) at the end. So every time the dropout mask operates on the same raw weight matrix...

    Or maybe I just overlooked something. Can someone help me understand this?

    Thanks!

    opened by jerrybai1995 3
  • why is hidden layer initialized only on beginning of an epoch?

    why is hidden layer initialized only on beginning of an epoch?

    If someone can help me with this I'd be thankful, I couldn't understand this part. Why are we keeping the hidden layer for whole train dataset until epoch finishes? Isn't that bad for testing and predicting later on? Thanks..

    opened by realiti4 2
  • Input batch size doesn't match hidden batch size

    Input batch size doesn't match hidden batch size

    Hi. I am using an LSTMCell. However, I get this error: Input batch size 1 doesn't match hidden[0] batch size 512

    This is my testing code:

    lstm = nn.LSTMCell(256,512)
    x = torch.randn(1,256)
    h = torch.randn(1,512)
    print('Testing WeightDrop with LSTM')
    wdrnn = WeightDrop(lstm, ['weight_hh'], dropout=0.5)
    wdrnn.cuda()
    run1 = [x.sum() for x in wdrnn(x, h)[0].data]
    

    The input and hidden both have a batch_size of 1. Why does the LSTM work but LSTM Cell doesn't work?

    opened by homelifes 2
  • The might be a bug in splitcross.py

    The might be a bug in splitcross.py

    The following test case failed wtih pytorch 1.6.

    if __name__ == '__main__':
        np.random.seed(42)
        torch.manual_seed(42)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(42)
    
        V = 300
        H = 400
        N = 500
        E = 10
    
        embed = torch.nn.Embedding(V, H).cuda()
        crit = SplitCrossEntropyLoss(hidden_size=H, splits=[100, 200]).cuda()
        bias = torch.nn.Parameter(torch.ones(V)).cuda()
        optimizer = torch.optim.SGD(list(embed.parameters()) + list(crit.parameters()), lr=1)
    
        for _ in range(E):
            prev = torch.rand(N, 1) * 0.999 * V
            prev = prev.int().long().cuda()
            x = (torch.rand(N, 1) * 0.999 * V).int().long().cuda()
            y = embed(prev).squeeze()
            c = crit(embed.weight, bias, y, x.view(N))
            print('Crit', c.exp().item())
    
            logprobs = crit.logprob(embed.weight, bias, y[:2]).exp()
            print(logprobs)
            print(logprobs.sum(dim=1))
    
            optimizer.zero_grad()
            c.backward()
            optimizer.step()
    

    Here are the error messages:

    ......
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [42,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [43,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [44,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [46,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [48,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [49,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [51,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [54,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [55,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [58,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:86: operator(): block: [0,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    Traceback (most recent call last):
      File "splitcross.py", line 195, in <module>
        print('Crit', c.exp().item())
    RuntimeError: CUDA error: device-side assert triggered
    
    opened by cswhjiang 0
  • A UsingWarning about call flatten_parameters()

    A UsingWarning about call flatten_parameters()

    Hi,

    When running !python -u main.py --epochs 500 --nlayers 3 --emsize 200 --nhid 1000 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.25 --dropouti 0.1 --dropout 0.1 --wdrop 0.5 --wdecay 1.2e-6 --bptt 150 --batch_size 128 --optimizer adam --lr 2e-3 --data data/pennchar --save PTBC.pt --when 300 400 I get the following warnings:

    -----------------------------------------------------------------------------------------
     | end of epoch  28 | time: 317.33s | valid loss  1.01 | valid ppl     2.75 | valid bpc    1.462
     -----------------------------------------------------------------------------------------
    Saving model (new best validation)
    /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
    /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
    .
    .
    .
    -----------------------------------------------------------------------------------------
    | end of epoch  29 | time: 3123.33s | valid loss  1.00 | valid ppl     2.75 | valid bpc    1.462
    -----------------------------------------------------------------------------------------
    Saving model (new best validation)
    /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
    /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
    
    

    It looks at work, But fill with UsingWarning. I run it in Google Colaboratory, pytorch1.5. How fix it?Must pytorch0.4?

    Thank you!

    opened by wxhiff 0
  • Dropconnect layer implementation.

    Dropconnect layer implementation.

    I am trying to code dropconnect for Conv2D and transposeconv2D layer by following the implementation for dropconnect lstm and dropconnect gru in this repo. Below is my implementation of the code.

    
    def _weight_drop(module, weights, dropout):
        for name_w in weights:
            w = getattr(module, name_w)
            del module._parameters[name_w]
            module.register_parameter(name_w + '_raw', Parameter(w))
        original_module_forward = module.forward
    
        def forward(*args, **kwargs):
            for name_w in weights:
                raw_w = getattr(module, name_w + '_raw')
                w = torch.nn.functional.dropout(raw_w, p=dropout, training=module.training)
                setattr(module, name_w, w)
            return original_module_forward(*args, **kwargs)
        setattr(module, 'forward', forward)
            
    class WeightDropConv2d(torch.nn.Conv2d):
        def __init__(self, *args, weight_dropout=0.0, **kwargs):
            super().__init__(*args, **kwargs)
            weights = ['weight']
            _weight_drop(self, weights, weight_dropout)
            
    class WeightDropConvTranspose2d(torch.nn.ConvTranspose2d):
        def __init__(self, *args, weight_dropout=0.0, **kwargs):
            super().__init__(*args, **kwargs)
            weights = ['weight']
            _weight_drop(self, weights, weight_dropout)
    

    torch.version.cuda: 1.1.0 torch.version: 9.0.176

    I get the following error in the 2nd epoch:

    Traceback (most recent call last):
      File "dropconnect.py", line 110, in <module>
        out = model(image)
      File "/home/sbhand2s/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "dropconnect.py", line 73, in forward
        out = self.c1(x)
      File "/home/sbhand2s/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "dropconnect.py", line 34, in forward
        setattr(module, name_w, w)
      File "/home/sbhand2s/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 558, in __setattr__
        .format(torch.typename(value), name))
    TypeError: cannot assign 'torch.cuda.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)
    

    This error occurs in the second epoch when I switch from .eval() to .train(). This error does not occur if I don't call .eval()

    Any suggestions on why this error is occurring or how to implement dropconnect in a better manner?

    opened by swaroop1904 0
  • AssertionError

    AssertionError

    python3 -u main.py --epochs 50 --nlayers 3 --emsize 400 --nhid 1840 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.4 --wdrop 0.2 --wdecay 1.2e-6 --bptt 200 --batch_size 128 --optimizer adam --lr 1e-3 --data data/enwik8 --save ENWIK8.pt --when 25 35 Producing dataset... Traceback (most recent call last): File "main.py", line 100, in corpus = data.Corpus(args.data) File "/home/swee/Desktop/awd-lstm-lm-master/data.py", line 30, in init self.train = self.tokenize(os.path.join(path, 'train.txt')) File "/home/swee/Desktop/awd-lstm-lm-master/data.py", line 36, in tokenize assert os.path.exists(path) AssertionError

    opened by HCideal 1
  • AttributeError: 'Program' object has no attribute '_program'

    AttributeError: 'Program' object has no attribute '_program'

    I'm trying to reproduce QRNN in the project and launch the command from readme example:

    (.venv3gpu) [email protected]:~/Cloud/spell_corr/awd-lstm-lm$ python -u main.py --epochs 14 --nlayers 4 --emsize 400 --nhid 2500 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.1 --wdrop 0 --wdecay 0 --bptt 140 --batch_size 60 --optimizer adam --lr 1e-3 --data data/wikitext-103 --save WT103.12hr.QRNN.pt --when 12 --model QRNN
    Loading cached dataset...
    Applying weight drop of 0.0 to weight
    Applying weight drop of 0.0 to weight
    Applying weight drop of 0.0 to weight
    Applying weight drop of 0.0 to weight
    [QRNNLayer(
      (linear): WeightDrop(
        (module): Linear(in_features=800, out_features=7500, bias=True)
      )
    ), QRNNLayer(
      (linear): WeightDrop(
        (module): Linear(in_features=2500, out_features=7500, bias=True)
      )
    ), QRNNLayer(
      (linear): WeightDrop(
        (module): Linear(in_features=2500, out_features=7500, bias=True)
      )
    ), QRNNLayer(
      (linear): WeightDrop(
        (module): Linear(in_features=2500, out_features=1200, bias=True)
      )
    )]
    Using [2800, 20000, 76000]
    Args: Namespace(alpha=0.0, batch_size=60, beta=0.0, bptt=140, clip=0.25, cuda=True, data='data/wikitext-103', dropout=0.1, dropoute=0.0, dropouth=0.1, dropouti=0.1, emsize=400, epochs=14, log_interval=200, lr=0.001, model='QRNN', nhid=2500, nlayers=4, nonmono=5, optimizer='adam', resume='', save='WT103.12hr.QRNN.pt', seed=1111, tied=True, wdecay=0.0, wdrop=0.0, when=[12])
    Model total parameters: 153886638
    /home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torch/nn/functional.py:1340: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
      warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
    /home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torch/nn/functional.py:1351: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
      warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
    /pytorch/torch/csrc/autograd/python_function.cpp:622: UserWarning: Legacy autograd function with non-static forward method is deprecated and will be removed in 1.3. Please use new-style autograd function with static forward method. (Example: https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)
    Traceback (most recent call last):
      File "main.py", line 240, in <module>
        train()
      File "main.py", line 196, in train
        output, hidden, rnn_hs, dropped_rnn_hs = model(data, hidden, return_h=True)
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/alx/Cloud/spell_corr/awd-lstm-lm/model.py", line 81, in forward
        raw_output, new_h = rnn(raw_output, hidden[l])
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torchqrnn/qrnn.py", line 99, in forward
        C = ForgetMult()(F, Z, hidden, use_cuda=self.use_cuda)
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torchqrnn/forget_mult.py", line 179, in forward
        return GPUForgetMult()(f, x, hidden_init) if use_cuda else CPUForgetMult()(f, x, hidden_init)
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torchqrnn/forget_mult.py", line 120, in forward
        self.compile()
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/torchqrnn/forget_mult.py", line 102, in compile
        program = Program(kernel.encode(), 'recurrent_forget_mult.cu'.encode())
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/pynvrtc/compiler.py", line 52, in __init__
        include_names)
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/pynvrtc/interface.py", line 200, in nvrtcCreateProgram
        c_char_p(encode_str(src)), c_char_p(encode_str(name)),
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/pynvrtc/interface.py", line 54, in encode_str
        return s.encode("utf-8")
    AttributeError: 'bytes' object has no attribute 'encode'
    Exception ignored in: <bound method Program.__del__ of <pynvrtc.compiler.Program object at 0x7fa9f5a4e518>>
    Traceback (most recent call last):
      File "/home/alx/Cloud/.venv3gpu/lib/python3.6/site-packages/pynvrtc/compiler.py", line 56, in __del__
        self._interface.nvrtcDestroyProgram(self._program)
    AttributeError: 'Program' object has no attribute '_program'
    

    Does anybody know how to fix this issue?

    opened by acriptis 1
Releases(PyTorch==0.1.12)
  • PyTorch==0.1.12(Aug 25, 2017)

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Tooling for converting STAC metadata to ODC data model

手语识别 0、使用到的模型 (1). openpose,作者:CMU-Perceptual-Computing-Lab https://github.com/CMU-Perceptual-Computing-Lab/openpose (2). 图像分类classification,作者:Bubbl

Open Data Cube 65 Dec 20, 2022
[CVPR 2021] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans Introduction We introduce the task of dense captioning in 3D scans from commodity RGB-D sensor

Dave Z. Chen 79 Nov 07, 2022
Hardware accelerated, batchable and differentiable optimizers in JAX.

JAXopt Installation | Examples | References Hardware accelerated (GPU/TPU), batchable and differentiable optimizers in JAX. Installation JAXopt can be

Google 621 Jan 08, 2023
Awesome AI Learning with +100 AI Cheat-Sheets, Free online Books, Top Courses, Best Videos and Lectures, Papers, Tutorials, +99 Researchers, Premium Websites, +121 Datasets, Conferences, Frameworks, Tools

All about AI with Cheat-Sheets(+100 Cheat-sheets), Free Online Books, Courses, Videos and Lectures, Papers, Tutorials, Researchers, Websites, Datasets

Niraj Lunavat 1.2k Jan 01, 2023
MAg: a simple learning-based patient-level aggregation method for detecting microsatellite instability from whole-slide images

MAg Paper Abstract File structure Dataset prepare Data description How to use MAg? Why not try the MAg_lib! Trained models Experiment and results Some

Calvin Pang 3 Apr 08, 2022
CLIP (Contrastive Language–Image Pre-training) for Italian

Italian CLIP CLIP (Radford et al., 2021) is a multimodal model that can learn to represent images and text jointly in the same space. In this project,

Italian CLIP 114 Dec 29, 2022
Generative Flow Networks

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation Implementation for our paper, submitted to NeurIPS 2021 (also chec

Emmanuel Bengio 381 Jan 04, 2023
SweiNet is an uncertainty-quantifying shear wave speed (SWS) estimator for ultrasound shear wave elasticity (SWE) imaging.

SweiNet SweiNet is an uncertainty-quantifying shear wave speed (SWS) estimator for ultrasound shear wave elasticity (SWE) imaging. SweiNet takes as in

Felix Jin 3 Mar 31, 2022
《Fst Lerning of Temporl Action Proposl vi Dense Boundry Genertor》(AAAI 2020)

Update 2020.03.13: Release tensorflow-version and pytorch-version DBG complete code. 2019.11.12: Release tensorflow-version DBG inference code. 2019.1

Tencent 338 Dec 16, 2022
Facilitates implementing deep neural-network backbones, data augmentations

Introduction Nowadays, the training of Deep Learning models is fragmented and unified. When AI engineers face up with one specific task, the common wa

40 Dec 29, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Hiring research interns for visual transformer

Multimedia Research 484 Dec 29, 2022
Experimental solutions to selected exercises from the book [Advances in Financial Machine Learning by Marcos Lopez De Prado]

Advances in Financial Machine Learning Exercises Experimental solutions to selected exercises from the book Advances in Financial Machine Learning by

Brian 1.4k Jan 04, 2023
This is the code of paper ``Contrastive Coding for Active Learning under Class Distribution Mismatch'' with python.

Contrastive Coding for Active Learning under Class Distribution Mismatch Official PyTorch implementation of ["Contrastive Coding for Active Learning u

21 Dec 22, 2022
A small tool to joint picture including gif

README 做设计的时候遇到拼接长图的情况,但是发现没有什么好用的能拼接gif的工具。 于是自己写了个gif拼接小工具。 可以自动拼接gif、png和jpg等常见格式。 效果 从上至下 从下至上 从左至右 从右至左 使用 克隆仓库 git clone https://github.com/Dels

3 Dec 15, 2021
PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning Warning: This is a rapidly evolving research prototype.

MIT Probabilistic Computing Project 190 Dec 27, 2022
The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding"

AutoSF The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" and this paper has been accepted by ICDE2020. News:

AutoML Research 64 Dec 17, 2022
Streaming over lightweight data transformations

Description Data augmentation libarary for Deep Learning, which supports images, segmentation masks, labels and keypoints. Furthermore, SOLT is fast a

Research Unit of Medical Imaging, Physics and Technology 256 Jan 08, 2023
Implementation of ICCV2021(Oral) paper - VMNet: Voxel-Mesh Network for Geodesic-aware 3D Semantic Segmentation

VMNet: Voxel-Mesh Network for Geodesic-Aware 3D Semantic Segmentation Created by Zeyu HU Introduction This work is based on our paper VMNet: Voxel-Mes

HU Zeyu 82 Dec 27, 2022
Weakly Supervised Segmentation with Tensorflow. Implements instance segmentation as described in Simple Does It: Weakly Supervised Instance and Semantic Segmentation, by Khoreva et al. (CVPR 2017).

Weakly Supervised Segmentation with TensorFlow This repo contains a TensorFlow implementation of weakly supervised instance segmentation as described

Phil Ferriere 220 Dec 13, 2022
Semi-supervised Learning for Sentiment Analysis

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining Code, models and Datasets for《Neural Semi-supervised Learning fo

47 Jan 01, 2023