Longformer: The Long-Document Transformer

Overview

Longformer

Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents.

***** New December 1st, 2020: LongformerEncoderDecoder *****

A LongformerEncoderDecoder (LED) model is now available. It supports seq2seq tasks with long input. With gradient checkpointing, fp16, and 48GB gpu, the input length can be up to 16K tokens. Check the updated paper for the model details and evaluation.

  • Pretrained models: 1) led-base-16384, 2) led-large-16384

  • Requirements: Make sure to use the huggingface/transformers fork specified in requirements.txt. It adds support for gradient checkpointing and allows different maximum sequence length for the input and output. You can also run pip install git+https://github.com/allenai/longformer.git

  • Check the script scripts/summarization.py for an example of how to use the model.

***** New July 23rd, 2020: Speed degradation *****

A significant speed degradation in the hugginface/transformers was recenlty discovered and fixed (check this PR for details). To avoid this problem, either use the old release v2.11.0 but it doesn't support gradient checkpointing, or use the master branch. This problem should be fixed with the next hugginface/transformers release.

***** New June 29th, 2020: Easier to use Gradient checkpointing *****

Gradient checkpointing has been released with huggingface/transformers release v3.0.0. Gradient checkpointing reduces memory by 5x which makes it possible to process longer sequences on smaller GPUs. To use, try something like the following:

from transformers import LongformerModel
model = LongformerModel.from_pretrained('allenai/longformer-base-4096', gradient_checkpointing=True)

***** New June 2nd, 2020: Integrating with Huggingface + Train your own long model + Gradient checkpointing *****

  1. Longformer is now integrated in the huggingface/transformers release v2.11.0. Now you can do
from transformers import LongformerModel
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

The release also includes LongformerForQA and other LongformerForTaskName with automatic setting of global attention.

  1. We added a notebook to show how to convert an existing pretrained model into its "long" version.

  2. Gradient checkpointing has been merged into HF master (check PR). Gradient checkpointing can reduce memory usage significanlty (5x for longformer-base-4096) allowing longer sequences on smaller gpus.

***** New April 27th, 2020: A PyTorch implementation of the sliding window attention *****

We added a PyTorch implementation of the sliding window attention that doesn't require the custom CUDA kernel. It is limited in functionality but more convenient to use for finetuning on downstream tasks.

Advantage: supports CPU, TPU and fp16, which aren't supported by the custom CUDA kernel

Limitations: uses 2x more memory (but fp16 offsets that), and doesn’t support dilation and autoregressive attention (not needed for finetuning)

therefore, it is suitable for finetuning on downstream tasks but not a good choice for language modeling. The code snippit below and the TriviaQA scripts were updated to use this new implementation.

***** End new information *****

How to use

  1. Download pretrained model
  1. Install environment and code

    conda create --name longformer python=3.7
    conda activate longformer
    conda install cudatoolkit=10.0
    pip install git+https://github.com/allenai/longformer.git
  2. Run the model

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'sliding_chunks'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    # model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]

Model pretraining

This notebook demonstrates our procedure for training Longformer starting from the RoBERTa checkpoint. The same procedure can be followed to get a long-version of other existing pretrained models.

TriviaQA

  • Training scripts: scripts/triviaqa.py
  • Pretrained large model: here (replicates leaderboard results)
  • Instructions: scripts/cheatsheet.txt

CUDA kernel

Our custom CUDA kernel is implemented in TVM. For now, the kernel only works on GPUs and Linux. We tested it on Ubuntu, Python 3.7, CUDA10, PyTorch >= 1.2.0. If it doesn't work for your environment, please create a new issue.

Compiling the kernel: We already include the compiled binaries of the CUDA kernel, so most users won't need to compile it, but if you are intersted, check scripts/cheatsheet.txt for instructions.

Known issues

Please check the repo issues for a list of known issues that we are planning to address soon. If your issue is not discussed, please create a new one.

Citing

If you use Longformer in your research, please cite Longformer: The Long-Document Transformer.

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}

Longformer is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments
  • ImportError: cannot import name 'nvcc'

    ImportError: cannot import name 'nvcc'

    from tvm.contrib import nvcc ImportError: cannot import name 'nvcc'

    I get this when trying to compile the kernel from scratch. Did I miss something in the cmake config? I can import a lot of TVM modules but not nvcc.

    My cuda version is: Cuda compilation tools, release 10.0, V10.0.130

    opened by safooray 33
  • Text Classifier using longformer

    Text Classifier using longformer

    Can we request to add a short example of longformer for long text/review classification? Current triviaQA is good but more examples will encourage further use of longformer.

    Thanks. Patrick

    opened by pchankh 14
  • RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

    RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

    I'm trying to train a new model from scratch where it's length is 1024 (using huggingface implementation of longformer), but I get the following exception at a line that is recently added:

    --> 150         is_global_attn = is_index_global_attn.flatten().any().item()
        151 
        152         hidden_states = hidden_states.transpose(0, 1)
    
    RuntimeError: CUDA error: device-side assert triggered
    

    I tried Reformer and it worked as expected. The Longfomer config is as follows?

    LongformerConfig {
      "attention_probs_dropout_prob": 0.1,
      "attention_window": 64,
      "bos_token_id": 0,
      "eos_token_id": 2,
      "gradient_checkpointing": false,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 1026,
      "model_type": "longformer",
      "num_attention_heads": 12,
      "num_hidden_layers": 6,
      "pad_token_id": 257,
      "sep_token_id": 258,
      "type_vocab_size": 2,
      "vocab_size": 261
    }
    

    Any idea what the issue is?

    opened by zarandioon 13
  • segmentation fault illegal instruction

    segmentation fault illegal instruction

    setup

    ubuntu 16.04 tvm 0.7 dev1 pytorch 1.4.0 transformer 2.11.0 other same as requirements.txt

    issue

    I uncomment the line in diagonaled_mm_tvm.py DiagonaledMM._get_function('float32', 'cuda')

    After that, When I run the code , it show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... segmentation fault (core dump) or show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... illegal instruction (core dump)

    other

    I test the tvm, tensorflow and pytorch, there are fine. And I follow the scripts/cheatsheet.txt to regenerate the lib_diagonaled_mm_float32_cuda.so, it can generate succeed.

    Any idea or suggestion?

    the code is below

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'tvm'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]
    
    opened by ProfXGiter 13
  • Using RoBERTa or LongFormer for texts with 16K tokens

    Using RoBERTa or LongFormer for texts with 16K tokens

    LongFormer does it by pooling all the local attentions (512) together in global attention (512 x 8 = 4096).

    This is not entirely true. There's no "pooling" of the 4096 tokens into 512. We keep all 4096 tokens. The only change is how attention is computed; instead of every token attending to every other token, we change it such that every token attends to a smaller number of surrounding tokens. This speeds up selfattention computation (which is the bottleneck) by assuming that the attention score between certain pairs of words is zero. This doesn't change the architecture or introduce any pooling.

    We are working on some code that will make it easy to train your own long model, so you can try longer sequences. We know it is easy to get to 16K or even 32k with RoBERTa-base architecture (need base model, fp16, gradient checkpointing). For sequences longer than that, you will need to find ways to save memory depending on your application. For example, reducing window size, reducing size of the feed forward layers, implementing reversible transformers, use sinusoidal position embedding instead of learned position embedding.

    Originally posted by @ibeltagy in https://github.com/allenai/longformer/issues/48#issuecomment-634270401

    opened by vr25 10
  • Not able to use the embedding for calculating similarity.

    Not able to use the embedding for calculating similarity.

    First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :) Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.

    However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}' ................................... ...................... output = model(input_ids, attention_mask=attention_mask)[0]

    I get a embedding of dimension: torch.Size([1, 512, 768]) and when I try to calculate the cosine similarity on these embeddings I get error saying : ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?

    I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num

    However, I am unsure where should I append this line of code. I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.

    Thanks for help :)

    opened by titu1992 10
  • help in understanding task global attention

    help in understanding task global attention

    Hi,

    Need help in understanding the concept below?

    image

    So does this mean that the complexity is quadratic (if all tokens attend to all other tokens) for task tuning but linear otherwise?

    Thanks!

    opened by vr25 9
  • Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

    Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

    Hi, I'm trying to reproduce the TriviaQA result following instructions in cheatsheet. I user following instructions to reproduce it from cheatsheet.txt

    // To run our pretrained TriviaQA large model (replicates the leaderboard results), // first download the pytorch-lightning checkpoint: // https://ai2-s2-research.s3-us-west-2.amazonaws.com/longformer/triviaqa-longformer-large.tar.gz // then run: python -m scripts.triviaqa
    --train_dataset squad-wikipedia-train-4096.json \ # loaded but not used --dev_dataset squad-wikipedia-dev-4096.json
    --gpus 0 --num_workers 4
    --max_seq_len 4096 --doc_stride -1
    --save_prefix triviaqa-longformer-large \ # pretrained pytorch-lighting checkpoint --model_path path/to/pretrained/longformer-large-4096 \ # loaded but not used --test # predictions will be saved into predictions.json

    //then run the official evaluation scripts python -m scripts.triviaqa_utils.evaluation_utils
    --dataset_file path/to/qa/wikipedia-dev.json
    --prediction_file predictions.json
    //Output should be: {'exact_match': 73.07644188665083, 'f1': 77.78523804802242, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}

    But I keep getting result {'exact_match': 0.025021894157387713, 'f1': 4.579085300341775, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}, which is very weird..

    I downloaded dataset and converted both train and dev dataset into squad format by provided script, and I just replaced data and model path to my server's setting.

    Has anyone reproduced the result f1:77.78 with given pytorch-lightning checkpoint?

    opened by YJYJLee 9
  • How can I train the pre-train model on chinese corpus?

    How can I train the pre-train model on chinese corpus?

    Now I want to train a pre-train model on chinese corpus, but the details are not clear. such as, how to make the minimal changes necessary to support Longformer’s attention mechanism, how to take the attention pattern to plug into a pretrained transformer model.

    opened by liangxg787 9
  • Fine-tuning Longformer for squad (out of memory)

    Fine-tuning Longformer for squad (out of memory)

    I have pretrained an MLM Longformer using roberta-base based on this recipe.

    Then I tried to fine-tune it for squad quetion-answering. Here is the trainer and following is the run-time setting (based on here):

    python run_squad.py
    --model_type roberta
    --model_name_or_path pathe_to_roberta_base_mlm_trained_4096
    --do_train
    --do_eval
    --do_lower_case
    --train_file $SQUAD_DIR/train-v1.1.json
    --predict_file $SQUAD_DIR/dev-v1.1.json
    --per_gpu_train_batch_size 1
    --learning_rate 3e-5
    --num_train_epochs 2.0
    --max_seq_length 4096
    --doc_stride 128
    --output_dir /tmp/debug_squad/

    While I am using a V100 node (16-GPUs, 32 GB), it always faces memory limit of gpu as follow:

    File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
    

    File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 642, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in forward output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward hidden_states, attention_mask, head_mask, output_attentions=output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 240, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 31.72 GiB total capacity; 30.25 GiB already allocated; 300.38 MiB free; 30.29 GiB reserved in total by PyTorch)

    However, using allenai/longformer-base-4096, it works. Could you please comment on what I may be missing in the above steps.

    opened by arashashari 8
  • CUDA error: device-side assert triggered, while converting BERT to Long

    CUDA error: device-side assert triggered, while converting BERT to Long

    Hi!

    I got an apparently working code for converting a BERT model into a longformer, but now I am trying to convert BERTeus to Longoformer, which I expected to work in the same way (just changing the dataset + model name/path).

    with a small(with big same issue) training corpus (50K lines), the training starts well, but it breaks around step 20, after 3-4 epochs.

    
    2020-09-22 15:01:55.336576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
    2020-09-22 15:01:55.338202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
    INFO:__main__:Loading the model from tmp/bert-base-4096
    INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
    INFO:transformers.configuration_utils:Model config BertConfig {
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "attention_window": [
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512
      ],
      "gradient_checkpointing": true,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 4096,
      "model_type": "bert",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 3,
      "type_vocab_size": 2,
      "vocab_size": 50099
    }
    
    INFO:transformers.tokenization_utils_base:Model name 'tmp/bert-base-4096' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'tmp/bert-base-4096' is a path, a model identifier, or url to a directory containing tokenizer files.
    INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/added_tokens.json. We won't load it.
    INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/tokenizer.json. We won't load it.
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/vocab.txt
    INFO:transformers.tokenization_utils_base:loading file None
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/special_tokens_map.json
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/tokenizer_config.json
    INFO:transformers.tokenization_utils_base:loading file None
    /mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_auto.py:798: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
      FutureWarning,
    INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
    INFO:transformers.configuration_utils:Model config BertConfig {
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "attention_window": [
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512
      ],
      "gradient_checkpointing": true,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 4096,
      "model_type": "bert",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 3,
      "type_vocab_size": 2,
      "vocab_size": 50099
    }
    
    INFO:transformers.modeling_utils:loading weights file tmp/bert-base-4096/pytorch_model.bin
    WARNING:transformers.modeling_utils:Some weights of the model checkpoint at tmp/bert-base-4096 were not used when initializing BertForMaskedLM: ['bert.encoder.layer.0.attention.self.query_global.weight', 'bert.encoder.layer.0.attention.self.query_global.bias', 'bert.encoder.layer.0.attention.self.key_global.weight', 'bert.encoder.layer.0.attention.self.key_global.bias', 'bert.encoder.layer.0.attention.self.value_global.weight', 'bert.encoder.layer.0.attention.self.value_global.bias', 'bert.encoder.layer.1.attention.self.query_global.weight', 'bert.encoder.layer.1.attention.self.query_global.bias', 'bert.encoder.layer.1.attention.self.key_global.weight', 'bert.encoder.layer.1.attention.self.key_global.bias', 'bert.encoder.layer.1.attention.self.value_global.weight', 'bert.encoder.layer.1.attention.self.value_global.bias', 'bert.encoder.layer.2.attention.self.query_global.weight', 'bert.encoder.layer.2.attention.self.query_global.bias', 'bert.encoder.layer.2.attention.self.key_global.weight', 'bert.encoder.layer.2.attention.self.key_global.bias', 'bert.encoder.layer.2.attention.self.value_global.weight', 'bert.encoder.layer.2.attention.self.value_global.bias', 'bert.encoder.layer.3.attention.self.query_global.weight', 'bert.encoder.layer.3.attention.self.query_global.bias', 'bert.encoder.layer.3.attention.self.key_global.weight', 'bert.encoder.layer.3.attention.self.key_global.bias', 'bert.encoder.layer.3.attention.self.value_global.weight', 'bert.encoder.layer.3.attention.self.value_global.bias', 'bert.encoder.layer.4.attention.self.query_global.weight', 'bert.encoder.layer.4.attention.self.query_global.bias', 'bert.encoder.layer.4.attention.self.key_global.weight', 'bert.encoder.layer.4.attention.self.key_global.bias', 'bert.encoder.layer.4.attention.self.value_global.weight', 'bert.encoder.layer.4.attention.self.value_global.bias', 'bert.encoder.layer.5.attention.self.query_global.weight', 'bert.encoder.layer.5.attention.self.query_global.bias', 'bert.encoder.layer.5.attention.self.key_global.weight', 'bert.encoder.layer.5.attention.self.key_global.bias', 'bert.encoder.layer.5.attention.self.value_global.weight', 'bert.encoder.layer.5.attention.self.value_global.bias', 'bert.encoder.layer.6.attention.self.query_global.weight', 'bert.encoder.layer.6.attention.self.query_global.bias', 'bert.encoder.layer.6.attention.self.key_global.weight', 'bert.encoder.layer.6.attention.self.key_global.bias', 'bert.encoder.layer.6.attention.self.value_global.weight', 'bert.encoder.layer.6.attention.self.value_global.bias', 'bert.encoder.layer.7.attention.self.query_global.weight', 'bert.encoder.layer.7.attention.self.query_global.bias', 'bert.encoder.layer.7.attention.self.key_global.weight', 'bert.encoder.layer.7.attention.self.key_global.bias', 'bert.encoder.layer.7.attention.self.value_global.weight', 'bert.encoder.layer.7.attention.self.value_global.bias', 'bert.encoder.layer.8.attention.self.query_global.weight', 'bert.encoder.layer.8.attention.self.query_global.bias', 'bert.encoder.layer.8.attention.self.key_global.weight', 'bert.encoder.layer.8.attention.self.key_global.bias', 'bert.encoder.layer.8.attention.self.value_global.weight', 'bert.encoder.layer.8.attention.self.value_global.bias', 'bert.encoder.layer.9.attention.self.query_global.weight', 'bert.encoder.layer.9.attention.self.query_global.bias', 'bert.encoder.layer.9.attention.self.key_global.weight', 'bert.encoder.layer.9.attention.self.key_global.bias', 'bert.encoder.layer.9.attention.self.value_global.weight', 'bert.encoder.layer.9.attention.self.value_global.bias', 'bert.encoder.layer.10.attention.self.query_global.weight', 'bert.encoder.layer.10.attention.self.query_global.bias', 'bert.encoder.layer.10.attention.self.key_global.weight', 'bert.encoder.layer.10.attention.self.key_global.bias', 'bert.encoder.layer.10.attention.self.value_global.weight', 'bert.encoder.layer.10.attention.self.value_global.bias', 'bert.encoder.layer.11.attention.self.query_global.weight', 'bert.encoder.layer.11.attention.self.query_global.bias', 'bert.encoder.layer.11.attention.self.key_global.weight', 'bert.encoder.layer.11.attention.self.key_global.bias', 'bert.encoder.layer.11.attention.self.value_global.weight', 'bert.encoder.layer.11.attention.self.value_global.bias']
    - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
    - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    INFO:transformers.modeling_utils:All the weights of BertForMaskedLM were initialized from the model checkpoint at tmp/bert-base-4096.
    If your task is similar to the task the model of the ckeckpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
    INFO:__main__:Pretraining bert-base-4096 ... 
    INFO:filelock:Lock 140392820589624 acquired on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
    INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_valEusLong.txt [took 0.008 s]
    INFO:filelock:Lock 140392820589624 released on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
    INFO:__main__:Loading and tokenizing training data is usually slow: trainEusLong1.txt
    INFO:filelock:Lock 140392820589456 acquired on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
    INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_trainEusLong1.txt [took 0.053 s]
    INFO:filelock:Lock 140392820589456 released on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
    INFO:transformers.training_args:PyTorch: setting up devices
    INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
    INFO:transformers.trainer:***** Running Evaluation *****
    INFO:transformers.trainer:  Num examples = 70
    INFO:transformers.trainer:  Batch size = 1
    Evaluation:   0%|                                                                                                                                                 | 0/70 [00:00<?, ?it/s]/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
      warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
    Evaluation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:21<00:00,  3.22it/s]
    INFO:transformers.trainer:{'eval_loss': 12.326190962110246, 'step': 0}
    INFO:__main__:Initial eval bpc: 17.782934574086813
    INFO:transformers.trainer:***** Running training *****
    INFO:transformers.trainer:  Num examples = 388
    INFO:transformers.trainer:  Num Epochs = 501
    INFO:transformers.trainer:  Instantaneous batch size per device = 1
    INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 64
    INFO:transformers.trainer:  Gradient Accumulation steps = 64
    INFO:transformers.trainer:  Total optimization steps = 3000
    INFO:transformers.trainer:  Starting fine-tuning.
    Epoch:   0%|                                                                                                                                                     | 0/501 [00:00<?, ?it/sINFO:transformers.trainer:{'loss': 12.102866038680077, 'learning_rate': 6.000000000000001e-08, 'epoch': 0.16494845360824742, 'step': 1}                  | 63/388 [01:18<06:51,  1.27s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-1
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-1/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-1/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.099215269088745, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.32989690721649484, 'step': 2}                                 | 127/388 [02:50<05:35,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-2
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-2/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-2/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.078452616930008, 'learning_rate': 1.8e-07, 'epoch': 0.4948453608247423, 'step': 3}                                                 | 191/388 [04:24<04:14,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-3
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-3/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-3/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.023080185055733, 'learning_rate': 2.4000000000000003e-07, 'epoch': 0.6597938144329897, 'step': 4}                                  | 255/388 [05:56<02:50,  1.28s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-4
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-4/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-4/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.003526121377945, 'learning_rate': 3.0000000000000004e-07, 'epoch': 0.8247422680412371, 'step': 5}█████████▉                        | 319/388 [07:29<01:28,  1.29s/it]INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-5
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-5/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-5/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 11.993770495057106, 'learning_rate': 3.6e-07, 'epoch': 0.9896907216494846, 'step': 6}███████████████████████████████████████████████▎ | 383/388 [09:01<00:06,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-6
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-6/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-6/pytorch_model.bin
    Iteration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:18<00:00,  1.44s/it]
    Epoch:   0%|▎                                                                                                                                        | 1/501 [09:18<77:36:08, 558.74s/it]                 INFO:transformers.trainer:{'loss': 12.672470852732658, 'learning_rate': 4.2e-07, 'epoch': 1.1649484536082475, 'step': 7}                                                   | 63/388 [01:20<06:58,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-7
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-7/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-7/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-8
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-8/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-8/pytorch_model.bin
    
    Iteration:  36%|███████████████████████████████████████████████████████▏                                                                                                 | 140/388 [03:21<05:27,  1.32s/iItINFO:transformers.trainer:{'loss': 11.813278079032898, 'learning_rate': 5.4e-07, 'epoch': 1.4948453608247423, 'step': 9}                                                  | 191/388 [04:27<04:15,  1.30s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-9
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-9/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-9/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-10
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-10/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-10/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-11
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-11/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-11/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-12
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-12/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-12/pytorch_model.bin
    Iteration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
    Epoch:   0%|▌                                                                                                                                        | 2/501 [18:43<77:40:49, 560.42s/it]<00:00,  2.07s/it]INFO:transformers.trainer:{'loss': 12.117324143648148, 'learning_rate': 7.799999999999999e-07, 'epoch': 2.1649484536082473, 'step': 13}                                     | 63/388 [01:20<06:59,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-13
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-13/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-13/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-14
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-14/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-14/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-15
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-15/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-15/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-16
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-16/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-16/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-17
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-17/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-17/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-18
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-18/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-18/pytorch_model.bin
    Iteration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
    Epoch:   1%|▊                                                                                                                                        | 3/501 [28:07<77:40:37, 561.52s/it]4<00:00,  2.07s/itINFO:transformers.trainer:{'loss': 11.206573352217674, 'learning_rate': 1.14e-06, 'epoch': 3.1649484536082473, 'step': 19}                                                  | 63/388 [01:20<06:58,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-19
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-19/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-19/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-20
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-20/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-20/pytorch_model.bin
    
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    Iteration:  39%|████████████████████████████████████████████████████████████▋                                                                                             | 153/388 [03:38<05:35,  1.43s/it]
    Epoch:   1%|▊                                                                                                                                        | 3/501 [31:45<87:51:44, 635.15s/it]
    Traceback (most recent call last):
      File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
        trainer.train(model_path=model_path)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
        tr_loss += self._training_step(model, inputs, optimizer)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
        outputs = model(**inputs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 753, in forward
        input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 182, in forward
        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
    RuntimeError: CUDA error: device-side assert triggered
    

    the same run with

    ###########################################

    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

    ###########################################

    ...
    Epoch:   1%|▉                                                                                                                                                          | 3/501 [30:52<85:25:53, 617.58s/it]
    Traceback (most recent call last):
      File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
        trainer.train(model_path=model_path)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
        tr_loss += self._training_step(model, inputs, optimizer)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
        outputs = model(**inputs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 430, in forward
        encoder_attention_mask,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 155, in checkpoint
        return CheckpointFunction.apply(function, preserve, *args)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 74, in forward
        outputs = run_function(*args)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 420, in custom_forward
        return module(*inputs, output_attentions)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward
        hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward
        hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 243, in forward
        attention_scores = attention_scores + attention_mask
    RuntimeError: CUDA error: device-side assert triggered
    (transformers) [email protected]:/mnt/datuak/gorka-tmp$ python BERTeus2LongB.py
    

    Any hint what causes this error?

    By the way, I also got sometimes this error, which I am not able to reproduce right now:

     File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      ...
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/functional.py", line 1372, in linear
        output = input.matmul(weight.t())
    RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
    

    Regards, Gorka

    opened by GorkaUrbizu 7
  • Number of tokens per batch mismatch - longformer vs roberta

    Number of tokens per batch mismatch - longformer vs roberta

    I see in your conversion notebook that you suggest that the number of tokens per batch should be the same as roberta: 2^18 = 260k

    When I look at the roberta paper, it says it uses a sequence length of 512 and a batch size of 8k. This means that each batch has 512*8k = 4M tokens

    Am I missing something?

    opened by nbroad1881 1
  • Answering performance of Longformer-base on the HotpotQA dev set

    Answering performance of Longformer-base on the HotpotQA dev set

    Hi,

    I only found Longformer-base's joint F1 on the HopotQA dev set from the paper, and I would like to know if my reproduction results (Ans EM = 61.38, Ans F1 = 75.18) are expected. Could you provide some more specific metrics?

    Thank you!

    opened by zycdev 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Updated BART to Longformer-encoder-decoder (LED) converter

    Updated BART to Longformer-encoder-decoder (LED) converter

    Hi @ibeltagy et al., I'm pre-training BART to Portuguese and converting the pre-trained model to LED following the instructions you gave in the paper and the code at https://github.com/allenai/longformer/blob/caefee668e39cacdece7dd603a0bebf24df6d8ca/scripts/convert_bart_to_longformerencoderdecoder.py.

    The huggingface library is evolving fast; unfortunately, the code you provided is outdated and I had to implement a new version based on yours.

    I have 2 questions:

    1. Could you tell me if everything is ok or if I missed something? https://gist.github.com/erichans/af745a381b28b1c019f96997ddac4cd7
    2. Is the LEDForConditionalGeneration model uploaded to huggingface just a BART model converted to LED or is there something else?

    Thanks in advance!

    opened by erichans 0
  • Why the TVM impelmentation is memroy efficient

    Why the TVM impelmentation is memroy efficient

    Thanks for your excellent work!

    Just want to discuss the memory reduction problem. It seems that the TVM implementation does not store fewer matrices (like Queries, Keys, and Values matrix). The num of Q-K pairs is less than the full attention so that we can get a faster calculation speed, but why the memory reduction has a similar trend with the time reduction? Seems the TVM kernel does not use any technique to save the memory, and the padding 0 values are also int32, but the fact is that TVM implementation is memory efficient...

    Looking forward to your reply.

    opened by jlidw 0
  • Pretraining longformer for NER on big pdf text

    Pretraining longformer for NER on big pdf text

    Hi, I'm trying to extract entities from documents containing 50-60 pages per document. can anybody suggest a better approach for it, please? I couldn't find any NER implementation of longformers.

    opened by ajaysurya1221 0
Releases(v0.2)
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For furth

Yiming Cui 1.2k Dec 30, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
COVID-19 Chatbot with Rasa 2.0: open source conversational AI

COVID-19 chatbot implementation with Rasa open source 2.0, conversational AI framework.

Aazim Parwaz 1 Dec 23, 2022
A python script that will use hydra to get user and password to login to ssh, ftp, and telnet

Hydra-Auto-Hack A python script that will use hydra to get user and password to login to ssh, ftp, and telnet Project Description This python script w

2 Jan 16, 2022
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021

Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates Vibhor Agarwal, Sagar Joglekar, Anthony P. Young an

Vibhor Agarwal 2 Jun 30, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

Mickaël Schoentgen 163 Dec 31, 2022
The code for two papers: Feedback Transformer and Expire-Span.

transformer-sequential This repo contains the code for two papers: Feedback Transformer Expire-Span The training code is structured for long sequentia

Meta Research 125 Dec 25, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 31, 2022
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022