The entmax mapping and its loss, a family of sparse softmax alternatives.

Related tags

Text Data & NLPentmax
Overview

Build Status

PyPI version

entmax


This package provides a pytorch implementation of entmax and entmax losses: a sparse family of probability mappings and corresponding loss functions, generalizing softmax / cross-entropy.

Features:

  • Exact partial-sort algorithms for 1.5-entmax and 2-entmax (sparsemax).
  • A bisection-based algorithm for generic alpha-entmax.
  • Gradients w.r.t. alpha for adaptive, learned sparsity!

Requirements: python 3, pytorch >= 1.0 (and pytest for unit tests)

Example

In [1]: import torch

In [2]: from torch.nn.functional import softmax

In [2]: from entmax import sparsemax, entmax15, entmax_bisect

In [4]: x = torch.tensor([-2, 0, 0.5])

In [5]: softmax(x, dim=0)
Out[5]: tensor([0.0486, 0.3592, 0.5922])

In [6]: sparsemax(x, dim=0)
Out[6]: tensor([0.0000, 0.2500, 0.7500])

In [7]: entmax15(x, dim=0)
Out[7]: tensor([0.0000, 0.3260, 0.6740])

Gradients w.r.t. alpha (continued):

In [1]: from torch.autograd import grad

In [2]: x = torch.tensor([[-1, 0, 0.5], [1, 2, 3.5]])

In [3]: alpha = torch.tensor(1.33, requires_grad=True)

In [4]: p = entmax_bisect(x, alpha)

In [5]: p
Out[5]:
tensor([[0.0460, 0.3276, 0.6264],
        [0.0026, 0.1012, 0.8963]], grad_fn=<EntmaxBisectFunctionBackward>)

In [6]: grad(p[0, 0], alpha)
Out[6]: (tensor(-0.2562),)

Installation

pip install entmax

Citations

Sparse Sequence-to-Sequence Models

@inproceedings{entmax,
  author    = {Peters, Ben and Niculae, Vlad and Martins, Andr{\'e} FT},
  title     = {Sparse Sequence-to-Sequence Models},
  booktitle = {Proc. ACL},
  year      = {2019},
  url       = {https://www.aclweb.org/anthology/P19-1146}
}

Adaptively Sparse Transformers

@inproceedings{correia19adaptively,
  author    = {Correia, Gon\c{c}alo M and Niculae, Vlad and Martins, Andr{\'e} FT},
  title     = {Adaptively Sparse Transformers},
  booktitle = {Proc. EMNLP-IJCNLP (to appear)},
  year      = {2019},
}

Further reading:

Comments
  • entmax_bisect leads to loss becoming nan

    entmax_bisect leads to loss becoming nan

    Hi,

    I've used several different strategies with attention. I have tried entmax on a small batch, it works well, but somewhere during training on full dataset, my loss becomes Nan. The behavior is irregular, someone for one epoch, I did not get, but most of the times I'm getting Nan as my loss. Can you please suggest some ways of how this can be fixed. nn.Softmax works fine.

    opened by prajjwal1 16
  • Gradient wrt alpha

    Gradient wrt alpha

    Added gradients wrt alpha in the EntmaxBisectFunction class and the wrapper entmax_bisect. The extra computation is only performed if the alpha argument is a tensor with requires_grad=True.

    Todos

    • [x] reference @goncalomcorreia's paper in readme
    • [ ] link to camera ready (when out)
    • [x] add a general extension to nd tensors and dim argument. I think this can be done inside entmax_bisect using views and unsqueeze. The requirement probably should be, if X.shape = (m_1, m_2, ..., m_k) and dim=d then alpha.shape must be (m_1, ..., m_{d-1}, m_{d+1}, ..., m_k). (Weight sharing along certain dimensions can be achieved using torch.expand in user code)
    • [x] refactor losses
    • [x] add example for getting gradient wrt alpha in readme
    opened by vene 14
  • Errors when using the loss function

    Errors when using the loss function

    The error shows that when computing the loss, the size of 'target' is not the same as 'p_star'. https://github.com/deep-spin/entmax/blob/master/entmax/losses.py#L156 Should it be switch to index_add_? Any hint?

    Pytorch version: '0.4.1.post2'

    Thanks

    opened by berlino 9
  • Alpha value less than one?

    Alpha value less than one?

    Can alpha value be less than one?

    I basically need it to be sum-normalized sigmoids in that case

    (e.g. rather than softmax that is the case where alpha = 1.0).

    opened by kayuksel 8
  • Unexpected behaviour of sparsemax gradients for 3d tensors

    Unexpected behaviour of sparsemax gradients for 3d tensors

    Hi folks!

    It seems like the gradients of sparsemax are not the same when we have two "equal" tensors: one 2d, and the other with a time dimension.

    Here is the code to reproduce the problem:

    import torch
    import entmax
    
    
    def test_map_fn(activation_fn):
        x = torch.tensor([[-2, 0, 0.5], [0.1, 2, -0.4]], requires_grad=True)
        # >>> x.shape
        # torch.Size([2, 3])
        a_2d = activation_fn(x, dim=-1)
        z_2d = torch.sum(torch.pow(a_2d, 2))
        z_2d.backward()
        grad_2d = x.grad
    
        x = torch.tensor([[[-2, 0, 0.5]], [[0.1, 2, -0.4]]], requires_grad=True)
        # >>> x.shape
        # torch.Size([2, 1, 3])
        a_3d = activation_fn(x, dim=-1)
        z_3d = torch.sum(torch.pow(a_3d, 2))
        z_3d.backward()
        grad_3d = x.grad
    
        print(activation_fn.__name__)
        print('Ok acts:', torch.allclose(a_2d.squeeze(), a_3d.squeeze()))
        print('Ok grads:', torch.allclose(grad_2d.squeeze(), grad_3d.squeeze()))
        print(grad_2d.squeeze())
        print(grad_3d.squeeze())
        print('---\n')
    
    
    if __name__ == '__main__':
        test_map_fn(torch.softmax)
        test_map_fn(entmax.entmax15)
        test_map_fn(entmax.sparsemax)
    

    The output of this code is:

    softmax
    Ok acts: True
    Ok grads: True
    tensor([[-0.0421, -0.0883,  0.1304],
            [-0.1325,  0.2198, -0.0873]])
    tensor([[-0.0421, -0.0883,  0.1304],
            [-0.1325,  0.2198, -0.0873]])
    ---
    
    entmax15
    Ok acts: True
    Ok grads: True
    tensor([[ 0.0000, -0.2344,  0.2344],
            [-0.0926,  0.0926,  0.0000]])
    tensor([[ 0.0000, -0.2344,  0.2344],
            [-0.0926,  0.0926,  0.0000]])
    ---
    
    sparsemax
    Ok acts: True
    Ok grads: False
    tensor([[ 0.0000, -0.5000,  0.5000],
            [ 0.0000,  0.0000,  0.0000]])
    tensor([[ 0., -2.,  0.],
            [ 0.,  1.,  0.]])
    ---
    

    So, using sparsemax, the grads of the two tensors are different. Obs: it seems that a quick fix by doing tensor.view(-1, nb_labels) to get a 2d tensor works fine in practice.

    opened by mtreviso 8
  • Release patch 1.0.1 with torch install_requires fix

    Release patch 1.0.1 with torch install_requires fix

    Thanks for creating this useful library. We recently included it as part of our low code toolkit, Ludwig. However, we ran into an issue whereby if the user does not have torch already installed before installing entmax, it raises an exception:

    × python setup.py egg_info did not run successfully.
      │ exit code: 1
      ╰─> [10 lines of output]
          Traceback (most recent call last):
            File "<string>", line 36, in <module>
            File "<pip-setuptools-caller>", line 34, in <module>
            File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/setup.py", line 2, in <module>
              from entmax import __version__
            File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/entmax/__init__.py", line 3, in <module>
              from entmax.activations import sparsemax, entmax15, Sparsemax, Entmax15
            File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/entmax/activations.py", line 13, in <module>
              import torch
          ModuleNotFoundError: No module named 'torch'
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: metadata-generation-failed
    

    Looks like this was fixed some time back here, but this change was made after the v1.0 release, meaning the current production release has this bug. Can you create a patch release v1.0.1 that includes this fix?

    Thanks.

    opened by tgaddair 5
  • `entmax_bisect` is not stable around `alpha=1`

    `entmax_bisect` is not stable around `alpha=1`

    First of all, thanks for a great library! It's very nicely implemented!

    From the documentation provided in the code, I understood that entmax_bisect should behave like softmax when alpha is set to 1.

    I've done some experiments and the results seem to be different from softmax when the alpha is equal to 1. Yet, when it's close to 1 it approximates the softmax behavior.

    Here is the code snippet for a very small example:

    import torch
    from entmax import entmax_bisect
    
    
    torch.softmax(torch.Tensor([0., 0., 1.]), dim=-1)                        # tensor([0.2119, 0.2119, 0.5761])
    
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.9)             # tensor([0.2195, 0.2195, 0.5611])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.95)            # tensor([0.2157, 0.2157, 0.5687])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.99)            # tensor([0.2127, 0.2127, 0.5747])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.999999)        # tensor([0.2119, 0.2119, 0.5761])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1)               # tensor([0.3333, 0.3333, 0.3333]) <--
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1.00001)         # tensor([0.2119, 0.2119, 0.5761])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1.1)             # tensor([0.1985, 0.1985, 0.6031])
    

    Is this a bug or is it the intended behavior? I think it's not very clear from the documentation.

    opened by MartinXPN 3
  • A bug when alpha = 1 for entmax_bisect?

    A bug when alpha = 1 for entmax_bisect?

    For function "entmax_bisect", when given parameter alpha, it can give out results like: softmax (alpha = 1), entmax15 (alpha = 1.5), sparsemax (alpha = 2). But when I try alpha = 1, it gives out wrong results that all number is the same. But when I set alpha = 0.99999 or 1.00001, it works well. And other alpha, like 2 and 1,5, this function also works well. So is this a bug or I just use it wrongly? Thank you a lot!

    opened by mysteriouslfz 3
  • Usage of alpha

    Usage of alpha

    Hi,

    May I know if we need to define a new trainable parameter for each head per layer for the alpha value? Could anyone be kind enough to show a simple example of how it could be used in normal transformer?

    Thanks!

    opened by alibabadoufu 3
  • Fix setup.py

    Fix setup.py

    • Avoid importing the package in setup.py to get its version (this will fail if dependencies are not installed). See this.
    • Use setuptools instead of distutils to make dependencies work (distutils doesn't actually support install_requires and python_requires).
    opened by cifkao 2
  • Do not import entmax => torch in setup.py

    Do not import entmax => torch in setup.py

    Currently entmax/__init__.py is imported during install, which imports torch. What this means is that if the user of this package, or anything which depends upon it, doesn't already have torch installed, the install will fail.

    Following the advice from https://packaging.python.org/en/latest/guides/single-sourcing-package-version/#single-sourcing-the-package-version it looks like you can now get the same effect using setup.cfg and it will avoid actually executing the code but instead pull it from the AST.

    opened by frankier 1
  • Index -1 is out of bounds

    Index -1 is out of bounds

    Hi! I am training a language model similar to one in Sparse Text Generation project with custom input format. When I start training it can not calculate an entmax loss. My inputs and labels both has shapes (batch_size, seq_len) before went to loss. Afterwards (batch_size*seq_len, vocab_size)and (batch_size*seq_len,) respectively. I use masking via -1 in labels and despite I set ignore_index=-1 , my log is:

    Traceback (most recent call last):                                                                                                       │
      File "run_lm_finetuning.py", line 782, in <module>                                                                                     │
        main()                                                                                                                               │
      File "run_lm_finetuning.py", line 736, in main                                                                                         │
        global_step, tr_loss = train(args, train_dataset, model, tokenizer, gen_func)                                                        │
      File "run_lm_finetuning.py", line 300, in train                                                                                        │
        outputs = model(inputs, labels=labels)                                                                                               │
      File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 880, in _call_impl                                      │
        result = self.forward(*input, **kwargs)                                                                                              │
      File "/app/src/pytorch_transformers/modeling_gpt2.py", line 607, in forward                                                            │
        loss = self.loss(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))                                                │
      File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 880, in _call_impl                                      │
        result = self.forward(*input, **kwargs)                                                                                              │
      File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 17, in forward                                                    │
        loss = self.loss(X, target)                                                                                                          │
      File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 278, in loss                                                      │
        return entmax_bisect_loss(X, target, self.alpha, self.n_iter)                                                                        │
      File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 242, in entmax_bisect_loss                                        │
        return EntmaxBisectLossFunction.apply(X, target, alpha, n_iter)                                                                      │
      File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 129, in forward                                                   │
        ctx, X, target, alpha, proj_args=dict(n_iter=n_iter)                                                                                 │
      File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 45, in forward                                                    │
        p_star.scatter_add_(1, target.unsqueeze(1), torch.full_like(p_star, -1))                                                             │
    RuntimeError: index -1 is out of bounds for dimension 1 with size 50257  
    

    How to fix this?

    UPD: I realized that the problem is not connected with ignore_index, but with shapes missmatch between target and p_star in forward method of _GenericLossFunction class. Still don't know hot to fix this bug. So, help me please, if somebody know how :)

    opened by liehtman 0
  • Entmax fails when all inputs are -inf

    Entmax fails when all inputs are -inf

    When all inputs to entmax are -inf, it fails with

    RuntimeError                              Traceback (most recent call last)
    <ipython-input-404-217bd9c1ced2> in <module>
          1 from entmax import entmax15
          2 logits = torch.ones(10) * float('-inf')
    ----> 3 entmax15(logits)
    
    ~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in entmax15(X, dim, k)
        254     """
        255 
    --> 256     return Entmax15Function.apply(X, dim, k)
        257 
        258 
    
    ~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in forward(cls, ctx, X, dim, k)
        176         X = X / 2  # divide by 2 to solve actual Entmax
        177 
    --> 178         tau_star, _ = _entmax_threshold_and_support(X, dim=dim, k=k)
        179 
        180         Y = torch.clamp(X - tau_star, min=0) ** 2
    
    ~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in _entmax_threshold_and_support(X, dim, k)
        129 
        130     support_size = (tau <= Xsrt).sum(dim).unsqueeze(dim)
    --> 131     tau_star = tau.gather(dim, support_size - 1)
        132 
        133     if k is not None and k < X.shape[dim]:
    
    RuntimeError: index -1 is out of bounds for dimension 0 with size 10
    

    A minimal snippet to reproduce this behavior is

    from entmax import entmax15
    logits = torch.ones(10) * float('-inf')
    entmax15(logits)
    

    For reference, torch.softmax will return a tensor of nan's. This is certainly a corner case, but sometimes padding may create -inf-only inputs and it's easier to deal with nan's later.

    [This is possibly related to #9 ]

    opened by erickrf 1
  • Problem with sparse activations

    Problem with sparse activations

    I just replace the sotfmax function with sparsemax function or tsallis15 function in my transformer model. It works well on training stage, but the following errors occur during the testing phase: RuntimeError: CUDA error: device-side assert triggered

    If I replace it with softmax function again, it works.

    What could be the cause?

    opened by zylm 9
  • entmax implementation for Tesnoflow 2

    entmax implementation for Tesnoflow 2

    Dear team, Thank you for your great work.

    Could you please add the support for Tensorflow 2 as I need it for many projects. Do you have any plans to do so?

    opened by deepgradient 1
  • Sparse losses return nan when there is -inf in the input

    Sparse losses return nan when there is -inf in the input

    The sparse loss functions (and their equivalent classes) return nans when there is -inf in the input.

    Example:

    import torch
    import numpy as np
    from entmax import entmax15_loss, sparsemax_loss
    x = torch.rand(10, 5)
    y = torch.randint(0, 4, [10])
    x[:, 4] = -np.inf
    entmax15_loss(x, y) 
    # tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
    
    sparsemax_loss(x, y)
    # tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
    
    opened by erickrf 2
  • ignore_index in loss functions

    ignore_index in loss functions

    This pull request implements two changes to make the loss function API closer to nn.CrossEntropyLoss.

    • Treating ignore_index. The common use case is to use a pseudo class id such as -1 in the target tensor to indicate padding positions (or any samples that should be ignored in the loss computation). This PR filters out ignore_index from the target tensor before computing the actual loss; the current implementation does not do this.

    • Adding reduction mean as a synonym to elementwise_mean. The latter has been deprecated in nn.CrossEntropyLoss in favor of mean.

    opened by erickrf 0
Releases(v1.1)
  • v1.1(Dec 3, 2022)

    Among various small fixes, this version should fix an installation bug that caused installation to fail if torch was not already installed.

    Source code(tar.gz)
    Source code(zip)
Owner
DeepSPIN
Deep Structured Prediction in NLP
DeepSPIN
문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

Jeong Ukjae 16 Apr 02, 2022
CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

CJK computer science terms comparison This repository contains the source code of the website. You can see the website from the following link: Englis

Hong Minhee (洪 民憙) 88 Dec 23, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 08, 2023
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022
Yuqing Xie 2 Feb 17, 2022
CPC-big and k-means clustering for zero-resource speech processing

The CPC-big model and k-means checkpoints used in Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing.

Benjamin van Niekerk 5 Nov 23, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 06, 2023
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

Advit Deepak 7 Sep 17, 2022
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
A simple Streamlit App to classify swahili news into different categories.

Swahili News Classifier Streamlit App A simple app to classify swahili news into different categories. Installation Install all streamlit requirements

Davis David 4 May 01, 2022
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

Matthias 479 Jan 01, 2023
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Extension - matrix and vocabulary extractor for TF-IDF and Doc2Vec An extension for ASReview that adds a tf-idf extractor that saves the matrix and th

ASReview 4 Jun 17, 2022