The entmax mapping and its loss, a family of sparse softmax alternatives.

Related tags

Text Data & NLPentmax

Build Status

PyPI version


This package provides a pytorch implementation of entmax and entmax losses: a sparse family of probability mappings and corresponding loss functions, generalizing softmax / cross-entropy.


  • Exact partial-sort algorithms for 1.5-entmax and 2-entmax (sparsemax).
  • A bisection-based algorithm for generic alpha-entmax.
  • Gradients w.r.t. alpha for adaptive, learned sparsity!

Requirements: python 3, pytorch >= 1.0 (and pytest for unit tests)


In [1]: import torch

In [2]: from torch.nn.functional import softmax

In [2]: from entmax import sparsemax, entmax15, entmax_bisect

In [4]: x = torch.tensor([-2, 0, 0.5])

In [5]: softmax(x, dim=0)
Out[5]: tensor([0.0486, 0.3592, 0.5922])

In [6]: sparsemax(x, dim=0)
Out[6]: tensor([0.0000, 0.2500, 0.7500])

In [7]: entmax15(x, dim=0)
Out[7]: tensor([0.0000, 0.3260, 0.6740])

Gradients w.r.t. alpha (continued):

In [1]: from torch.autograd import grad

In [2]: x = torch.tensor([[-1, 0, 0.5], [1, 2, 3.5]])

In [3]: alpha = torch.tensor(1.33, requires_grad=True)

In [4]: p = entmax_bisect(x, alpha)

In [5]: p
tensor([[0.0460, 0.3276, 0.6264],
        [0.0026, 0.1012, 0.8963]], grad_fn=<EntmaxBisectFunctionBackward>)

In [6]: grad(p[0, 0], alpha)
Out[6]: (tensor(-0.2562),)


pip install entmax


Sparse Sequence-to-Sequence Models

  author    = {Peters, Ben and Niculae, Vlad and Martins, Andr{\'e} FT},
  title     = {Sparse Sequence-to-Sequence Models},
  booktitle = {Proc. ACL},
  year      = {2019},
  url       = {}

Adaptively Sparse Transformers

  author    = {Correia, Gon\c{c}alo M and Niculae, Vlad and Martins, Andr{\'e} FT},
  title     = {Adaptively Sparse Transformers},
  booktitle = {Proc. EMNLP-IJCNLP (to appear)},
  year      = {2019},

Further reading:

  • entmax_bisect leads to loss becoming nan

    entmax_bisect leads to loss becoming nan


    I've used several different strategies with attention. I have tried entmax on a small batch, it works well, but somewhere during training on full dataset, my loss becomes Nan. The behavior is irregular, someone for one epoch, I did not get, but most of the times I'm getting Nan as my loss. Can you please suggest some ways of how this can be fixed. nn.Softmax works fine.

    opened by prajjwal1 16
  • Gradient wrt alpha

    Gradient wrt alpha

    Added gradients wrt alpha in the EntmaxBisectFunction class and the wrapper entmax_bisect. The extra computation is only performed if the alpha argument is a tensor with requires_grad=True.


    • [x] reference @goncalomcorreia's paper in readme
    • [ ] link to camera ready (when out)
    • [x] add a general extension to nd tensors and dim argument. I think this can be done inside entmax_bisect using views and unsqueeze. The requirement probably should be, if X.shape = (m_1, m_2, ..., m_k) and dim=d then alpha.shape must be (m_1, ..., m_{d-1}, m_{d+1}, ..., m_k). (Weight sharing along certain dimensions can be achieved using torch.expand in user code)
    • [x] refactor losses
    • [x] add example for getting gradient wrt alpha in readme
    opened by vene 14
  • Errors when using the loss function

    Errors when using the loss function

    The error shows that when computing the loss, the size of 'target' is not the same as 'p_star'. Should it be switch to index_add_? Any hint?

    Pytorch version: '0.4.1.post2'


    opened by berlino 9
  • Alpha value less than one?

    Alpha value less than one?

    Can alpha value be less than one?

    I basically need it to be sum-normalized sigmoids in that case

    (e.g. rather than softmax that is the case where alpha = 1.0).

    opened by kayuksel 8
  • Unexpected behaviour of sparsemax gradients for 3d tensors

    Unexpected behaviour of sparsemax gradients for 3d tensors

    Hi folks!

    It seems like the gradients of sparsemax are not the same when we have two "equal" tensors: one 2d, and the other with a time dimension.

    Here is the code to reproduce the problem:

    import torch
    import entmax
    def test_map_fn(activation_fn):
        x = torch.tensor([[-2, 0, 0.5], [0.1, 2, -0.4]], requires_grad=True)
        # >>> x.shape
        # torch.Size([2, 3])
        a_2d = activation_fn(x, dim=-1)
        z_2d = torch.sum(torch.pow(a_2d, 2))
        grad_2d = x.grad
        x = torch.tensor([[[-2, 0, 0.5]], [[0.1, 2, -0.4]]], requires_grad=True)
        # >>> x.shape
        # torch.Size([2, 1, 3])
        a_3d = activation_fn(x, dim=-1)
        z_3d = torch.sum(torch.pow(a_3d, 2))
        grad_3d = x.grad
        print('Ok acts:', torch.allclose(a_2d.squeeze(), a_3d.squeeze()))
        print('Ok grads:', torch.allclose(grad_2d.squeeze(), grad_3d.squeeze()))
    if __name__ == '__main__':

    The output of this code is:

    Ok acts: True
    Ok grads: True
    tensor([[-0.0421, -0.0883,  0.1304],
            [-0.1325,  0.2198, -0.0873]])
    tensor([[-0.0421, -0.0883,  0.1304],
            [-0.1325,  0.2198, -0.0873]])
    Ok acts: True
    Ok grads: True
    tensor([[ 0.0000, -0.2344,  0.2344],
            [-0.0926,  0.0926,  0.0000]])
    tensor([[ 0.0000, -0.2344,  0.2344],
            [-0.0926,  0.0926,  0.0000]])
    Ok acts: True
    Ok grads: False
    tensor([[ 0.0000, -0.5000,  0.5000],
            [ 0.0000,  0.0000,  0.0000]])
    tensor([[ 0., -2.,  0.],
            [ 0.,  1.,  0.]])

    So, using sparsemax, the grads of the two tensors are different. Obs: it seems that a quick fix by doing tensor.view(-1, nb_labels) to get a 2d tensor works fine in practice.

    opened by mtreviso 8
  • Release patch 1.0.1 with torch install_requires fix

    Release patch 1.0.1 with torch install_requires fix

    Thanks for creating this useful library. We recently included it as part of our low code toolkit, Ludwig. However, we ran into an issue whereby if the user does not have torch already installed before installing entmax, it raises an exception:

    × python egg_info did not run successfully.
      │ exit code: 1
      ╰─> [10 lines of output]
          Traceback (most recent call last):
            File "<string>", line 36, in <module>
            File "<pip-setuptools-caller>", line 34, in <module>
            File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/", line 2, in <module>
              from entmax import __version__
            File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/entmax/", line 3, in <module>
              from entmax.activations import sparsemax, entmax15, Sparsemax, Entmax15
            File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/entmax/", line 13, in <module>
              import torch
          ModuleNotFoundError: No module named 'torch'
          [end of output]
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: metadata-generation-failed

    Looks like this was fixed some time back here, but this change was made after the v1.0 release, meaning the current production release has this bug. Can you create a patch release v1.0.1 that includes this fix?


    opened by tgaddair 5
  • `entmax_bisect` is not stable around `alpha=1`

    `entmax_bisect` is not stable around `alpha=1`

    First of all, thanks for a great library! It's very nicely implemented!

    From the documentation provided in the code, I understood that entmax_bisect should behave like softmax when alpha is set to 1.

    I've done some experiments and the results seem to be different from softmax when the alpha is equal to 1. Yet, when it's close to 1 it approximates the softmax behavior.

    Here is the code snippet for a very small example:

    import torch
    from entmax import entmax_bisect
    torch.softmax(torch.Tensor([0., 0., 1.]), dim=-1)                        # tensor([0.2119, 0.2119, 0.5761])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.9)             # tensor([0.2195, 0.2195, 0.5611])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.95)            # tensor([0.2157, 0.2157, 0.5687])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.99)            # tensor([0.2127, 0.2127, 0.5747])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.999999)        # tensor([0.2119, 0.2119, 0.5761])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1)               # tensor([0.3333, 0.3333, 0.3333]) <--
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1.00001)         # tensor([0.2119, 0.2119, 0.5761])
    entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1.1)             # tensor([0.1985, 0.1985, 0.6031])

    Is this a bug or is it the intended behavior? I think it's not very clear from the documentation.

    opened by MartinXPN 3
  • A bug when alpha = 1 for entmax_bisect?

    A bug when alpha = 1 for entmax_bisect?

    For function "entmax_bisect", when given parameter alpha, it can give out results like: softmax (alpha = 1), entmax15 (alpha = 1.5), sparsemax (alpha = 2). But when I try alpha = 1, it gives out wrong results that all number is the same. But when I set alpha = 0.99999 or 1.00001, it works well. And other alpha, like 2 and 1,5, this function also works well. So is this a bug or I just use it wrongly? Thank you a lot!

    opened by mysteriouslfz 3
  • Usage of alpha

    Usage of alpha


    May I know if we need to define a new trainable parameter for each head per layer for the alpha value? Could anyone be kind enough to show a simple example of how it could be used in normal transformer?


    opened by alibabadoufu 3
  • Fix


    • Avoid importing the package in to get its version (this will fail if dependencies are not installed). See this.
    • Use setuptools instead of distutils to make dependencies work (distutils doesn't actually support install_requires and python_requires).
    opened by cifkao 2
  • Do not import entmax => torch in

    Do not import entmax => torch in

    Currently entmax/ is imported during install, which imports torch. What this means is that if the user of this package, or anything which depends upon it, doesn't already have torch installed, the install will fail.

    Following the advice from it looks like you can now get the same effect using setup.cfg and it will avoid actually executing the code but instead pull it from the AST.

    opened by frankier 1
  • Index -1 is out of bounds

    Index -1 is out of bounds

    Hi! I am training a language model similar to one in Sparse Text Generation project with custom input format. When I start training it can not calculate an entmax loss. My inputs and labels both has shapes (batch_size, seq_len) before went to loss. Afterwards (batch_size*seq_len, vocab_size)and (batch_size*seq_len,) respectively. I use masking via -1 in labels and despite I set ignore_index=-1 , my log is:

    Traceback (most recent call last):                                                                                                       │
      File "", line 782, in <module>                                                                                     │
        main()                                                                                                                               │
      File "", line 736, in main                                                                                         │
        global_step, tr_loss = train(args, train_dataset, model, tokenizer, gen_func)                                                        │
      File "", line 300, in train                                                                                        │
        outputs = model(inputs, labels=labels)                                                                                               │
      File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/", line 880, in _call_impl                                      │
        result = self.forward(*input, **kwargs)                                                                                              │
      File "/app/src/pytorch_transformers/", line 607, in forward                                                            │
        loss = self.loss(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))                                                │
      File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/", line 880, in _call_impl                                      │
        result = self.forward(*input, **kwargs)                                                                                              │
      File "/usr/local/lib/python3.6/dist-packages/entmax/", line 17, in forward                                                    │
        loss = self.loss(X, target)                                                                                                          │
      File "/usr/local/lib/python3.6/dist-packages/entmax/", line 278, in loss                                                      │
        return entmax_bisect_loss(X, target, self.alpha, self.n_iter)                                                                        │
      File "/usr/local/lib/python3.6/dist-packages/entmax/", line 242, in entmax_bisect_loss                                        │
        return EntmaxBisectLossFunction.apply(X, target, alpha, n_iter)                                                                      │
      File "/usr/local/lib/python3.6/dist-packages/entmax/", line 129, in forward                                                   │
        ctx, X, target, alpha, proj_args=dict(n_iter=n_iter)                                                                                 │
      File "/usr/local/lib/python3.6/dist-packages/entmax/", line 45, in forward                                                    │
        p_star.scatter_add_(1, target.unsqueeze(1), torch.full_like(p_star, -1))                                                             │
    RuntimeError: index -1 is out of bounds for dimension 1 with size 50257  

    How to fix this?

    UPD: I realized that the problem is not connected with ignore_index, but with shapes missmatch between target and p_star in forward method of _GenericLossFunction class. Still don't know hot to fix this bug. So, help me please, if somebody know how :)

    opened by liehtman 0
  • Entmax fails when all inputs are -inf

    Entmax fails when all inputs are -inf

    When all inputs to entmax are -inf, it fails with

    RuntimeError                              Traceback (most recent call last)
    <ipython-input-404-217bd9c1ced2> in <module>
          1 from entmax import entmax15
          2 logits = torch.ones(10) * float('-inf')
    ----> 3 entmax15(logits)
    ~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/ in entmax15(X, dim, k)
        254     """
    --> 256     return Entmax15Function.apply(X, dim, k)
    ~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/ in forward(cls, ctx, X, dim, k)
        176         X = X / 2  # divide by 2 to solve actual Entmax
    --> 178         tau_star, _ = _entmax_threshold_and_support(X, dim=dim, k=k)
        180         Y = torch.clamp(X - tau_star, min=0) ** 2
    ~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/ in _entmax_threshold_and_support(X, dim, k)
        130     support_size = (tau <= Xsrt).sum(dim).unsqueeze(dim)
    --> 131     tau_star = tau.gather(dim, support_size - 1)
        133     if k is not None and k < X.shape[dim]:
    RuntimeError: index -1 is out of bounds for dimension 0 with size 10

    A minimal snippet to reproduce this behavior is

    from entmax import entmax15
    logits = torch.ones(10) * float('-inf')

    For reference, torch.softmax will return a tensor of nan's. This is certainly a corner case, but sometimes padding may create -inf-only inputs and it's easier to deal with nan's later.

    [This is possibly related to #9 ]

    opened by erickrf 1
  • Problem with sparse activations

    Problem with sparse activations

    I just replace the sotfmax function with sparsemax function or tsallis15 function in my transformer model. It works well on training stage, but the following errors occur during the testing phase: RuntimeError: CUDA error: device-side assert triggered

    If I replace it with softmax function again, it works.

    What could be the cause?

    opened by zylm 9
  • entmax implementation for Tesnoflow 2

    entmax implementation for Tesnoflow 2

    Dear team, Thank you for your great work.

    Could you please add the support for Tensorflow 2 as I need it for many projects. Do you have any plans to do so?

    opened by deepgradient 1
  • Sparse losses return nan when there is -inf in the input

    Sparse losses return nan when there is -inf in the input

    The sparse loss functions (and their equivalent classes) return nans when there is -inf in the input.


    import torch
    import numpy as np
    from entmax import entmax15_loss, sparsemax_loss
    x = torch.rand(10, 5)
    y = torch.randint(0, 4, [10])
    x[:, 4] = -np.inf
    entmax15_loss(x, y) 
    # tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
    sparsemax_loss(x, y)
    # tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
    opened by erickrf 2
  • ignore_index in loss functions

    ignore_index in loss functions

    This pull request implements two changes to make the loss function API closer to nn.CrossEntropyLoss.

    • Treating ignore_index. The common use case is to use a pseudo class id such as -1 in the target tensor to indicate padding positions (or any samples that should be ignored in the loss computation). This PR filters out ignore_index from the target tensor before computing the actual loss; the current implementation does not do this.

    • Adding reduction mean as a synonym to elementwise_mean. The latter has been deprecated in nn.CrossEntropyLoss in favor of mean.

    opened by erickrf 0
  • v1.1(Dec 3, 2022)

    Among various small fixes, this version should fix an installation bug that caused installation to fail if torch was not already installed.

    Source code(tar.gz)
    Source code(zip)
Deep Structured Prediction in NLP
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
CorNet Correlation Networks for Extreme Multi-label Text Classification

CorNet Correlation Networks for Extreme Multi-label Text Classification Prerequisites python==3.6.3 pytorch==1.2.0 torchgpipe==0.0.5 click==7.0 ruamel

Guangxu Xun 38 Dec 31, 2022
jiant is an NLP toolkit

🚨 Update 🚨 : As of 2021/10/17, the jiant project is no longer being actively maintained. This means there will be no plans to add new models, tasks,

ML² AT CILVR 1.5k Dec 28, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Subformer This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us to retain performance while

Machel Reid 10 Dec 27, 2022
Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If yo

LIAAD - Laboratory of Artificial Intelligence and Decision Support 163 Dec 23, 2022
This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

NERphilosophy 👋 Welcome to the github repository of my BsC thesis. This repository contains (not all) code from my project on Named Entity Recognitio

Ruben 1 Jan 27, 2022
Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

Matthias 479 Jan 01, 2023
Simple translation demo showcasing our headliner package.

Headliner Demo This is a demo showcasing our Headliner package. In particular, we trained a simple seq2seq model on an English-German dataset. We didn

Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering 16 Nov 24, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 04, 2022

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench 🪑 The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 01, 2023