The entmax mapping and its loss, a family of sparse softmax alternatives.

Last update: Dec 22, 2022

Related tags

Overview

entmax

This package provides a pytorch implementation of entmax and entmax losses: a sparse family of probability mappings and corresponding loss functions, generalizing softmax / cross-entropy.

Features:

Exact partial-sort algorithms for 1.5-entmax and 2-entmax (sparsemax).
A bisection-based algorithm for generic alpha-entmax.
Gradients w.r.t. alpha for adaptive, learned sparsity!

Requirements: python 3, pytorch >= 1.0 (and pytest for unit tests)

Example

In [1]: import torch

In [2]: from torch.nn.functional import softmax

In [2]: from entmax import sparsemax, entmax15, entmax_bisect

In [4]: x = torch.tensor([-2, 0, 0.5])

In [5]: softmax(x, dim=0)
Out[5]: tensor([0.0486, 0.3592, 0.5922])

In [6]: sparsemax(x, dim=0)
Out[6]: tensor([0.0000, 0.2500, 0.7500])

In [7]: entmax15(x, dim=0)
Out[7]: tensor([0.0000, 0.3260, 0.6740])

Gradients w.r.t. alpha (continued):

In [1]: from torch.autograd import grad

In [2]: x = torch.tensor([[-1, 0, 0.5], [1, 2, 3.5]])

In [3]: alpha = torch.tensor(1.33, requires_grad=True)

In [4]: p = entmax_bisect(x, alpha)

In [5]: p
Out[5]:
tensor([[0.0460, 0.3276, 0.6264],
        [0.0026, 0.1012, 0.8963]], grad_fn=<EntmaxBisectFunctionBackward>)

In [6]: grad(p[0, 0], alpha)
Out[6]: (tensor(-0.2562),)

Installation

pip install entmax

Citations

Sparse Sequence-to-Sequence Models

@inproceedings{entmax,
  author    = {Peters, Ben and Niculae, Vlad and Martins, Andr{\'e} FT},
  title     = {Sparse Sequence-to-Sequence Models},
  booktitle = {Proc. ACL},
  year      = {2019},
  url       = {https://www.aclweb.org/anthology/P19-1146}
}

Adaptively Sparse Transformers

@inproceedings{correia19adaptively,
  author    = {Correia, Gon\c{c}alo M and Niculae, Vlad and Martins, Andr{\'e} FT},
  title     = {Adaptively Sparse Transformers},
  booktitle = {Proc. EMNLP-IJCNLP (to appear)},
  year      = {2019},
}

Comments

entmax_bisect leads to loss becoming nan

Hi,

I've used several different strategies with attention. I have tried entmax on a small batch, it works well, but somewhere during training on full dataset, my loss becomes Nan. The behavior is irregular, someone for one epoch, I did not get, but most of the times I'm getting Nan as my loss. Can you please suggest some ways of how this can be fixed. nn.Softmax works fine.

opened by prajjwal1 16
Gradient wrt alpha
Added gradients wrt alpha in the EntmaxBisectFunction class and the wrapper entmax_bisect. The extra computation is only performed if the alpha argument is a tensor with requires_grad=True.

Todos

[x] reference @goncalomcorreia's paper in readme

[ ] link to camera ready (when out)

[x] add a general extension to nd tensors and dim argument. I think this can be done inside entmax_bisect using views and unsqueeze. The requirement probably should be, if X.shape = (m_1, m_2, ..., m_k) and dim=d then alpha.shape must be (m_1, ..., m_{d-1}, m_{d+1}, ..., m_k). (Weight sharing along certain dimensions can be achieved using torch.expand in user code)

[x] refactor losses

[x] add example for getting gradient wrt alpha in readme
opened by vene 14
Errors when using the loss function

The error shows that when computing the loss, the size of 'target' is not the same as 'p_star'. https://github.com/deep-spin/entmax/blob/master/entmax/losses.py#L156 Should it be switch to index_add_? Any hint?

Pytorch version: '0.4.1.post2'

Thanks

opened by berlino 9
Alpha value less than one?

Can alpha value be less than one?

I basically need it to be sum-normalized sigmoids in that case

(e.g. rather than softmax that is the case where alpha = 1.0).

opened by kayuksel 8

Unexpected behaviour of sparsemax gradients for 3d tensors

Hi folks!

It seems like the gradients of sparsemax are not the same when we have two "equal" tensors: one 2d, and the other with a time dimension.

Here is the code to reproduce the problem:

import torch
import entmax


def test_map_fn(activation_fn):
    x = torch.tensor([[-2, 0, 0.5], [0.1, 2, -0.4]], requires_grad=True)
    # >>> x.shape
    # torch.Size([2, 3])
    a_2d = activation_fn(x, dim=-1)
    z_2d = torch.sum(torch.pow(a_2d, 2))
    z_2d.backward()
    grad_2d = x.grad

    x = torch.tensor([[[-2, 0, 0.5]], [[0.1, 2, -0.4]]], requires_grad=True)
    # >>> x.shape
    # torch.Size([2, 1, 3])
    a_3d = activation_fn(x, dim=-1)
    z_3d = torch.sum(torch.pow(a_3d, 2))
    z_3d.backward()
    grad_3d = x.grad

    print(activation_fn.__name__)
    print('Ok acts:', torch.allclose(a_2d.squeeze(), a_3d.squeeze()))
    print('Ok grads:', torch.allclose(grad_2d.squeeze(), grad_3d.squeeze()))
    print(grad_2d.squeeze())
    print(grad_3d.squeeze())
    print('---\n')


if __name__ == '__main__':
    test_map_fn(torch.softmax)
    test_map_fn(entmax.entmax15)
    test_map_fn(entmax.sparsemax)

The output of this code is:

softmax
Ok acts: True
Ok grads: True
tensor([[-0.0421, -0.0883,  0.1304],
        [-0.1325,  0.2198, -0.0873]])
tensor([[-0.0421, -0.0883,  0.1304],
        [-0.1325,  0.2198, -0.0873]])
---

entmax15
Ok acts: True
Ok grads: True
tensor([[ 0.0000, -0.2344,  0.2344],
        [-0.0926,  0.0926,  0.0000]])
tensor([[ 0.0000, -0.2344,  0.2344],
        [-0.0926,  0.0926,  0.0000]])
---

sparsemax
Ok acts: True
Ok grads: False
tensor([[ 0.0000, -0.5000,  0.5000],
        [ 0.0000,  0.0000,  0.0000]])
tensor([[ 0., -2.,  0.],
        [ 0.,  1.,  0.]])
---

So, using sparsemax, the grads of the two tensors are different. Obs: it seems that a quick fix by doing tensor.view(-1, nb_labels) to get a 2d tensor works fine in practice.

opened by mtreviso 8

Release patch 1.0.1 with torch install_requires fix

Thanks for creating this useful library. We recently included it as part of our low code toolkit, Ludwig. However, we ran into an issue whereby if the user does not have torch already installed before installing entmax, it raises an exception:

× python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Traceback (most recent call last):
        File "<string>", line 36, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/setup.py", line 2, in <module>
          from entmax import __version__
        File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/entmax/__init__.py", line 3, in <module>
          from entmax.activations import sparsemax, entmax15, Sparsemax, Entmax15
        File "/tmp/pip-install-xasbhx2w/entmax_64da4068d2414a04a5c3adc7187695b4/entmax/activations.py", line 13, in <module>
          import torch
      ModuleNotFoundError: No module named 'torch'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Looks like this was fixed some time back here, but this change was made after the v1.0 release, meaning the current production release has this bug. Can you create a patch release v1.0.1 that includes this fix?

Thanks.

opened by tgaddair 5

`entmax_bisect` is not stable around `alpha=1`

First of all, thanks for a great library! It's very nicely implemented!

From the documentation provided in the code, I understood that entmax_bisect should behave like softmax when alpha is set to 1.

I've done some experiments and the results seem to be different from softmax when the alpha is equal to 1. Yet, when it's close to 1 it approximates the softmax behavior.

Here is the code snippet for a very small example:

import torch
from entmax import entmax_bisect


torch.softmax(torch.Tensor([0., 0., 1.]), dim=-1)                        # tensor([0.2119, 0.2119, 0.5761])

entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.9)             # tensor([0.2195, 0.2195, 0.5611])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.95)            # tensor([0.2157, 0.2157, 0.5687])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.99)            # tensor([0.2127, 0.2127, 0.5747])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=0.999999)        # tensor([0.2119, 0.2119, 0.5761])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1)               # tensor([0.3333, 0.3333, 0.3333]) <--
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1.00001)         # tensor([0.2119, 0.2119, 0.5761])
entmax_bisect(torch.Tensor([0., 0., 1.]), dim=-1, alpha=1.1)             # tensor([0.1985, 0.1985, 0.6031])

Is this a bug or is it the intended behavior? I think it's not very clear from the documentation.

opened by MartinXPN 3

A bug when alpha = 1 for entmax_bisect?

For function "entmax_bisect", when given parameter alpha, it can give out results like: softmax (alpha = 1), entmax15 (alpha = 1.5), sparsemax (alpha = 2). But when I try alpha = 1, it gives out wrong results that all number is the same. But when I set alpha = 0.99999 or 1.00001, it works well. And other alpha, like 2 and 1,5, this function also works well. So is this a bug or I just use it wrongly? Thank you a lot!

opened by mysteriouslfz 3
Usage of alpha

Hi,

May I know if we need to define a new trainable parameter for each head per layer for the alpha value? Could anyone be kind enough to show a simple example of how it could be used in normal transformer?

Thanks!

opened by alibabadoufu 3
Fix setup.py
Avoid importing the package in setup.py to get its version (this will fail if dependencies are not installed). See this.

Use setuptools instead of distutils to make dependencies work (distutils doesn't actually support install_requires and python_requires).
opened by cifkao 2
Do not import entmax => torch in setup.py

Currently entmax/__init__.py is imported during install, which imports torch. What this means is that if the user of this package, or anything which depends upon it, doesn't already have torch installed, the install will fail.

Following the advice from https://packaging.python.org/en/latest/guides/single-sourcing-package-version/#single-sourcing-the-package-version it looks like you can now get the same effect using setup.cfg and it will avoid actually executing the code but instead pull it from the AST.

opened by frankier 1

Index -1 is out of bounds

Hi! I am training a language model similar to one in Sparse Text Generation project with custom input format. When I start training it can not calculate an entmax loss. My inputs and labels both has shapes (batch_size, seq_len) before went to loss. Afterwards (batch_size*seq_len, vocab_size)and (batch_size*seq_len,) respectively. I use masking via -1 in labels and despite I set ignore_index=-1 , my log is:

Traceback (most recent call last):                                                                                                       │
  File "run_lm_finetuning.py", line 782, in <module>                                                                                     │
    main()                                                                                                                               │
  File "run_lm_finetuning.py", line 736, in main                                                                                         │
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, gen_func)                                                        │
  File "run_lm_finetuning.py", line 300, in train                                                                                        │
    outputs = model(inputs, labels=labels)                                                                                               │
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 880, in _call_impl                                      │
    result = self.forward(*input, **kwargs)                                                                                              │
  File "/app/src/pytorch_transformers/modeling_gpt2.py", line 607, in forward                                                            │
    loss = self.loss(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))                                                │
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 880, in _call_impl                                      │
    result = self.forward(*input, **kwargs)                                                                                              │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 17, in forward                                                    │
    loss = self.loss(X, target)                                                                                                          │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 278, in loss                                                      │
    return entmax_bisect_loss(X, target, self.alpha, self.n_iter)                                                                        │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 242, in entmax_bisect_loss                                        │
    return EntmaxBisectLossFunction.apply(X, target, alpha, n_iter)                                                                      │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 129, in forward                                                   │
    ctx, X, target, alpha, proj_args=dict(n_iter=n_iter)                                                                                 │
  File "/usr/local/lib/python3.6/dist-packages/entmax/losses.py", line 45, in forward                                                    │
    p_star.scatter_add_(1, target.unsqueeze(1), torch.full_like(p_star, -1))                                                             │
RuntimeError: index -1 is out of bounds for dimension 1 with size 50257

How to fix this?

UPD: I realized that the problem is not connected with ignore_index, but with shapes missmatch between target and p_star in forward method of _GenericLossFunction class. Still don't know hot to fix this bug. So, help me please, if somebody know how :)

opened by liehtman 0

Entmax fails when all inputs are -inf

When all inputs to entmax are -inf, it fails with

RuntimeError                              Traceback (most recent call last)
<ipython-input-404-217bd9c1ced2> in <module>
      1 from entmax import entmax15
      2 logits = torch.ones(10) * float('-inf')
----> 3 entmax15(logits)

~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in entmax15(X, dim, k)
    254     """
    255 
--> 256     return Entmax15Function.apply(X, dim, k)
    257 
    258 

~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in forward(cls, ctx, X, dim, k)
    176         X = X / 2  # divide by 2 to solve actual Entmax
    177 
--> 178         tau_star, _ = _entmax_threshold_and_support(X, dim=dim, k=k)
    179 
    180         Y = torch.clamp(X - tau_star, min=0) ** 2

~/.virtualenvs/sparseref/lib/python3.7/site-packages/entmax/activations.py in _entmax_threshold_and_support(X, dim, k)
    129 
    130     support_size = (tau <= Xsrt).sum(dim).unsqueeze(dim)
--> 131     tau_star = tau.gather(dim, support_size - 1)
    132 
    133     if k is not None and k < X.shape[dim]:

RuntimeError: index -1 is out of bounds for dimension 0 with size 10

A minimal snippet to reproduce this behavior is

from entmax import entmax15
logits = torch.ones(10) * float('-inf')
entmax15(logits)

For reference, torch.softmax will return a tensor of nan's. This is certainly a corner case, but sometimes padding may create -inf-only inputs and it's easier to deal with nan's later.

[This is possibly related to #9 ]

opened by erickrf 1

Problem with sparse activations

I just replace the sotfmax function with sparsemax function or tsallis15 function in my transformer model. It works well on training stage, but the following errors occur during the testing phase: RuntimeError: CUDA error: device-side assert triggered

If I replace it with softmax function again, it works.

What could be the cause?

opened by zylm 9
entmax implementation for Tesnoflow 2

Dear team, Thank you for your great work.

Could you please add the support for Tensorflow 2 as I need it for many projects. Do you have any plans to do so?

opened by deepgradient 1

Sparse losses return nan when there is -inf in the input

The sparse loss functions (and their equivalent classes) return nans when there is -inf in the input.

Example:

import torch
import numpy as np
from entmax import entmax15_loss, sparsemax_loss
x = torch.rand(10, 5)
y = torch.randint(0, 4, [10])
x[:, 4] = -np.inf
entmax15_loss(x, y) 
# tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

sparsemax_loss(x, y)
# tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

opened by erickrf 2

ignore_index in loss functions
This pull request implements two changes to make the loss function API closer to nn.CrossEntropyLoss.

Treating ignore_index. The common use case is to use a pseudo class id such as -1 in the target tensor to indicate padding positions (or any samples that should be ignored in the loss computation). This PR filters out ignore_index from the target tensor before computing the actual loss; the current implementation does not do this.

Adding reduction mean as a synonym to elementwise_mean. The latter has been deprecated in nn.CrossEntropyLoss in favor of mean.
opened by erickrf 0

Releases(v1.1)

v1.1(Dec 3, 2022)

Among various small fixes, this version should fix an installation bug that caused installation to fail if torch was not already installed.
Source code(tar.gz)
Source code(zip)

Owner

DeepSPIN

Deep Structured Prediction in NLP

GitHub Repository

Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

597 Dec 18, 2022

"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021

transformers-arithmetic This repository contains the code to reproduce the experiments from the paper: Nogueira, Jiang, Lin "Investigating the Limitat

33 Nov 16, 2022

A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

27 Dec 22, 2022

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

407 Jan 03, 2023

Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022

Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

26 Dec 25, 2022

Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

255 Dec 27, 2022

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

1.6k Dec 25, 2022

A curated list of FOSS tools to improve the Hacker News experience

Awesome-Hackernews Hacker News is a social news website focusing on computer technologies, hacking and startups. It promotes any content likely to "gr

141 Dec 27, 2022

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式，因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

9 Oct 13, 2022

justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

25 Dec 27, 2022

Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

847 Dec 19, 2022

Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

20 Aug 22, 2022

Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

0 Apr 09, 2022

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

15.3k Jan 03, 2023

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

1 Feb 01, 2022

Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

4 Apr 06, 2022