Sentence Embeddings with BERT & XLNet

Overview

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known as sentence embeddings). The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and are tuned specificially meaningul sentence embeddings such that sentences with similar meanings are close in vector space.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net, as well as our publications:

Installation

We recommend Python 3.6 or higher, PyTorch 1.6.0 or higher and transformers v3.1.0 or higher. The code does not work with Python 2.7.

Install with pip

Install the sentence-transformers with pip:

pip install -U sentence-transformers

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

See Quickstart in our documenation.

This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.

First download a pretrained model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-distilroberta-base-v1')

Then provide some sentences to the model.

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

And that's it already. We now have a list of numpy arrays with the embeddings.

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Pre-Trained Models

We provide a large list of Pretrained Models for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: SentenceTransformer('model_name').

» Full list of pretrained models

Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.

See Training Overview for an introduction how to train your own embedding models. We provide various examples how to train models on various datasets.

Some highlights are:

  • Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
  • Multi-Lingual and multi-task learning
  • Evaluation during training to find optimal model
  • 10+ loss-functions allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, constrative loss.

Performance

Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed.

Model STS benchmark SentEval
Avg. GloVe embeddings 58.02 81.52
BERT-as-a-service avg. embeddings 46.35 84.04
BERT-as-a-service CLS-vector 16.50 84.66
InferSent - GloVe 68.03 85.59
Universal Sentence Encoder 74.92 85.10
Sentence Transformer Models
nli-bert-base 77.12 86.37
nli-bert-large 79.19 87.78
stsb-bert-base 85.14 86.07
stsb-bert-large 85.29 86.66
stsb-roberta-base 85.44 -
stsb-roberta-large 86.39 -
stsb-distilbert-base 85.16 -

Application Examples

You can use this framework for:

and many more use-cases.

For all examples, see examples/applications.

Citing & Authors

If you find this repository helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

If you use one of the multilingual models, feel free to cite our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

If you use the code for data augmentation, feel free to cite our publication Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks:

@article{thakur-2020-AugSBERT,
    title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
    author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and  Gurevych, Iryna", 
    journal= "arXiv preprint arXiv:2010.08240",
    month = "10",
    year = "2020",
    url = "https://arxiv.org/abs/2010.08240",
}

The main contributors of this repository are:

Contact person: Nils Reimers, [email protected]

https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Comments
  • Is it Multilingual?

    Is it Multilingual?

    Hello,

    This might be a stupid question, but i wanted to know if I can use the clustering on German sentences? Will it work with the pre-trained model or do I need to train it on German data first?

    Thanks.

    opened by SouravDutta91 44
  • Fine-tune multilingual model for domain specific vocab

    Fine-tune multilingual model for domain specific vocab

    Thanks for the repository and for continuous updates.

    Wanted to check if understood it correctly: Is it possible to continue fine-tuning one of the multilingual models for a specific domain? For example I can take 'xlm-r-distilroberta-base-paraphrase-v1' and fine-tune it on domain-related parallel data( English-other languages) with MultipleNegativesRankingLoss?

    opened by langineer 30
  • Is it possible to encode by using multi-GPU?

    Is it possible to encode by using multi-GPU?

    Thanks for this beautiful package, it saves a lot of work to do semantic embedding. I am running a large size data base trying to transform docs into embedding matrix. When I was running with the code, it seemed only using single GPU to encode the sentence. Is there any way that I could do this by multi-GPU?

    opened by z307287280 30
  • public.ukp.informatik.tu-darmstadt.de Unreachable

    public.ukp.informatik.tu-darmstadt.de Unreachable

    It looks like the server which hosts the pre-trained models (https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/) has been unavailable for a few hours now.

    opened by Ganners 20
  • ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    After pip installing and trying to import SentenceTransformer I get this error: ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    When I look into the source code the only folder I have is models. I am missing evaluation, etc. Any Idea why?

    opened by DavidBegert 20
  • Fine-tune underlying language model for SBERT

    Fine-tune underlying language model for SBERT

    Hi,

    I'd like to use SBERT model architecture for document similarity and topic modelling tasks. However, my data corpus is fairly specific to domain, and I suspect that SBERT will underperform as it was trained on generic WIki/Library corpuses. So, I wonder if there are any recommendation around fine-tuning of underlying language model for SBERT.

    I envision that the overall process will be following:

    1. Take pre-trained BERT model
    2. Fine tune Language Model on domain-specific corpus
    3. Then retrain SBERT model architecture on specific tasks (e.g. SNLI dataset/task)

    Curious to hear thought on the approach and problem definition.

    opened by vdabravolski 18
  • ModuleNotFoundError: No module named 'setuptools.command.build'

    ModuleNotFoundError: No module named 'setuptools.command.build'

    I am trying to pip install sentence transformers on my Macbook Pro with M1 chip. I am using:

    pip install -U sentence-transformers
    

    When I run this, I get this error saying:

    ModuleNotFoundError: No module named 'setuptools.command.build'
    

    Full output:

    Defaulting to user installation because normal site-packages is not writeable
    Collecting sentence-transformers
      Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
      Preparing metadata (setup.py) ... done
    Collecting transformers<5.0.0,>=4.6.0
      Using cached transformers-4.21.0-py3-none-any.whl (4.7 MB)
    Collecting tqdm
      Using cached tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
    Requirement already satisfied: torch>=1.6.0 in ./Library/Python/3.8/lib/python/site-packages (from sentence-transformers) (1.12.0)
    Collecting torchvision
      Using cached torchvision-0.13.0-cp38-cp38-macosx_11_0_arm64.whl (1.2 MB)
    Requirement already satisfied: numpy in ./Library/Python/3.8/lib/python/site-packages (from sentence-transformers) (1.23.1)
    Collecting scikit-learn
      Using cached scikit_learn-1.1.1-cp38-cp38-macosx_12_0_arm64.whl (7.6 MB)
    Collecting scipy
      Using cached scipy-1.8.1-cp38-cp38-macosx_12_0_arm64.whl (28.6 MB)
    Collecting nltk
      Using cached nltk-3.7-py3-none-any.whl (1.5 MB)
    Collecting sentencepiece
      Using cached sentencepiece-0.1.96.tar.gz (508 kB)
      Preparing metadata (setup.py) ... done
    Collecting huggingface-hub>=0.4.0
      Using cached huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
    Collecting requests
      Using cached requests-2.28.1-py3-none-any.whl (62 kB)
    Collecting pyyaml>=5.1
      Using cached PyYAML-6.0.tar.gz (124 kB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... done
      Preparing metadata (pyproject.toml) ... done
    Requirement already satisfied: typing-extensions>=3.7.4.3 in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.3.0)
    Requirement already satisfied: filelock in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.7.1)
    Requirement already satisfied: packaging>=20.9 in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (21.3)
    Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
      Using cached tokenizers-0.12.1.tar.gz (220 kB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... error
      error: subprocess-exited-with-error
      
      × Getting requirements to build wheel did not run successfully.
      │ exit code: 1
      ╰─> [20 lines of output]
          Traceback (most recent call last):
            File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
              main()
            File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
              json_out['return_val'] = hook(**hook_input['kwargs'])
            File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
              return hook(config_settings)
            File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 146, in get_requires_for_build_wheel
              return self._get_build_requires(
            File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 127, in _get_build_requires
              self.run_setup()
            File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 142, in run_setup
              exec(compile(code, __file__, 'exec'), locals())
            File "setup.py", line 2, in <module>
              from setuptools_rust import Binding, RustExtension
            File "/private/var/folders/bg/ncfh283n4t39vqhvbd5n9ckh0000gn/T/pip-build-env-vjj6eow8/overlay/lib/python3.8/site-packages/setuptools_rust/__init__.py", line 1, in <module>
              from .build import build_rust
            File "/private/var/folders/bg/ncfh283n4t39vqhvbd5n9ckh0000gn/T/pip-build-env-vjj6eow8/overlay/lib/python3.8/site-packages/setuptools_rust/build.py", line 20, in <module>
              from setuptools.command.build import build as CommandBuild  # type: ignore[import]
          ModuleNotFoundError: No module named 'setuptools.command.build'
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: subprocess-exited-with-error
    
    × Getting requirements to build wheel did not run successfully.
    │ exit code: 1
    ╰─> See above for output.
    
    	note: This error originates from a subprocess, and is likely not a problem with pip.
    

    Can anybody tell me what I should do or what is wrong with what I am currently doing? I factory reset my Mac and re-downloaded everything but I still get this same error. I am stumped.

    opened by joeyoneill 15
  • HTTPError: 403 Client Error:

    HTTPError: 403 Client Error:

    I get a request error and I do not know why.

    
    [W 2021-02-02 18:43:15,951] Trial 0 failed because of the following error: HTTPError('403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip',)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/optuna/_optimize.py", line 211, in _run_trial
        value_or_values = func(trial)
      File "<ipython-input-6-af5cb77f5b44>", line 40, in objective
        model = SentenceTransformer(model_name)  # distiluse-base-multilingual-cased-v2  distilbert-multilingual-nli-stsb-quora-ranking
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py", line 92, in __init__
        raise e
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py", line 75, in __init__
        http_get(model_url, zip_save_path)
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/util.py", line 201, in http_get
        req.raise_for_status()
      File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 941, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip
    
    HTTPError: 403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip
    
    opened by tide90 15
  • Batch cos_sim for community_detection?

    Batch cos_sim for community_detection?

    I've been experimenting with the community_detection method but noticed I quickly get OOM errors if I use too large of embeddings.

    Seeing how it uses cos_sim to computed all the embedding distances, do you think it would make sense to have the option for batching? I believe you will find other bottlenecks when iterating over the entries, but at least it will complete on larger embeddings.

    opened by mmaybeno 13
  • 'torch._C.PyTorchFileReader' object has no attribute'seek'

    'torch._C.PyTorchFileReader' object has no attribute'seek'

    Hello,

    I am using the following model for sentence similarity

    https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual/tree/main

    word_embedding_model = models.Transformer(bert_model_dir)  # , max_seq_length=512
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model], device=device_str)
    

    But, I get this error:

    Traceback (most recent call last):
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 306, in _check_seekable
    
        f.seek(f.tell())
    
    AttributeError:'torch._C.PyTorchFileReader' object has no attribute'seek'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1205, in from_pretrained
    
        state_dict = torch.load(resolved_archive_file, map_location="cpu")
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 584, in load
    
        return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/moxing/framework/file/file_io_patch.py", line 200, in _load
    
        _check_seekable(f)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 309, in _check_seekable
    
        raise_err_msg(["seek", "tell"], e)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 302, in raise_err_msg
    
        raise type(e)(msg)
    
    AttributeError:'torch._C.PyTorchFileReader' object has no attribute'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead .
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "code/similarity.py", line 118, in <module>
    
        word_embedding_model = models.Transformer(bert_model_dir) #, max_seq_length=512
    
      File "/home/work/anaconda/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py", line 30, in __init__
    
        self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py", line 381, in from_pretrained
    
        return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1208, in from_pretrained
    
        f"Unable to load weights from pytorch checkpoint file for'{pretrained_model_name_or_path}' "
    
    OSError: Unable to load weights from pytorch checkpoint file for'/home/work/user-job-dir/input/pretrained_models/stsb-xlm-r-multilingual/' at'/home/work/user-job-dir/input /pretrained_models/stsb-xlm-r-multilingual/pytorch_model.bin'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. 
    

    I checked on web but could not find any solution. What could be the problem? Thank you.

    opened by deadsoul44 13
  • Getting SSL Error in downloading

    Getting SSL Error in downloading "distilroberta-base-paraphrase-v1" model embeddings

    I am using google collab with PyTorch version 1.7.0+cu101 I am getting an SSL Error when I am trying to download "distilroberta-base-paraphrase-v1" model.

    Code from sentence_transformers import SentenceTransformer model = SentenceTransformer('distilroberta-base-paraphrase-v1')

    Error

    SSLError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

    24 frames SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)

    During handling of the above exception, another exception occurred:

    MaxRetryError Traceback (most recent call last) MaxRetryError: HTTPSConnectionPool(host='public.ukp.informatik.tu-darmstadt.de', port=443): Max retries exceeded with url: /reimers/sentence-transformers/v0.2/distilroberta-base-paraphrase-v1.zip (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

    During handling of the above exception, another exception occurred:

    SSLError Traceback (most recent call last) SSLError: HTTPSConnectionPool(host='public.ukp.informatik.tu-darmstadt.de', port=443): Max retries exceeded with url: /reimers/sentence-transformers/v0.2/distilroberta-base-paraphrase-v1.zip (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

    During handling of the above exception, another exception occurred:

    FileNotFoundError Traceback (most recent call last) /usr/lib/python3.6/shutil.py in rmtree(path, ignore_errors, onerror) 473 # lstat()/open()/fstat() trick. 474 try: --> 475 orig_st = os.lstat(path) 476 except Exception: 477 onerror(os.lstat, path, sys.exc_info())

    FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/sentence_transformers/sbert.net_models_distilroberta-base-paraphrase-v1'

    opened by rahuliitkgp31 13
  • model.fit  results in nan

    model.fit results in nan

    Hi,

    I want to fine-tune SBERT with pre-trained weights of 'bert-base-uncased'. I follow this tutorial: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py using MultipleNegativesRankingLoss loss function.

    When I do model.fit , the results are 'nan' everywhere.

    here is my code: `root_model = AutoModel.from_pretrained('bert-base-uncased') output_dir = "/root/Automated_Assessment_(ETS)/Model/DRAFT/DRAFT_Bert_base_uncased" BERT_model = root_model.save_pretrained(output_dir) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') #('onlplab/alephbert-base') tokenizer.save_pretrained(output_dir)

    learning_rate, batch_size, epochs = 2e-5, 8, 1

    train_dataloader = datasets.NoDuplicatesDataLoader(train_data, batch_size=batch_size)
    word_embedding_model = models.Transformer(output_dir, max_seq_length=512)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean') model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

    train_loss = losses.MultipleNegativesRankingLoss(model) val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(val_data, batch_size=batch_size)

    warmup_steps = math.ceil(len(train_dataloader) * epochs * 0.1) #10% of train data for warm-up logging.info("Warmup-steps: {}".format(warmup_steps))

    output_file = 'output/sentence_similarity'+MODEL_NAME.replace("/", "-")+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S") sb_output_path = os.path.join(ref_saved_models_path, output_file)

    model.fit(train_objectives=[(train_dataloader, train_loss)], evaluator=val_evaluator, epochs=epochs, evaluation_steps=int(len(train_dataloader)*0.1), warmup_steps=warmup_steps, output_path=sb_output_path, use_amp=False #Set to True, if your GPU supports FP16 operations ) `

    here is a screenshot of the log: Capture

    I don't understand what am I doing wrong? Could you please help me?

    opened by Abigail-gs 0
  • Dtype error when using Pooling + Dense layers with half precision

    Dtype error when using Pooling + Dense layers with half precision

    The models.Pooling layer seams to always output a 32-bit float as it's sentence_embedding. This leads to an dtype error when using a dense layer after the pooling layer when the model is in half precision mode via model.half()

    Here is a minimal example:

    from sentence_transformers import SentenceTransformer,models
    from torch import nn
    
    word_embedding_model = models.Transformer("sentence-transformers/all-MiniLM-L6-v2")
    polling = models.Pooling(word_embedding_model.get_word_embedding_dimension(),"mean")
    dense = models.Dense(word_embedding_model.get_word_embedding_dimension(), out_features=64, activation_function=nn.Tanh())
    
    #This works as expected
    sentence_transformer_without_dense = SentenceTransformer(modules=[word_embedding_model,polling])
    sentence_transformer_without_dense.half()
    
    print(sentence_transformer_without_dense.encode("Hello World"))
    
    #This will throw an error
    sentence_transformer_with_dense = SentenceTransformer(modules=[word_embedding_model,polling,dense])
    sentence_transformer_with_dense.half()
    
    print(sentence_transformer_with_dense.encode("Hello World"))
    

    Is this the expected behaviour or a bug?

    opened by LLukas22 0
  • How can I use models.Dense() layer with DenoisingAutoEncoderLoss()?

    How can I use models.Dense() layer with DenoisingAutoEncoderLoss()?

    When creating a SentenceTransformer as follows:

    word_embedding_model = Transformer(
      model_name_or_path=model_name_or_path, # "bert-base-uncased"
      max_seq_length=max_seq_length, # 384
      cache_dir=cache_dir,
      tokenizer_args=tokenizer_args, # {"truncation": True, "padding": "max_length, "max_length": 384}
      do_lower_case=do_lower_case, # True
      tokenizer_name_or_path=tokenizer_name_or_path # "bert-base-uncased"
     )
    
    word_embedding_dimension = word_embedding_model.get_word_embedding_dimension()
    pooling_mode = "cls"
    pooling_model = Pooling(
        word_embedding_dimension=word_embedding_dimension,
        pooling_mode=pooling_mode,
    )
    
    in_features = pooling_model.get_sentence_embedding_dimension()
    out_features = config["parameters"]["num_dense_dimensions"] # 256
    dense_model = Dense(
        in_features=in_features,
        out_features=out_features,
        activation_function=nn.Tanh(),
    )
    
    modules = [
        word_embedding_model,
        pooling_model,
        dense_model,
    ]
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    cache_folder = os.path.join(cache_location)
    model = SentenceTransformer(
        modules=modules,
        device=device,
        cache_folder=cache_folder,
    )
    

    And creating the following DenoisingAutoEncoderLoss:

    train_loss = DenoisingAutoEncoderLoss(
        model=model,
        tie_encoder_decoder=tie_encoder_decoder, # True
    )
    

    With this training setting:

    train_objectives = [
        (train_dataloader, train_loss)
    ]
    evaluator = MSEEvaluator(
        source_sentences=source_sentences,
        target_sentences=target_sentences,
        teacher_model=model,
        show_progress_bar=True,
        batch_size=batch_size, # batch_size = 16
        name="job2vec",
        write_csv=True,
    )
    
    def free_memory(score, epoch, steps):
        torch.cuda.empty_cache()
        gc.collect()
    
    epochs = config["hyperparameters"]["num_epochs"]
    warmup_steps = config["hyperparameters"]["warmup_steps"]
    evaluation_steps = batch_size * 32, # batch_size = 16
    output_path = os.path.join(cache_location, "job2vec")
    save_best_model = True
    use_amp = True
    callback = free_memory,
    show_progress_bar = True
    checkpoint_path = os.path.join(cache_location, "job2vec/checkpoints")
    checkpoint_save_steps = len(train_dataloader)
    model.fit(
        train_objectives=train_objectives,
        evaluator=evaluator,
        epochs=epochs,
        warmup_steps=warmup_steps,
        evaluation_steps=evaluation_steps,
        output_path=output_path,
        save_best_model=save_best_model,
        show_progress_bar=show_progress_bar,
        use_amp=use_amp,
        callback=callback,
        checkpoint_path=checkpoint_path,
        checkpoint_save_steps=checkpoint_save_steps,
    )
    

    Then the following error occurs:

    Traceback (most recent call last):
      File "src/denoising_autoencoder.py", line 216, in <module> 
        main()
      File "src/denoising_autoencoder.py", line 213, in main
        train()
      File "src/denoising_autoencoder.py", line 209, in train
        checkpoint_save_steps=checkpoint_save_steps,
      File "venv/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 710, in fit
        loss_value = loss_model(features, labels)
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 119, in forward
        use_cache=False
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 1250, in forward
        return_dict=return_dict,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 1031, in forward
        return_dict=return_dict,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 617, in forward
        output_attentions,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 529, in forward
        output_attentions,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 433, in forward
        output_attentions,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 298, in forward
        key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
        return F.linear(input, self.weight, self.bias)
    RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x256 and 768x768)
    

    How can I use the models.Dense() layer with the DenoisingAutoEncoderLoss?

    opened by niquet 0
  • How to distill model with different tokenizer?

    How to distill model with different tokenizer?

    I am trying to train word embedding models to match embeddings from a sentence transformer, and using model_distillation won't cut it, because when running student_model.fit the model uses student's smart_batching_collate so the teacher model gets wrong tokens.

    Has anybody worked on something similar? I don't see any workaround other than rewriting the SentenceTransformer.fit method, but maybe there's easier way to do this?

    opened by lambdaofgod 0
  • Community detection algorithm can loop forever

    Community detection algorithm can loop forever

    If only one vector is passed community detection algorithm will loop forever.

    I suggest adding

    assert embeddings.shape[0] >= 2, "Embeddings should contain at least two vectors"
    assert embeddings.shape[0] >= min_community_size, "Number of vectors is less than specified min_community_size"
    

    checks. (Can open a pull request for this)

    opened by maiiabocharova 0
  • Override tokenizer args of sentencetransformer

    Override tokenizer args of sentencetransformer

    How can we apply sliding window on sentencetranformer tokenizer. I want to be able to override return_overflowing_tokens=True and stride in the default tokenizer to enable the sliding window.

    opened by datashinobi 0
Releases(v2.2.2)
  • v2.2.2(Jun 26, 2022)

    huggingface_hub dropped support in version 0.5.0 for Python 3.6

    This release fixes the issue so that huggingface_hub with version 0.4.0 and Python 3.6 can still be used.

    Source code(tar.gz)
    Source code(zip)
  • v2.2.1(Jun 23, 2022)

    Version 0.8.1 of huggingface_hub introduces several changes that resulted in errors and warnings. This version of sentence-transformers fixes these issues.

    Further, several improvements have been added / merged:

    • util.community_detection was improved: 1) It works in a batched mode to save memory, 2) Overlapping clusters are no longer dropped but removed by overlapping items, 3) The parameter init_max_size was removed and replaced by a heuristic to estimate the max size of clusters
    • #1581 the training dataset names can be saved in the model card
    • #1426 fix the text summarization example
    • #1487 Rekursive sentence-transformers models are now possible
    • #1522 Private models can now be loaded
    • #1551 DataLoaders can now have workers
    • #1565 Models are just checked on the hub if they don't exist in the cache. Fixes issues with connectivity issues
    • #1591 Example added how to stream encode larger datasets
    Source code(tar.gz)
    Source code(zip)
  • v2.2.0(Feb 10, 2022)

    T5

    You can now use the encoder from T5 to learn text embeddings. You can use it like any other transformer model:

    from sentence_transformers import SentenceTransformer, models
    word_embedding_model = models.Transformer('t5-base', max_seq_length=256)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
    

    See T5-Benchmark results - the T5 encoder is not the best model for learning text embeddings models. It requires quite a lot of training data and training steps. Other models perform much better, at least in the given experiment with 560k training triplets.

    New Models

    The models from the papers Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models and Large Dual Encoders Are Generalizable Retrievers have been added:

    For benchmark results, see https://seb.sbert.net

    Private Models

    Thanks to #1406 you can now load private models from the hub:

    model = SentenceTransformer("your-username/your-model", use_auth_token=True)
    
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Oct 1, 2021)

    This is a smaller release with some new features

    MarginMSELoss

    MarginMSELoss is a great method to train embeddings model with the help of a cross-encoder model. The details are explained here: MSMARCO - MarginMSE Training

    You pass your training data in the format:

    InputExample(texts=[query, positive, negative], label=cross_encoder.predict([query, positive])-cross_encoder.predict([query, negative])
    

    MultipleNegativesSymmetricRankingLoss

    MultipleNegativesRankingLoss computes the loss just in one way: Find the correct answer for a given question.

    MultipleNegativesSymmetricRankingLoss also computes the loss in the other direction: Find the correct question for a given answer.

    Breaking Change: CLIPModel

    The CLIPModel is now based on the transformers model.

    You can still load it like this:

    model = SentenceTransformer('clip-ViT-B-32')
    

    Older SentenceTransformers versions are now longer able to load and use the 'clip-ViT-B-32' model.

    Added files on the hub are automatically downloaded

    PR #1116 checks if you have all files in your local cache or if there are added files on the hub. If this is the case, it will automatically download them.

    SentenceTransformers.encode() can return all values

    When you set output_value=None for the encode method, all values (token_ids, token_embeddings, sentence_embedding) will be returned.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Jun 24, 2021)

    Models hosted on the hub

    All pre-trained models are now hosted on the Huggingface Models hub.

    Our pre-trained models can be found here: https://huggingface.co/sentence-transformers

    But you can easily share your own sentence-transformer model on the hub and have other people easily access it. Simple upload the folder and have people load it via:

    model = SentenceTransformer('[your_username]/[model_name]')
    

    For more information, see: Sentence Transformers in the Hugging Face Hub

    Breaking changes

    There should be no breaking changes. Old models can still be loaded from disc. However, if you use one of the provided pre-trained models, it will be downloaded again in version 2 of sentence transformers as the cache path has slightly changed.

    Find sentence-transformer models on the Hub

    You can filter the hub for sentence-transformers models: https://huggingface.co/models?filter=sentence-transformers

    Add the sentence-transformers tag to you model card so that others can find your model.

    Widget & Inference API

    A widget was added to sentence-transformers models on the hub that lets you interact directly on the models website: https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

    Further, models can now be used with the Accelerated Inference API: Send you sentences to the API and get back the embeddings from the respective model.

    Save Model to Hub

    A new method was added to the SentenceTransformer class: save_to_hub.

    Provide the model name and the model is saved on the hub.

    Here you find the explanation from transformers how the hub works: Model sharing and uploading

    Automatic Model Card

    When you save a model with save or save_to_hub, a README.md (also known as model card) is automatically generated with basic information about the respective SentenceTransformer model.

    New Models

    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Jun 24, 2021)

  • v1.2.0(May 12, 2021)

    Unsupervised Sentence Embedding Learning

    New methods integrated to train sentence embedding models without labeled data. See Unsupervised Learning for an overview of all existent methods.

    New methods:

    Pre-Training Methods

    • MLM: An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.

    Training Examples

    New models

    New Functions

    • SentenceTransformer.fit() Checkpoints: The fit() method now allows to save checkpoints during the training at a fixed number of steps. More info
    • Pooling-mode as string: You can now pass the pooling-mode to models.Pooling() as string:
      pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')
      

      Valid values are mean/max/cls.

    • NoDuplicatesDataLoader: When using the MultipleNegativesRankingLoss, one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Apr 21, 2021)

    Unsupervised Sentence Embedding Learning

    This release integrates methods that allows to learn sentence embeddings without having labeled data:

    • TSDAE: TSDAE is using a denoising auto-encoder to learn sentence embeddings. The method has been presented in our recent paper and achieves state-of-the-art performance for several tasks.
    • GenQ: GenQ uses a pre-trained T5 system to generate queries for a given passage. It was presented in our recent BEIR paper and works well for domain adaptation for (semantic search)[https://www.sbert.net/examples/applications/semantic-search/README.html]

    New Models - SentenceTransformer

    • MSMARCO Dot-Product Models: We trained models using the dot-product instead of cosine similarity as similarity function. As shown in our recent BEIR paper, models with cosine-similarity prefer the retrieval of short documents, while models with dot-product prefer retrieval of longer documents. Now you can choose what is most suitable for your task.
    • MSMARCO MiniLM Models: We uploaded some models based on MiniLM: It uses just 384 dimensions, is faster than previous models and achieves nearly the same performance

    New Models - CrossEncoder

    New Features

    • You can now pass to the CrossEncoder class a default_activation_function, that is applied on-top of the output logits generated by the class.
    • You can now pre-process images for the CLIP Model. Soon I will release a tutorial how to fine-tune the CLIP Model with your data.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.4(Apr 1, 2021)

    It was not possible to fine-tune and save the CLIPModel. This release fixes it. CLIPModel can now be saved like any other model by calling model.save(path)

    Source code(tar.gz)
    Source code(zip)
  • v1.0.3(Mar 22, 2021)

  • v1.0.2(Mar 19, 2021)

    v1.0.2 - Patch for CLIPModel, new Image Examples

    • Bugfix in CLIPModel: Too long inputs raised a RuntimeError. Now they are truncated.
    • New util function: util.paraphrase_mining_embeddings, to find most similar embeddings in a matrix
    • Image Clustering and Duplicate Image Detection examples added: more info
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 18, 2021)

    This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes

    Text-Image-Model CLIP

    You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:

    from sentence_transformers import SentenceTransformer, util
    from PIL import Image
    
    #Load CLIP model
    model = SentenceTransformer('clip-ViT-B-32')
    
    #Encode an image:
    img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
    
    #Encode text descriptions
    text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
    
    #Compute cosine similarities 
    cos_scores = util.cos_sim(img_emb, text_emb)
    print(cos_scores)
    

    More Information IPython Demo Colab Demo

    Examples how to train the CLIP model on your data will be added soon.

    New Models

    New Features

    • The Asym Model can now be used as the first model in a SentenceTransformer modules list.
    • Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise
    • Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used
    • New util methods: util.dot_score computes the dot product of two embedding matrices. util.normalize_embeddings will normalize embeddings to unit length
    • New parameter for SentenceTransformer.encode method: normalize_embeddings if set to true, it will normalize embeddings to unit length. In that case the faster util.dot_score can be used instead of util.cos_sim to compute cosine similarity scores.
    • If you specify in models.Transformer(do_lower_case=True) when creating a new SentenceTransformer, then all input will be lower cased.

    New Examples

    Bugfixes

    • Encode method now correctly returns token_embeddings if output_value='token_embeddings' is defined
    • Bugfix of the LabelAccuracyEvaluator
    • Bugfix of removing tensors off the CPU if you specified encode(sent, convert_to_tensor=True). They now stay on the GPU

    Breaking changes:

    • SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Jan 4, 2021)

    Refactored Tokenization

    • Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
    • Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:
    train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
        InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    
    • If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts
    • Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

    Asymmetric Models Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

    word_embedding_model = models.Transformer(base_model, max_seq_length=250)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
    d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
    asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])
    
    ##Your input examples have to look like this:
    inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)
    
    ##Encoding (Note: Mixed inputs are not allowed)
    model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])
    

    Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer. More documentation on how to design asymmetric models will follow soon.

    New Namespace & Models for Cross-Encoder Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

    Logging Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

    Unit tests A lot more unit tests have been added, which test the different components of the framework.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 22, 2020)

    • Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
    • New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
    • New application example for information retrieval and question answering retrieval. Together with respective pre-trained models
    Source code(tar.gz)
    Source code(zip)
  • v0.3.9(Nov 18, 2020)

    This release only include some smaller updates:

    • Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
    • As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
    • model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
    • The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
    • The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.8(Oct 19, 2020)

    • Add support training and using CrossEncoder
    • Data Augmentation method AugSBERT added
    • New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
    • New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
    • Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
    • New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

    Smaller changes:

    • Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
    • SentenceTransformer.encode method detaches tensors from compute graph
    • SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty
    Source code(tar.gz)
    Source code(zip)
  • v0.3.7(Sep 29, 2020)

    • Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
    • Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
    • Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

    Minor changes:

    • Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
    • Added models.Normalize() to allow the normalization of embeddings to unit length
    Source code(tar.gz)
    Source code(zip)
  • v0.3.6(Sep 11, 2020)

    Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

    This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.5(Sep 1, 2020)

    • The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting model.fit(use_amp=True), AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory.
    • Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk
    • If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.
    • Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0
    • Several bugfixes: Downloading of files, mutli-GPU-encoding
    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Aug 24, 2020)

    • The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
    • The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set num_workers to a positive integer in your DataLoader, tokenization will happen in a background thread. This substantially increases the start-up time for training.
    • model.encode() uses also a PyTorch DataSet + DataLoader. If you set num_workers to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.
    • Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
    • Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
    • Smaller bugfixes

    Breaking changes:

    • Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator
    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Aug 6, 2020)

    New Functions

    • Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
    • Tokenization of datasets for training can now run in parallel (Linux Only)
    • New example for Quora Duplicate Questions Retrieval: See examples-folder
    • Many small improvements for training better models for Information Retrieval
    • Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
    • Added new Evaluators for ParaphraseMining and InformationRetrieval
    • evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
    • model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
    • New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
    • New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

    Breaking Changes

    • The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jul 23, 2020)

    This is a minor release. There should be no breaking changes.

    • ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
    • util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
    • SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Jul 22, 2020)

    This is a minor update that changes some classes for training & evaluating multilingual sentence embedding methods.

    The examples for training multi-lingual sentence embeddings models have been significantly extended. See docs/training/multilingual-models.md for details. An automatic script that downloads suitable data and extends sentence embeddings to multiple languages has been added.

    The following classes/files have been changed:

    • datasets/ParallelSentencesDataset.py: The dataset with parallel sentences is encoded on-the-fly, reducing the start-up time for extending a sentence embedding model to new languages. An embedding cache can be configure to store previously computed sentence embeddings during training.

    New evaluation files:

    • evaluation/MSEEvaluator.py - breaking change. Now, this class expects lists of strings with parallel (translated) sentences. The old class has been renamed to MSEEvaluatorFromDataLoader.py
    • evaluation/EmbeddingSimilarityEvaluatorFromList.py - Semantic Textual Similarity data can be passed as lists of strings & scores
    • evaluation/MSEEvaluatorFromDataFrame.py - MSE Evaluation of teacher and student embeddings based on data in a data frame
    • evaluation/MSEEvaluatorFromDataLoader.py - MSE Evaluation if data is passed as a data loader

    Bugfixes:

    • model.encode() failed to sort sentences by length. This function has been fixed to boost encoding speed by reducing overhead of padding tokens.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jul 9, 2020)

    This release updates HuggingFace transformers to v3.0.2. Transformers did some breaking changes to the tokenization API. This (and future) versions will not be compatible with HuggingFace transfomers v2.

    There are no known breaking changes for existent models or existent code. Models trained with version 2 can be loaded without issues.

    New Loss Functions

    Thanks to PR #299 and #176 several new loss functions: Different triplet loss functions and ContrastiveLoss

    Source code(tar.gz)
    Source code(zip)
  • v0.2.6(Apr 16, 2020)

    The release update huggingface/transformers to the release v2.8.0.

    New Features

    • models.Transformer: The Transformer-Model can now load any huggingface transformers model, like BERT, RoBERTa, XLNet, XLM-R, Elextra... It is based on the AutoModel from HuggingFace. You now longer need the architecture specific models (like models.BERT, models.RoBERTa) any more. It also works with the community models.
    • Multilingual Training: Code is released for making mono-lingual sentence embeddings models mutli-lingual. See training_multilingual.py for an example. More documentation and details will follow soon.
    • WKPooling: Adding a pytorch implementation of SBERT-WK. Note, due to an inefficient implementation in pytorch of QR decomposition, WKPooling can only be run on the CPU, which makes it about 40 slower than mean pooling. For some models WKPooling improves the performance, for other don't.
    • WeightedLayerPooling: A new pooling layer that uses representations from all transformer layers and learns a weighted sum of them. So far no improvement compared to only averaging the last layer.
    • New pre-trained models released. Every available model is document in a google Spreadsheet for an easier overview.

    Minor changes

    • Clean-up of the examples folder.
    • Model and tokenizer arguments can now be passed to the according transformers models.
    • Previous version had some issues with RoBERTa and XLM-RoBERTa, that the wrong special characters were added. Everything is fixed now and relies on huggingface transformers for the correct addition of special characters to the input sentences.

    Breaking changes

    • STSDataReader: The default parameter values have been changed, so that it expects the sentences in the first two columns and the score in the third column. If you want to load the STS benchmkark dataset, you can use the STSBenchmarkDataReader.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Jan 10, 2020)

    huggingface/transformers was updated to version 2.3.0

    Changes:

    • ALBERT works (bug was fixed in transformers). Does not yield improvements compared to BERT / RoBERTA
    • T5 added (does not run on GPU due to a bug in transformers). Does not yield improvements compared to BERT / RoBERTA
    • CamemBERT added
    • XML-RoBERTa added
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Dec 6, 2019)

    This version update the underlying HuggingFace Transformer package to v2.2.1.

    Changes:

    • DistilBERT and ALBERT modules added
    • Pre-trained models for RoBERTa and DistilBERT uploaded
    • Some smaller bug-fixes
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Aug 20, 2019)

    No breaking changes. Just update with pip install -U sentence-transformers

    Bugfixes:

    • SentenceTransformers can now be used with Windows (threw an exception before about invalid tensor types before)
    • Outputs a warning if seq. length for BERT / RoBERTa is too long

    Improvements:

    • A flag can be set to hide the progress bar when a dataset is convert or an evaluator is executed
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Aug 19, 2019)

    Updated pytorch-transformers to v1.1.0. Adding support for RoBERTa model.

    Bugfixes:

    • Critical bugfix for SoftmaxLoss: Classifier weights were not optimized in previous version
    • Minor fix for including the timestamp of the output folders
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Aug 16, 2019)

Owner
Ubiquitous Knowledge Processing Lab
Ubiquitous Knowledge Processing Lab
[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

UVA Computer Vision 86 Dec 28, 2022
LeBenchmark: a reproducible framework for assessing SSL from speech

LeBenchmark: a reproducible framework for assessing SSL from speech

11 Nov 30, 2022
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 02, 2023
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 03, 2023
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 03, 2023
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Microsoft 105 Jan 08, 2022
a test times augmentation toolkit based on paddle2.0.

Patta Image Test Time Augmentation with Paddle2.0! Input | # input batch of images / / /|\ \ \ # apply

AgentMaker 110 Dec 03, 2022
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

Hyunwoong Ko 72 Dec 07, 2022
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

1k Dec 26, 2022
👄 The most accurate natural language detection library for Python, suitable for long and short text alike

1. What does this library do? Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a prepr

Peter M. Stahl 334 Dec 30, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP prod

VinAI Research 109 Dec 02, 2022
Code for the paper PermuteFormer

PermuteFormer This repo includes codes for the paper PermuteFormer: Efficient Relative Position Encoding for Long Sequences. Directory long_range_aren

Peng Chen 42 Mar 16, 2022
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022