A fast, efficient universal vector embedding utility package.

Overview
magnitude

Magnitude: a fast, simple vector embedding utility library

pipeline status   Build Status   Build status
PyPI version   license   Python version    DOI    arXiv

A feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by Plasticity. It is primarily intended to be a simpler / faster alternative to Gensim, but can be used as a generic key-vector store for domains outside NLP. It offers unique features like out-of-vocabulary lookups and streaming of large models over HTTP. Published in our paper at EMNLP 2018 and available on arXiv.

Table of Contents

Installation

You can install this package with pip:

pip install pymagnitude # Python 2.7
pip3 install pymagnitude # Python 3

Google Colaboratory has some dependency issues with installing Magnitude due to conflicting dependencies. You can use the following snippet to install Magnitude on Google Colaboratory:

# Install Magnitude on Google Colab
! echo "Installing Magnitude.... (please wait, can take a while)"
! (curl https://raw.githubusercontent.com/plasticityai/magnitude/master/install-colab.sh | /bin/bash 1>/dev/null 2>/dev/null)
! echo "Done installing Magnitude."

Motivation

Vector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. A fast, lightweight tool to consume these large vector space embedding models efficiently is lacking.

The Magnitude file format (.magnitude) for vector embeddings is intended to be a more efficient universal vector embedding format that allows for lazy-loading for faster cold starts in development, LRU memory caching for performance in production, multiple key queries, direct featurization to the inputs for a neural network, performant similiarity calculations, and other nice to have features for edge cases like handling out-of-vocabulary keys or misspelled keys and concatenating multiple vector models together. It also is intended to work with large vector models that may not fit in memory.

It uses SQLite, a fast, popular embedded database, as its underlying data store. It uses indexes for fast key lookups as well as uses memory mapping, SIMD instructions, and spatial indexing for fast similarity search in the vector space off-disk with good memory performance even between multiple processes. Moreover, memory maps are cached between runs so even after closing a process, speed improvements are reaped.

Benchmarks and Features

Metric Magnitude Light Magnitude Medium Magnitude Heavy Magnitude Stream
Initial load time 0.7210s ━ 1 ━ 1 7.7550s
Cold single key query 0.0001s ━ 1 ━ 1 1.6437s
Warm single key query
(same key as cold query)
0.00004s ━ 1 ━ 1 0.0004s
Cold multiple key query
(n=25)
0.0442s ━ 1 ━ 1 1.7753s
Warm multiple key query
(n=25) (same keys as cold query)
0.00004s ━ 1 ━ 1 0.0001s
First most_similar search query
(n=10) (worst case)
247.05s ━ 1 ━ 1 -
First most_similar search query
(n=10) (average case) (w/ disk persistent cache)
1.8217s ━ 1 ━ 1 -
Subsequent most_similar search
(n=10) (different key than first query)
0.2434s ━ 1 ━ 1 -
Warm subsequent most_similar search
(n=10) (same key as first query)
0.00004s 0.00004s 0.00004s -
First most_similar_approx search query
(n=10, effort=1.0) (worst case)
N/A N/A 29.610s -
First most_similar_approx search query
(n=10, effort=1.0) (average case) (w/ disk persistent cache)
N/A N/A 0.9155s -
Subsequent most_similar_approx search
(n=10, effort=1.0) (different key than first query)
N/A N/A 0.1873s -
Subsequent most_similar_approx search
(n=10, effort=0.1) (different key than first query)
N/A N/A 0.0199s -
Warm subsequent most_similar_approx search
(n=10, effort=1.0) (same key as first query)
N/A N/A 0.00004s -
File size 4.21GB 5.29GB 10.74GB 0.00GB
Process memory (RAM) utilization 18KB ━ 1 ━ 1 1.71MB
Process memory (RAM) utilization after 100 key queries 168KB ━ 1 ━ 1 1.91MB
Process memory (RAM) utilization after 100 key queries + similarity search 342KB2 ━ 1 ━ 1
Integrity checks and tests
Universal format between word2vec (.txt, .bin), GloVe (.txt), fastText (.vec), and ELMo (.hdf5) with converter utility
Simple, Pythonic interface
Few dependencies
Support for larger than memory models
Lazy loading whenever possible for speed and performance
Optimized for threading and multiprocessing
Bulk and multiple key lookup with padding, truncation, placeholder, and featurization support
Concatenting multiple vector models together
Basic out-of-vocabulary key lookup
(character n-gram feature hashing)
Advanced out-of-vocabulary key lookup with support for misspellings
(character n-gram feature hashing to similar in-vocabulary keys)
Approximate most similar search with an annoy index
Built-in training for new models

1: same value as previous column
2: uses mmap to read from disk, so the OS will still allocate pages of memory when memory is available, but it can be shared between processes and isn't managed within each process for extremely large files which is a performance win
*: All benchmarks were performed on the Google News pre-trained word vectors (GoogleNews-vectors-negative300.bin) with a MacBook Pro (Retina, 15-inch, Mid 2014) 2.2GHz quad-core Intel Core i7 @ 16GB RAM on SSD over an average of trials where feasible.

Pre-converted Magnitude Formats of Popular Embeddings Models

Popular embedding models have been pre-converted to the .magnitude format for immmediate download and usage:

Contributor Data Light

(basic support for out-of-vocabulary keys)
Medium
(recommended)

(advanced support for out-of-vocabulary keys)
Heavy

(advanced support for out-of-vocabulary keys and faster most_similar_approx)
Google - word2vec Google News 100B 300D 300D 300D
Stanford - GloVe Wikipedia 2014 + Gigaword 5 6B 50D100D200D300D 50D100D200D300D 50D100D200D300D
Stanford - GloVe Wikipedia 2014 + Gigaword 5 6B
(lemmatized by Plasticity)
50D100D200D300D 50D100D200D300D 50D100D200D300D
Stanford - GloVe Common Crawl 840B 300D 300D 300D
Stanford - GloVe Twitter 27B 25D50D100D200D 25D50D100D200D 25D50D100D200D
Facebook - fastText English Wikipedia 2017 16B 300D 300D 300D
Facebook - fastText English Wikipedia 2017 + subword 16B 300D 300D 300D
Facebook - fastText Common Crawl 600B 300D 300D 300D
AI2 - AllenNLP ELMo ELMo Models ELMo Models ELMo Models ELMo Models
Google - BERT Coming Soon... Coming Soon... Coming Soon... Coming Soon...

There are instructions below for converting any .bin, .txt, .vec, .hdf5 file to a .magnitude file.

Using the Library

Constructing a Magnitude Object

You can create a Magnitude object like so:

from pymagnitude import *
vectors = Magnitude("/path/to/vectors.magnitude")

If needed, and included for convenience, you can also open a .bin, .txt, .vec, .hdf5 file directly with Magnitude. This is, however, less efficient and very slow for large models as it will convert the file to a .magnitude file on the first run into a temporary directory. The temporary directory is not guaranteed to persist and does not persist when your computer reboots. You should pre-convert .bin, .txt, .vec, .hdf5 files with python -m pymagnitude.converter typically for faster speeds, but this feature is useful for one-off use-cases. A warning will be generated when instantiating a Magnitude object directly with a .bin, .txt, .vec, .hdf5. You can supress warnings by setting the supress_warnings argument in the constructor to True.


  • By default, lazy loading is enabled. You can pass in an optional lazy_loading argument to the constructor with the value -1 to disable lazy-loading and pre-load all vectors into memory (a la Gensim), 0 (default) to enable lazy-loading with an unbounded in-memory LRU cache, or an integer greater than zero X to enable lazy-loading with an LRU cache that holds the X most recently used vectors in memory.
  • If you want the data for the most_similar functions to be pre-loaded eagerly on initialization, set eager to True.
  • Note, even when lazy_loading is set to -1 or eager is set to True data will be pre-loaded into memory in a background thread to prevent the constructor from blocking for a few minutes for large models. If you really want blocking behavior, you can pass True to the blocking argument.
  • By default, unit-length normalized vectors are returned unless you are loading an ELMo model. Set the optional argument normalized to False if you wish to recieve the raw non-normalized vectors instead.
  • By default, NumPy arrays are returned for queries. Set the optional argument use_numpy to False if you wish to recieve Python lists instead.
  • By default, querying for keys is case-sensitive. Set the optional argument case_insensitive to True if you wish to perform case-insensitive searches.
  • Optionally, you can include the pad_to_length argument which will specify the length all examples should be padded to if passing in multple examples. Any examples that are longer than the pad length will be truncated.
  • Optionally, you can set the truncate_left argument to True if you want the beginning of the the list of keys in each example to be truncated instead of the end in case it is longer than pad_to_length when specified.
  • Optionally, you can set the pad_left argument to True if you want the padding to appear at the beginning versus the end (which is the default).
  • Optionally, you can pass in the placeholders argument, which will increase the dimensions of each vector by a placeholders amount, zero-padding those extra dimensions. This is useful, if you plan to add other values and information to the vectors and want the space for that pre-allocated in the vectors for efficiency.
  • Optionally, you can pass in the language argument with an ISO 639-1 Language Code, which, if you are using Magnitude for word vectors, will ensure the library respects stemming and other language-specific features for that language. The default is en for English. You can also pass in None if you are not using Magnitude for word vectors.
  • Optionally, you can pass in the dtype argument which will let you control the data type of the NumPy arrays returned by Magnitude.
  • Optionally, you can pass in the devices argument which will let you control the usage of GPUs when the underlying models supports GPU usage. This argument should be a list of integers, where each integer represents the GPU device number (0, 1, etc.).
  • Optionally, you can pass in the temp_dir argument which will let you control the location of the temporary directory Magnitude will use.
  • Optionally, you can pass in the log argument which will have Magnitude log progress to standard error when slow operations are taking place.

Querying

You can query the total number of vectors in the file like so:

len(vectors)

You can query the dimensions of the vectors like so:

vectors.dim

You can check if a key is in the vocabulary like so:

"cat" in vectors

You can iterate through all keys and vectors like so:

for key, vector in vectors:
  ...

You can query for the vector of a key like so:

vectors.query("cat")

You can index for the n-th key and vector like so:

vectors[42]

You can query for the vector of multiple keys like so:

vectors.query(["I", "read", "a", "book"])

A 2D array (keys by vectors) will be returned.


You can query for the vector of multiple examples like so:

vectors.query([["I", "read", "a", "book"], ["I", "read", "a", "magazine"]])

A 3D array (examples by keys by vectors) will be returned. If pad_to_length is not specified, and the size of each example is uneven, they will be padded to the length of the longest example.


You can index for the keys and vectors of multiple indices like so:

vectors[:42] # slice notation
vectors[42, 1337, 2001] # tuple notation

You can query the distance of two or multiple keys like so:

vectors.distance("cat", "dog")
vectors.distance("cat", ["dog", "tiger"])

You can query the similarity of two or multiple keys like so:

vectors.similarity("cat", "dog")
vectors.similarity("cat", ["dog", "tiger"])

You can query for the most similar key out of a list of keys to a given key like so:

vectors.most_similar_to_given("cat", ["dog", "television", "laptop"]) # dog

You can query for which key doesn't match a list of keys to a given key like so:

vectors.doesnt_match(["breakfast", "cereal", "dinner", "lunch"]) # cereal

You can query for the most similar (nearest neighbors) keys like so:

vectors.most_similar("cat", topn = 100) # Most similar by key
vectors.most_similar(vectors.query("cat"), topn = 100) # Most similar by vector

Optionally, you can pass a min_similarity argument to most_similar. Values from [-1.0-1.0] are valid.


You can also query for the most similar keys giving positive and negative examples (which, incidentally, solves analogies) like so:

vectors.most_similar(positive = ["woman", "king"], negative = ["man"]) # queen

Similar to vectors.most_similar, a vectors.most_similar_cosmul function exists that uses the 3CosMul function from Levy and Goldberg:

vectors.most_similar_cosmul(positive = ["woman", "king"], negative = ["man"]) # queen

You can also query for the most similar keys using an approximate nearest neighbors index which is much faster, but doesn't guarantee the exact answer:

vectors.most_similar_approx("cat")
vectors.most_similar_approx(positive = ["woman", "king"], negative = ["man"])

Optionally, you can pass an effort argument with values between [0.0-1.0] to the most_similar_approx function which will give you runtime trade-off. The default value for effort is 1.0 which will take the longest, but will give the most accurate result.


You can query for all keys closer to a key than another key is like so:

vectors.closer_than("cat", "rabbit") # ["dog", ...]

You can access all of the underlying vectors in the model in a large numpy.memmap array of size (len(vectors) x vectors.emb_dim) like so:

vectors.get_vectors_mmap()

You can clean up all associated resources, open files, and database connections like so:

vectors.close()

Basic Out-of-Vocabulary Keys

For word vector representations, handling out-of-vocabulary keys is important to handling new words not in the trained model, handling mispellings and typos, and making models trained on the word vector representations more robust in general.

Out-of-vocabulary keys are handled by assigning them a random vector value. However, the randomness is deterministic. So if the same out-of-vocabulary key is encountered twice, it will be assigned the same random vector value for the sake of being able to train on those out-of-vocabulary keys. Moreover, if two out-of-vocabulary keys share similar character n-grams ("uberx", "uberxl") they will placed close to each other even if they are both not in the vocabulary:

vectors = Magnitude("/path/to/GoogleNews-vectors-negative300.magnitude")
"uberx" in vectors # False
"uberxl" in vectors # False
vectors.query("uberx") # array([ 5.07109939e-02, -7.08248823e-02, -2.74812328e-02, ... ])
vectors.query("uberxl") # array([ 0.04734962, -0.08237578, -0.0333479, -0.00229564, ... ])
vectors.similarity("uberx", "uberxl") # 0.955000000200815

Advanced Out-of-Vocabulary Keys

If using a Magnitude file with advanced out-of-vocabulary support (Medium or Heavy), out-of-vocabulary keys will also be embedded close to similar keys (determined by string similarity) that are in the vocabulary:

vectors = Magnitude("/path/to/GoogleNews-vectors-negative300.magnitude")
"uberx" in vectors # False
"uberification" in vectors # False
"uber" in vectors # True
vectors.similarity("uberx", "uber") # 0.7383483267618451
vectors.similarity("uberification", "uber") # 0.745452837882727

Handling Misspellings and Typos

This also makes Magnitude robust to a lot of spelling errors:

vectors = Magnitude("/path/to/GoogleNews-vectors-negative300.magnitude")
"missispi" in vectors # False
vectors.similarity("missispi", "mississippi") # 0.35961736624824003
"discrimnatory" in vectors # False
vectors.similarity("discrimnatory", "discriminatory") # 0.8309152561753461
"hiiiiiiiiii" in vectors # False
vectors.similarity("hiiiiiiiiii", "hi") # 0.7069775034853861

Character n-grams are used to create this effect for out-of-vocabulary keys. The inspiration for this feature was taken from Facebook AI Research's Enriching Word Vectors with Subword Information, but instead of utilizing character n-grams at train time, character n-grams are used at inference so the effect can be somewhat replicated (but not perfectly replicated) in older models that were not trained with character n-grams like word2vec and GloVe.

Concatenation of Multiple Models

Optionally, you can combine vectors from multiple models to feed stronger information into a machine learning model like so:

from pymagnitude import *
word2vec = Magnitude("/path/to/GoogleNews-vectors-negative300.magnitude")
glove = Magnitude("/path/to/glove.6B.50d.magnitude")
vectors = Magnitude(word2vec, glove) # concatenate word2vec with glove
vectors.query("cat") # returns 350-dimensional NumPy array ('cat' from word2vec concatenated with 'cat' from glove)
vectors.query(("cat", "cats")) # returns 350-dimensional NumPy array ('cat' from word2vec concatenated with 'cats' from glove)

You can concatenate more than two vector models, simply by passing more arguments to constructor.

Additional Featurization (Parts of Speech, etc.)

You can automatically create vectors from additional features you may have such as parts of speech, syntax dependency information, or any other information using the FeaturizerMagnitude class:

from pymagnitude import *
pos_vectors = FeaturizerMagnitude(100, namespace = "PartsOfSpeech")
pos_vectors.dim # 4 - number of dims automatically determined by Magnitude from 100
pos_vectors.query("NN") # - array([ 0.08040417, -0.71705252,  0.61228951,  0.32322192]) 
pos_vectors.query("JJ") # - array([-0.11681135,  0.10259253,  0.8841201 , -0.44063763])
pos_vectors.query("NN") # - array([ 0.08040417, -0.71705252,  0.61228951,  0.32322192]) (deterministic hashing so the same value is returned every time for the same key)
dependency_vectors = FeaturizerMagnitude(100, namespace = "SyntaxDependencies")
dependency_vectors.dim # 4 - number of dims automatically determined by Magnitude from 100
dependency_vectors.query("nsubj") # - array([-0.81043793,  0.55401352, -0.10838071,  0.15656626])
dependency_vectors.query("prep") # - array([-0.30862918, -0.44487267, -0.0054573 , -0.84071788])

Magnitude will use the feature hashing trick internally to directly use the hash of the feature value to create a unique vector for that feature value.

The first argument to FeaturizerMagnitude should be an approximate upper-bound on the number of values for the feature. Since there are < 100 parts of speech tags and < 100 syntax dependencies, we choose 100 for both in the example above. The value chosen will determine how many dimensions Magnitude will automatically assign to the particular the FeaturizerMagnitude object to reduce the chance of a hash collision. The namespace argument can be any string that describes your additional feature. It is optional, but highly recommended.

You can then concatenate these features for use with a standard Magnitude object:

from pymagnitude import *
word2vec = Magnitude("/path/to/GoogleNews-vectors-negative300.magnitude")
pos_vectors = FeaturizerMagnitude(100, namespace = "PartsOfSpeech")
dependency_vectors = FeaturizerMagnitude(100, namespace = "SyntaxDependencies")
vectors = Magnitude(word2vec, pos_vectors, dependency_vectors) # concatenate word2vec with pos and dependencies
vectors.query([
    ("I", "PRP", "nsubj"), 
    ("saw", "VBD", "ROOT"), 
    ("a", "DT", "det"), 
    ("cat", "NN", "dobj"), 
    (".",  ".", "punct")
  ]) # array of size 5 x (300 + 4 + 4) or 5 x 308

# Or get a unique vector for every 'buffalo' in:
# "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo"
# (https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo)
vectors.query([
    ("Buffalo", "JJ", "amod"), 
    ("buffalo", "NNS", "nsubj"), 
    ("Buffalo", "JJ", "amod"), 
    ("buffalo", "NNS", "nsubj"), 
    ("buffalo",  "VBP", "rcmod"),
    ("buffalo",  "VB", "ROOT"),
    ("Buffalo",  "JJ", "amod"),
    ("buffalo",  "NNS", "dobj")
  ]) # array of size 8 x (300 + 4 + 4) or 8 x 308

A machine learning model, given this output, now has access to parts of speech information and syntax dependency information instead of just word vector information. In this case, this additional information can give neural networks stronger signal for semantic information and reduce the need for training data.

Using Magnitude with a ML library

Magnitude makes it very easy to quickly build and iterate on models that need to use vector representations by taking care of a lot of pre-processing code to convert a dataset of text (or keys) into vectors. Moreover, it can make these models more robust to out-of-vocabulary words and misspellings.

There is example code available using Magnitude to build an intent classification model for the ATIS (Airline Travel Information Systems) dataset (Train/Test), used for chatbots or conversational interfaces, in a few popular machine learning libraries below.

Keras

You can access a guide for using Magnitude with Keras (which supports TensorFlow, Theano, CNTK) at this Google Colaboratory Python notebook.

PyTorch

The PyTorch guide is coming soon.

TFLearn

The TFLearn guide is coming soon.

Utils

You can use the MagnitudeUtils class for convenient access to functions that may be useful when creating machine learning models.

You can import MagnitudeUtils like so:

  from pymagnitude import MagnitudeUtils

You can download a Magnitude model from a remote source like so:

  vecs = Magnitude(MagnitudeUtils.download_model('word2vec/heavy/GoogleNews-vectors-negative300'))

By default, download_model will download files from http://magnitude.plasticity.ai to a ~/.magnitude folder created automatically. If the file has already been downloaded, it will not be downloaded again. You can change the directory of the local download folder using the optional download_dir argument. You can change the domain from which models will be downloaded with the optional remote_path argument.

You can create a batch generator for X and y data with batchify, like so:

  X = [.3, .2, .7, .8, .1]
  y = [0, 0, 1, 1, 0]
  batch_gen = MagnitudeUtils.batchify(X, y, 2)
  for X_batch, y_batch in batch_gen:
    print(X_batch, y_batch)
  # Returns:
  # 1st loop: X_batch = [.3, .2], y_batch = [0, 0]
  # 2nd loop: X_batch = [.7, .8], y_batch = [1, 1]
  # 3rd loop: X_batch = [.1], y_batch = [0]
  # next loop: repeats infinitely...

You can encode class labels to integers and back with class_encoding, like so:

  add_class, class_to_int, int_to_class = MagnitudeUtils.class_encoding()
  add_class("cat") # Returns: 0
  add_class("dog") # Returns: 1
  add_class("cat") # Returns: 0
  class_to_int("dog") # Returns: 1
  class_to_int("cat") # Returns: 0
  int_to_class(1) # Returns: "dog"
  int_to_class(0) # Returns: "cat"

You can convert categorical data with class integers to one-hot NumPy arrays with to_categorical, like so:

  y = [1, 5, 2]
  MagnitudeUtils.to_categorical(y, num_classes = 6) # num_classes is optional
  # Returns: 
  # array([[0., 1., 0., 0., 0., 0.] 
  #       [0., 0., 0., 0., 0., 1.] 
  #       [0., 0., 1., 0., 0., 0.]])

You can convert from one-hot NumPy arrays back to a 1D NumPy array of class integers with from_categorical, like so:

  y_c = [[0., 1., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 1.]]
  MagnitudeUtils.from_categorical(y_c)
  # Returns: 
  # array([1., 5.])

Concurrency and Parallelism

The library is thread safe (it uses a different connection to the underlying store per thread), is read-only, and it never writes to the file. Because of the light-memory usage, you can also run it in multiple processes (or use multiprocessing) with different address spaces without having to duplicate the data in-memory like with other libraries and without having to create a multi-process shared variable since data is read off-disk and each process keeps its own LRU memory cache. For heavier functions, like most_similar a shared memory mapped file is created to share memory between processes.

File Format and Converter

The Magnitude package uses the .magnitude file format instead of .bin, .txt, .vec, or .hdf5 as with other vector models like word2vec, GloVe, fastText, and ELMo. There is an included command-line utility for converting word2vec, GloVe, fastText, and ELMo files to Magnitude files.

You can convert them like so:

python -m pymagnitude.converter -i <PATH TO FILE TO BE CONVERTED> -o <OUTPUT PATH FOR MAGNITUDE FILE>

The input format will automatically be determined by the extension / the contents of the input file. You should only need to perform this conversion once for a model. After converting, the Magnitude file format is static and it will not be modified or written to make concurrent read access safe.

The flags for pymagnitude.converter are specified below:

  • You can pass in the -h flag for help and to list all flags.
  • You can use the -p <PRECISION> flag to specify the decimal precision to retain (selecting a lower number will create smaller files). The actual underlying values are stored as integers instead of floats so this is essentially quantization for smaller model footprints.
  • You can add an approximate nearest neighbors index to the file (increases size) with the -a flag which will enable the use of the most_similar_approx function. The -t <TREES> flag controls the number of trees in the approximate neigherest neighbors index (higher is more accurate) when used in conjunction with the -a flag (if not supplied, the number of trees is automatically determined).
  • You can pass the -s flag to disable adding subword information to the file (which will make the file smaller), but disable advanced out-of-vocabulary key support.
  • If converting a model that has no vocabulary like ELMo, you can pass the -v flag along with the path to another Magnitude file you would like to take the vocabulary from.

Optionally, you can bulk convert many files by passing an input folder and output folder instead of an input file and output file. All .txt, .bin, .vec, .hdf5 files in the input folder will be converted to .magnitude files in the the output folder. The output folder must exist before a bulk conversion operation.

Remote Loading

You can instruct Magnitude download and open a model from Magnitude's remote repository instead of a local file path. The file will automatically be downloaded locally on the first run to ~/.magnitude/ and subsequently skip the download if the file already exists locally.

  vecs = Magnitude('http://magnitude.plasticity.ai/word2vec/heavy/GoogleNews-vectors-negative300.magnitude') # full url
  vecs = Magnitude('word2vec/heavy/GoogleNews-vectors-negative300') # or, use the shorthand for the url

For more control over the remote download domain and local download directory, see how to use MagnitudeUtils.download_model.

Remote Streaming over HTTP

Magnitude models are generally large files (multiple GB) that take up a lot of disk space, even though the .magnitude format makes it fast to utilize the vectors. Magnitude has an option to stream these large files over HTTP. This is explicitly different from the remote loading feature, in that the model doesn't even need to be downloaded at all. You can begin querying models immediately with no disk space used at all.

  vecs = Magnitude('http://magnitude.plasticity.ai/word2vec/heavy/GoogleNews-vectors-negative300.magnitude', stream=True) # full url
  vecs = Magnitude('word2vec/heavy/GoogleNews-vectors-negative300', stream=True) # or, use the shorthand for the url

  vecs.query("king") # Returns: the vector for "king" quickly, even with no local model file downloaded

You can play around with a demo of this in a Google Colaboratory Python Notebook.

This feature is extremely useful if your computing environment is resource constrainted (low RAM and low disk space), you want to experiment quickly with vectors without downloading and setting up large model files, or you are training a small model. While there is some added network latency since the data is being streamed, Magnitude will still use an in-memory cache as specified by the lazy_loading constructor parameter. Since languages generally have a Zipf-ian distribution, the network latency should largely not be an issue after the cache is warmed after being queried a small number of times.

They will be queried directly off a static HTTP web server using HTTP Range Request headers. All Magnitude methods support streaming, however, most_similar and most_similar_approx may be slow as they are not optimized for streaming yet. You can see how this streaming mode performs currently in the benchmarks, however, it will get faster as we optimize it in the future!

Other Documentation

Other documentation is not available at this time. See the source file directly (it is well commented) if you need more information about a method's arguments or want to see all supported features.

Other Languages

Currently, we only provide English word vector models on this page pre-converted to the .magnitude format. You can, however, still use Magnitude with word vectors of other languages. Facebook has trained their fastText vectors for many different languages. You can down the .vec file for any language you want and then convert it to .magnitude with the converter.

Other Programming Languages

Currently, reading Magnitude files is only supported in Python, since it has become the de-facto language for machine learning. This is sufficient for most use cases. Extending the file format to other languages shouldn't be difficult as SQLite has a native C implementation and has bindings in most languages. The file format itself and the protocol for reading and searching is also fairly straightforward upon reading the source code of this repository.

Other Domains

Currently, natural language processing is the most popular domain that uses pre-trained vector embedding models for word vector representations. There are, however, other domains like computer vision that have started using pre-trained vector embedding models like Deep1B for image representation. This library intends to stay agnostic to various domains and instead provides a generic key-vector store and interface that is useful for all domains.

Contributing

The main repository for this project can be found on GitLab. The GitHub repository is only a mirror. Pull requests for more tests, better error-checking, bug fixes, performance improvements, or documentation or adding additional utilties / functionalities are welcome on GitLab.

You can contact us at [email protected].

Roadmap

  • Speed optimizations on remote streaming and exposing stream cache configuration options
  • Make most_similar_approx optimized for streaming
  • In addition to the "Light", "Medium", and "Heavy" flavors, add a "Ludicrous" flavor that will be of an even larger file size but removes the constraint of the initially slow most_similar lookups.
  • Add Google BERT support
  • Support fastText .bin format

Other Notable Projects

  • spotify/annoy - Powers the approximate nearest neighbors algorithm behind most_similar_approx in Magnitude using random-projection trees and hierarchical 2-means. Thanks to author Erik Bernhardsson for helping out with some of the integration details between Magnitude and Annoy.

Citing this Repository

If you'd like to cite our paper at EMNLP 2018, you can use the following BibTeX citation:

@inproceedings{patel2018magnitude,
  title={Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package},
  author={Patel, Ajay and Sands, Alexander and Callison-Burch, Chris and Apidianaki, Marianna},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  pages={120--126},
  year={2018}
}

or follow the Google Scholar link for other ways to cite the paper.

If you'd like to cite this repository you can use the following DOI badge:  DOI

Clicking on the badge will lead to a page that will help you generate proper BibTeX citations, JSON-LD citations, and other citations.

LICENSE and Attribution

This repository is licensed under the license found here.

Seismic” icon by JohnnyZi from the Noun Project.

Comments
  • Problems with memory mapped files and vectors.most_similar()

    Problems with memory mapped files and vectors.most_similar()

    Hello. I am on windows using python 3.6

    There seems to be a problem with creating the magmap files for using the vectors.most_similar() method. Error:

    
      File "\pymagnitude\__init__.py", line 1074, in get_vectors_mmap
        os.rename(path_to_mmap_temp, self.path_to_mmap)
    
    PermissionError: [WinError 32] The process cannot access the file because it is in use by another process \\AppDataTemp\\82ce74f40baf842d23c45b0e90688b9f.magmmap.tmp' -> '\AppData\\Local\\Temp\\82ce74f40baf842d23c45b0e90688b9f.magmmap'
    

    I looked into the code and it seems you are creating and filling these files in the background. It works fine for the vectors.most_similar_approx() method. The memmap file for approx seems to be created just fine.

    The problem persists even when i use blocking=True when constructing the magnitude object. Then i can't even use the approx method. It just hangs while initializing the object.

    Thank you for your time.

    bug windows 
    opened by yassin-taskin 12
  • Long time to install

    Long time to install

    I have seen your efforts on installing this package and its dependencies, and I appreciate your great job. But it still takes more than an hour installing this package in China, either SKIP_ or not. So I hope if you could build most of things as binary wheels and let them be downloaded in github for faster installing. Again, thank you for your time.

    opened by gladuo 10
  • Error with most_similar

    Error with most_similar "too many SQL variables"

    Trying one of the examples on the main page (version 0.1.5 from pypi)

    Using the fasttext common crawl vectors (medium)

    http://magnitude.plasticity.ai/fasttext+subword/crawl-300d-2M.magnitude

    >>> vectors = Magnitude("crawl-300d-2M.magnitude")
    >>> vectors.most_similar("cat", topn = 100)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/ian/anaconda3/lib/python3.5/site-packages/pymagnitude/third_party/repoze/lru/__init__.py", line 354, in cached_wrapper
        val = func(*args, **kwargs)
      File "/home/ian/anaconda3/lib/python3.5/site-packages/pymagnitude/__init__.py", line 962, in most_similar
        return_similarities=return_similarities, method='distance')
      File "/home/ian/anaconda3/lib/python3.5/site-packages/pymagnitude/__init__.py", line 916, in _db_query_similarity
        return_vector = False)
      File "/home/ian/anaconda3/lib/python3.5/site-packages/pymagnitude/__init__.py", line 760, in index
        return self._keys_for_indices(q, return_vector=return_vector)
      File "/home/ian/anaconda3/lib/python3.5/site-packages/pymagnitude/__init__.py", line 672, in _keys_for_indices
        unseen_indices)
    sqlite3.OperationalError: too many SQL variables
    
    

    Any other info that would be helpful here?

    bug sqlite3 
    opened by IanCal 9
  • Requirements do not properly install

    Requirements do not properly install

    clean install in a virtual environment does not install requirements.

    pip 18.0    
    Python 3.6.3    
    
    ➜  virtualenv -p python3 foo; source foo/bin/activate; pip3 install pymagnitude                                         
    In [1]: from pymagnitude import Magnitude                           
    ---------------------------------------------------------------------------                                                             
    ModuleNotFoundError                       Traceback (most recent call last)                                                             
    <ipython-input-1-a4fc6d35defa> in <module>()                        
    ----> 1 from pymagnitude import Magnitude                           
    
    /tmp/foo/lib/python3.6/site-packages/pymagnitude/__init__.py in <module>()                                                              
         11 import hashlib                                              
         12 import heapq                                                
    ---> 13 import lz4.frame                                            
         14 import math                                                 
         15 import operator                                             
    
    ModuleNotFoundError: No module named 'lz4'                          
    
    bug python3 installation virtualenv 
    opened by mikeyshulman 6
  • WTF?

    WTF?

    First of all thank you for making magnitude available openly, and please don't take this as an attack, it is just food for thoughts. The title sums up my reaction when I tried to install the latest version of magnitude. (I've been using version 0.1.13 for a while now)

    I work in a corporate environment with an internal pypi proxy and some somewhat stringent checks on packages and stuff. All our packages and dependencies are scanned for license issues, and known code smells/bugs as well as against a list of known CVE. I tried installing the latest version of magnitude and after downloading the wheel it just hanged. After a few failed attempts I tried to download the archive and install it manually and I started seeing all these calls to aws and pypi which of course would fail since I'm cutoff from the internet.

    I know that my issue isn't yours, but ... not cool guys, setup.py has its semantics and hijacking it like that ... well suffice to say I will probably have a call asking about all these connection attempts to blacklisted urls.

    Anyway, not sure what can be done about it, as I didn't spend the time to understand everything that was going on but I still wanted to let you know that you are breaking at least this lone developer's project by doing so.

    opened by aborsu 5
  • Database disk image malformed with multiprocessing

    Database disk image malformed with multiprocessing

    Hi @AjayP13, I was curious if you had any examples of how you've used this with multiprocessing previously. I'm bumping into a pysqlite error when I try to run with multiprocessing:

    coord = tf.train.Coordinator()
    processes = []
    for i in range(num_processes):
        args = (texts_sliced[i], labels_sliced[i], output_files[i], concatenated_embeddings)
        p = Process(target=_convert_shard, args=args)
        p.start()
        processes.append(p)
    coord.join(processes)
    
      File "/home/jacob/test.py", line 454, in _convert_shard 
        text_embedding = embedding.query(text) 
      File "/home/jacob/anaconda3/pymagnitude/third_party/repoze/lru/__init__.py", line 390, in cached_wrapper                                                                 
        val = func(*args, **kwargs) 
    pysqlite2.dbapi2.DatabaseError: database disk image is malformed                                                         
      File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 2088, in query
        for i, m in enumerate(self.magnitudes)]
      File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 2088, in <listcomp>
        for i, m in enumerate(self.magnitudes)] 
      File "/home/jacob/anaconda3/pymagnitude/third_party/repoze/lru/__init__.py", line 390, in cached_wrapper
        val = func(*args, **kwargs)
      File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 1221, in query
        vectors = self._vectors_for_keys_cached(q, normalized)
      File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 1109, in _vectors_for_keys_cached 
        unseen_keys[i], normalized, force=force)
      File "/home/jacob/anaconda3/pymagnitude/third_party/repoze/lru/__init__.py", line 390, in cached_wrapper 
        val = func(*args, **kwargs)
      File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 483, in _out_of_vocab_vector_cached
        return self._out_of_vocab_vector(*args, **kwargs)
      File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 992, in _out_of_vocab_vector, normalized=normalized)
      File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 829, in _db_query_similar_keys_vector, params).fetchall()   
    pysqlite2.dbapi2.DatabaseError: database disk image is malformed
    

    I've tried reloading the .Magnitude files as well as setting blocking=True, but can't seem to get around it. Any ideas?

    Thanks!

    opened by jacobzweig 5
  • Not possible to download the datasets

    Not possible to download the datasets

    I attempted a few times to download the datasets, but each time the download stopped after less than 100k of data was received. I don't know whether this is a temporary server issue. Do you have plans to host the files somewhere else?

    The conversion from Glove to your format worked flawlessly, but maybe not everyone has the time and resources to perform it.

    bug downloads 
    opened by dinel 5
  • Error while trying to pip install pymagnitude

    Error while trying to pip install pymagnitude

      Using cached https://files.pythonhosted.org/packages/0a/a3/b9a34d22ed8c0ed59b00ff55092129641cdfa09d82f9abdc5088051a5b0c/pymagnitude-0.1.120.tar.gz
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-guvdu2_b/pymagnitude/setup.py", line 178, in <module>
            'a+')
        PermissionError: [Errno 13] Permission denied: '/tmp/magnitude.install'
        
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-guvdu2_b/pymagnitude/```
    opened by shaurya27 4
  • A minor change needed in the most_similar function when used by vector

    A minor change needed in the most_similar function when used by vector

    Regenerate the output to understand the issue:

    from pymagnitude import *
    glove = Magnitude("path/to/glove.6B.300d.magnitude")
    print(glove.most_similar("cat", topn = 2)) # Most similar by key
    print(glove.most_similar(glove.query("cat"), topn = 2)) # Most similar by vector
    

    Output will be as follows:```

    [('dog', 0.6816746), ('cats', 0.68158376)] [('cat', 1.0), ('dog', 0.6816746)]

    
    As one can clearly see that the function most_similar works perfectly when called by key. But it returns the same word when used by passing that word's vector. This should not be the case. A minor modification in code should be made to not take into account the same word as output.
    wontfix 
    opened by ParikhKadam 4
  • most_similar() anomalies

    most_similar() anomalies

    Hi, I converted bilingual FastText embeddings into a medium magnitude model and I'm getting some questionable results:

    >>> xlvecs=Magnitude("wiki.+de+en.tag.vec.magnitude") >>> katze=xlvecs.most_similar("[email protected]@", topn=5) >>> print(katze) [('rabbit,@en@', 0.3190704584121704), ('dogs,@en@', 0.31559139490127563), ('[email protected]@', 0.3059767484664917), ('[email protected]@', 0.30381107330322266), ('#[email protected]@', 0.29921069741249084)] >>> xlvecs.similarity("[email protected]@", "[email protected]@") 0.4569693 >>> xlvecs.similarity("[email protected]@", "[email protected]@") 0.38769498 >>> xlvecs.similarity("[email protected]@", "[email protected]@") 0.42773518 >>> xlvecs.similarity("[email protected]@", "[email protected]@") 0.40975133

    "[email protected]@", "[email protected]@", "[email protected]@" and even actual "[email protected]@" (no spurious comma) are more similar to "[email protected]@" but instead I'm getting "rabbits,@en@". Am I misunderstanding what most_similar is supposed to do?

    I thought maybe I could try setting max_distance to just a hair above xlvecs.distance("[email protected]@", "[email protected]@) to see what would happen, but I got TypeError: most_similar() got an unexpected keyword argument 'max_distance'

    I'm on version 0.1.48

    bug most_similar documentation 
    opened by mjmartindale 4
  • Query time slower than gensim?

    Query time slower than gensim?

    Hi!

    I really hope this question doesn't come across as critical - I think this project is a great idea and really loving the speed at which it can lazy-load models.

    I had one question - loading the Google news vectors is massively quicker in magnitude than gensim, however I'm finding that querying is significantly slower. Is this to be expected? It's is quite possible that this is a trade-off against loading time but want to confirm that there's nothing weird going on in my environment.

    Code i'm using for testing:

    import json
    import os
    import timeit
    
    
    ITERATIONS = 500
    
    # Tokens are loaded from disk.
    # tokens = ...
    tokens = json.dumps(tokens)
    
    mag = timeit.timeit(
    '''
    for token in tokens:
        try:
            getVector(token)
        except:
            pass
    ''',
        setup =
    '''
    from pymagnitude import Magnitude
    vec = Magnitude('/home/dom/Code/ner/ner/data/GoogleNews-vectors-negative300.magnitude')
    getVector = vec.query
    tokens = {}
    '''.format(tokens),
        number = ITERATIONS
    )
    
    gensim = timeit.timeit(
    '''
    for token in tokens:
        try:
            getVector(token)
        except:
            pass
    ''',
        setup = 
    '''
    from gensim.models import KeyedVectors
    vec = KeyedVectors.load('/home/dom/Code/ner/ner/data/GoogleNews-vectors-negative300.w2v', mmap='r')
    getVector = vec.__getitem__
    tokens = {}
    '''.format(tokens),
        number = ITERATIONS
    )
    
    print('Gensim is {}x faster'.format(mag / gensim))
    

    For the code in the above; I get gensim being approximately 5x faster if memory-mapped and if not over 13x faster.

    question optimization 
    opened by DomHudson 4
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • MutableMapping import error

    MutableMapping import error

    Hey, :)

    Awesome package!

    However, your third-party allennlp code is very old. It imports MutableMapping from collections, while it was moved to collections.abc in Python 3.3 and has since been deprecated to import from there - although the reference was kept for backwards compatibility.

    The reference was completely removed in Python 3.10 and onwards, so the magniture library actully doesn't work on the newest Python version, nor is it future-compatible.

    The fix is simple - change line 12 in pymagnitude/third_party/allennlp/common/params.py from:

    from collections import MutableMapping, OrderedDict
    

    to

    from collections import OrderedDict
    from collections.abc import MutableMapping
    

    Also, there might be some more broken imports lying around. I wouldn't know. :)

    Cheers!

    Reference stack trace of the error on Python 3.10.2:

    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    Input In [74], in <module>
    ----> 1 from pymagnitude import Magnitude
    
    File ~/.pyenv/versions/3.10.2/envs/py3/lib/python3.10/site-packages/pymagnitude/__init__.py:80, in <module>
         78 sys.path.append(os.path.dirname(__file__) + '/third_party/')
         79 sys.path.append(os.path.dirname(__file__) + '/third_party_mock/')
    ---> 80 from pymagnitude.third_party.allennlp.commands.elmo import ElmoEmbedder
         82 # Import SQLite
         83 try:
    
    File ~/.pyenv/versions/3.10.2/envs/py3/lib/python3.10/site-packages/pymagnitude/third_party/allennlp/commands/__init__.py:8, in <module>
          4 import argparse
          5 import logging
    ----> 8 from allennlp.commands.configure import Configure
          9 from allennlp.commands.elmo import Elmo
         10 from allennlp.commands.evaluate import Evaluate
    
    File ~/.pyenv/versions/3.10.2/envs/py3/lib/python3.10/site-packages/pymagnitude/third_party/allennlp/commands/__init__.py:8, in <module>
          4 import argparse
          5 import logging
    ----> 8 from allennlp.commands.configure import Configure
          9 from allennlp.commands.elmo import Elmo
         10 from allennlp.commands.evaluate import Evaluate
    
    File ~/.pyenv/versions/3.10.2/envs/py3/lib/python3.10/site-packages/pymagnitude/third_party/allennlp/commands/configure.py:25, in <module>
         22 import argparse
         24 from allennlp.commands.subcommand import Subcommand
    ---> 25 from allennlp.common.configuration import configure, Config, render_config
         27 class Configure(Subcommand):
         28     def add_subparser(self, name     , parser                            )                           :
         29         # pylint: disable=protected-access
    
    File ~/.pyenv/versions/3.10.2/envs/py3/lib/python3.10/site-packages/pymagnitude/third_party/allennlp/common/__init__.py:3, in <module>
          2 from __future__ import absolute_import
    ----> 3 from allennlp.common.params import Params
          4 from allennlp.common.registrable import Registrable
          5 from allennlp.common.tee_logger import TeeLogger
    
    File ~/.pyenv/versions/3.10.2/envs/py3/lib/python3.10/site-packages/pymagnitude/third_party/allennlp/common/params.py:12, in <module>
         10 from __future__ import absolute_import
         11 #typing
    ---> 12 from collections import MutableMapping, OrderedDict
         13 import copy
         14 import json
    
    ImportError: cannot import name 'MutableMapping' from 'collections' (/Users/myuser/.pyenv/versions/3.10.2/lib/python3.10/collections/__init__.py)
    
    opened by shaypal5 0
  • ELMO_heavy

    ELMO_heavy

    Downloaded this ELMO-model: "elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights_GoogleNews_vocab.magnitude"

    But when i try to load it with: Magnitude(ELMO) (where ELMO is path to the downloaded .magnitude file) i get the following ERROR:

    File "/path/to/venv/lib/python3.8/site-packages/pymagnitude/init.py", line 359, in init self.length = self._db().execute( IndexError: list index out of range

    opened by LeonHammerla 0
  • Quering taking lot of time (18 sec to 3 min) intermittently

    Quering taking lot of time (18 sec to 3 min) intermittently

    Quering taking lot of time (18 sec to 3 min) intermittently

    I am using pymagnitude in one of my project to load and use GoogleNews-vectors-negative300.bin.

    I have converted GoogleNews-vectors-negative300.bin ----> to a .magnitude file and loading the .magnitude file using Magnitude(). I use pymagnitude to generate embedding of words and then train a ANN model on those embedding.

    On my local (with below mentioned details), i face no issue and

    Environments:-

    (local):- Mac, 32 GB RAM,docker with centos ---- very fast less than fraction of a second

    (Testing Environment):- CentOs 16 GB Ram --- intermittent slowness, taking 18sec to 3 min for querying some words and the process timeouts.

    ** I am using a mount , to keep my mmap files. And assured that it is not getting wiped out.

    Here are the finings of a few words on Testing Environment and on local :-

    Word, Time on Testing Environment li��n , 0.82 min ph���m, 0.4 min al,1.3 Time on local of above keys is very less , even less than a second.

    On further investigation and profiling execution time we observed that more time is being taken in case an OOV token if found, and _db_query_similar_keys_vector function is invoked.

    Sample Queries which are taking more time:-

    SELECT magnitude.* FROM magnitude_subword, magnitude WHERE char_ngrams MATCH "\uf000al" OR "al" OR "l" OR "\uf000" AND magnitude.rowid = magnitude_subword.rowid ORDER BY ( ( LENGTH(offsets(magnitude_subword)) - LENGTH( REPLACE(offsets(magnitude_subword), ' ', '') ) ) + 1 ) DESC, magnitude.key LIKE 'a%' AND LENGTH(magnitude.key) <= 4 DESC, magnitude.key LIKE '%';

    -- Took 3.8 min to execute

    SELECT magnitude.* FROM magnitude_subword, magnitude WHERE char_ngrams MATCH "\uf000ch" OR "ch" OR "h" OR "n" OR "ng" OR "ng\uf000" AND magnitude.rowid = magnitude_subword.rowid ORDER BY ( ( LENGTH(offsets(magnitude_subword)) - LENGTH( REPLACE(offsets(magnitude_subword), ' ', '') ) ) + 1 ) DESC, magnitude.key LIKE 'a%' AND LENGTH(magnitude.key) <= 4 DESC, magnitude.key LIKE '%'; -- Took 2 min to execute

    opened by shubhamjoshi2130 0
  • Can't install pymagnitude because of the wheel

    Can't install pymagnitude because of the wheel

    Using pip3 (or pip) install pymagnitude and getting the error:

     Using cached pymagnitude-0.1.100.tar.gz (5.4 MB)
    Building wheels for collected packages: pymagnitude
      Building wheel for pymagnitude (setup.py) ... error
      ERROR: Command errored out with exit status 1:
                             ...
      ERROR: Failed building wheel for pymagnitude
      Running setup.py clean for pymagnitude
    Failed to build pymagnitude
    Installing collected packages: pymagnitude
        Running setup.py install for pymagnitude ... done
    Successfully installed pymagnitude
    
    

    Didn't find any solutions anywhere, so I hope you can help me...

    opened by necrofan 1
  • Spacy3.x not supported (tag_map no longer exists)

    Spacy3.x not supported (tag_map no longer exists)

    Hi -- I'm trying to make use of PyMagnitude in a Docker image that has Spacy 3.0.6 installed. When trying to load the downloaded vectors I receive the following error:

    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-37-54185549ba8b> in <module>
          1 import spacy
    ----> 2 from spacy.lang.en import tag_map
    
    ImportError: cannot import name 'tag_map' from 'spacy.lang.en' (/opt/conda/envs/nlp/lib/python3.8/site-packages/spacy/lang/en/__init__.py)
    

    A bit of Googling turned up:

    The TAG_MAP and MORPH_RULES in the language data have been replaced by the more flexible AttributeRuler
    

    See further info on Backwards Incompatibilities.

    opened by jreades 1
Releases(0.1.143)
Owner
Plasticity
The official GitHub account of Plasticity
Plasticity
DLO8012: Natural Language Processing & CSL804: Computational Lab - II

NATURAL-LANGUAGE-PROCESSING-AND-COMPUTATIONAL-LAB-II DLO8012: NLP & CSL804: CL-II [SEMESTER VIII] Syllabus NLP - Reference Books THE WALL MEGA SATISH

AMEY THAKUR 7 Apr 28, 2022
APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

APEACH - Korean Hate Speech Evaluation Datasets APEACH is the first crowd-generated Korean evaluation dataset for hate speech detection. Sentences of

Kevin-Yang 70 Dec 06, 2022
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
Smart discord chatbot integrated with Dialogflow

academic-NLP-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 06, 2023
"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021

transformers-arithmetic This repository contains the code to reproduce the experiments from the paper: Nogueira, Jiang, Lin "Investigating the Limitat

Castorini 33 Nov 16, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

Junying Chen 20 Dec 13, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022
Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

Fitrie Ratnasari 1 Feb 14, 2022
Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated. This engine can later be used for downstream tasks in NLP such as Q&A, summarization, generation

Diego 1 Mar 20, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
Yes it's true :broken_heart:

Information WARNING: No longer hosted If you would like to be on this repo's readme simply fork or star it! Forks 1 - Flowzii 2 - Errorcrafter 3 - vk-

Dropout 66 Dec 31, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023