Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Overview

Colibri Core

https://travis-ci.org/proycon/colibri-core.svg?branch=master http://applejack.science.ru.nl/lamabadge.php/colibri-core

GitHub release (latest by date)

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

by Maarten van Gompel, [email protected], Radboud University Nijmegen

Licensed under GPLv3 (See http://www.gnu.org/licenses/gpl-3.0.html)

Colibri Core is software to quickly and efficiently count and extract patterns from large corpus data, to extract various statistics on the extracted patterns, and to compute relations between the extracted patterns. The employed notion of pattern or construction encompasses the following categories:

  • n-gram -- n consecutive words
  • skipgram -- An abstract pattern of predetermined length with one or multiple gaps (of specific size).
  • flexgram -- An abstract pattern with one or more gaps of variable-size.

N-gram extraction may seem fairly trivial at first, with a few lines in your favourite scripting language, you can move a simple sliding window of size n over your corpus and store the results in some kind of hashmap. This trivial approach however makes an unnecessarily high demand on memory resources, this often becomes prohibitive if unleashed on large corpora. Colibri Core tries to minimise these space requirements in several ways:

  • Compressed binary representation -- Each word type is assigned a numeric class, which is encoded in a compact binary format in which highly frequent classes take less space than less frequent classes. Colibri core always uses this representation rather than a full string representation, both on disk and in memory.
  • Informed iterative counting -- Counting is performed more intelligently by iteratively processing the corpus in several passes and quickly discarding patterns that won't reach the desired occurrence threshold.

Skipgram and flexgram extraction are computationally more demanding but have been implemented with similar optimisations. Skipgrams are computed by abstracting over n-grams, and flexgrams in turn are computed either by abstracting over skipgrams, or directly from n-grams on the basis of co-occurrence information (mutual pointwise information).

At the heart of the sofware is the notion of pattern models. The core tool, to be used from the command-line, is colibri-patternmodeller which enables you to build pattern models, generate statistical reports, query for specific patterns and relations, and manipulate models.

A pattern model is simply a collection of extracted patterns (any of the three categories) and their counts from a specific corpus. Pattern models come in two varieties:

  • Unindexed Pattern Model -- The simplest form, which simply stores the patterns and their count.
  • Indexed Pattern Model -- The more informed form, which retains all indices to the original corpus, at the cost of more memory/diskspace.

The Indexed Pattern Model is much more powerful, and allows more statistics and relations to be inferred.

The generation of pattern models is optionally parametrised by a minimum occurrence threshold, a maximum pattern length, and a lower-boundary on the different types that may instantiate a skipgram (i.e. possible fillings of the gaps).

Technical Details

Colibri Core is available as a collection of standalone command-line tools, as a C++ library, and as a Python library.

Please consult the full documentation at https://proycon.github.io/colibri-core

Installation instructions are here: https://proycon.github.io/colibri-core/doc/#installation

Publication

This software is extensively described in the following peer-reviewed publication:

van Gompel, M and van den Bosch, A (2016) Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core. Journal of Open Research Software 4: e30, DOI: http://dx.doi.org/10.5334/jors.105

Click the link to access the publication and please cite it if you make use of Colibri Core in your work.

Comments
  • Unable to load large corpora into memory because PatternPointer length can't exceed 2^32 bytes (32 bit size descriptor)

    Unable to load large corpora into memory because PatternPointer length can't exceed 2^32 bytes (32 bit size descriptor)

    Whilst fine in most situations, this doesn't work for IndexedCorpus which loads an entire corpus into one PatternPointer. This prevents loading very large corpora (continuation of #41):

    Loading corpus data...
    Loaded 307725534 sentences; corpussize (bytes) = 9157735203
    ERROR: Pattern too long for pattern pointer [9157735203 bytes,explicit]
    terminate called after throwing an instance of 'InternalError'
      what():  Colibri internal error
    

    Simply setting the size descriptor to a 64 bit integer would waste too much memory in most other situation so isn't an option either. I think we need a more flexible solution through templating.

    enhancement in progress 
    opened by proycon 5
  • Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters!

    Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters!

    As reported by Pavel Vondřička, something fishy is going on in the computation of an indexed model on a large dataset (8.5GB compressed):

    Indexed:

    $ colibri-patternmodeller -l 1 -t 1 -f gigacorpus.colibri.dat                                                        
    Loading corpus data...
    Training model on  gigacorpus.colibri.dat
    Training patternmodel, occurrence threshold: 1
    Counting *all* n-grams (occurrence threshold=1)
     Found 2562104 ngrams... computing total word types prior to pruning...2562104...pruned 0...total kept: 2562104
    Sorting all indices...
    

    Unindexed (these are the correct):

    $ colibri-patternmodeller -u -l 1 -t 1 -f gigacorpus.colibri.dat
    Training unindexed model on  gigacorpus.colibri.dat
    Training patternmodel, occurrence threshold: 1
    Counting *all* n-grams (occurrence threshold=1)
     Found 11459477 ngrams... computing total word types prior to pruning...11459477...pruned 0...total kept: 11459477
    

    The encoded corpus file has been verified to be fine (i.e. it decodes properly):

    yes, I tried decoding the corpus back and it had a different size, but there was the whole contents - it seems that just some (white)spaces got lost, which is understandable. Anyway, the corpus wasn’t clipped.

    I did some tests and the problem does NOT reproduce on a small text (counts are equal there as expected), which also explains why it isn't caught by our automated tests. So the cause is not yet clear and further debugging is needed.

    bug PRIORITY investigate 
    opened by proycon 4
  • pip failed building wheel for colibricore Mac OSX 10.11.2

    pip failed building wheel for colibricore Mac OSX 10.11.2

    I brew installed the dependencies, but get "colibricore_wrapper.cpp:258:10: fatal error: 'unordered_map' file not found" error below after trying to pip install colibricore.

    coco)~/colibri-core - [master●] » pip install colibricore
    Collecting colibricore
      Using cached colibricore-2.1.2.tar.gz
    Requirement already satisfied (use --upgrade to upgrade): Cython>=0.23 in /Users/me/anaconda/envs/coco/lib/python3.4/site-packages (from colibricore)
    Building wheels for collected packages: colibricore
      Running setup.py bdist_wheel for colibricore
      Complete output from command /Users/me/anaconda/envs/coco/bin/python3 -c "import setuptools;__file__='/private/var/folders/6n/__f45xnx36q9r_fy3jg68tz8tn99rh/T/pip-build-hat3uagr/colibricore/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /var/folders/6n/__f45xnx36q9r_fy3jg68tz8tn99rh/T/tmps16svvf_pip-wheel-:
      running bdist_wheel
      running build
      running build_ext
      cythoning colibricore_wrapper.pyx to colibricore_wrapper.cpp
      building 'colibricore' extension
      creating build
      creating build/temp.macosx-10.5-x86_64-3.4
      gcc -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/me/anaconda/envs/coco/include -arch x86_64 -I/usr/local/include/colibri-core -I/usr/include/colibri-core -I/usr/include/libxml2 -I/Users/me/anaconda/envs/coco/include/python3.4m -c colibricore_wrapper.cpp -o build/temp.macosx-10.5-x86_64-3.4/colibricore_wrapper.o --std=c++0x
      colibricore_wrapper.cpp:258:10: fatal error: 'unordered_map' file not found
      #include <unordered_map>
               ^
      1 error generated.
      (Writing /private/var/folders/6n/__f45xnx36q9r_fy3jg68tz8tn99rh/T/pip-build-hat3uagr/colibricore/colibricore_wrapper.pyx)
      /Users/me/anaconda/envs/coco/lib/python3.4/distutils/extension.py:132: UserWarning: Unknown Extension options: 'pyrex_gdb'
        warnings.warn(msg)
      warning: colibricore_wrapper.pyx:1003:12: Unreachable code
      warning: colibricore_wrapper.pyx:1247:8: Unreachable code
      warning: colibricore_wrapper.pyx:2050:8: Unreachable code
      warning: colibricore_wrapper.pyx:2951:8: Unreachable code
      warning: colibricore_wrapper.pyx:3425:8: Unreachable code
      error: command 'gcc' failed with exit status 1
    
      ----------------------------------------
    Failed building wheel for colibricore
    Failed to build colibricore
    Installing collected packages: colibricore
    
    (coco)~/colibri-core - [master●] » brew --config
    HOMEBREW_VERSION: 0.9.5
    ORIGIN: https://github.com/Homebrew/homebrew
    HEAD: 2ae9b385ff174db4e1ac713f47a88c0e7034c516
    Last commit: 15 minutes ago
    HOMEBREW_PREFIX: /usr/local
    HOMEBREW_REPOSITORY: /usr/local
    HOMEBREW_CELLAR: /usr/local/Cellar
    HOMEBREW_BOTTLE_DOMAIN: https://homebrew.bintray.com
    CPU: 8-core 64-bit haswell
    OS X: 10.11.2-x86_64
    Xcode: 7.2
    CLT: 7.2.0.0.1.1447826929
    Clang: 7.0 build 700
    X11: N/A
    System Ruby: 2.0.0-p645
    Perl: /usr/bin/perl
    Python: /Users/me/anaconda/envs/coco/bin/python => /Users/me/anaconda/envs/coco/bin/python3.4
    Ruby: /usr/bin/ruby => /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/bin/ruby
    Java: 1.8.0_66
    
    help wanted 
    opened by mkrump 4
  • Wrong threshold in model.filter

    Wrong threshold in model.filter

    Hello! In this command options = colibricore.PatternModelOptions(mintokens=50, maxlength=6, doskipgrams=True) I set mintokens=50. But then I tried to extract skipgrams with a command self.model.filter(0, colibricore.Category.SKIPGRAM) Results look like threshold was 100 (I don't see any skipgram with occurence less than 100). Is it a bug or do I something wrong?

    question expired 
    opened by svetlana21 3
  • Discrepancy between totaloccurrencesingroup and patterns in getreverseindex

    Discrepancy between totaloccurrencesingroup and patterns in getreverseindex

    I'm training a 4-gram skipgram model with

    MINTOKENS = MINTOKENS_SKIPGRAMS = 2
    MINTOKENS_UNIGRAMS = 3
    MINLENGTH = 3
    MAXLENGTH = 4
    DOREVERSEINDEX = true
    DOSKIPGRAMS_EXHAUSTIVE = true
    

    with these numbers reported for the pattern model:

                                     PATTERNS         TOKENS       COVERAGE          TYPES
    Total:                                  -     1537297768              -        2425337
    Uncovered:                              -              0         0.0000        1718067
    Covered:                        273998512     1537297768         1.0000         707270
    
           CATEGORY      N (SIZE)        PATTERNS          TYPES    OCCURRENCES
                all            all      273998512         707270     3593418773
                all              2       16652489         707269      712369750
                all              3       75300479         582876     1205518415
                all              4      182045544         495923     1675530608
             n-gram            all      136902720         707269     1658562277
             n-gram              2       16652489         707269      712369750
             n-gram              3       52408087         580582      571966995
             n-gram              4       67842144         495586      374225532
           skipgram            all      137095792         553853     1934856496
           skipgram              3       22892392         553853      633551420
           skipgram              4      114203400         495923     1301305076
    

    trainPatternModel.totaloccurrencesingroup(0,4) reports there are 1675530608 patterns of length 4, whereas I get between 1904680000-1904700000 patterns (exact number is not reported by my code) with

    for(IndexedCorpus::iterator iter = indexedCorpus->begin(); iter != indexedCorpus->end(); ++iter)
            {
                for(PatternPointer patternp : trainPatternModel.getreverseindex(iter.index(), 0, 0, 4))
                { ...
    

    This is a difference of 13.7%.

    So what is the right way to get the number of patterns, after pruning and thresholding, indifferent of the pattern type?

    investigate 
    opened by naiaden 3
  • Non-functioning constraints in .getrightneighbours(), .getcooc() etc.

    Non-functioning constraints in .getrightneighbours(), .getcooc() etc.

    I wanted to get only n-grams of a specific size following some other n-gram. However, I experienced that the output did not adhere to the given constraints, at least not as I expected. I've consulted the documentation to figure out if I simply misunderstood something; if so, please enlighten me. :-)

    Here is a working example (most of it taken from the tutorial notebook) which shows it:

    import colibricore
    from urllib.request import urlopen
    
    
    TMPDIR = '/tmp/'
    corpusfile_plato_plaintext = TMPDIR + "republic.txt"
    classfile_plato = TMPDIR + "republic.colibri.cls"
    corpusfile_plato = TMPDIR + "republic.colibri.dat"
    
    f = urlopen('http://lst.science.ru.nl/~proycon/republic.txt')
    with open(corpusfile_plato_plaintext,'wb') as of:
        of.write(f.read())
    print("Downloaded to " + corpusfile_plato_plaintext)
    
    # make encoder, encode corpus and make decoder
    classencoder = colibricore.ClassEncoder(classfile_plato)
    classencoder.build(corpusfile_plato_plaintext)
    classencoder.save(classfile_plato)
    classencoder.encodefile(corpusfile_plato_plaintext, corpusfile_plato)
    classdecoder = colibricore.ClassDecoder(classfile_plato)
    
    # set options and train model
    options = colibricore.PatternModelOptions(mintokens=2, maxlength=8,
                                              doskipgrams=True)
    corpus_plato = colibricore.IndexedCorpus(corpusfile_plato)
    model = colibricore.IndexedPatternModel(reverseindex=corpus_plato)
    model.train(corpusfile_plato, options)
    
    # make ngram and get its neighbours under different constraints
    ngram = classencoder.buildpattern("the law")
    no_constraint = {(pattern, count)
                     for pattern, count in model.getrightneighbours(ngram, 1)}
    
    only_bigrams = {(pattern, count)
                     for pattern, count in model.getrightneighbours(ngram, 1, size=2)}
    
    # we'd expect nothing besides bigrams, but ...
    for pattern, count in only_bigrams:
        if not pattern.isskipgram() and len(pattern) != 2:
            print('Found a non-bigram where I should not!: ',
                  pattern.tostring(classdecoder))
            break
    
    only_ngrams = {(pattern, count)
                   for pattern, count in model.getrightneighbours(
            ngram, 1, category=colibricore.Category.NGRAM
        )}
    # we'd expect no skipgrams, but ...
    for pattern, count in only_ngrams:
        if pattern.isskipgram():
            print('Found a skipgram where I should not!',
                  pattern.tostring(classdecoder))
            break
    
    

    Output:

    Found a non-bigram where I should not!:  ; at the same time
    Found a skipgram where I should not! ; {*} their
    

    Similar things happen for cooc methods and left neighbours.

    bug investigate 
    opened by KasperFyhn 2
  • Error with Tibetan Unicode

    Error with Tibetan Unicode

    I'm working on a Tibetan language corpus and I get the following error message with the patternmodeller:

    Loading pattern model legya.colibri.dat as model... File is not a colibri model file (or a very old one) terminate called after throwing an instance of 'InternalError' what(): Colibri internal error

    The command was:

    colibri-patternmodeller -i legya.colibri.dat -t 10 -l 20 -T 3 -o legya.colibri.indexedpatternmodel
    

    classdecode spat out the Unicode without complaining, idem for the script colibri-ngrams...

    Here's the file: legya.txt

    opened by ngawangtrinley 2
  • Can't compile on CentOS 6.6

    Can't compile on CentOS 6.6

    I'm getting the following errors (gcc 4.4.7):

    # pip3 install colibricore
    [...]
        Bootstrapping colibri-core
        Autoconf archive found in /usr/share/aclocal/, good
        configure.ac:36: warning: AC_LANG_CONFTEST: no AC_LANG_SOURCE call detected in body
        ../../lib/autoconf/lang.m4:193: AC_LANG_CONFTEST is expanded from...
        ../../lib/autoconf/general.m4:2661: _AC_LINK_IFELSE is expanded from...
        ../../lib/autoconf/general.m4:2678: AC_LINK_IFELSE is expanded from...
        /usr/share/aclocal/libtool.m4:1022: _LT_SYS_MODULE_PATH_AIX is expanded from...
        /usr/share/aclocal/libtool.m4:4161: _LT_LINKER_SHLIBS is expanded from...
        /usr/share/aclocal/libtool.m4:5236: _LT_LANG_C_CONFIG is expanded from...
        /usr/share/aclocal/libtool.m4:138: _LT_SETUP is expanded from...
        /usr/share/aclocal/libtool.m4:67: LT_INIT is expanded from...
        configure.ac:36: the top level
    [...]
        libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -I../include -Wall -O3 -g -O2 -std=gnu++0x -MT pattern.lo -MD -MP -MF .deps/pattern.Tpo -c pattern.cpp  -fPIC -DPIC -o .libs/pattern.o
        In file included from ../include/patternstore.h:19,
                         from pattern.cpp:2:
        ../include/datatypes.h: In member function `std::string IndexReference::tostring() const':
        ../include/datatypes.h:73: error: call of overloaded `to_string(uint32_t)' is ambiguous
        /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int)
        /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note:                 std::string std::to_string(long long unsigned int)
        /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note:                 std::string std::to_string(long double)
        ../include/datatypes.h:73: error: call of overloaded `to_string(unsigned int)' is ambiguous
        /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int)
        /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note:                 std::string std::to_string(long long unsigned int)
        /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note:                 std::string std::to_string(long double)
        ../include/datatypes.h: In member function `void IndexedData::shrink_to_fit()':
        ../include/datatypes.h:151: error: `class std::vector<IndexReference, std::allocator<IndexReference> >' has no member named `shrink_to_fit'
        In file included from pattern.cpp:2:
        ../include/patternstore.h: In member function `void PatternSet<ReadWriteSizeType>::reserve(size_t)':
        ../include/patternstore.h:704: error: `class t_patternset' has no member named `reserve'
        pattern.cpp: In member function `const bool PatternPointer::unknown() const':
        pattern.cpp:408: warning: comparison between signed and unsigned integer expressions
        pattern.cpp: In constructor `Pattern::Pattern(std::istream*, bool, unsigned char, const unsigned char*, bool)':
        pattern.cpp:528: warning: comparison between signed and unsigned integer expressions
        make[2]: *** [pattern.lo] Error 1
        make[2]: Leaving directory `/home/avcrane1/src/colibri-core/tmp/pip-build-1al7wmwo/colibricore/src'
        make[1]: *** [all-recursive] Error 1
        make[1]: Leaving directory `/home/avcrane1/src/colibri-core/tmp/pip-build-1al7wmwo/colibricore'
        make: *** [all] Error 2
        Make of colibri-core failed
    
    wontfix waiting expired 
    opened by andreasvc 2
  • skipgram training adds strange ngrams that do not exist

    skipgram training adds strange ngrams that do not exist

    $ colibri-patternmodeller -c input.colibri.cls -f input.colibri.dat -o input.colibri.patternmodel -t 1 -l 4 -m 4 -u -P | cut -f1 > ngrams
    $ colibri-patternmodeller -c input.colibri.cls -f input.colibri.dat -o input.colibri.patternmodel -t 1 -l 4 -m 4 -u -s  -P | cut -f1 > ngramsskipgrams 
    $ cat ngramsskipgrams | grep -v "{*}" > ngramsskipgrams.filtered
    $ wc -l ngrams*
        3001 ngrams
        11339 ngramsskipgrams
        5070 ngramsskipgrams.filtered
    

    Example upon inspection of data:

    existing good ngram: 10 December 2007 imposing additional bad ngram: 10 December Other imposing

    bug 
    opened by proycon 2
  • Problems compiling with anaconda

    Problems compiling with anaconda

    I had two minor issues while building from source:

    1. First the installation aborted with the following error:
    libtool: Version mismatch error.  This is libtool 2.4.6, but the
    libtool: definition of this LT_INIT comes from libtool 2.4.6.42-b88ce.
    libtool: You should recreate aclocal.m4 with macros from libtool 2.4.6
    libtool: and run autoconf again.
    make[2]: *** [Makefile:798: SpookyV2.lo] Error 63
    make[2]: Leaving directory '/home/marco/PycharmProjects/colibri-core/src'
    make[1]: *** [Makefile:466: all-recursive] Error 1
    make[1]: Leaving directory '/home/marco/PycharmProjects/colibri-core'
    make: *** [Makefile:375: all] Error 2
    Make of colibri-core failed
    

    I solved this error as suggested by recreating aclocal.m4 using autoreconf --force --install

    1. Afterwards the compiling aborted again with the following error:
    /home/marco/anaconda3/envs/MedInf/compiler_compat/ld: build/temp.linux-x86_64-3.7/colibricore_wrapper.o: unable to initialize decompress status for section .debug_info
    build/temp.linux-x86_64-3.7/colibricore_wrapper.o: file not recognized: file format not recognized
    collect2: error: ld returned 1 exit status
    

    I solved this problem with a strange workaround by giving condas ld another name so the system wide ld was used.

    I'm not sure if you are in the position to solve those problems, but I leave this here, so maybe I save others some time.

    opened by redadmiral 1
  • added & to te sure. But refactoring woule be better (no loop needed)

    added & to te sure. But refactoring woule be better (no loop needed)

    I suggest using on line 49 IN->read( (char*)buffer, length) in stead of the loop: for (int i = 0; i < length; i++) { IN->read((char*) &buffer[i], sizeof(unsigned char)); }

    opened by kosloot 1
  • [Queries] Ability to create a model and cls from multiple input files

    [Queries] Ability to create a model and cls from multiple input files

    Hi,

    To begin with, Thank you.. For the amazing work you've done so far.. I have a few questions regarding my usage of colibric-core in my project

    What I am trying to build is a model that learns recurring patterns from a set of input text files. These are log files of a collection of software components.

    Each line in my log file is converted to a unique hash representing that line, and the input to the training is a single line whose words are the hashes, word count is equal to the line count of the actual log file. This is done to generate patterns across lines and not words.

    The model is then used to analyse whether patterns in a given test file matches against the training data, to detect any anomalies or unknown patterns. I am using your library for it's ability of creating variable length ngrams, skipgrams and flexgrams. The questions that I have are as follows -

    1. How do I create a unified model and class file, that contains patterns learnt from multiple input files
    2. Do I save the class file and model after every instance of model trained from an input file, or can I train from multiple input files and then finally call .save/ ,write
    3. Is there a way to perform this training on multiple cores, while saving the information to a single model? Multithreading?
    4. Alternatively is it possible to create temporary multiple models through a batch operation and then somehow merge them together to a single model file and .cls file?
    5. Also, I see random crashes some times while parsing a file. Re-running the training on the same file again sometimes results in a crash at the same point, and sometimes doesn't, which is weird. I'll try to get the backtraces for those crashes whenever i reproduce the issue again..

    I am willing to contribute any changes done in regards to the above requirements if you could just guide me. I have also attached the relevant code that shows my usage of the library.

    train_program.py.zip

    opened by manrock007 0
  • Class encoding fails if input only contains one line without new line?

    Class encoding fails if input only contains one line without new line?

    Discovered by @fkunneman; output file was only 2-bytes (the initial null byte and version marker).

    Input text was just: prachtig apparaat en droogt goed kreukelvrij fijn de verlichting binnenin voelt heel robuust en ziet er ook erg leuk uit

    Also verify this doesn't imply we lose the last sentence on larger encodings (can't imagine it does as the tests probably cover this, but better check).

    bug investigate 
    opened by proycon 0
  • Investigate improved scalability using use of out-of-memory datastructures

    Investigate improved scalability using use of out-of-memory datastructures

    The following library could be pluggable into our current framework:

    STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks.: http://stxxl.sourceforge.net/

    enhancement low priority investigate 
    opened by proycon 1
  • Load corpora with mmap

    Load corpora with mmap

    Would it be possible to load copora with mmap? This would make it possible to work with corpora larger than the available RAM, and is much more efficient if only a small part of a file is going to be used anyway.

    enhancement question low priority 
    opened by andreasvc 1
Releases(v2.5.6)
  • v2.5.6(Jul 22, 2022)

    [Maarten van Gompel]

    • codemeta.json: updating according to (proposed) CLARIAH requirements (CLARIAH/clariah-plus#38)
    • Dockerfile: added

    [Ko van der Sloot]

    • Code cleanup
      • added some exceptions for unwanted cases detected by scan-build
      • out-dommented DOFLEXFROMCOOC and cached_DOFLEXFROMCOOC variables, they seem useless
      • removed unused assignments
    Source code(tar.gz)
    Source code(zip)
  • v2.5.5(Apr 16, 2020)

  • v2.5.4(Apr 10, 2020)

    Implemented the ability to prune subsumed n-grams (retaining only the longer non-subsumed versions). Introduces a new PRUNESUBSUMED variable for PatternModelOptions. Note: This is an aggressive form of pruning that should also work for unordered models, matching is based on types rather than individual tokens (all subsumed types are pruned).

    Source code(tar.gz)
    Source code(zip)
  • v2.5.3(Apr 9, 2020)

  • v2.5.2(Feb 20, 2020)

    Bugfix release: Pattern size and category constraints were not working for several methods (getcooc/getleftcooc/getrightcooc/getleftneighbours/getrightneighbours) #44

    Source code(tar.gz)
    Source code(zip)
  • v2.5.1(Sep 9, 2019)

  • v2.5.0(Dec 7, 2018)

    Better handling of large patterns, PatternPointer size descriptor is now 64 bits (fixes #42) at cost of a small increase in memory consumption in various computations.

    (The experimental and relatively unused PatternPointerModels are not backwards compatible, contact me if this is a problem)

    Source code(tar.gz)
    Source code(zip)
  • v2.4.10(Dec 5, 2018)

    Important bugfix release:

    • Fixes data-clipping bug on loading large corpora in memory (used by indexed patternmodels) #41

    (All users are urged to upgrade!)

    Source code(tar.gz)
    Source code(zip)
  • v2.4.9(May 23, 2018)

  • v2.4.8(Mar 1, 2018)

  • v2.4.6(Sep 7, 2017)

  • v2.4.5(Feb 21, 2017)

  • v2.4.4(Dec 2, 2016)

    • Bugfix: fixes covered token count per category/n (issue #26)
    • New feature: colibri-patternmodeller has a--simplereport (-r) option that generates a report without coverage information (more limited but a lot faster)
    Source code(tar.gz)
    Source code(zip)
  • v2.4.3(Aug 19, 2016)

  • v2.4.2(Aug 19, 2016)

  • v2.4.1(Jun 15, 2016)

  • v2.4.0(Jun 2, 2016)

    Various fixes:

    • Speed up in ngrams() computation (issue #21)
    • Performance fix for processing long lines
    • Pattern.instanceof()should be faster and is now available from Python too
    • Attempt to fix compilation issue on certain platforms (issue #22), unconfirmed

    New features:

    • Implemented new filtering mechanism that supports actively checking whether patterns are instances of a limited set of specified skipgrams, or a superset of specified ngrams.
    • Implemented ignorenewlines option in class encoding. Useful if you have source text split by for instance sentences (one per line), but want a model that crosses sentence boundaries.
    • Implemented vocabulary import for the class encoding stage (issue #2)
    Source code(tar.gz)
    Source code(zip)
  • v2.3.0(Feb 11, 2016)

  • v2.2.0(Feb 10, 2016)

  • v2.1.2(Dec 14, 2015)

  • v2.1.1(Dec 7, 2015)

  • v2.1.0(Dec 4, 2015)

    • Implemented more efficient algorithms for the search and extraction of pre-specified skipgrams and flexgrams (issue #9)
    • Added colibri-findpatterns script (issue #9)
    • Documentation and Python tutorial updated with a section on finding pre-specified patterns
    • Better flexgram support
    • Patternmodeller tool now as long options for everything to avoid confusion
    • Fixed getskipcontent (issue #10)
    • Minor fixes and improvements for Cython/Python
    Source code(tar.gz)
    Source code(zip)
  • v2.0.4(Nov 27, 2015)

  • v2.0.3(Nov 25, 2015)

  • v2.0.0(Nov 8, 2015)

    Version 2.0 release of Colibri Core.

    Main changes:

    • better class encoding (stronger compression, less memory)
    • internal use of pattern pointers during training (quicker, less memory)
    • pattern pointer models
    • fixes in skipgram computation
    • more extensive test suite

    Data format changed from v1, but old formats can still be read by v2.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Sep 18, 2015)

Owner
Maarten van Gompel
Research software engineer - NLP - AI - 🐧 Linux & open-source enthusiast - 🐍 Python/ 🌊C/C++ / 🦀 Rust / 🐚 Shell - 🔐 Privacy, Security & Decentralisation
Maarten van Gompel
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Host your own GPT-3 Discord bot

GPT3 Discord Bot Host your own GPT-3 Discord bot i'd host and make the bot invitable myself, however GPT3 terms of service prohibit public use of GPT3

[something hillarious here] 8 Jan 07, 2023
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

OldSix 575 Dec 31, 2022
A look-ahead multi-entity Transformer for modeling coordinated agents.

baller2vec++ This is the repository for the paper: Michael A. Alcorn and Anh Nguyen. baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling

Michael A. Alcorn 30 Dec 16, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Takker - 一个普通的QQ机器人 此项目为基于 Nonebot2 和 go-cqhttp 开发,以 Sqlite 作为数据库的QQ群娱乐机器人 关于 纯兴趣开发,部分功能借鉴了大佬们的代码,作为Q群的娱乐+功能性Bot 声明 此项目仅用于学习交流,请勿用于非法用途 这是开发者的第一个Pytho

风屿 79 Dec 29, 2022
Fastseq 基于ONNXRUNTIME的文本生成加速框架

Fastseq 基于ONNXRUNTIME的文本生成加速框架

Jun Gao 9 Nov 09, 2021
用Resnet101+GPT搭建一个玩王者荣耀的AI

基于pytorch框架用resnet101加GPT搭建AI玩王者荣耀 本源码模型主要用了SamLynnEvans Transformer 的源码的解码部分。以及pytorch自带的预训练模型"resnet101-5d3b4d8f.pth"

冯泉荔 2.2k Jan 03, 2023
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

M. Yusuf Sarıgöz 13 Oct 10, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus CVSS is a massively multilingual-to-English speech-to-speech translation corpus, co

Google Research Datasets 118 Jan 06, 2023
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 02, 2023
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

44 Jan 06, 2023