Fast topic modeling platform

Overview

BigARTM Logo

The state-of-the-art platform for topic modeling.

Build Status Windows Build Status GitHub license DOI

What is BigARTM?

BigARTM is a powerful tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

References

Related Software Packages

  • TopicNet is a high-level interface for BigARTM which is helpful for rapid solution prototyping and for exploring the topics of finished ARTM models.
  • David Blei's List of Open Source topic modeling software
  • MALLET: Java-based toolkit for language processing with topic modeling package
  • Gensim: Python topic modeling library
  • Vowpal Wabbit has an implementation of Online-LDA algorithm

Installation

Installing with pip (Linux only)

We have a PyPi release for Linux:

$ pip install bigartm

or

$ pip install bigartm10

Installing on Windows

We suggest using pre-build binaries.

It is also possible to compile C++ code on Windows you want the latest development version.

Installing on Linux / MacOS

Download binary release or build from source using cmake:

$ mkdir build && cd build
$ cmake ..
$ make install

See here for detailed instructions.

How to Use

Command-line interface

Check out documentation for bigartm.

Examples:

  • Basic model (20 topics, outputed to CSV-file, inferred in 10 passes)
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --write-model-readable model.txt
--passes 10 --batch-size 50 --topics 20
  • Basic model with less tokens (filtered extreme values based on token's frequency)
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt
  • Simple regularized model (increase sparsity up to 60-70%)
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20  --write-model-readable model.txt 
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"
  • More advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
--regularizer "0.05 SparsePhi #obj"
--regularizer "0.05 SparseTheta #obj"
--regularizer "0.25 SmoothPhi #background"
--regularizer "0.25 SmoothTheta #background" 

Interactive Python interface

BigARTM supports full-featured and clear Python API (see Installation to configure Python API for your OS).

Example:

import artm

# Prepare data
# Case 1: data in CountVectorizer format
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from numpy import array

cv = CountVectorizer(max_features=1000, stop_words='english')
n_wd = array(cv.fit_transform(fetch_20newsgroups().data).todense()).T
vocabulary = cv.get_feature_names()

bv = artm.BatchVectorizer(data_format='bow_n_wd',
                          n_wd=n_wd,
                          vocabulary=vocabulary)

# Case 2: data in UCI format (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)
bv = artm.BatchVectorizer(data_format='bow_uci',
                          collection_name='kos',
                          target_folder='kos_batches')

# Learn simple LDA model (or you can use advanced artm.ARTM)
model = artm.LDA(num_topics=15, dictionary=bv.dictionary)
model.fit_offline(bv, num_collection_passes=20)

# Print results
model.get_top_tokens()

Refer to tutorials for details on how to start using BigARTM from Python, user's guide can provide information about more advanced features and cases.

Low-level API

Contributing

Refer to the Developer's Guide and follows Code Style.

To report a bug use issue tracker. To ask a question use our mailing list. Feel free to make pull request.

License

BigARTM is released under New BSD License that allowes unlimited redistribution for any purpose (even for commercial use) as long as its copyright notices and the licenseโ€™s disclaimers of warranty are maintained.

Comments
  • Sphinx python docs

    Sphinx python docs

    This pull request introduces basic functionality of generating Python API documentation automatically, using docstrings. Feel free to comment and suggest new ideas before merge :)

    Unfortunately I didn't manage to create pull request against stable branch, because I created my branch from master and couldn't merge commits on master_component.py :-(

    opened by JeanPaulShapo 20
  • Build fails on MacOS 10.12

    Build fails on MacOS 10.12

    Jeroens-MacBook-Pro:build jeroen$ brew install boost
    Updating Homebrew...
    ==> Auto-updated Homebrew!
    Updated 1 tap (homebrew/core).
    ==> Updated Formulae
    aws-sdk-cpp                  conan                        knot                         mercurial                    servus                       vim
    awscli                       docker-machine               knot-resolver                phoronix-test-suite          svgcleaner                   wpcli-completion
    bazel                        docker-machine-completion    libgphoto2                   sdl_mixer                    termius                      yara
    citus                        gphoto2                      makepkg                      sdl_sound                    vapoursynth
    
    ==> Downloading https://homebrew.bintray.com/bottles/boost-1.64.0_1.sierra.bottle.tar.gz
    ######################################################################## 100.0%
    ==> Pouring boost-1.64.0_1.sierra.bottle.tar.gz
    ==> Using the sandbox
    ๐Ÿบ  /usr/local/Cellar/boost/1.64.0_1: 12,628 files, 395.7MB
    Jeroens-MacBook-Pro:build jeroen$ unset  BOOST_INCLUDEDIR
    Jeroens-MacBook-Pro:build jeroen$ cmake ..
    -- Build type: Release
    -- Boost version: 1.64.0
    -- Boost version: 1.64.0
    -- Found the following Boost libraries:
    --   thread
    --   program_options
    --   date_time
    --   filesystem
    --   iostreams
    --   system
    --   chrono
    --   timer
    --   atomic
    --   regex
    -- Looking for C++ include stdint.h
    -- Looking for C++ include stdint.h - found
    -- Looking for C++ include inttypes.h
    -- Looking for C++ include inttypes.h - found
    -- Looking for C++ include sys/types.h
    -- Looking for C++ include sys/types.h - found
    -- Looking for C++ include sys/stat.h
    -- Looking for C++ include sys/stat.h - found
    -- Looking for C++ include fnmatch.h
    -- Looking for C++ include fnmatch.h - found
    -- Looking for strtoll
    -- Looking for strtoll - found
    -- Looking for C++ include stddef.h
    -- Looking for C++ include stddef.h - found
    -- Check size of pthread_rwlock_t
    -- Check size of pthread_rwlock_t - done
    -- running mz compiler detection tools
    -- compiler is clang
    -- GCC compatible compiler found
    -- compiler version Apple LLVM version 8.1.0 (clang-802.0.42)
    Target: x86_64-apple-darwin16.6.0
    Thread model: posix
    InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
    -- C++11 support detected
    -- 64bit platform
    -- Today is: Tue, 06 Jun 2017 21:41:10 +0200
    
    -- forcing C++11 support on this platform
    -- configuring for build type: Release
    -- Looking for include file dlfcn.h
    -- Looking for include file dlfcn.h - found
    -- Looking for include file execinfo.h
    -- Looking for include file execinfo.h - found
    -- Looking for include file glob.h
    -- Looking for include file glob.h - found
    -- Looking for include file pthread.h
    -- Looking for include file pthread.h - found
    -- Looking for include file libunwind.h
    -- Looking for include file libunwind.h - found
    -- Looking for include file libunwind.h
    -- Looking for include file libunwind.h - found
    -- Looking for include file gflags/gflags.h
    -- Looking for include file gflags/gflags.h - not found
    -- Looking for include file memory.h
    -- Looking for include file memory.h - found
    -- Looking for include file pthread.h
    -- Looking for include file pthread.h - found
    -- Looking for include file pwd.h
    -- Looking for include file pwd.h - found
    -- Looking for include file stdlib.h
    -- Looking for include file stdlib.h - found
    -- Looking for include file strings.h
    -- Looking for include file strings.h - found
    -- Looking for include file syscall.h
    -- Looking for include file syscall.h - not found
    -- Looking for include file syslog.h
    -- Looking for include file syslog.h - found
    -- Looking for include file sys/call.h
    -- Looking for include file sys/call.h - not found
    -- Looking for include file sys/time.h
    -- Looking for include file sys/time.h - found
    -- Looking for include file sys/syscall.h
    -- Looking for include file sys/syscall.h - found
    -- Looking for include file sys/ucontext.h
    -- Looking for include file sys/ucontext.h - found
    -- Looking for include file sys/utsname.h
    -- Looking for include file sys/utsname.h - found
    -- Looking for include file ucontext.h
    -- Looking for include file ucontext.h - not found
    -- Looking for include file unwind.h
    -- Looking for include file unwind.h - found
    -- Looking for fcntl
    -- Looking for fcntl - found
    -- Looking for sigaltstack
    -- Looking for sigaltstack - found
    -- Looking for __builtin_expect
    -- Looking for __builtin_expect - not found
    -- Looking for __sync_val_compare_and_swap
    -- Looking for __sync_val_compare_and_swap - not found
    CMake Warning (dev) at 3rdparty/protobuf-3.0.0/cmake/install.cmake:41 (message):
      The file
      "/Users/jeroen/Downloads/bigartm/3rdparty/protobuf-3.0.0/src/google/protobuf/repeated_field_reflection.h"
      is listed in
      "/Users/jeroen/Downloads/bigartm/3rdparty/protobuf-3.0.0/cmake/cmake/extract_includes.bat.in"
      but there not exists.  The file will not be installed.
    Call Stack (most recent call first):
      3rdparty/protobuf-3.0.0/cmake/CMakeLists.txt:159 (include)
    This warning is for project developers.  Use -Wno-dev to suppress it.
    
    -- Performing Test COMPILER_SUPPORTS_CXX11
    -- Performing Test COMPILER_SUPPORTS_CXX11 - Success
    -- Performing Test COMPILER_SUPPORTS_CXX0X
    -- Performing Test COMPILER_SUPPORTS_CXX0X - Success
    -- Found GLOG: /Users/jeroen/Downloads/bigartm/3rdparty/glog/src
    -- Boost version: 1.64.0
    -- Found the following Boost libraries:
    --   thread
    --   program_options
    --   date_time
    --   filesystem
    --   iostreams
    --   system
    --   chrono
    --   timer
    --   atomic
    --   regex
    -- Configuring done
    -- Generating done
    -- Build files have been written to: /Users/jeroen/Downloads/bigartm/build
    Jeroens-MacBook-Pro:build jeroen$ make
    Scanning dependencies of target gflags-static
    [  0%] Building CXX object 3rdparty/gflags/CMakeFiles/gflags-static.dir/src/gflags.cc.o
    [  0%] Building CXX object 3rdparty/gflags/CMakeFiles/gflags-static.dir/src/gflags_reporting.cc.o
    [  1%] Building CXX object 3rdparty/gflags/CMakeFiles/gflags-static.dir/src/gflags_completions.cc.o
    [  1%] Linking CXX static library ../../lib/libgflags.a
    [  1%] Built target gflags-static
    Scanning dependencies of target google-glog
    [  1%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/logging.cc.o
    [  2%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/raw_logging.cc.o
    [  2%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/vlog_is_on.cc.o
    [  2%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/utilities.cc.o
    [  3%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/demangle.cc.o
    [  3%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/symbolize.cc.o
    [  3%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/signalhandler.cc.o
    [  4%] Linking CXX static library ../../lib/libgoogle-glog.a
    [  4%] Built target google-glog
    Scanning dependencies of target libprotobuf
    [  4%] Building CXX object 3rdparty/protobuf-3.0.0/cmake/CMakeFiles/libprotobuf.dir/__/src/google/protobuf/arena.cc.o
    [  5%] Building CXX object 3rdparty/protobuf-3.0.0/cmake/CMakeFiles/libprotobuf.dir/__/src/google/protobuf/arenastring.cc.o
    /Users/jeroen/Downloads/bigartm/3rdparty/protobuf-3.0.0/src/google/protobuf/arenastring.cc:41:22: error: redefinition of 'AssignWithDefault'
    void ArenaStringPtr::AssignWithDefault(const ::std::string* default_value,
                         ^
    /usr/local/include/google/protobuf/arenastring.h:316:29: note: previous definition is here
    inline void ArenaStringPtr::AssignWithDefault(const ::std::string* default_value,
                                ^
    /Users/jeroen/Downloads/bigartm/3rdparty/protobuf-3.0.0/src/google/protobuf/arenastring.cc:47:48: error: too many arguments to function call, expected 0, have 1
        SetNoArena(default_value, value.GetNoArena(default_value));
                                  ~~~~~~~~~~~~~~~~ ^~~~~~~~~~~~~
    /usr/local/include/google/protobuf/arenastring.h:225:3: note: 'GetNoArena' declared here
      inline const ::std::string& GetNoArena() const { return *ptr_; }
      ^
    2 errors generated.
    make[2]: *** [3rdparty/protobuf-3.0.0/cmake/CMakeFiles/libprotobuf.dir/__/src/google/protobuf/arenastring.cc.o] Error 1
    make[1]: *** [3rdparty/protobuf-3.0.0/cmake/CMakeFiles/libprotobuf.dir/all] Error 2
    make: *** [all] Error 2
    
    opened by jeroen 19
  • CentOS linker issue

    CentOS linker issue

    On CentOS 7 I got following:

    [ 99%] Building CXX object src/bigartm/CMakeFiles/bigartm.dir//artm/cpp_interface.cc.o [100%] Building CXX object src/bigartm/CMakeFiles/bigartm.dir//artm/messages.pb.cc.o Linking CXX executable bigartm /usr/bin/ld: cannot find -lboost_thread-mt /usr/bin/ld: cannot find -lboost_program_options-mt /usr/bin/ld: cannot find -lboost_date_time-mt /usr/bin/ld: cannot find -lboost_filesystem-mt /usr/bin/ld: cannot find -lboost_iostreams-mt /usr/bin/ld: cannot find -lboost_system-mt /usr/bin/ld: cannot find -lboost_chrono-mt /usr/bin/ld: cannot find -lboost_timer-mt /usr/bin/ld: cannot find -lpthread /usr/bin/ld: cannot find -lstdc++ /usr/bin/ld: cannot find -lm /usr/bin/ld: cannot find -lpthread /usr/bin/ld: cannot find -lc collect2: error: ld returned 1 exit status make[2]: *** [src/bigartm/bigartm] Error 1 make[1]: *** [src/bigartm/CMakeFiles/bigartm.dir/all] Error 2 make: *** [all] Error 2

    Actually I have all this files, but they start with lib instead of l. I'm not make/cmake expert, so can't figure out how to fix this.

    documentation build 
    opened by dselivanov 19
  • Build standalone Python wheels

    Build standalone Python wheels

    I call "a standalone wheel" a package which does not require the libartm installation - aka "tensorflow style". Those wheels can be safely pushed on PyPi and will just work.

    I had to change:

    • python/CMakeLists.txt: add another custom target to execute setup.py bdist_wheel
    • python/setup.py: hack setuptools to include the needed shared library
    • python/artm/wrapper/api.py: refactor _load_cdll() to load the library from several places - the common practice. The env variable is the most important, then the default relative path and finally the absolute path in the package root.
    • docs - as far as I could

    Besides, there seems to be a bug in TravisCI configuration regarding python command detection. There is no explicit -DPYTHON in the build script, and by default cmake chooses python2 which prevents from testing py3.x:

    if (MSVC OR APPLE)
    set(PYTHON python CACHE INTERNAL "Python command")
    else (MSVC OR APPLE)
    set(PYTHON python2 CACHE INTERNAL "Python command")
    endif (MSVC OR APPLE)
    

    bdist_wheel command helped to find this bug: there is no wheel package installed in py2 env in Travis. I dared to remove python2 branch completely.

    opened by vmarkovtsev 18
  • Auto-generate C++-headers and sources from Proto-files

    Auto-generate C++-headers and sources from Proto-files

    Overview of changes are contained in commits' names. It includes usage of ${PROTOBUF_PROTOC_EXECUTABLE} in CMakeLists, so I strongly recommend someone with Windows to check these changes (maybe @sashafrey or @MelLain?) On my laptop wth Fedora 22 and no system protobuf distibution it works perfectly.

    opened by JeanPaulShapo 14
  • Get ASCII encoding problem if run .fit_offline

    Get ASCII encoding problem if run .fit_offline

    1. Try to run "ะ”ะตะผะพัั‚ั€ะฐั†ะธั BigARTM (ะฒะตั€ัะธั 0.8.0).ipynb" from https://www.coursera.org/learn/unsupervised-learning/supplement/suSWG/noutbuk-iz-diemonstratsii-ispol-zovaniia-bigartm
    2. Use artm.version() 0.8.1
    3. Run model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=40)
    4. Get C:\Coursera\Anaconda2\lib\site-packages\protobuf-2.5.1rc0-py2.7.egg\google\protobuf\internal\type_checkers.pyc in CheckValue(self, proposed_value)

    126 'encoding. Non-ASCII strings must be converted to ' 127 'unicode objects before being added.' % --> 128 (proposed_value))

    ValueError: 'c:\Coursera\week4\school_batches\aaaaaa.batch' has type str, but isn't in 7-bit ASCII encoding. Non-ASCII strings must be converted to unicode objects before being added.

    What could be the problem?

    opened by ValeraSarapas 13
  • master build fails outside of bigartm/build

    master build fails outside of bigartm/build

    While trying to build BigARTM outside of bigartm/build (i.e. somewhere like bigartm/../build) I got the following error:

    Generating ./artm/wrapper/messages_pb2.py...
    Traceback (most recent call last):
      File "/home/omtcyf0/Documents/dev/src/bigartm/python/setup.py", line 83, in <module>
        cmdclass = {'build': build},
      File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
        dist.run_commands()
      File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
        cmd_obj.run()
      File "/home/omtcyf0/Documents/dev/src/bigartm/python/setup.py", line 69, in run
        "./artm/wrapper/messages_pb2.py")
      File "/home/omtcyf0/Documents/dev/src/bigartm/python/setup.py", line 51, in generate_proto_files
        if subprocess.call(protoc_command):
      File "/usr/lib/python2.7/subprocess.py", line 522, in call
        return Popen(*popenargs, **kwargs).wait()
      File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
        errread, errwrite)
      File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
        raise child_exception
    AttributeError: 'NoneType' object has no attribute 'rfind'
    

    Ubuntu 15.10, Python 2.7.10

    opened by kirillbobyrev 12
  • make install fails on ubuntu

    make install fails on ubuntu

    error: Setup script exited with error: Command "x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Inumpy/core/include -Ibuild/src.linux-x86_64-2.7/numpy/core/include/numpy -I/usr/include/python2.7 -Ibuild/src.linux-x86_64-2.7/numpy/core/src/private -Ibuild/src.linux-x86_64-2.7/numpy/core/src/private -Ibuild/src.linux-x86_64-2.7/numpy/core/src/private -c build/src.linux-x86_64-2.7/numpy/core/src/npymath/ieee754.c -o build/temp.linux-x86_64-2.7/build/src.linux-x86_64-2.7/numpy/core/src/npymath/ieee754.o" failed with exit status 1numpy/core/src/npymath/ieee754.c.src:7:29: fatal error: npy_math_common.h: No such file or directory

    And I don't even need numpy, I already installed it by pip install AND by anaconda :(

    build 
    opened by vadamoto 11
  • Enable running bigartm-cli as standalone executable.

    Enable running bigartm-cli as standalone executable.

    It works on Linux (Fedora 21) without Intel MKL. Before merging it's strongly recommended to test on Windows/Linux with/without Intel MKL (i don't have Windows and Intel MKL binaries :( )

    [PR: review OK] 
    opened by JeanPaulShapo 11
  • Error during MAKE on Ubuntu 16

    Error during MAKE on Ubuntu 16

    Try to make, MAKE command and get

    [ 93%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/master_model_test.cc.o In file included from /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/master_model_test.cc:7:0: /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h: In instantiation of โ€˜testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = int]โ€™: /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h:18898:30: required from โ€˜static testing::AssertionResult testing::internal::EqHelper<lhs_is_null_literal>::Compare(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = int; bool lhs_is_null_literal = false]โ€™ /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/master_model_test.cc:110:5: required from here /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h:18861:16: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (expected == actual) { ^ [ 94%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/multiple_classes_test.cc.o [ 94%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/regularizers_test.cc.o [ 94%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/scores_test.cc.o /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/scores_test.cc: In member function โ€˜virtual void Scores_ScoreTrackerExportImport_Test::TestBody()โ€™: /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/scores_test.cc:165:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (size_t i = 0; i < nPasses; ++i) { ^ /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/scores_test.cc:172:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (size_t i = 0; i < nPasses; ++i) { ^ [ 95%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/repeatable_result_test.cc.o In file included from /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/repeatable_result_test.cc:4:0: /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h: In instantiation of โ€˜testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = int]โ€™: /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h:18898:30: required from โ€˜static testing::AssertionResult testing::internal::EqHelper<lhs_is_null_literal>::Compare(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = int; bool lhs_is_null_literal = false]โ€™ /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/repeatable_result_test.cc:72:3: required from here /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h:18861:16: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (expected == actual) { ^ [ 95%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/supcry_test.cc.o [ 95%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/template_manager_test.cc.o [ 96%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/test_mother.cc.o [ 96%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/thread_safe_holder_test.cc.o [ 97%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/topic_seg_test.cc.o [ 97%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir///3rdparty/gtest/fused-src/gtest/gtest_main.cc.o [ 97%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir///3rdparty/gtest/fused-src/gtest/gtest-all.cc.o [ 98%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/__/artm/cpp_interface.cc.o [ 98%] Linking CXX executable ../../bin/artm_tests CMakeFiles/artm_tests.dir/collection_parser_test.cc.o: In function CollectionParser_MatrixMarket_Test::TestBody()': collection_parser_test.cc:(.text+0x760): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' CMakeFiles/artm_tests.dir/collection_parser_test.cc.o: In function CollectionParser_VowpalWabbit_Test::TestBody()': collection_parser_test.cc:(.text+0xdea): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' CMakeFiles/artm_tests.dir/collection_parser_test.cc.o: In function CollectionParser_UciBagOfWords_Test::TestBody()': collection_parser_test.cc:(.text+0x29c0): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' CMakeFiles/artm_tests.dir/collection_parser_test.cc.o: In function CollectionParser_Multiclass_Test::TestBody()': collection_parser_test.cc:(.text+0x33d2): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' ../../lib/libartm-static.a(helpers.cc.o): In function artm::core::Helpers::ListAllBatches(boost::filesystem::path const&)': helpers.cc:(.text+0x1124): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' collect2: error: ld returned 1 exit status src/artm_tests/CMakeFiles/artm_tests.dir/build.make:606: recipe for target 'bin/artm_tests' failed make[2]: *** [bin/artm_tests] Error 1 CMakeFiles/Makefile2:661: recipe for target 'src/artm_tests/CMakeFiles/artm_tests.dir/all' failed make[1]: *** [src/artm_tests/CMakeFiles/artm_tests.dir/all] Error 2 Makefile:138: recipe for target 'all' failed make: *** [all] Error 2

    question 
    opened by vladimircape 10
  • Fix problem with newline in vocab files

    Fix problem with newline in vocab files

    @ofrei

    • this PR indents to fix #519;
    • new vocab file must be without the last newline;
    • is it good idea to have identical code in src/artm/core/dicitonary.cc and src/artm/core/collection_parser.cc?
    [PR: review OK] 
    opened by JeanPaulShapo 10
  • Bump protobuf from 3.0.0 to 3.18.3 in /docs

    Bump protobuf from 3.0.0 to 3.18.3 in /docs

    Bumps protobuf from 3.0.0 to 3.18.3.

    Release notes

    Sourced from protobuf's releases.

    Protocol Buffers v3.18.3

    C++

    Protocol Buffers v3.16.1

    Java

    • Improve performance characteristics of UnknownFieldSet parsing (#9371)

    Protocol Buffers v3.18.2

    Java

    • Improve performance characteristics of UnknownFieldSet parsing (#9371)

    Protocol Buffers v3.18.1

    Python

    • Update setup.py to reflect that we now require at least Python 3.5 (#8989)
    • Performance fix for DynamicMessage: force GetRaw() to be inlined (#9023)

    Ruby

    • Update ruby_generator.cc to allow proto2 imports in proto3 (#9003)

    Protocol Buffers v3.18.0

    C++

    • Fix warnings raised by clang 11 (#8664)
    • Make StringPiece constructible from std::string_view (#8707)
    • Add missing capability attributes for LLVM 12 (#8714)
    • Stop using std::iterator (deprecated in C++17). (#8741)
    • Move field_access_listener from libprotobuf-lite to libprotobuf (#8775)
    • Fix #7047 Safely handle setlocale (#8735)
    • Remove deprecated version of SetTotalBytesLimit() (#8794)
    • Support arena allocation of google::protobuf::AnyMetadata (#8758)
    • Fix undefined symbol error around SharedCtor() (#8827)
    • Fix default value of enum(int) in json_util with proto2 (#8835)
    • Better Smaller ByteSizeLong
    • Introduce event filters for inject_field_listener_events
    • Reduce memory usage of DescriptorPool
    • For lazy fields copy serialized form when allowed.
    • Re-introduce the InlinedStringField class
    • v2 access listener
    • Reduce padding in the proto's ExtensionRegistry map.
    • GetExtension performance optimizations
    • Make tracker a static variable rather than call static functions
    • Support extensions in field access listener
    • Annotate MergeFrom for field access listener
    • Fix incomplete types for field access listener
    • Add map_entry/new_map_entry to SpecificField in MessageDifferencer. They record the map items which are different in MessageDifferencer's reporter.
    • Reduce binary size due to fieldless proto messages
    • TextFormat: ParseInfoTree supports getting field end location in addition to start.

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Fix README:

    Fix README:

    Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

    opened by fkrasnov 0
  • Fix the warning of scipy

    Fix the warning of scipy

    /usr/local/lib/python3.8/dist-packages/artm/batches_utils.py:227: DeprecationWarning: Please use spmatrix from the scipy.sparse namespace, the scipy.sparse.base namespace is deprecated. from scipy.sparse.base import spmatrix

    opened by fkrasnov 0
  • nan in Perplexity for big file with train data

    nan in Perplexity for big file with train data

    Hi!

    I use bigartm=0.9.2 to train my topic model. And I run into some unexpectable behaviour of Perplexity score. I have a very big file with train data. And I got nan in perplexity score after train my model. I find out that if train file is bigger than 1000 documents, I have nan, and when it <= I got numerical value of Perplexity score. I tried to use bigartm==0.10.2, it doesn't help.

    I also tried to use batch_size=100, it doesnt't help too. But when I set batch_size==100000 or batch_size==10000, I have numerical value of Perplexity score.

    I happy that I solved my problem, but this behaviour is unexpectable. And now it turns out that I should use very big batch_size. My train data is getting biggerover time, so I want to know in advance what batch_size I shold use to not have nan in Perplexity score in future. Which batch_size do you recommend? Should I use batch_size that is the same size as number of documents in my train collection?

    opened by Guince 1
  • vocab.kos.txt does not exist ('interactive python interface' Example from ReadMe)

    vocab.kos.txt does not exist ('interactive python interface' Example from ReadMe)

    File "git_script.py", line 18, in bv = artm.BatchVectorizer(data_format='bow_uci', File "/home/roman/.local/lib/python3.8/site-packages/artm/batches_utils.py", line 113, in init self._parse_uci_or_vw(data_weight=data_weight, File "/home/roman/.local/lib/python3.8/site-packages/artm/batches_utils.py", line 191, in _parse_uci_or_vw lib.ArtmParseCollection(parser_config) File "/home/roman/.local/lib/python3.8/site-packages/artm/wrapper/api.py", line 161, in artm_api_call self._check_error(result) File "/home/roman/.local/lib/python3.8/site-packages/artm/wrapper/api.py", line 97, in _check_error raise exception_class(error_message) artm.wrapper.exceptions.DiskReadException: File vocab.kos.txt does not exist.

    opened by RomanAvdeev 0
Releases(v0.10.1)
Owner
BigARTM
State-of-the-art Topic Modeling
BigARTM
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
Lyrics generation with GPT2-based Transformer

HuggingArtists - Train a model to generate lyrics Create AI-Artist in just 5 minutes! ๐Ÿš€ Run the demo notebook to train ๐Ÿš€ Run the GUI demo to test Di

Aleksey Korshuk 65 Dec 19, 2022
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 01, 2023
A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

Kensho 315 Dec 21, 2022
A desktop GUI providing an audio interface for GPT3.

Jabberwocky neil_degrasse_tyson_with_audio.mp4 Project Description This GUI provides an audio interface to GPT-3. My main goal was to provide a conven

16 Nov 27, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | ไธญๆ–‡ โ— Now we provide inferencing code and pre-training models

164 Jan 02, 2023
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
pyupbit ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ upbit์—์„œ ๋น„ํŠธ์ฝ”์ธ์„ ์ž๋™๋งค๋งคํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์กฐ์ฝ”๋”ฉ ์œ ํŠœ๋ธŒ ์ฑ„๋„์—์„œ ์ž์„ธํ•œ ๊ฐ•์˜ ์˜์ƒ์„ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ๋น„ํŠธ์ฝ”์ธ ํˆฌ์ž ์ž๋™ํ™” ๊ฐ•์˜ ์ฝ”๋“œ by ์œ ํŠœ๋ธŒ ์กฐ์ฝ”๋”ฉ ์ฑ„๋„ pyupbit ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ upbit ๊ฑฐ๋ž˜์†Œ์—์„œ ๋น„ํŠธ์ฝ”์ธ ์ž๋™๋งค๋งค๋ฅผ ํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ํŒŒ์ผ ๊ตฌ์„ฑ test.py : ์ž”๊ณ  ์กฐํšŒ (1๊ฐ•) backtest.py : ๋ฐฑํ…Œ์ŠคํŒ… ์ฝ”๋“œ (2๊ฐ•) bestK.p

์กฐ์ฝ”๋”ฉ JoCoding 186 Dec 29, 2022
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 665 Dec 17, 2022
Use Google's BERT for named entity recognition ๏ผˆCoNLL-2003 as the dataset๏ผ‰.

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition ๏ผˆCoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

min95 169 Jan 05, 2023
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
๐ŸŒ Translation microservice powered by AI

Dot Translate ๐ŸŒ A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
This library is testing the ethics of language models by using natural adversarial texts.

prompt2slip This library is testing the ethics of language models by using natural adversarial texts. This tool allows for short and simple code and v

9 Dec 28, 2021