👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Overview

lingua


ci build status codecov supported languages Kotlin platforms badge license badge

javadoc Maven Central Jcenter Download


Quick Info

  • this library tries to solve language detection of very short words and phrases, even shorter than tweets
  • makes use of both statistical and rule-based approaches
  • outperforms Apache Tika, Apache OpenNLP and Optimaize Language Detector for more than 70 languages
  • works within every Java 6+ application and on Android
  • no additional training of language models necessary
  • api for adding your own language models
  • offline usage without having to connect to an external service or API
  • can be used in a REPL for a quick try-out

Table of Contents

  1. What does this library do?
  2. Why does this library exist?
  3. Which languages are supported?
  4. How good is it?
  5. Why is it better than other libraries?
  6. Test report generation
  7. How to add it to your project?
    7.1 Using Gradle
    7.2 Using Maven
  8. How to build?
  9. How to use?
    9.1 Programmatic use
    9.2 Standalone mode
  10. What's next for version 1.1.0?

1. What does this library do? Top ▲

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

2. Why does this library exist? Top ▲

Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.

So far, three other comprehensive open source libraries working on the JVM for this task are Apache Tika, Apache OpenNLP and Optimaize Language Detector. Unfortunately, especially the latter has three major drawbacks:

  1. Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it doesn't provide adequate results.
  2. The more languages take part in the decision process, the less accurate are the detection results.
  3. Configuration of the library is quite cumbersome and requires some knowledge about the statistical methods that are used internally.

Lingua aims at eliminating these problems. It nearly doesn't need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

3. Which languages are supported? Top ▲

Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 74 languages are supported:

  • A
    • Afrikaans
    • Albanian
    • Arabic
    • Armenian
    • Azerbaijani
  • B
    • Basque
    • Belarusian
    • Bengali
    • Norwegian Bokmal
    • Bosnian
    • Bulgarian
  • C
    • Catalan
    • Chinese
    • Croatian
    • Czech
  • D
    • Danish
    • Dutch
  • E
    • English
    • Esperanto
    • Estonian
  • F
    • Finnish
    • French
  • G
    • Ganda
    • Georgian
    • German
    • Greek
    • Gujarati
  • H
    • Hebrew
    • Hindi
    • Hungarian
  • I
    • Icelandic
    • Indonesian
    • Irish
    • Italian
  • J
    • Japanese
  • K
    • Kazakh
    • Korean
  • L
    • Latin
    • Latvian
    • Lithuanian
  • M
    • Macedonian
    • Malay
    • Marathi
    • Mongolian
  • N
    • Norwegian Nynorsk
  • P
    • Persian
    • Polish
    • Portuguese
    • Punjabi
  • R
    • Romanian
    • Russian
  • S
    • Serbian
    • Shona
    • Slovak
    • Slovene
    • Somali
    • Sotho
    • Spanish
    • Swahili
    • Swedish
  • T
    • Tagalog
    • Tamil
    • Telugu
    • Thai
    • Tsonga
    • Tswana
    • Turkish
  • U
    • Ukrainian
    • Urdu
  • V
    • Vietnamese
  • W
    • Welsh
  • X
    • Xhosa
  • Y
    • Yoruba
  • Z
    • Zulu

4. How good is it? Top ▲

Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:

  1. a list of single words with a minimum length of 5 characters
  2. a list of word pairs with a minimum length of 10 characters
  3. a list of complete grammatical sentences of various lengths

Both the language models and the test data have been created from separate documents of the Wortschatz corpora offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.

Given the generated test data, I have compared the detection results of Lingua, Apache Tika, Apache OpenNLP and Optimaize Language Detector using parameterized JUnit tests running over the data of Lingua's supported 74 languages. Languages that are not supported by the other libraries are simply ignored for those during the detection process.

The box plot below shows the distribution of the averaged accuracy values for all three performed tasks: Single word detection, word pair detection and sentence detection. Lingua clearly outperforms its contenders. Bar plots for each language and further box plots for the separate detection tasks can be found in the file ACCURACY_PLOTS.md. Detailed statistics including mean, median and standard deviation values for each language and classifier are available in the file ACCURACY_TABLE.md.

boxplot-average

5. Why is it better than other libraries? Top ▲

Every language detector uses a probabilistic n-gram model trained on the character distribution in some training corpus. Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. The shorter the input text is, the less n-grams are available. The probabilities estimated from such few n-grams are not reliable. This is why Lingua makes use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct language.

A second important difference is that Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text. Only then, in a second step, the probabilistic n-gram model is taken into consideration. This makes sense because loading less language models means less memory consumption and better runtime performance.

In general, it is always a good idea to restrict the set of languages to be considered in the classification process using the respective api methods. If you know beforehand that certain languages are never to occur in an input text, do not let those take part in the classifcation process. The filtering mechanism of the rule-based engine is quite good, however, filtering based on your own knowledge of the input text is always preferable.

6. Test report and plot generation Top ▲

If you want to reproduce the accuracy results above, you can generate the test reports yourself for all four classifiers and all languages by doing:

./gradlew accuracyReport

You can also restrict the classifiers and languages to generate reports for by passing arguments to the Gradle task. The following task generates reports for Lingua and the languages English and German only:

./gradlew accuracyReport -Pdetectors=Lingua -Planguages=English,German

By default, only a single CPU core is used for report generation. If you have a multi-core CPU in your machine, you can fork as many processes as you have CPU cores. This speeds up report generation significantly. However, be aware that forking more than one process can consume a lot of RAM. You do it like this:

./gradlew accuracyReport -PcpuCores=2

For each detector and language, a test report file is then written into /accuracy-reports, to be found next to the src directory. As an example, here is the current output of the Lingua German report:

com.github.pemistahl.lingua.report.lingua.GermanDetectionAccuracyReport

##### GERMAN #####

>>> Accuracy on average: 89.10%

>> Detection of 1000 single words (average length: 9 chars)
Accuracy: 73.60%
Erroneously classified as DUTCH: 2.30%, ENGLISH: 2.10%, DANISH: 2.10%, LATIN: 2.00%, BOKMAL: 1.60%, ITALIAN: 1.20%, BASQUE: 1.20%, FRENCH: 1.20%, ESPERANTO: 1.10%, SWEDISH: 1.00%, AFRIKAANS: 0.80%, TSONGA: 0.70%, PORTUGUESE: 0.60%, NYNORSK: 0.60%, FINNISH: 0.50%, YORUBA: 0.50%, ESTONIAN: 0.50%, WELSH: 0.50%, SOTHO: 0.50%, SPANISH: 0.40%, SWAHILI: 0.40%, IRISH: 0.40%, ICELANDIC: 0.40%, POLISH: 0.40%, TSWANA: 0.40%, TAGALOG: 0.30%, CATALAN: 0.30%, BOSNIAN: 0.30%, LITHUANIAN: 0.20%, INDONESIAN: 0.20%, ALBANIAN: 0.20%, SLOVAK: 0.20%, ZULU: 0.20%, CROATIAN: 0.20%, ROMANIAN: 0.20%, XHOSA: 0.20%, TURKISH: 0.10%, LATVIAN: 0.10%, MALAY: 0.10%, SLOVENE: 0.10%, SOMALI: 0.10%

>> Detection of 1000 word pairs (average length: 18 chars)
Accuracy: 94.00%
Erroneously classified as DUTCH: 0.90%, LATIN: 0.80%, ENGLISH: 0.70%, SWEDISH: 0.60%, DANISH: 0.50%, FRENCH: 0.40%, BOKMAL: 0.30%, TAGALOG: 0.20%, IRISH: 0.20%, SWAHILI: 0.20%, TURKISH: 0.10%, ZULU: 0.10%, ESPERANTO: 0.10%, ESTONIAN: 0.10%, FINNISH: 0.10%, ITALIAN: 0.10%, NYNORSK: 0.10%, ICELANDIC: 0.10%, AFRIKAANS: 0.10%, SOMALI: 0.10%, TSONGA: 0.10%, WELSH: 0.10%

>> Detection of 1000 sentences (average length: 111 chars)
Accuracy: 99.70%
Erroneously classified as DUTCH: 0.20%, LATIN: 0.10%

The plots have been created with Python and the libraries Pandas, Matplotlib and Seaborn. If you have a global Python 3 installation and the python3 command available on your command line, you can redraw the plots after modifying the test reports by executing the following Gradle task:

./gradlew drawAccuracyPlots

The detailed table in the file ACCURACY_TABLE.md containing all accuracy values can be written with:

./gradlew writeAccuracyTable

7. How to add it to your project? Top ▲

Lingua is hosted on Jcenter and Maven Central.

7.1 Using Gradle

// Groovy syntax
implementation 'com.github.pemistahl:lingua:1.0.3'

// Kotlin syntax
implementation("com.github.pemistahl:lingua:1.0.3")

7.2 Using Maven

<dependency>
    <groupId>com.github.pemistahl</groupId>
    <artifactId>lingua</artifactId>
    <version>1.0.3</version>
</dependency>

8. How to build? Top ▲

Lingua uses Gradle to build and requires Java >= 1.8 for that.

git clone https://github.com/pemistahl/lingua.git
cd lingua
./gradlew build

Several jar archives can be created from the project.

  1. ./gradlew jar assembles lingua-1.0.3.jar containing the compiled sources only.
  2. ./gradlew sourcesJar assembles lingua-1.0.3-sources.jar containing the plain source code.
  3. ./gradlew jarWithDependencies assembles lingua-1.0.3-with-dependencies.jar containing the compiled sources and all external dependencies needed at runtime. This jar file can be included in projects without dependency management systems. You should be able to use it in your Android project as well by putting it in your project's lib folder. This jar file can also be used to run Lingua in standalone mode (see below).

9. How to use? Top ▲

Lingua can be used programmatically in your own code or in standalone mode.

9.1 Programmatic use Top ▲

The API is pretty straightforward and can be used in both Kotlin and Java code.

/* Kotlin */

import com.github.pemistahl.lingua.api.*
import com.github.pemistahl.lingua.api.Language.*

val detector: LanguageDetector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH).build()
val detectedLanguage: Language = detector.detectLanguageOf(text = "languages are awesome")

By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language. The word prologue, for instance, is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed up probabilities for each possible language have to satisfy. It can be stated in the following way:

val detector = LanguageDetectorBuilder
    .fromAllLanguages()
    .withMinimumRelativeDistance(0.25) // minimum: 0.00 maximum: 0.99 default: 0.00
    .build()

Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise you will get most results returned as Language.UNKNOWN which is the return value for cases where language detection is not reliably possible.

The public API of Lingua never returns null somewhere, so it is safe to be used from within Java code as well.

/* Java */

import com.github.pemistahl.lingua.api.*;
import static com.github.pemistahl.lingua.api.Language.*;

final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH).build();
final Language detectedLanguage = detector.detectLanguageOf("languages are awesome");

There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance (what a surprise :-). The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages:

// include all languages available in the library
// WARNING: in the worst case this produces high memory 
//          consumption of approximately 3.5GB 
//          and slow runtime performance
LanguageDetectorBuilder.fromAllLanguages()

// include only languages that are not yet extinct (= currently excludes Latin)
LanguageDetectorBuilder.fromAllSpokenLanguages()

// include only languages written with Cyrillic script
LanguageDetectorBuilder.fromAllLanguagesWithCyrillicScript()

// exclude only the Spanish language from the decision algorithm
LanguageDetectorBuilder.fromAllLanguagesWithout(Language.SPANISH)

// only decide between English and German
LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.GERMAN)

// select languages by ISO 639-1 code
LanguageDetectorBuilder.fromIsoCodes639_1(IsoCode639_1.EN, IsoCode639_3.DE)

// select languages by ISO 639-3 code
LanguageDetectorBuilder.fromIsoCodes639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)

Knowing about the most likely language is nice but how reliable is the computed likelihood? And how less likely are the other examined languages in comparison to the most likely one? These questions can be answered as well:

val detector = LanguageDetectorBuilder.fromLanguages(GERMAN, ENGLISH, FRENCH, SPANISH).build()
val confidenceValues = detector.computeLanguageConfidenceValues(text = "Coding is fun.")

// {
//   ENGLISH=1.0, 
//   GERMAN=0.8665738136456169, 
//   FRENCH=0.8249537317466078, 
//   SPANISH=0.7792362923625288
// }

In the example above, a map of all possible languages is returned, sorted by their confidence value in descending order. The values that the detector computes are part of a relative confidence metric, not of an absolute one. Each value is a number between 0.0 and 1.0. The most likely language is always returned with value 1.0. All other languages get values assigned which are lower than 1.0, denoting how less likely those languages are in comparison to the most likely language.

The map returned by this method does not necessarily contain all languages which the calling instance of LanguageDetector was built from. If the rule-based engine decides that a specific language is truly impossible, then it will not be part of the returned map. Likewise, if no ngram probabilities can be found within the detector's languages for the given input text, the returned map will be empty. The confidence value for each language not being part of the returned map is assumed to be 0.0.

9.2 Standalone mode Top ▲

If you want to try out Lingua before you decide whether to use it or not, you can run it in a REPL and immediately see its detection results.

  1. With Gradle: ./gradlew runLinguaOnConsole --console=plain
  2. Without Gradle: java -jar lingua-1.0.3-with-dependencies.jar

Then just play around:

This is Lingua.
Select the language models to load.

1: enter language iso codes manually
2: all supported languages

Type a number and press <Enter>.
Type :quit to exit.

> 1
List some language iso 639-1 codes separated by spaces and press <Enter>.
Type :quit to exit.

> en fr de es
Loading language models...
Done. 4 language models loaded lazily.

Type some text and press <Enter> to detect its language.
Type :quit to exit.

> languages
ENGLISH
> Sprachen
GERMAN
> langues
FRENCH
> :quit
Bye! Ciao! Tschüss! Salut!

10. What's next for version 1.1.0? Top ▲

Take a look at the planned issues.

Comments
  • Improve performance and reduce memory consumption

    Improve performance and reduce memory consumption

    As pointed out in #39 and #57 Lingua's great accuracy comes at the cost of high memory usage. This imposes a problem for some projects trying to use Lingua. In this issue I will try to highlight some main areas where performance can be improved, some of this is already covered by #98. Note that some of the proposed changes might decrease execution speed or require some larger refactoring.

    Model files

    • Instead of storing the model data in JSON format, a binary format could be used matching the in-memory format (see "In-memory models" section). This would have the advantage that:

      • Lookup maps such as Char2DoubleOpenHashMap could be created with the expected size avoiding rehashing of the maps during deserialization.
      • Model file loading is faster.
      • Model file sizes will be slightly smaller when encoding the frequency only once, followed by the number of ngrams which share this frequency, followed by the ngram values.

      Note that even though the fastutil maps are Serializable, using JDK serialization might introduce unnecessary overhead and would make this library dependent on the internal serialization format of the fastutil maps. Instead the data could be written manually to a DataOutputStream.

    Model file loading

    • Use streaming JSON library. The currently used kotlinx-serialization-json does not seem to support streaming yet. Therefore currently the complete model files are loaded as String before being parsed. This is (likely) slow and requires large amounts of memory. Instead streaming JSON libraries such as https://github.com/square/moshi should be used. Note that this point becomes obsolete if a binary format (as described in the "Model files" section above) is used.

    In-memory models

    • Object2DoubleOpenHashMap load factor can increased from the default 0.75 to a higher value. This reduces memory usage but might slow down execution.

    • Ngrams can be encoded using primitives. Since this project uses only up to fivegrams (5 chars), most of the ngrams (and for some languages even ngrams of all lengths) can be encoded as JVM primitives using bitwise operations, e.g.:

      • Unigrams as Byte or Char
      • Bigrams as Short or Int
      • Trigrams as Int or Long
      • Quadrigrams as Int or Long
      • Fivegrams as Long or in the worst case as String object. Note that at least for fivegrams the binary encoding should probably be offset based, so one char is the base code point and the the remaining bits of the Long encode the offsets of the other chars to the base char. This allows encoding alphabets such as Georgian where each char is > Long.SIZE_BITS / 5.

      This might even increase execution speed since it avoids hashCode() and equals(...) calls when looking up frequencies (speed-up, if any, has to be tested though).

    • Reduce frequency accuracy for in-memory models and model files from 64-bit Double to 32-bit. This can have a big impact on memory usage, saving more than 100MB with all models preloaded. However, instead of using a 32-bit Float to store the frequency, a custom 32-bit encoding can (and maybe should) be used since Float 'wastes' some bits for the sign (frequency will never be negative) and the exponent (frequency will never be >= 1.0), though this might decrease language detection speed due to the decoding overhead.

    • Remove Korean fivegrams (and quadrigrams?). The Korean language models are quite large, additionally due to the large range of Korean code points a great majority (> 1.000.000 fivegrams (?)) cannot be encoded with the primitive encoding approach outlined above. Chinese and Japanse don't seem to have quadrigram and fivegram models as well, not sure if this is due to how the languages work, but maybe it would be acceptable to drop them for Korean as well; also because detection of Korean seems to be rather unambiguous.

    Runtime performance

    • Remove Alphabet. The Alphabet class can probably removed, Character.UnicodeScript seems to be an exact substitute and might allow avoiding some indirection, e.g. only lookup UnicodeScript for a Char once and then compare it with expected ones instead of having each Alphabet look up UnicodeScript.
    • Avoid creation of Ngram objects. Similar to the primitive encoding described in "In-memory models" above, Ngram objects created as part of splitting up the text can be avoided as well (with a different encoding). A Kotlin inline class can be used to still get type safety and have some convenience functions. Primitive encoding can only support trigrams reliably without too much overhead / too complicated encoding, but that is probably fine because since d0f7a7c211abb03885cc89febae9d77fbf640342 at most trigrams will be used for longer texts.
    • Instead of accessing lazy frequency lookup in every iteration, it might be faster to access it once at the beginning and then directly use it instead (though this could also be premature optimization).

    Conclusion

    With some / all of these suggestions applied memory usage can be reduced and execution speed can be increased without affecting accuracy. However, some of the suggestions might be premature optimization, and they only work for 16-bit Char but not for supplementary code points (> 16-bit) (but the current implementation, mainly Ngram creation, seems to have that limitation as well).

    I have implemented some of these optimizations and some other minor improvements in https://github.com/Marcono1234/lingua/tree/experimental/performance. However, these changes are pretty experimental: The Git history is not very nice to look at; in some commits I fixed bugs I introduced before or reverted changes again. Additionally the unit tests and model file writing are broken. Some of the changes might also be premature optimization. Though maybe it is interesting nonetheless, it appears the memory usage with all languages being preloaded went down to about ~~640MB~~ (Edit: 920MB, made a mistake in the binary encoding) on AdoptOpenJDK 11.

    opened by Marcono1234 14
  • Compact memory data (#101)

    Compact memory data (#101)

    I changed the runtime memory model, the original JSON is translated to a dense map. This reduces memory requirements at cost of speed (frequencies lookup should be slower). Frequencies are stored as Float instead of Double, this introduces an 0.001% error on calculation, and tests are updated accordingly.

    fastutil dependency has been removed.

    All changes are performed in internal classes, so this request is compatible with the 1.1 version and I hope that the merge will be considered soon.

    opened by fvasco 12
  • Lingua's use of Kotlin coroutines causes leaks in web applications

    Lingua's use of Kotlin coroutines causes leaks in web applications

    I'm using lingua 1.1.0 in a Java web application for language detection. The application is set up to load all models on the first request:

    LanguageDetectorBuilder.fromAllLanguages().withPreloadedLanguageModels().build();

    When I undeploy the web application from the application server, the models stay in memory. I took a heap dump using Eclipse Memory Anaylzer. The dump shows that there are still instances of the classes related to coroutines (e.g. kotlinx.coroutines.scheduling.CoroutineScheduler$WorkerState, kotlinx.coroutines.scheduling.WorkQueue, kotlinx.coroutines.scheduling.CoroutineScheduler) after undeploying the application. The coroutines still seem to reference the models.

    I've built a reproducer using only Servlet API that seems to show similar behaviour on Tomcat. Tomcat shows warnings like this:

    WARNUNG: The web application [lingua-reproducer] appears to have started a thread named [DefaultDispatcher-worker-9] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338) kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.park(CoroutineScheduler.kt:795) kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.tryPark(CoroutineScheduler.kt:740) kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:711) kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

    Is there are way to ensure that the threads created by Lingua terminate when the application is undeployed?

    bug 
    opened by dnb-erik-brangs 10
  • v0.6.1 seems better than v1.0.0

    v0.6.1 seems better than v1.0.0

    I compared version 0.6.1 vs version 1.0.0 on two private test sets

    In the following table (see below) I reported results for both the version and their difference for all languages of my benchmark. Scores are the ratio of corrected classified segments with respect to a gold reference. Actually, the sets contains real-world text from the web and from technical domains, and they are not manually checked. So it is possible, that they contain some wrongly classified sentences.

    Nevertheless, v1.0.0 seems worse for than v0.6.1 for many languages.

    One of the big difference relates with text including string in different languages: For instance, Chinese segment having both Chinese and English, like the following ones, which are detected as English and Portuguese, instead of Chinese

    Snapchat 并不 是唯一 一家 触 及 这些 文化 底 线 的公 司 
    Gomes : 我們 的目 標 是 提 升 搜 尋 服 務 的 品 質
    

    or Greek segment with few western strings, like the following ones, which are detected as Danish and Italian, instead of Greek

    Rasmussen μετά τη συνάντησή τους στο ΥΠΕΘΑ
    γεγονός που, πέραν της σημασίας των σχετικών πολιτικών επαφών, εμπεριέχει και ιδιαίτερη συμβολική αξία, καθώς η επίσκεψη πραγματοποιήθηκε δύο μόλις έτη μετά την επίσκεψη του πρώην ιταλού Προέδρου, κ. Azeglio Ciampi, στην Αθήνα, στις 15-17.
    

    or Arabic segment with few English strings, like the following ones, which are detected as English and Tagalog, instead of Arabic

    أداة Google Scholar وضعت أبحاثًا متاحةً للجميع البحث عنها سهل والوصول إليها أسهل.
    يارد "YARID" - اللاجئون الأفارقة الشباب للتنمية المتكاملة- بدأت كمحادثة داخل المُجتمع الكونغو
    

    Several other examples, can be found even between languages with more similar alphabets.

    It seems that v1.0.0 relies too much on Western alphabets to identify the language, without considering the amount of such Western characters.

    set  lng Lingua Lingua100 diffs_vs_Lingua
    setA ar  0.930   0.902  diff: -0.028
    setA az  0.807   0.784  diff: -0.023
    setA be  0.861   0.816  diff: -0.045
    setA bg  0.801   0.734  diff: -0.067
    setA bs  0.412   0.408  diff: -0.004
    setA ca  0.760   0.762  diff: 0.002
    setA cs  0.792   0.785  diff: -0.007
    setA da  0.760   0.752  diff: -0.008
    setA de  0.848   0.848  diff: 0
    setA el  0.947   0.932  diff: -0.015
    setA es  0.804   0.853  diff: 0.049
    setA et  0.856   0.853  diff: -0.003
    setA fi  0.865   0.864  diff: -0.001
    setA fr  0.868   0.882  diff: 0.014
    setA he  0.972   0.961  diff: -0.011
    setA hi  0.790   0.733  diff: -0.057
    setA hr  0.628   0.623  diff: -0.005
    setA hu  0.858   0.848  diff: -0.01
    setA hy  0.827   0.801  diff: -0.026
    setA id  0.665   0.665  diff: 0
    setA is  0.863   0.831  diff: -0.032
    setA it  0.866   0.865  diff: -0.001
    setA ja  0.758   0.752  diff: -0.006
    setA ka  0.802   0.787  diff: -0.015
    setA ko  0.887   0.827  diff: -0.06
    setA lt  0.839   0.828  diff: -0.011
    setA lv  0.882   0.869  diff: -0.013
    setA mk  0.786   0.723  diff: -0.063
    setA ms  0.801   0.809  diff: 0.008
    setA nb  0.735   0.733  diff: -0.002
    setA nl  0.799   0.835  diff: 0.036
    setA nn  0.768   0.768  diff: 0
    setA pl  0.879   0.881  diff: 0.002
    setA pt  0.862   0.858  diff: -0.004
    setA ro  0.765   0.751  diff: -0.014
    setA ru  0.820   0.773  diff: -0.047
    setA sk  0.783   0.766  diff: -0.017
    setA sl  0.714   0.708  diff: -0.006
    setA sq  0.829   0.826  diff: -0.003
    setA sr  0.417   0.302  diff: -0.115
    setA sv  0.833   0.830  diff: -0.003
    setA th  0.940   0.927  diff: -0.013
    setA tl  0.747   0.748  diff: 0.001
    setA tr  0.901   0.895  diff: -0.006
    setA uk  0.877   0.848  diff: -0.029
    setA vi  0.920   0.877  diff: -0.043
    setA zh  0.941   0.858  diff: -0.083
    
    setB ar   0.996   0.988  diff: -0.008
    setB bg   0.957   0.947  diff: -0.01
    setB bs   0.495   0.494  diff: -0.001
    setB ca   0.946   0.953  diff: 0.007
    setB cs   0.993   0.992  diff: -0.001
    setB da   0.947   0.946  diff: -0.001
    setB de   0.996   0.996  diff: 0
    setB el   0.996   0.992  diff: -0.004
    setB en   0.964   0.966  diff: 0.002
    setB es   0.897   0.920  diff: 0.023
    setB et   0.978   0.974  diff: -0.004
    setB fi   0.998   0.998  diff: 0
    setB fr   0.962   0.971  diff: 0.009
    setB he   1.000   0.999  diff: -0.001
    setB hr   0.858   0.868  diff: 0.01
    setB hu   0.988   0.988  diff: 0
    setB id   0.765   0.765  diff: 0
    setB is   0.979   0.971  diff: -0.008
    setB it   0.939   0.937  diff: -0.002
    setB ja   0.986   0.986  diff: 0
    setB ko   0.998   0.998  diff: 0
    setB lt   0.992   0.990  diff: -0.002
    setB lv   0.990   0.983  diff: -0.007
    setB mk   0.927   0.930  diff: 0.003
    setB ms   0.927   0.927  diff: 0
    setB nb   0.927   0.928  diff: 0.001
    setB nl   0.921   0.949  diff: 0.028
    setB nn   0.942   0.946  diff: 0.004
    setB pl   0.993   0.992  diff: -0.001
    setB pt   0.952   0.948  diff: -0.004
    setB ro   0.964   0.958  diff: -0.006
    setB ru   0.997   0.911  diff: -0.086
    setB sk   0.977   0.975  diff: -0.002
    setB sl   0.943   0.942  diff: -0.001
    setB sq   0.983   0.983  diff: 0
    setB sv   0.973   0.971  diff: -0.002
    setB th   0.996   0.996  diff: 0
    setB tr   0.993   0.990  diff: -0.003
    setB uk   0.943   0.964  diff: 0.021
    setB vi   0.994   0.954  diff: -0.04
    setB zh  0.992   0.955  diff: -0.037
    
    
    
    bug 
    opened by nicolabertoldi 10
  • Add function to avoid ambiguous results

    Add function to avoid ambiguous results

    Hi,

    while testing the library with some texts I encountered some ambiguous detection results. As far as I understand, the detectLanguageOf method always returns a language as soon as it has at least some possibility. However, there exist texts where this behaviour is probably not desired.

    Imagine a text which leads to similar possibilities for two languages, with the first one just a little bit more likely. It would be nice to be able to detect such cases, or at least to ensure a certain distance between the possibilities of the most and the second most likely language (otherwise the method may return UNKNOWN). In our use-case we would prefer to have more detection as unknown rather than (a lot of) false-positives.

    The following code snippet illustrates my idea:

    @JvmOverloads
    fun detectLanguageOf(text: String, requiredRelativeDistance: Double = 0.95): Language {
        
        [...]
    
        return getMostLikelyLanguage(allProbabilities, unigramCountsOfInputText, requiredRelativeDistance)
    }
        
    internal fun getMostLikelyLanguage(
        probabilities: List<Map<Language, Double>>,
        unigramCountsOfInputText: Map<Language, Int>,
        requiredRelativeDistance: Double = 0.95
    ): Language {
        
        [...]
    
        return when {
            filteredProbabilities.none() -> UNKNOWN
            filteredProbabilities.singleOrNull() != null -> filteredProbabilities.first().key
            else -> {
                val candidate = filteredProbabilities.maxBy { it.value }!!
                val second = filteredProbabilities.filter { it.key != candidate.key }.maxBy { it.value }!!
                if (second.value * requiredRelativeDistance < candidate.value) {
                    candidate.key
                } else {
                    UNKNOWN
                }
            }
        }
    }
    

    Feel free to copy the code if you want. I don't know whether this is a good approach for the problem or if there are better ways to do that. However, it would be really nice to have a solution for this in some way.

    Thanks in advance!

    enhancement 
    opened by bgeisberger 9
  • LanguageDetector and multithreading

    LanguageDetector and multithreading

    I had a plan to use 'lingua' in a multi threaded Java environment, but, if I got it right, 'LanguageDetector' instance is not thread safe, i.e. if several threads will use it simultaneously, they may corrupt each other work. Am I right? New 'LanguageDetector' instance seems to be very expensive.

    question 
    opened by werder06 9
  • [ Performance and Memory Analysis for Large Dataset ] very slow for large numbers of Hits

    [ Performance and Memory Analysis for Large Dataset ] very slow for large numbers of Hits

    I am trying to run language on using this scrit

                final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH, JAPANESE, CHINESE,ITALIAN, PORTUGUESE,ARABIC,RUSSIAN,DUTCH,KOREAN,SWEDISH,HINDI,POLISH).build();
    
    	    long start=System.currentTimeMillis();  
    
    
    	    final Language detectedLanguage = detector.detectLanguageOf("Zum Vergleich kann es auch nützlich sein, diese Rankings neben einigen etwas älteren Forschungsergebnissen zu sehen. Im Jahr 2013, Common Sense Advisory zur Verfügung gestellt , eine empirische Studie basiert auf einer Wallet World Online (WOW) - definiert als ‚die gesamte wirtschaftliche Chance, sowohl online als auch offline, berechnet durch einen Anteil eines Landes BIP zu allen wichtigen Blöcken dieser Gesellschaft assoziieren. ' Hier ist, was uns ihre Studie gezeigt hat.");
    //	    System.out.println(detectedLanguage.toString());
    	    long end=System.currentTimeMillis();  
    	    System.out.println("Time: "+ (end - start));
    

    it's taking 700millisecong. which is very slow. which can not be used for 10000+ files.. is there any approach to get results with 1-10milliseconds?

    or any function like isEnglish(). which will be true only for English..

    opened by the-black-knight-01 8
  • How to reduce the size of the jar file by excluding language profiles?

    How to reduce the size of the jar file by excluding language profiles?

    I need to run this lib in a memory constrained environment: less than 200Mb for the unzipped package. How can I exclude rare language profiles from the library?

    An alternative: can the memory size be significantly decreased by minifying the json files used for each language?

    Note: I am using the maven build of the lingua.

    question 
    opened by seinecle 8
  • Memory leak when using Lingua in web applications

    Memory leak when using Lingua in web applications

    As discussed in #110 , there seems to be a memory leak when using Lingua in a Java web application. I've uploaded an example application at https://github.com/deutsche-nationalbibliothek/lingua-reproducer-memory-leak . The README contains some information about the problem. Please let me know if you need more information.

    bug 
    opened by dnb-erik-brangs 7
  • java.lang.NoClassDefFoundError: kotlin/KotlinNothingValueException

    java.lang.NoClassDefFoundError: kotlin/KotlinNothingValueException

    Hi,

    I am trying to use Lingua inside a plain Java maven project and using following maven dependency for that:

    <dependency>
                    <groupId>com.github.pemistahl</groupId>
                    <artifactId>lingua</artifactId>
                    <version>1.0.3</version>
    </dependency>
    

    Code Sample:

    import com.github.pemistahl.lingua.api.Language;
    import com.github.pemistahl.lingua.api.LanguageDetector;
    import com.github.pemistahl.lingua.api.LanguageDetectorBuilder;
    
    import static com.github.pemistahl.lingua.api.Language.*;
    
    public class Test {
        public static void detect(){
            LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH).build();
            Language language = detector.detectLanguageOf("languages are awesome");
            System.out.println(language);
        }
    
        public static void main(String[] args) {
            detect();
        }
    }
    

    While running I am getting following exception:

    Exception in thread "main" java.lang.NoClassDefFoundError: kotlin/KotlinNothingValueException
    	at kotlinx.serialization.SerializersKt.serializer(Unknown Source)
    	at com.github.pemistahl.lingua.internal.TrainingDataLanguageModel$Companion.fromJson(TrainingDataLanguageModel.kt:150)
    	at com.github.pemistahl.lingua.api.LanguageDetector.loadLanguageModel$lingua(LanguageDetector.kt:401)
    	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:407)
    	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:79)
    	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
    	at com.github.pemistahl.lingua.api.LanguageDetector.lookUpNgramProbability$lingua(LanguageDetector.kt:390)
    	at com.github.pemistahl.lingua.api.LanguageDetector.computeSumOfNgramProbabilities$lingua(LanguageDetector.kt:366)
    	at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageProbabilities$lingua(LanguageDetector.kt:353)
    	at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageConfidenceValues(LanguageDetector.kt:162)
    	at com.github.pemistahl.lingua.api.LanguageDetector.detectLanguageOf(LanguageDetector.kt:102)
    	at com.tomtom.ssv.apt.ingestion.service.LanguageDetector.detect(LanguageDetector.java:67)
    	at com.tomtom.ssv.apt.ingestion.service.LanguageDetector.main(LanguageDetector.java:72)
    Caused by: java.lang.ClassNotFoundException: kotlin.KotlinNothingValueException
    	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    	... 13 more
    
    bug 
    opened by piyusht007 7
  • Compare against CLD3 and CLD2

    Compare against CLD3 and CLD2

    Google's Compact Language Detectors (CLD) are good libraries that are used in Chrome browser and in many other projects. While being written in C++ they have wrappers for Java (cld2, cld3) and Python (cld2, cld3). While 2nd version is n-gram based, 3rd version uses Neural Networks.

    Please compare their performance on your test set, both accuracy and speed wise.

    opened by igrinis 7
  • Language recognition error

    Language recognition error

    LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH,CHINESE , THAI, VIETNAMESE).build();
    SortedMap<Language, Double> languageDoubleSortedMap = detector.computeLanguageConfidenceValues("ี่มีประสิทธิภาพหลอดไฟพลังงานแสงอาทิตย์กลางแจ้งเซ็นเซอร์ตรวจจับการเคลื่อนไหวสวนกันน้ำ LED พลังงานแสงอาทิตย์โคมไฟสปอร์ตไลท์สำหรับ Garden เส้นทางถนนแบ็คดรอปเป่าลม Led Light");
    System.out.println(languageDoubleSortedMap);
    

    The following information is printed : {ENGLISH=1.0, VIETNAMESE=0.5658177137374878} I think it's Thai, but I can recognize English, even Vietnamese, and Thai doesn't version is : 1.2.2

    opened by xujiaw 0
  • Option: Other

    Option: Other

    Great tool - thank you! Suggestion: The possibility to add OTHER as a language. Lets say I want to find English and French in a multi-language set. I want to add English and French to LanguageDetectorBuilder.from_languages, but if the probability is low, I don't want everything to be marked as English or French, but something else -> Other.

    enhancement 
    opened by thsm-kb 1
  • Reduce resources to load language models

    Reduce resources to load language models

    Currently, the language models are parsed from json files and loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python. Perhaps it is even possible to store those data structures in some kind of binary format on disk which can be loaded faster than the current json files.

    Promising candidates could be:

    enhancement 
    opened by pemistahl 0
  • Support specifying custom `Executor`

    Support specifying custom `Executor`

    Related to #119

    Currently Lingua uses ForkJoinPool.commonPool() for model loading and language detection. However, maybe it would be useful to allow users to specify their own Executor, for example with LanguageDetectorBuilder.withExecutor(Executor) (the default could still be commonPool()). This would have the following advantages:

    • could customize worker thread count, or even run single-threaded, e.g. executor = r -> r.run()
    • could customize worker threads:
      • custom name to make performance monitoring easier
      • custom priority

    It would not be possible anymore to use invokeAll then, but a helper function such as the following one might add the missing functionality:

    private fun <E> executeTasks(tasks: List<Callable<E>>): List<E> {
        val futures = tasks.map { FutureTask(it) }
        futures.forEach(executor::execute)
        return futures.map(Future<E>::get)
    }
    

    (Note that I have not extensively checked how well this performs compared to invokeAll, and whether exception collection from the Futures could be improved. Probably this implementation is flawed because called would wait on get() call without participating in the work.) Alternatively CompletableFuture could be used; but then care must be taken to not use ForkJoinPool.commonPool() when its parallelism is 1, otherwise performance might be pretty bad due to JDK-8213115.

    This would require some changes to the documentation which currently explicitly refers to ForkJoinPool.commonPool().

    What do you think?

    new feature 
    opened by Marcono1234 1
  • Add more classification metrics in library comparisons

    Add more classification metrics in library comparisons

    Hello!

    So I've been trying out the lingua library and it's awesome. Was wondering if it's possible to add other classification metrics such as Precision, Recall, Specificity and F1 in the comparisons between tika, Optimaize and the other java language detection libraries for more transparency?

    Thanks!

    enhancement 
    opened by willyspinner 2
Releases(v1.2.2)
  • v1.2.2(Aug 2, 2022)

  • v1.2.1(Jun 9, 2022)

  • v1.2.0(Jun 7, 2022)

    Features

    • The library can now be used as a Java 9 module. Thanks to @Marcono1234 for helping with the implementation. (#120, #138)
    • The new method LanguageDetectorBuilder.withLowAccuracyMode() has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance. (#136)

    Improvements

    • The memory footprint has been reduced significantly by applying several internal optimizations. Thanks to @Marcono1234, @fvasco and @sigpwned for their help. (#101, #127)
    • Several language model files have become obsolete and could be deleted without decreasing detection accuracy. This results in a smaller memory footprint and a 36% smaller jar file.

    Bug Fixes

    • A bug in the rule engine has been fixed that caused incorrect language detection for certain texts. Thanks to @bdecarne who has found it.

    Other changes

    • Due to a refactoring of how the internal thread pool works, the method LanguageDetector.destroy() has been deprecated in favor of the newly introduced method LanguageDetector.unloadLanguageModels().
    Source code(tar.gz)
    Source code(zip)
    lingua-1.2.0-javadoc.jar(340.75 KB)
    lingua-1.2.0-sources.jar(33.67 KB)
    lingua-1.2.0-with-dependencies.jar(104.01 MB)
    lingua-1.2.0.jar(76.68 MB)
  • v1.1.1(Dec 12, 2021)

    Improvements

    • The new method LanguageDetector.destroy() has been introduced that frees internal resources to prevent memory leaks within application server deployments. (#110, #116)
    • Language model loading performance has been improved by creating a manually optimized internal thread pool. This replaces the coroutines used in the previous release. (#116)

    Bug Fixes

    • The character â was erroneously not treated as a possible indicator for the French language. (#115)
    • Language detection was non-deterministic when multiple alphabets had the same occurrence count. (#105)
    Source code(tar.gz)
    Source code(zip)
    lingua-1.1.1-javadoc.jar(339.68 KB)
    lingua-1.1.1-sources.jar(33.29 KB)
    lingua-1.1.1-with-dependencies.jar(149.73 MB)
    lingua-1.1.1.jar(125.10 MB)
  • v1.1.0(May 2, 2021)

    Languages

    • There is now support for the Maori language which was contributed to the Rust implementation of Lingua. (#93)

    Features

    • Language models are now loaded asynchronously and in parallel using Kotlin coroutines, making this step more performant. (#84)
    • Language Models can now be loaded either lazily (default) or eagerly. (#79)
    • Instead of loading multiple copies of the language models into memory for each separate instance of LanguageDetector, multiple instances now share the same language models and access them asynchronously. (#91)

    Improvements

    • Language detection for sentences with more than 120 characters now performs more quickly by iterating through trigrams only which is enough to achieve high detection accuracy.
    • Textual input that includes logograms from Chinese, Japanese or Korean is now split at each logogram and not only at whitespace. This provides for more reliable language detection for sentences that include multi-language content. (#85)

    Bug Fixes

    • For an odd number of words as input, the method LanguageDetector.computeLanguageConfidenceValues computed wrong values under certain circumstances. (#87)
    • When Lingua was used in projects with an explictly set Kotlin version which differed from Lingua's implicitly set version in the Gradle script, several errors occurred during runtime. By explicitly setting Lingua's Kotlin version, these errors are now hopefully gone. (#88, #89)
    • Errors in the rule engine for the Latvian language have been resolved. (#92)
    Source code(tar.gz)
    Source code(zip)
    lingua-1.1.0-javadoc.jar(337.21 KB)
    lingua-1.1.0-sources.jar(32.54 KB)
    lingua-1.1.0-with-dependencies.jar(150.91 MB)
    lingua-1.1.0.jar(125.11 MB)
  • v1.0.3(Oct 15, 2020)

    Bug Fixes

    • When two languages had exactly the same confidence values, one of them was erroneously removed from the result map. Thanks to @mmedek for reporting this bug. (#72)
    • There was still a problem with the classification of texts consisting of certain alphabets. Thanks to @nicolabertoldi for reporting this bug. (#76)
    • The language detection for Spanish did not take the rarely used accented characters á, é, í, ó, ú and ü into account. Thanks to @joeporter for reporting this bug. (#73)
    • A bug in the rule engine led to weak detection accuracy for Macedonian and Serbian. This has been fixed.

    Other Changes

    • The Kotlin compiler and runtime have been updated to version 1.4. This includes the current stable release 1.0.0 of the kotlinx-serialization framework.
    • The accuracy report files have been moved to their own Gradle source set. This allows for separate compilation of unit tests and accuracy report tests, leading to more flexible and slightly faster compilation.
    Source code(tar.gz)
    Source code(zip)
    lingua-1.0.3-javadoc.jar(49.20 KB)
    lingua-1.0.3-sources.jar(31.12 KB)
    lingua-1.0.3-with-dependencies.jar(145.08 MB)
    lingua-1.0.3.jar(124.79 MB)
  • v1.0.2(Aug 9, 2020)

    Bug Fixes

    • The language mapping for character ë was incorrect which has been fixed. Thanks to @sandernugterenedia for reporting this bug. (#66)
    • The implementation of LanguageDetector made use of functionality that was introduced in Java 8 which made the library unusable for Java 6 and 7. Thanks to @levant916 for reporting this bug. (#69)
    • The Gradle shadow plugin has been added so that ./gradlew jarWithDependencies produces a jar file whose dependencies do not conflict anymore with the same dependencies of different versions in the same project. (#67)
    Source code(tar.gz)
    Source code(zip)
    lingua-1.0.2-javadoc.jar(49.22 KB)
    lingua-1.0.2-sources.jar(31.10 KB)
    lingua-1.0.2-with-dependencies.jar(144.95 MB)
    lingua-1.0.2.jar(124.79 MB)
  • v1.0.1(Jul 4, 2020)

  • v1.0.0(Jun 24, 2020)

    Languages

    • added 9 new languages, this time with a focus on Africa: Ganda, Shona, Sotho, Swahili, Tsonga, Tswana, Xhosa, Yoruba, Zulu
    • removed language Norwegian in favor of Bokmal and Nynorsk (#59)

    Features

    • LanguageDetector can now provide confidence scores for each evaluated language. (#11)
    • The public API for creating language model (LanguageModelFilesWriter) and test data files (TestDataFilesWriter) has been stabilized. (#37)
    • New convenience methods have been added to LanguageDetectorBuilder in order to build LanguageDetector from languages written in a certain script. (#61)

    Improvements

    • The rule-based detection algorithm has been made less sensitive so that single words in a different language cannot mislead the algorithm so easily.
    • The fastutil library has been added again to reduce memory consumption. (#58)
    • The language model-based algorithm has been optimized so that language detection performs approximately 25% faster now. (#58)
    • Support for the Kotlin linter ktlint has been added to help with a consistent coding style. (#47)
    • Third-party dependencies have been updated to their latest versions. (#36)

    Bug Fixes

    • Incorrect regex character classes caused the library to not work properly on Android. (#32)

    Test Coverage

    • Test coverage has been extended from 59% to 72%.

    Documentation

    • The README contains a new section describing how users can add their own languages to Lingua.

    Other changes

    There is a breaking change in this release:

    • Methods with the prefix fromAllBuiltIn... have been renamed to fromAll... to make them more succinct and clear. (#61)
    Source code(tar.gz)
    Source code(zip)
    lingua-1.0.0-javadoc.jar(49.14 KB)
    lingua-1.0.0-sources.jar(30.25 KB)
    lingua-1.0.0-with-dependencies.jar(144.74 MB)
    lingua-1.0.0.jar(124.80 MB)
  • v0.6.1(Feb 6, 2020)

  • v0.6.0(Jan 5, 2020)

    Languages

    • added 11 new languages: Armenian, Bosnian, Azerbaijani, Esperanto, Georgian, Kazakh, Macedonian, Marathi, Mongolian, Serbian, Ukrainian

    Features

    There are some breaking changes in this release:

    • The support for MapDB has been removed. It did not provide enough advantages over Kotlin's lazy loading of language models. It used a lot of disc space and language detection became slow. With the long-term goal of creating a multiplatform library, only those features will be implemented in the future that support JavaScript as well.
    • The dependency on the fastutil library has been removed. It did not provide enough advantages over Kotlin's lazy loading of language models.
    • The method LanguageDetector.detectLanguagesOf(text: Iterable<String>) has been removed because the sorting order of the returned languages was undefined for input collections such as a HashSet. From now on, the method LanguageDetector.detectLanguageOf(text: String) will be the only one to be used.
    • The LanguageDetector can now be built with the following additional methods:
      • LanguageDetectorBuilder.fromIsoCodes639_1(vararg isoCodes: IsoCode639_1)
      • LanguageDetectorBuilder.fromIsoCodes639_3(vararg isoCodes: IsoCode639_3)
      • the following method has been removed: LanguageDetectorBuilder.fromIsoCodes(isoCode: String, vararg isoCodes: String)
    • The Gson library has been replaced with kotlinx-serialization for the loading of the json language models. This results in a significant reduction of code and makes reflection obsolete, so the dependency on kotlin-reflect could be removed.

    Improvements

    • The overall detection algorithm has been improved again several times to fix several detection bugs.
    Source code(tar.gz)
    Source code(zip)
    lingua-0.6.0-javadoc.jar(36.52 KB)
    lingua-0.6.0-sources.jar(25.57 KB)
    lingua-0.6.0-with-dependencies.jar(125.82 MB)
    lingua-0.6.0.jar(123.90 MB)
  • v0.5.0(Aug 12, 2019)

    Languages

    • added 12 new languages: Bengali, Chinese (not differentiated between traditional and simplified, as of now), Gujarati, Hebrew, Hindi, Japanese, Korean, Punjabi, Tamil, Telugu, Thai, Urdu

    Features

    • The LanguageDetectorBuilder now supports the additional method withMinimumRelativeDistance() that allows to specify the minimum distance between the logarithmized and summed up probabilities for each possible language. If two or more languages yield nearly the same probability for a given input text, it is likely that the wrong language may be returned. By specifying a higher value for the minimum relative distance, Language.UNKNOWN is returned instead of risking false positives.

    • Test report generation can now use multiple CPU cores, allowing to run as many reports as CPU cores are available. This has been implemented as an additional attribute for the respective Gradle task: ./gradlew writeAccuracyReports -PcpuCores=...

    • The REPL now allows to freely specify the languages you want to try out by entering the desired ISO 639-1 codes. Before, it has only been possible to choose between certain language combinations.

    Improvements

    • The overall detection algorithm has been improved, yielding slightly more accurate results for those languages that are based on the Latin alphabet.

    Bug Fixes

    Thanks to the great work of contributor Bernhard Geisberger, two bugs could be fixed.

    1. The fix in pull request #8 solves the problem of not being able to recreate the MapDB cache files automatically in case the data has been corrupted.

    2. The fix in pull request #9 makes the class LanguageDetector completely thread-safe. Previously, in some rare cases it was possible that two threads mutated one of the internal variables at the same time, yielding inaccurate language detection results.

    Thank you, Bernhard.

    Source code(tar.gz)
    Source code(zip)
    lingua-0.5.0-sources.jar(24.49 KB)
    lingua-0.5.0-with-dependencies.jar(147.69 MB)
    lingua-0.5.0.jar(111.25 MB)
  • v0.4.0(May 7, 2019)

    This release took some time, but here it is.

    Languages

    • added 18 new languages: Afrikaans, Albanian, Basque, Bokmal, Catalan, Greek, Icelandic, Indonesian, Irish, Malay, Norwegian, Nynorsk, Slovak, Slovene, Somali, Tagalog, Vietnamese, Welsh

    Features

    • Language models are now lazy-loaded into memory upon first access and not already when an instance of LanguageDetector is created. This way, if the rule-based engine can filter out some unlikely languages, their language models are not loaded into memory as they are not necessary at that point. So the overall memory consumption is further reduced.

    • The fastutil library is used to compress the probability values of the language models in memory. They are now stored as primitive data types (double) instead of objects (Double) which reduces memory consumption by approximately 500 MB if all language models are selected.

    Improvements

    • The overall code quality has been improved significantly. This allows for easier unit testing, configuration and extensibility.

    Bug Fixes

    • Reported bug #3 has been fixed which prevented certain character classes to be used on Android.

    Build system

    • Starting from this version, Gradle is used as this library's build system instead of Maven. This allows for more customizations, such as in test report generation, and is a first step towards multiplatform support. Please take a look at this project's README to read about the available Gradle tasks.

    Test Coverage

    • Test coverage has been extended from 24% to 55%.
    Source code(tar.gz)
    Source code(zip)
    lingua-0.4.0-sources.jar(22.81 KB)
    lingua-0.4.0-with-dependencies.jar(99.93 MB)
    lingua-0.4.0.jar(63.67 MB)
  • v0.3.2(Feb 8, 2019)

    This minor update fixes a critical bug reported in issue #1.

    Bug Fixes

    • The attempt to detect the language of a string solely containing characters that do not occur in any of the supported languages returned kotlin.KotlinNullPointerException. This has been fixed in this release. Instead, Language.UNKNOWN is now returned as expected.

    Dependency Updates

    • The Kotlin compiler, standard library and runtime have been updated from version 1.3.20 to 1.3.21
    Source code(tar.gz)
    Source code(zip)
    lingua-0.3.2-sources.jar(23.15 KB)
    lingua-0.3.2-with-dependencies.jar(61.17 MB)
    lingua-0.3.2.jar(42.70 MB)
  • v0.3.1(Jan 24, 2019)

    This minor update contains some significant detection accuracy improvements.

    Accuracy Improvements

    • added new detection rules to improve accuracy especially for single words and word pairs
    • accuracy for single words has been increased from 78% to 82% on average
    • accuracy for word pairs has been increased from 92% to 94% on average
    • accuracy for sentences has been increased from 98% to 99% on average
    • overall accuracy has been increased from 90% to 91% on average
    • overall standard deviation has been reduced from 6.01 to 5.35

    API changes

    • LanguageDetectorBuilder.fromIsoCodes() now accepts vararg arguments instead of a List in order to have a consistent API with the other methods of LanguageDetectorBuilder
    • If a language iso 639-1 code is passed to LanguageDetectorBuilder.fromIsoCodes() which does not exist, then an IllegalArgumentException is thrown. Previously, Language.UNKNOWN was returned. However, this could lead to bugs as a LanguageDetector with Language.UNKNOWN was built. This is now prevented.

    Dependency Updates

    • The Kotlin compiler, standard library and runtime have been updated from version 1.3.11 to 1.3.20
    Source code(tar.gz)
    Source code(zip)
    lingua-0.3.1-sources.jar(23.14 KB)
    lingua-0.3.1-with-dependencies.jar(61.17 MB)
    lingua-0.3.1.jar(42.70 MB)
  • v0.3.0(Jan 16, 2019)

    This major release offers a lot of new features, including new languages. Finally! :-)

    Languages

    • added 18 languages: Arabic, Belarusian, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Hungarian, Latvian, Lithuanian, Polish, Persian, Romanian, Russian, Swedish, Turkish

    Features

    • Language models can now be cached by MapDB to reduce memory usage and speed up loading times.

    Improvements

    • In the standalone app, you can now choose which language models to load in order to compare detection accuracy between strongly related languages.
    • For test report generation using Maven, you can now select a specific language using the attribute language and do not need to run the reports for all languages anymore: mvn test -P accuracy-reports -D detector=lingua -D language=German.

    API changes

    • Lingua's package structure has been simplified. The public API intended for end users now lives in com.github.pemistahl.lingua.api. Breaking changes herein are tried to keep to a minimum in 0.*.* versions and will not be performed anymore starting from version 1.0.0. All other code is stored in com.github.pemistahl.lingua.internal and is subject to change without any further notice.
    • added new class com.github.pemistahl.lingua.api.LanguageDetectorBuilder which is now responsible for building and configuring instances of com.github.pemistahl.lingua.api.LanguageDetector

    Test Coverage

    • Test coverage of the public API has been extended from 6% to 23%.

    Documentation

    • In addition to the test reports, graphical plots have been created in order to compare the detection results between the different classifiers even more easily. The code for the plots has been written in Python and is stored in an IPython notebook under /accuracy-reports/accuracy-reports-analysis-notebook.ipynb.
    Source code(tar.gz)
    Source code(zip)
    lingua-0.3.0-sources.jar(23.50 KB)
    lingua-0.3.0-with-dependencies.jar(61.12 MB)
    lingua-0.3.0.jar(42.68 MB)
  • v0.2.2(Dec 28, 2018)

    This minor version update provides the following:

    Improvements

    • The included language model JSON files now use a more efficient formatting, saving approximately 25% disk space in uncompressed format compared to version 0.2.1.

    Bug Fixes

    • The version of the Jacoco test coverage Maven plugin was incorrectly specified, leading to download errors. Now the most current snapshot version of Jacoco is used which provides enhancements for Kotlin test coverage measurement.
    Source code(tar.gz)
    Source code(zip)
    lingua-0.2.2-sources.jar(17.46 KB)
    lingua-0.2.2-with-dependencies.jar(13.28 MB)
    lingua-0.2.2.jar(9.20 MB)
  • v0.2.1(Dec 20, 2018)

    This minor version update provides the following:

    Performance Improvements

    • Lingua's language detection has been speeded up. It is now approximately 25% faster for large data sets.

    Comparison with Apache Tika

    • Accuracy report test classes have been written for Apache Tika to compare its language detection performance with Lingua's one. Lingua actually outperforms Tika for short paragraphs of text by up to 15% in accuracy. A detailed comparison table can be found in the README.
    Source code(tar.gz)
    Source code(zip)
    lingua-0.2.1-sources.jar(17.42 KB)
    lingua-0.2.1-with-dependencies.jar(14.92 MB)
    lingua-0.2.1.jar(10.84 MB)
  • v0.2.0(Dec 17, 2018)

    This release provides both new features and bug fixes. It is the first release that has been published to JCenter. Publication on Maven Central will follow soon.

    Languages

    • added detection support for Portuguese

    Features

    • extended language models for already existing languages to provide for more accurate detection results
    • the larger language models are now lazy-loaded to reduce waiting times during start-up, especially when starting the lingua REPL
    • added some unit tests for the LanguageDetector class that cover the most basic functionality (will be extended in upcoming versions)
    • added accuracy reports and test data for each supported language, in order to measure language detection accuracy (can be generated with mvn test -P accuracy-reports)
    • added accuracy statistics summary of the current implementation to README

    API changes

    • renamed method LanguageDetector.detectLanguageFrom() to LanguageDetector.detectLanguageOf() to use the grammatically correct English preposition
    • in version 0.1.0, the now called method LanguageDetector.detectLanguageOf() returned null for strings whose language could not be detected reliably. Now, Language.UNKNOWN is returned instead in those cases to prevent NullPointerExceptions especially in Java code.

    Bug Fixes

    • fixed a bug in lingua's REPL that caused non-ASCII characters to get broken in consoles which do not use UTF-8 encoding by default, especially on Windows systems
    Source code(tar.gz)
    Source code(zip)
    lingua-0.2.0-sources.jar(16.84 KB)
    lingua-0.2.0-with-dependencies.jar(14.92 MB)
    lingua-0.2.0.jar(10.84 MB)
  • v0.1.0(Nov 16, 2018)

    This is the very first release of Lingua. It aims at accurate language detection results for both long and especially short text. Detection on short text fragments such as Twitter messages is a weak spot of many similar libraries.

    Supported languages so far:

    • English
    • French
    • German
    • Italian
    • Latin
    • Spanish
    Source code(tar.gz)
    Source code(zip)
Owner
Peter M. Stahl
Computational linguist, Rust enthusiast, green IT advocate
Peter M. Stahl
Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR

Products Recognition 介绍 商品识别,围绕在复杂的商场零售场景中,识别出货架图像中的商品信息。主要组成部分: 重复图像检测。【更新进度 4/10】 图像拼接。【更新进度 0/10】 目标检测。【更新进度 0/10】 商品识别。【更新进度 1/10】 OCR。【更新进度 1/10】

zhenjieWang 18 Jan 27, 2022
Deskewing images with slanted content

skew_correction De-skewing images with slanted content by finding the deviation using Canny Edge Detection. To Run: In python 3.6, from deskew import

13 Aug 27, 2022
Generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv

basic-dataset-generator-from-image-of-numbers generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv inpu

1 Jan 01, 2022
Character Segmentation using TensorFlow

Character Segmentation Segment characters and spaces in one text line,from this paper Chinese English mixed Character Segmentation as Semantic Segment

26 Aug 25, 2022
Convert scans of handwritten notes to beautiful, compact PDFs

Convert scans of handwritten notes to beautiful, compact PDFs

Matt Zucker 4.8k Jan 01, 2023
Document blur detection based on Laplacian operator and text detection.

Document Blur Detection For general blurred image, using the variance of Laplacian operator is a good solution. But as for the blur detection of docum

JoeyLr 5 Oct 20, 2022
A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.

LAREX LAREX is a semi-automatic open-source tool for layout analysis on early printed books. It uses a rule based connected components approach which

162 Jan 05, 2023
question‘s area recognition using image processing and regular expression

======================================== Paper-Question-recognition ======================================== question‘s area recognition using image p

Yuta Mizuki 7 Dec 27, 2021
Corner-based Region Proposal Network

Corner-based Region Proposal Network CRPN is a two-stage detection framework for multi-oriented scene text. It employs corners to estimate the possibl

xhzdeng 140 Nov 04, 2022
Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

SceneTextPapers Tracking the latest progress in Scene Text Detection and Recognition: must-read papers well organized Information about this repositor

Shangbang Long 763 Jan 01, 2023
Toolbox for OCR post-correction

Ochre Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress! Overview of OCR pos

National Library of the Netherlands / Research 117 Nov 10, 2022
The first open-source library that detects the font of a text in a image.

Typefont Typefont is an experimental library that detects the font of a text in a image. Usage Import the main function and invoke it like in the foll

Vasile Pește 1.6k Feb 24, 2022
Code for CVPR 2022 paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

Bailando Code for CVPR 2022 (oral) paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory" [Paper] | [Project Page] | [Vi

Li Siyao 237 Dec 29, 2022
Converts an image into funny, smaller amongus characters

SussyImage Converts an image into funny, smaller amongus characters Demo Mona Lisa | Lona Misa (Made up of AmongUs characters) API I've also added an

Dhravya Shah 14 Aug 18, 2022
Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral) Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained

Applied Research Center (ARC), Tencent PCG 99 Jan 06, 2023
Deep learning based page layout analysis

Deep Learning Based Page Layout Analyze This is a Python implementaion of page layout analyze tool. The goal of page layout analyze is to segment page

186 Dec 29, 2022
Web interface for browsing arXiv papers

Currently, arxivbox considers only major computer vision and machine learning conferences

Ankan Kumar Bhunia 12 Sep 11, 2022
Face Detection with DLIB

Face Detection with DLIB In this project, we have detected our face with dlib and opencv libraries. Setup This Project Install DLIB & OpenCV You can i

Can 2 Jan 16, 2022
Some codes from PyImageSearch course's and external projects.

👨‍💻 Some codes and projects 👨‍💻 💡 Technologies 📜 Projects 📍 Chrome Dinosaur Controller 📦 Script 📍 Coins Counter 📦 Script 🤓 Author Lucas Biv

Lucas Bivar 25 Oct 24, 2021
Convolutional Recurrent Neural Network (CRNN) for image-based sequence recognition.

Convolutional Recurrent Neural Network This software implements the Convolutional Recurrent Neural Network (CRNN), a combination of CNN, RNN and CTC l

Baoguang Shi 2k Dec 31, 2022