The Classical Language Toolkit

Overview

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.cltk.org/ for the legacy code and docs.

travis rtd codecov pypi zenodo binder

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing (NLP) for the languages of pre–modern Eurasia.

Installation

For the CLTK's latest pre-release version:

$ pip install --pre cltk
Requirements:

Documentation

Documentation at https://dev.cltk.org.

Citation

@Misc{johnsonetal2014,
 author = {Johnson, Kyle P. and Patrick Burns and John Stewart and Todd Cook},
 title = {CLTK: The Classical Language Toolkit},
 url = {https://github.com/cltk/cltk},
 year = {2014--2020},
}

License

Copyright (c) 2014-2021 Kyle P. Johnson under the MIT License.

Comments
  • Add Sanskrit stopwords

    Add Sanskrit stopwords

    For @Akhilesh28. (Please assign this to yourself.)

    In Sanskrit, a stopword list would include, at least: pronouns and determiners (source), and upasarga (verbal prefix / "preverb" / "preposition") and nipāta (particle) (which I read about here). Also add anything like conjunctions, particles, and interjections.

    Just putting together this list shouldn't take more than one week. Let us know if you're having problems. You can post your stopwords here, first, as a "gist": https://gist.github.com/

    opened by kylepjohnson 55
  • Add IPA Phonetic Transcription for Greek

    Add IPA Phonetic Transcription for Greek

    This ticket is for Jack Duff, with @jtauber generously assisting.

    The basic idea is to make a map of Greek letters and their IPA equivalents, something like:

    {'α': 'a',
    'αι', 'ai',
    'ζ': 'zd',
    'θ': 'tʰ'}
    

    Obviously, it won't all be so easy, due to proximal characters changing pronunciation (for example, "γ" being IPA "ɡ" but before ["κ", "χ", "γ", "μ"] becoming "ŋ").

    If you can get this down for Attic, then consider moving on to other dialects, like Ionic or Koine.

    Within the CLTK's architecture, the transliteration maps and logic should go into something like cltk/phonetics/greek/transcription.py. Or consider making a general transcription entry point at cltk/phonetics/transcription.py and then declaring a which language and dialect. I'll leave the implementation details to you two, though.

    enhancement 
    opened by kylepjohnson 51
  • Words to be added in Sanskrit's Stop Word Collection

    Words to be added in Sanskrit's Stop Word Collection

    • ~सः (He)~
    • ~स्वयम्(himself)~
    • तदीय(theres) -आसम्(be)
    • ज्ञा (have) -परि (with) -शक्नोति(can(verb)) -यद्(if) -कतम(which)

    add all the words in all their different cases, gender and and all 3 numbers(sin, dual, plural) . If you are doing it right, there must be 72 words exactly for each entity(including a few repetitions). Needs to be careful when it come to verb's word form, they are entirely different structures.

    File at: https://github.com/cltk/cltk/blob/master/cltk/stop/sanskrit/stops.py

    opened by nikheelpandey 42
  • Scraping srimad-bhagavadgita and valmiki ramayana.

    Scraping srimad-bhagavadgita and valmiki ramayana.

    • I am Scapping Sanskrit - English data from
      • Srimad-bhagavadgita : http://www.gitasupersite.iitk.ac.in/srimad
      • Valmiki Ramayana : http://www.valmiki.iitk.ac.in/

    Ping @kylepjohnson

    new corpus 
    opened by ghost 36
  • Add corpus for classical telugu

    Add corpus for classical telugu

    https://te.wikisource.org/wiki contains the classical telugu ithihasas, puranas, vedas, stothras, etc; So I would like to scrape them and add as a new corpus.

    Thank you.

    new corpus 
    opened by ghost 31
  • Make stopwords list for Old English

    Make stopwords list for Old English

    To generalize, I observe that there are different approaches to making stopword lists, based either on statistics (most common words, variously calculated) or grammar (definite and indefinite articles, pronouns, etc.) (or some combination).

    In doing this ticket, I would like you to do a little research on whether there exist any good lists for OE. If there is one, let's just take it. If not, we can do a little more research about what's right.

    enhancement easy 
    opened by kylepjohnson 29
  • Scraping Raw Classical Hindi Data

    Scraping Raw Classical Hindi Data

    I am scraping Raw Classical Hindi Data from http://ltrc.iiit.ac.in/showfile.php?filename=downloads/Classical_Hindi_Literature/SHUSHA/index.html @kylepjohnson

    new corpus 
    opened by Akirato 29
  • Add declining tool based on Collatinus and Eulexis ?

    Add declining tool based on Collatinus and Eulexis ?

    Hi there, It's been months I have been thinking about this and I do not think CLTK contains anything like that. Collatinus and Eulexis are two Lemmatizer and Decliners which are open source (their data is either open or easy to reconstruct. And they are a nice bunch of people).

    • Collatinus is in C
      • https://github.com/biblissima/collatinus is the most up to date source code for the flexer / lemmatizer
      • https://github.com/ycollatin/Collatinus-data is the repo for their data (but not up to date I guess ). It seems this is more up to date.
    • Eulexis is in php
      • https://github.com/biblissima/eulexis/blob/master/traitement.php For the whole code

    I'd be happy to convert the collatinus flexer for CLTK in the long run (give or take few months) but I think Eulexis and the lemmatizer part are out of my scope right now.

    What's your opinion on this ? This would help search APIs a lot for text which are not lemmatized.

    opened by PonteIneptique 28
  • Normalize Unicode throughout CLTK

    Normalize Unicode throughout CLTK

    I've been reading about normalize() and hope it will prevent normalization problems in the future. This builtin method solves the problem of accented characters made with combining diacritics not equaling precomposed characters. Examples of this appear in the testing library, where I have struggled to make two strings of accented Greek equal one another.

    Example of normalize() from Fluent Python by Luciano Ramalho (117-118):

    >>> from unicodedata import normalize
    >>> s1 = 'café' # composed "e" with acute accent
    >>> s2 = 'cafe\u0301' # decomposed "e" and acute accent 
    >>> len(s1), len(s2)
    (4, 5)
    >>> len(normalize('NFC', s1)), len(normalize('NFC', s2)) 
    (4, 4)
    >>> len(normalize('NFD', s1)), len(normalize('NFD', s2)) 
    (5, 5)
    >>> normalize('NFC', s1) == normalize('NFC', s2)
    True
    >>> normalize('NFD', s1) == normalize('NFD', s2) 
    True
    

    Solutions

    1. In core, use normalize with the argument 'NFC', as Fluent Python recommends. Not all Greek combining forms may reduce into precomposed … will need to be tested out.

    2. In tests, especially for assertEqual(), check that more complicated strings equal one another. Use normalize('NFC', <text>) on the comparison strings, too, if necessary.

    3. Use this to strip out accented characters coming from the PHI, which I don't do very gracefully here: https://github.com/kylepjohnson/cltk/blob/master/cltk/corpus/utils/formatter.py#L94

    Docs: https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize

    enhancement 
    opened by kylepjohnson 25
  • add Latin WordNet API

    add Latin WordNet API

    The Latin WordNet API mimics the NLTK Princeton WordNet API in all major respects; however because the data is sourced from latinwordnet.exeter.ac.uk (rather than locally) a number of under-the-hood changes were made. Many access methods now return generators rather than lists, and in general the API is now 'lazy' where multiple HTTP requests would cause a bottleneck. The Resnick, Jiang-Conrath, and Lin similarity scoring functions work, but require availability of a corpus-based information content file (forthcoming).

    opened by wmshort 24
  • Write syllabifiers for Indian languages

    Write syllabifiers for Indian languages

    This ticket is for @soumyag213

    As discussed by email, you'll port this and related modules, to the CLTK, from the Indic NLP Library.

    For a first step, I'd like to see this working in your own repo, which you have started at: https://github.com/soumyag213/cltk-beginning-indo. In the README for this, I would like to see an example of its API. For example, I imagine you showing something like this is the Python shell (BTW I like iPython):

    In [1]: from indic_syllabifier import orthographic_syllabify
    In [2]: orthographic_syllabify('supercalifragilisticexpialidocious', 'tamil')
    Out[2]: 'su-per-cal-i-fra-gil-ist-ic-ex-pi-al-i-doc-ious'
    
    enhancement 
    opened by kylepjohnson 24
  • Processing text with square brackets using the Latin NLP pipeline

    Processing text with square brackets using the Latin NLP pipeline

    I noticed an anomaly processing Latin text with the default pipeline. The tokenizer fails to separate square brackets from the words they enclose.

    text = 'Benedictus XVI [Iosephus Aloisius Ratzinger] fuit papa et episcopus Romanus.'
    
    from cltk import NLP
    
    cltk_nlp = NLP('lat')
    cltk_nlp.analyze(text).tokens
    

    Result:

    ['Benedictus', 'XVI', '[Iosephus', 'Aloisius', 'Ratzinger]', 'fuit', 'papa', 'et', 'episcopus', 'Romanus', '.']
    

    The problem does not occur when the LatinWordTokenizer is used.

    from cltk.tokenizers.lat.lat import LatinWordTokenizer
    
    tokenizer = LatinWordTokenizer()
    tokenizer.tokenize(text)
    

    Result:

    ['Benedictus', 'XVI', '[', 'Iosephus', 'Aloisius', 'Ratzinger', ']', 'fuit', 'papa', 'et', 'episcopus', 'Romanus', '.']
    

    Environment: Windows 10 + python 3.9.13 + cltk 1.1.6.

    bug 
    opened by DavideMassidda 0
  • SpaCy process

    SpaCy process

    I added the spaCy process with a custom wrapper to translate Token from spacy to Word in cltk. The aim is to be able to use trained models provided by spaCy with CLTK.

    opened by clemsciences 0
  • A way to tell what tokens `LatinBackOffLemmatizer()` has failed to lemmatize

    A way to tell what tokens `LatinBackOffLemmatizer()` has failed to lemmatize

    In LatinBackOffLemmatizer() and the lemmatizers in its chain I can't seem to find an option to return an empty value (such as in OldEnglishDictionaryLemmatizer()'s best_guess=False option), instead of returning the input value, when the lemmatizer fails to assign a lemma.

    Without such an option, it doesn't seem possible to tell successful from unsuccessful lemmatization attempts programmatically, severely limiting the range of the lemmatizer's applications.

    question acknowledged feature-request 
    opened by langeslag 6
  • Bump certifi from 2022.5.18.1 to 2022.12.7

    Bump certifi from 2022.5.18.1 to 2022.12.7

    Bumps certifi from 2022.5.18.1 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Unicode issue with Greek accented vowels in prosody

    Unicode issue with Greek accented vowels in prosody

    Unicode has two code points for acute accented vowels, one in the Greek and Coptic block and one in the Greek extended block (for omicron they are U+03CC and U+1F79. The list of accented vowels only takes into account the acute accents in the Greek and Coptic block resulting in some vowels not being properly scanned.

    >>> from cltk.prosody.grc import Scansion
    >>> text_string = "πότνια, θῦμον"
    >>> Scansion()._make_syllables(text_string)
    [[['πότνι', 'α'], ['θῦ', 'μον']]]
    

    Expected behavior

    >>> from cltk.prosody.grc import Scansion
    >>> text_string = "πότνια, θῦμον"
    >>> Scansion()._make_syllables(text_string)
    [[['πο', 'τνι' , 'α'], ['θῦ', 'μον']]]
    

    Desktop

    • MacOS 13.0
    bug 
    opened by JoshuaCCampbell 1
  • Latin enclitic tokenizer broken?

    Latin enclitic tokenizer broken?

    Latin tokenizer does not separate -que, ne, ve. In line 147 of tokenizers/lat/lat.py I suggest: specific_tokens += [token[: -len(enclitic)]] + ["-"+enclitic] This fixed it for me.

    Mac OS 15.7 Python 3.9

    bug 
    opened by polycrates 3
Releases(1.0.15)
Owner
Classical Language Toolkit
Natural language processing for Classical languages
Classical Language Toolkit
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

537 Jan 05, 2023
Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

S41R4J 121 Dec 27, 2022
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022
An extensive UI tool built using new data scraped from BBC News

BBC-News-Analyzer An extensive UI tool built using new data scraped from BBC New

Antoreep Jana 1 Dec 31, 2021
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 07, 2022
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
Natural Language Processing Tasks and Examples.

Natural Language Processing Tasks and Examples With the advancement of A.I. technology in recent years, natural language processing technology has bee

Soohwan Kim 53 Dec 20, 2022
glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Rhasspy 8 Dec 25, 2022
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022
A method for cleaning and classifying text using transformers.

NLP Translation and Classification The repository contains a method for classifying and cleaning text using NLP transformers. Overview The input data

Ray Chamidullin 0 Nov 15, 2022
本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。

【关于 NLP】那些你不知道的事 作者:杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金 介绍 本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。 目录架构 一、【

1.4k Dec 30, 2022
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 65 Sep 21, 2022
This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

IPL-data-analysis This project consists of data analysis and data visualization of all IPL seasons from 2008 to 2019 and answering the most asked ques

Sivateja A T 2 Feb 08, 2022
A fast, efficient universal vector embedding utility package.

Magnitude: a fast, simple vector embedding utility library A feature-packed Python package and vector storage file format for utilizing vector embeddi

Plasticity 1.5k Jan 02, 2023
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022