ChirpText is a collection of text processing tools for Python 3.

Last update: Nov 30, 2022

Overview

ChirpText is a collection of text processing tools for Python 3.

It is not meant to be a powerful tank like the popular NTLK but a small package which you can pip-install anywhere and write a few lines of code to process textual data.

Main features

Simple file data manipulation using an enhanced open() function (txt, gz, binary, etc.)
CSV helper functions
Parse Japanese text with mecab library (Does not require mecab-python3 package even on Windows, only a binary release (i.e. mecab.exe) is required)
Built-in "lite" text annotation formats (texttaglib TTL/CSV and TTL/JSON)
Helper functions and useful data for processing English, Japanese, Chinese and Vietnamese.
Application configuration files management which can make educated guess about config files' whereabouts
Quick text-based report generation

Installation

chirptext is available on PyPI and can be installed using pip

pip install chirptext

Parsing Japanese text

chirptext supports parsing Japanese text using different parsers (mecab, Janome, and igo-python)

>> doc = deko.parse_doc("猫が好きです。\n犬も好きです。") >>> for sent in doc: ... print(sent, sent.tokens.values()) ... 猫が好きです。 ['猫', 'が', '好き', 'です', '。'] 犬も好きです。 ['犬', 'も', '好き', 'です', '。'] ">

>>> from chirptext import deko
>>> sent = deko.parse('猫が好きです。')
>>> sent.tokens
['`猫`<0:1>', '`が`<1:2>', '`好き`<2:4>', '`です`<4:6>', '`。`<6:7>']
>>> sent.tokens.values()
['猫', 'が', '好き', 'です', '。']
>>> sent[0]
`猫`<0:1>
>>> sent[0].pos
'名詞'
>>> sent[1].lemma
'が'
>>> sent[2].reading
'スキ'

# tokenize
>>> deko.tokenize('猫が好きです。')
['猫', 'が', '好き', 'です', '。']

# split sentences
>>> deko.tokenize_sent("猫が好きです。\n犬も好きです。")
['猫が好きです。', '犬も好きです。']

# parse a document (i.e. multiple sentences)
>>> doc = deko.parse_doc("猫が好きです。\n犬も好きです。")
>>> for sent in doc:
...     print(sent, sent.tokens.values())
... 
猫が好きです。 ['猫', 'が', '好き', 'です', '。']
犬も好きです。 ['犬', 'も', '好き', 'です', '。']

Notes: At least one of the following tools must be installed to use chirptext Japanese parsing:

mecab: http://taku910.github.io/mecab/#download
Janome: available on PyPI, install with pip install Janome
igo-python: available on PyPI, install with pip install igo-python

Convenient IO APIs

>>> from chirptext import chio
>>> chio.write_tsv('data/test.tsv', [['a', 'b'], ['c', 'd']])
>>> chio.read_tsv('data/tes.tsv')
[['a', 'b'], ['c', 'd']]

>>> chio.write_file('data/content.tar.gz', 'Support writing to .tar.gz file')
>>> chio.read_file('data/content.tar.gz')
'Support writing to .tar.gz file'

>>> for row in chio.read_tsv_iter('data/test.tsv'):
...     print(row)
... 
['a', 'b']
['c', 'd']

Sample TextReport

# a string report
rp = TextReport()  # by default, TextReport will write to standard output, i.e. terminal
rp = TextReport(TextReport.STDOUT)  # same as above
rp = TextReport('~/tmp/my-report.txt')  # output to a file
rp = TextReport.null()  # ouptut to /dev/null, i.e. nowhere
rp = TextReport.string()  # output to a string. Call rp.content() to get the string
rp = TextReport(TextReport.STRINGIO)  # same as above

# TextReport will close the output stream automatically by using the with statement
with TextReport.string() as rp:
    rp.header("Lorem Ipsum Analysis", level="h0")
    rp.header("Raw", level="h1")
    rp.print(LOREM_IPSUM)
    rp.header("Top 5 most common letters")
    ct.summarise(report=rp, limit=5)
    print(rp.content())

Output

+---------------------------------------------------------------------------------- 
| Lorem Ipsum Analysis 
+---------------------------------------------------------------------------------- 
 
Raw 
------------------------------------------------------------ 
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 
 
Top 5 most common letters
------------------------------------------------------------ 
i: 42 
e: 37 
t: 32 
o: 29 
a: 29

Useful links

Documentation: https://chirptext.readthedocs.io
Source code: https://github.com/letuananh/chirptext/
PyPI: https://pypi.org/project/chirptext/

Paranoid text spacing in Python

pangu.py Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

194 Nov 19, 2022

py-trans is a Free Python library for translate text into different languages.

Free Python library to translate text into different languages.

13 Aug 27, 2022

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

2 Dec 14, 2022

Text Summarizationcls app with python

Text Summarizationcls app This is the repo for the Text Summarization AI Project. It makes use of pre-trained Hugging Face models Packages Used The pa

1 Oct 23, 2021

This is a text summarizing tool written in Python

Summarize Written by: Ling Li Ya This is a text summarizing tool written in Python. User Guide Some things to note: The application is accessible here

2 Feb 18, 2022

Simple python program to auto credit your code, text, book, whatever!

Credit Simple python program to auto credit your code, text, book, whatever! Setup First change credit_text to whatever text you would like to credit

1 Jan 29, 2022

Parse Any Text With Python

ParseAnyText A small package to parse strings. What is the work of it? Well It's a module to creates parser that helps to parse a text easily with les

1 Jan 11, 2022

Adventura is an open source Python Text Adventure Engine

Adventura Adventura is an open source Python Text Adventure Engine, Not yet uplo

5 Oct 2, 2022

Skype export archive to text converter for python

Skype export archive to text converter This software utility extracts chat logs

2 Jun 30, 2022

Comments

Asking for a new release on PyPi

Hi,

Version 0.1a18 is a bit outdated, could you update a newer version to PyPi?

I only need this commit, but since long time is passed i think most of the master change are stable.

opened by matteofumagalli1275 1
Add CodeQL workflow for GitHub code scanning
Hi letuananh/chirptext!

This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

Questions? Check out the FAQ below!

FAQ

Click here to expand the FAQ section

How often will the code scanning analysis run?

By default, code scanning will trigger a scan with the CodeQL engine on the following events:

On every pull request — to flag up potential security problems for you to investigate before merging a PR.

On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.

Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

What will this cost?

Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

What types of problems does CodeQL find?

The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

How do I upgrade my CodeQL engine?

No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

The analysis doesn’t seem to be working

If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

How do I disable LGTM.com?

If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

Which source code hosting platforms does code scanning support?

GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

How do I know this PR is legitimate?

This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

I have another question / how do I get in touch?

Please join the discussion here to ask further questions and send us suggestions!
opened by lgtm-com[bot] 0
Revamp TTL APIs for more complex usecases
simplify multi-tag handling (i.e. sense candidates, chunk languages, annotators, etc.)

Built-in support for CoNLL

use first tag slot for scalar tags (i.e. POS, lemma, surface, languages)

Re-design TTL JSON
opened by letuananh 3
Add support for Leipzig and Penn Treebank tagset
Leipzig

Reference: https://www.eva.mpg.de/lingua/resources/glossing-rules.php

Penn Treebank tag set

Version 1: https://www.sketchengine.eu/penn-treebank-tagset/

Version 2: https://www.sketchengine.eu/english-treetagger-pipeline-2

enhancement
opened by letuananh 0

Releases(0.2a2)

0.2a2(May 20, 2021)
Changes

Added missing keyword arguments newline and encoding to chio.write_csv and chio.write_tsv

Updated test cases

PyPI link: https://pypi.org/project/chirptext/0.2.a2/
Source code(tar.gz)
Source code(zip)
chirptext-0.1.2(May 20, 2021)
chirptext 0.1.2 stable maintenance release for supporting texttaglib legacy APIs

Changes:

[v0.1.2] Added missing keyword arguments newline and encoding to chio.write_csv and chio.write_tsv

[v0.1.2] Updated test cases

PyPI link: https://pypi.org/project/chirptext/0.1.2/

To use Japanese parsing with chirptext, see chirptext 0.1.1 stable release
Source code(tar.gz)
Source code(zip)
0.2a1(May 17, 2021)

PyPI release: https://pypi.org/project/chirptext/0.2a1/
Source code(tar.gz)
Source code(zip)
chirptext-0.1.1(May 17, 2021)

chirptext 0.1.1 stable maintenance release for supporting texttaglib legacy APIs

PyPI link: https://pypi.org/project/chirptext/0.1.1/

To use Japanese parsing with chirptext, please download mecab installer attached below or from Mecab's official website.
Source code(tar.gz)
Source code(zip)
mecab-0.996.exe(10.82 MB)
chirptext-0.1(May 13, 2021)

chirptext 0.1 official release for supporting texttaglib legacy APIs

PyPI link: https://pypi.org/project/chirptext/0.1/
Source code(tar.gz)
Source code(zip)
0.1rc1(May 2, 2021)
chirptext 0.1 release candidate for supporting texttaglib legacy APIs

Source code(tar.gz)
Source code(zip)
0.1a21(Apr 23, 2021)
Code refactoring

Able to retrieve path from AppConfig

Source code(tar.gz)
Source code(zip)
0.1a19(Jun 1, 2020)
Improved texttaglib (lite) module

Better TTL-JSON support

Standardized TTL access methods (find(), find_all() to get_tag(), get_tags())

Improved chirptext.sino module (Kangxi radical information)

Rename TextReport.file to TextReport.stream (more intuitive)

Show fewer mecab related warnings

Use Markdown for PyPI project README file

Source code(tar.gz)
Source code(zip)
0.1a18(Jul 18, 2018)
Important notes

chirptext.io module has been renamed to chirptext.chio

Source code(tar.gz)
Source code(zip)

0.1a14(Apr 11, 2018)

Deko can be used without mecab-python3 with this release.

from chirptext import deko
deko.set_mecab_bin("C:\\mecab\\bin\\mecab.exe")
# Now we can use deko as usual
sent = deko.txt2mecab("雨が降る。")
print(sent.words)
print(sent[0].pos)

Source code(tar.gz)
Source code(zip)

0.1a11(Apr 2, 2018)
Added TxtWriter and TxtReader to texttaglib module for faster reading

Added DataObject to anhxa

Deko documents and sentences can be exported to TTL format

etc.

Source code(tar.gz)
Source code(zip)
0.1a4(Feb 5, 2018)
Made WebHelper accept string as path to cache DB

Added WebHelper.fetch_json() method

Added some bug fixes

Added README file with some code samples

Source code(tar.gz)
Source code(zip)
0.1a2(Jan 24, 2018)

PyPI URL: https://pypi.python.org/pypi/chirptext
Source code(tar.gz)
Source code(zip)

Owner

Le Tuan Anh

computational linguist, semanticist, deeply interested in well-being, languages, and free software

GitHub Repository https://chirptext.readthedocs.io

A Python3 script that simulates the user typing a text on their keyboard.

A Python3 script that simulates the user typing a text on their keyboard. (control the speed, randomness, rate of typos and more!)

3 Feb 22, 2022

知乎评论区词云分析

zhihu-comment-wordcloud 知乎评论区词云分析起源于：如何看待知乎问题“男生真的很不能接受彩礼吗？”的一个回答下评论数超8万条，创单个回答下评论数新记录？项目代码说明 2.download_comment.py 下载全量评论 2.word_cloud_by_dt 生成词云 2

10 Sep 26, 2022

Phone Number formatting for PlaySMS Platform - BulkSMS Platform

BulkSMS-Number-Formatting Phone Number formatting for PlaySMS Platform - BulkSMS Platform. Phone Number Formatting for PlaySMS Phonebook Service This

1 Nov 08, 2021

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

3k Jan 02, 2023

Fixes mojibake and other glitches in Unicode text, after the fact.

ftfy: fixes text for you print(fix_encoding("(à¸‡'âŒ£')à¸‡")) (ง'⌣')ง Full documentation: https://ftfy.readthedocs.org Testimonials “My life is li

3.4k Jan 08, 2023

This project is a small tool for processing url-containing texts delivered by HUAWEI Share on Windows.

hwshare_helper This project is a small tool for handling url-containing texts delivered by HUAWEI Share on Windows. config Before use, please install

1 Jan 19, 2022

Returns unicode slugs

Python Slugify A Python slugify application that handles unicode. Overview Best attempt to create slugs from unicode strings while keeping it DRY. Not

1.3k Jan 04, 2023

Question answering on russian with XLMRobertaLarge as a service

QA Roberta Ru SaaS Question answering on russian with XLMRobertaLarge as a service. Thanks for the model to Alexander Kaigorodov. Stack Flask Gunicorn

21 Jul 04, 2022

Auto translate Localizable.strings for multiple languages in Xcode

auto_localize Auto translate Localizable.strings for multiple languages in Xcode Usage put your origin Localizable.strings file in folder pip3 install

13 Nov 22, 2022

Build a translation program similar to Google Translate with Python programming language and QT library

google-translate Build a translation program similar to Google Translate with Python programming language and QT library Different parts of the progra

3 Oct 09, 2021

Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Meaningful Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short li

1 Jan 01, 2022

An experimental Fang Song style Chinese font generated with skeleton-tracing and pix2pix

An experimental Fang Song style Chinese font generated with skeleton-tracing and pix2pix, with glyphs based on cwTeXFangSong. The font is optimised fo

98 Jan 07, 2023

A generator library for concise, unambiguous and URL-safe UUIDs.

Description shortuuid is a simple python library that generates concise, unambiguous, URL-safe UUIDs. Often, one needs to use non-sequential IDs in pl

1.8k Dec 31, 2022

Split large XML files into smaller ones for easy upload

Split large XML files into smaller ones for easy upload. Works for WordPress Posts Import and other XML files.

1 Jan 30, 2022

Open-source linguistic ethnography tool for framing public opinion in mediatized groups.

Open-source linguistic ethnography tool for framing public opinion in mediatized groups. Table of Contents Installing Quickstart Links Installing Pyth

7 Jun 02, 2022

A python tool to convert Bangla Bijoy text to Unicode text.

Unicode Converter A python tool to convert Bangla Bijoy text to Unicode text. Installation Unicode Converter can be installed via PyPi. Make sure pip

10 Sep 29, 2022

一款高性能敏感词(非法词/脏字)检测过滤组件，附带繁体简体互换，支持全角半角互换，汉字转拼音，模糊搜索等功能。

一款高性能非法词(敏感词)检测组件，附带繁体简体互换，支持全角半角互换，获取拼音首字母，获取拼音字母，拼音模糊搜索等功能。

3.6k Jan 07, 2023

A neat little program to read the text from the "All Ten Fingers" program, and write them back.

ATFTyper A neat little program to read the text from the "All Ten Fingers" program, and write them back. How does it work? This program uses the Pillo

1 Nov 26, 2021

Text to ASCII and ASCII to text

Text2ASCII Description This python script (converter.py) contains two functions: encode() is used to return a list of Integer, one item per character

4 Jan 22, 2022

Converts a Bangla numeric string to literal words.

Bangla Number in Words Converts a Bangla numeric string to literal words. Install $ pip install banglanum2words Usage

3 Aug 29, 2022

ChirpText is a collection of text processing tools for Python 3.

Related tags

Overview

Main features

Installation

Parsing Japanese text

Convenient IO APIs

Sample TextReport

Output

Useful links

You might also like...

Paranoid text spacing in Python

py-trans is a Free Python library for translate text into different languages.

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

Text Summarizationcls app with python

This is a text summarizing tool written in Python

Simple python program to auto credit your code, text, book, whatever!

Parse Any Text With Python

Adventura is an open source Python Text Adventure Engine

Skype export archive to text converter for python

Comments

Asking for a new release on PyPi

Add CodeQL workflow for GitHub code scanning

FAQ

How often will the code scanning analysis run?

What will this cost?

What types of problems does CodeQL find?

How do I upgrade my CodeQL engine?

The analysis doesn’t seem to be working

How do I disable LGTM.com?

Which source code hosting platforms does code scanning support?

How do I know this PR is legitimate?

I have another question / how do I get in touch?

Revamp TTL APIs for more complex usecases

Add support for Leipzig and Penn Treebank tagset

Releases(0.2a2)

0.2a2(May 20, 2021)

chirptext-0.1.2(May 20, 2021)

0.2a1(May 17, 2021)

chirptext-0.1.1(May 17, 2021)

chirptext-0.1(May 13, 2021)

0.1rc1(May 2, 2021)

0.1a21(Apr 23, 2021)

0.1a19(Jun 1, 2020)

0.1a18(Jul 18, 2018)

Important notes

0.1a14(Apr 11, 2018)

0.1a11(Apr 2, 2018)

0.1a4(Feb 5, 2018)

0.1a2(Jan 24, 2018)

Owner

Le Tuan Anh

A Python3 script that simulates the user typing a text on their keyboard.

知乎评论区词云分析

Phone Number formatting for PlaySMS Platform - BulkSMS Platform

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Fixes mojibake and other glitches in Unicode text, after the fact.

This project is a small tool for processing url-containing texts delivered by HUAWEI Share on Windows.

Returns unicode slugs

Question answering on russian with XLMRobertaLarge as a service

Auto translate Localizable.strings for multiple languages in Xcode

Build a translation program similar to Google Translate with Python programming language and QT library

Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

An experimental Fang Song style Chinese font generated with skeleton-tracing and pix2pix

A generator library for concise, unambiguous and URL-safe UUIDs.

Split large XML files into smaller ones for easy upload

Open-source linguistic ethnography tool for framing public opinion in mediatized groups.

A python tool to convert Bangla Bijoy text to Unicode text.

一款高性能敏感词(非法词/脏字)检测过滤组件，附带繁体简体互换，支持全角半角互换，汉字转拼音，模糊搜索等功能。

A neat little program to read the text from the "All Ten Fingers" program, and write them back.

Text to ASCII and ASCII to text

Converts a Bangla numeric string to literal words.