Japanese NLP Library

Overview

Japanese NLP Library


Back to Home

1   Requirements

1.1   Links

  • All code at jProcessing Repo GitHub
  • PyPi Python Package
clone [email protected]:kevincobain2000/jProcessing.git

1.2   Install

In Terminal

bash$ python setup.py install

1.3   History

  • 0.2

    • Sentiment Analysis of Japanese Text
  • 0.1
    • Morphologically Tokenize Japanese Sentence
    • Kanji / Hiragana / Katakana to Romaji Converter
    • Edict Dictionary Search - borrowed
    • Edict Examples Search - incomplete
    • Sentence Similarity between two JP Sentences
    • Run Cabocha(ISO--8859-1 configured) in Python.
    • Longest Common String between Sentences
    • Kanji to Katakana Pronunciation
    • Hiragana, Katakana Chart Parser

2   Libraries and Modules

2.1   Tokenize jTokenize.py

In Python

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

Returns:

... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...]
... 私--は--彼--を--5--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> print '--'.join(jReads(input_sentence)).encode('utf-8')
... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2   Cabocha jCabocha.py

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

>>> from jNlp.jCabocha import cabocha
>>> print cabocha(input_sentence).encode('utf-8')

Output:

">
<sentence>
 <chunk id="0" link="8" rel="D" score="0.971639" head="0" func="1">
  <tok id="0" read="ワタシ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">私tok>
  <tok id="1" read="" base="" pos="助詞-係助詞" ctype="" cform="" ne="O">はtok>
 chunk>
 <chunk id="1" link="2" rel="D" score="0.488672" head="2" func="3">
  <tok id="2" read="カレ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">彼tok>
  <tok id="3" read="" base="" pos="助詞-格助詞-一般" ctype="" cform="" ne="O">をtok>
 chunk>
 <chunk id="2" link="8" rel="D" score="2.25834" head="6" func="6">
  <tok id="4" read="" base="" pos="名詞-数" ctype="" cform="" ne="B-DATE">5tok>
  <tok id="5" read="ニチ" base="" pos="名詞-接尾-助数詞" ctype="" cform="" ne="I-DATE">日tok>
  <tok id="6" read="マエ" base="" pos="名詞-副詞可能" ctype="" cform="" ne="I-DATE">前tok>
  <tok id="7" read="" base="" pos="記号-読点" ctype="" cform="" ne="O">、tok>
 chunk>

2.3   Kanji / Katakana /Hiragana to Tokenized Romaji jConvert.py

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = u'気象庁が21日午前4時48分、発表した天気概況によると、'
>>> print ' '.join(tokenizedRomaji(input_sentence))
>>> print tokenizedRomaji(input_sentence)
...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...]

katakanaChart.txt

2.4   Longest Common String Japanese jProcessing.py

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print long_substr(a, b)

Output

...a time in

On Japanese Strings

>>> a = u'これでアナタも冷え知らず'
>>> b = u'これでア冷え知らずナタも'
>>> print long_substr(a, b).encode('utf-8')

Output

...冷え知らず

2.5   Similarity between two sentences jProcessing.py

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:
>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print s.minhash(a,b)
...0.444444444444
Japanese Strings:
>>> from jNlp.jProcessing import *
>>> a = u'これは何ですか?'
>>> b = u'これはわからないです'
>>> print s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b)))
...0.210526315789

3   Edict Japanese Dictionary Search with Example sentences

3.1   Sample Ouput Demo

3.2   Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3   Charset

Two files

  • utf8 Charset example file if not using src/jNlp/data/edict_examples

    To convert EUCJP/ISO-8859-1 to utf8

    iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
    
  • ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4   Links

Latest Dictionary files can be downloaded here

3.5   edict_search.py

author: Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print entry.to_string().encode('utf-8')

3.6   edict_examples.py

Note: Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author: Pulkit Kathuria
>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 x線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

4   Sentiment Analysis Japanese Text

This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*.txt).

4.1   Wordnet files download links

  1. http://nlpwww.nict.go.jp/wn-ja/eng/downloads.html
  2. http://sentiwordnet.isti.cnr.it/

4.2   How to Use

The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet.

  • (Adnouns, nouns, verbs, .. all included)
  • No WSD module on Japanese Sentence
  • Uses word as its common sense for polarity score
>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高!'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3   Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5   Contacts

Author: pulkit[at]jaist.ac.jp [change at with @]
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

NERphilosophy 👋 Welcome to the github repository of my BsC thesis. This repository contains (not all) code from my project on Named Entity Recognitio

Ruben 1 Jan 27, 2022
Fast, DB Backed pretrained word embeddings for natural language processing.

Embeddings Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning. Instead of lo

Victor Zhong 212 Nov 21, 2022
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
Asr abc - Automatic speech recognition(ASR),中文语音识别

语音识别的简单示例,主要在课堂演示使用 创建python虚拟环境 在linux 和macos 上验证通过 # 如果已经有pyhon3.6 环境,跳过该步骤,使用

LIyong.Guo 8 Nov 11, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
Convolutional Neural Networks for Sentence Classification

Convolutional Neural Networks for Sentence Classification Code for the paper Convolutional Neural Networks for Sentence Classification (EMNLP 2014). R

Yoon Kim 2k Jan 02, 2023
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Saptak Bhoumik 14 May 24, 2022
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 06, 2023
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021