Maha is a text processing library specially developed to deal with Arabic text.

Last update: Nov 27, 2022

Overview

An Arabic text processing library intended for use in NLP applications

Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments

Time: Add the ability to parse Hijri dates
What does this pull request change?

Closes #27.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 6
Added distance to dimension parsing
What does this pull request change?

Resolves #15.

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

parsing highlight
opened by TRoboto 5
Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names
What does this pull request change?

This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 4
Add pyupgrade to pre-commit and upgrade to future-style type annotations
What does this pull request change?

Upgrades to new type annotations style.

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

maintenance
opened by TRoboto 3
Deprecate and remove `datasets` module and host datasets on Hugging Face instead
What does this pull request change?

Removes datasets module.

Datasets are now hosted here

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

breaking changes deprecation
opened by TRoboto 3
Add the ability to parse names from text
What does this pull request change?

Adds #24. Depends on #40

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 3
Add a deprecation system
What does this pull request change?

Closes #23

Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

development
opened by saedx1 3
Prepare for the next release of Maha (v0.3.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.3.0.

Bumped pypi version to v0.3.0.

Updated the citation information.
opened by github-actions[bot] 2
Ordinal: Add support to `بعد` in ordinal parsing
What does this pull request change?

Closes #48.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Numeral: Add support for hierarchical parsing
What does this pull request change?

Closes #25

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Prepare for the next release of Maha (v0.2.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.2.0.

Bumped pypi version to v0.2.0.

Updated the citation information.
opened by github-actions[bot] 2
Update ci.yml
Check the support for python 3,10

What does this pull request change? It checks if the library is supporting python 3.10.

...

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[ ] tox passes
opened by PAIN-BARHAM 1
[pre-commit.ci] pre-commit autoupdate
updates:

github.com/pre-commit/pre-commit-hooks: v4.3.0 → v4.4.0

github.com/psf/black: 22.6.0 → 22.12.0

github.com/pycqa/isort: 5.10.1 → 5.11.4

github.com/asottile/pyupgrade: v2.37.3 → v3.3.1
opened by pre-commit-ci[bot] 1
Add the option to ignore Harakat when removing or replacing
What problem are you trying to solve?

Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

Examples (if relevant)

Current:

>> from maha.cleaners.functions import remove >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة") >> output يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى

Suggested:

>> from maha.cleaners.functions import remove >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True) >> output يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى

Definition of Done

It must adhere to the coding style used in the defined cleaner functions.

The implementation should cover most use cases.

Adding tests

feature request
opened by xaleel 1
Wrong parsed name using name dimension
What happened?

The name parser extracted wrong name likes : بي, شكرا.

Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

I expect to extract the names on the name dataset only.

Python version

3.8

What operating system are you using?

Linux

Code to reproduce the issue

>>> from maha.parsers.functions import parse_dimension >>> text = `أريد البحث في سجل الإنفاق الخاص بي` >>> extracted = parse_dimension(text, names=True) [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

Relevant log output

No response
bug parsing
opened by PAIN-BARHAM 0
Add feature to parse duration period
What problem are you trying to solve?

Parsing the duration from the text that has the difference between the two dates.

Examples (if relevant)

>>> from maha.parsers.functions import parse_dimension >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value >>> output DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)

Definition of Done

It must adhere to the coding style used in the defined dimensions, duration dimension.

The implementation should cover most use cases.

Adding tests

feature request
opened by PAIN-BARHAM 1

Adding the parser functionality to Processors

What problem are you trying to solve?

Adding the parser functionality to Processors to parse different dimensions.

Examples (if relevant)

>>> from pathlib import Path
>>> import maha
>>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
>>> data = resource_path.read_text()
>>> print(data)

الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
لما حد يسالني بتختفي كتير لية =..
زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
#Windows11 is on the horizon. What feature are you looking forward to
Get vaccinate #savethesaviour
Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit

>>> from maha.processors import FileProcessor
>>> proc = FileProcessor(resource_path)
>>> parsed = proc.parse_dimension(time=True)
[Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
 Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
 Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]

Definition of Done

It must adhere to the coding style.
The implementation should cover most use cases.
Adding tests.

good first issue feature request parsing

opened by PAIN-BARHAM 0

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 16, 2021)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.1.2(Sep 23, 2021)
Quick fix:

Added readme badges

Fixed missing regex dependency

Source code(tar.gz)
Source code(zip)

Owner

Mohammad Al-Fetyani

Machine Learning Engineer

GitHub Repository

Maha is a text processing library specially developed to deal with Arabic text.

Related tags

Overview

Installation

Overview

Documentation

Contributing

License

Comments

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

What happened?

Python version

What operating system are you using?

Code to reproduce the issue

Relevant log output

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

v0.2.0(Nov 16, 2021)

v0.1.2(Sep 23, 2021)

Owner

Mohammad Al-Fetyani

The guide to tackle with the Text Summarization

A Chinese to English Neural Model Translation Project

Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

AMUSE - financial summarization

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Creating an LSTM model to generate music

Index different CKAN entities in Solr, not just datasets

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

KoBART model on huggingface transformers

Open-World Entity Segmentation

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

A high-level Python library for Quantum Natural Language Processing

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"