Maha is a text processing library specially developed to deal with Arabic text.

Overview



CI Documentation Status codecov Discord Downloads License PyPI version Code style: black Checked with mypy PyPI - Python Version

An Arabic text processing library intended for use in NLP applications


Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments
  • Time: Add the ability to parse Hijri dates

    Time: Add the ability to parse Hijri dates

    What does this pull request change?

    Closes #27.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 6
  • Added distance to dimension parsing

    Added distance to dimension parsing

    What does this pull request change?

    Resolves #15.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    parsing highlight 
    opened by TRoboto 5
  • Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    What does this pull request change?

    This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

    Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 4
  • Add pyupgrade to pre-commit and upgrade to future-style type annotations

    Add pyupgrade to pre-commit and upgrade to future-style type annotations

    What does this pull request change?

    Upgrades to new type annotations style.

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    maintenance 
    opened by TRoboto 3
  • Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    What does this pull request change?

    • Removes datasets module.
    • Datasets are now hosted here

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    breaking changes deprecation 
    opened by TRoboto 3
  • Add the ability to parse names from text

    Add the ability to parse names from text

    What does this pull request change?

    Adds #24. Depends on #40

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 3
  • Add a deprecation system

    Add a deprecation system

    What does this pull request change?

    • Closes #23
    • Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    development 
    opened by saedx1 3
  • Prepare for the next release of Maha (v0.3.0)

    Prepare for the next release of Maha (v0.3.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.3.0.
    • Bumped pypi version to v0.3.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Ordinal: Add support to `بعد` in ordinal parsing

    Ordinal: Add support to `بعد` in ordinal parsing

    What does this pull request change?

    Closes #48.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Numeral: Add support for hierarchical parsing

    Numeral: Add support for hierarchical parsing

    What does this pull request change?

    Closes #25

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Prepare for the next release of Maha (v0.2.0)

    Prepare for the next release of Maha (v0.2.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.2.0.
    • Bumped pypi version to v0.2.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Update ci.yml

    Update ci.yml

    Check the support for python 3,10

    What does this pull request change? It checks if the library is supporting python 3.10.

    • ...

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [ ] tox passes
    opened by PAIN-BARHAM 1
  • Add the option to ignore Harakat when removing or replacing

    Add the option to ignore Harakat when removing or replacing

    What problem are you trying to solve?

    Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

    Examples (if relevant)

    Current:

    >> from maha.cleaners.functions import remove
    >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة")
    >> output
    يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى
    

    Suggested:

    >> from maha.cleaners.functions import remove
    >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True)
    >> output
    يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى
    

    Definition of Done

    • It must adhere to the coding style used in the defined cleaner functions.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by xaleel 1
  • Wrong parsed name using name dimension

    Wrong parsed name using name dimension

    What happened?

    The name parser extracted wrong name likes : بي, شكرا.

    Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

    I expect to extract the names on the name dataset only.

    Python version

    3.8

    What operating system are you using?

    Linux

    Code to reproduce the issue

    >>> from maha.parsers.functions import parse_dimension
    >>> text = `أريد البحث في سجل الإنفاق الخاص بي`
    >>> extracted = parse_dimension(text, names=True)
    [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]
    

    Relevant log output

    No response

    bug parsing 
    opened by PAIN-BARHAM 0
  • Add feature to parse duration period

    Add feature to parse duration period

    What problem are you trying to solve?

    Parsing the duration from the text that has the difference between the two dates.

    Examples (if relevant)

    >>> from maha.parsers.functions import parse_dimension
    >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value
    >>> output
    DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)
    
    

    Definition of Done

    • It must adhere to the coding style used in the defined dimensions, duration dimension.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by PAIN-BARHAM 1
  • Adding the parser functionality to Processors

    Adding the parser functionality to Processors

    What problem are you trying to solve?

    Adding the parser functionality to Processors to parse different dimensions.

    Examples (if relevant)

    >>> from pathlib import Path
    >>> import maha
    >>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
    >>> data = resource_path.read_text()
    >>> print(data)
    
    الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
    طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
    يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
    مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
    لما حد يسالني بتختفي كتير لية =..
    زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
    #Windows11 is on the horizon. What feature are you looking forward to
    Get vaccinate #savethesaviour
    Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit
    
    >>> from maha.processors import FileProcessor
    >>> proc = FileProcessor(resource_path)
    >>> parsed = proc.parse_dimension(time=True)
    [Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
     Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
     Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]
    
    

    Definition of Done

    • It must adhere to the coding style.
    • The implementation should cover most use cases.
    • Adding tests.
    good first issue feature request parsing 
    opened by PAIN-BARHAM 0
Releases(v0.3.0)
Owner
Mohammad Al-Fetyani
Machine Learning Engineer
Mohammad Al-Fetyani
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 01, 2023
A flask application to predict the speech emotion of any .wav file.

This is a speech emotion recognition app. It will allow you to train a modular MLP model with the RAVDESS dataset, and then use that model with a flask application to predict the speech emotion of an

Aryan Vijaywargia 2 Dec 15, 2021
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

Nathan Raw 185 Dec 21, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers an

Parv Bhatt 1 Jan 01, 2022
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
⚖️ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

Amin Sabbagh 2 Feb 17, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
Word Bot for JKLM Bomb Party

Word Bot for JKLM Bomb Party A bot for Bomb Party on https://www.jklm.fun (Only English) Requirements pynput pyperclip pyautogui Usage: Step 1: Run th

Nicolas 7 Oct 30, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022