Maha is a text processing library specially developed to deal with Arabic text.

Overview



CI Documentation Status codecov Discord Downloads License PyPI version Code style: black Checked with mypy PyPI - Python Version

An Arabic text processing library intended for use in NLP applications


Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments
  • Time: Add the ability to parse Hijri dates

    Time: Add the ability to parse Hijri dates

    What does this pull request change?

    Closes #27.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 6
  • Added distance to dimension parsing

    Added distance to dimension parsing

    What does this pull request change?

    Resolves #15.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    parsing highlight 
    opened by TRoboto 5
  • Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    What does this pull request change?

    This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

    Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 4
  • Add pyupgrade to pre-commit and upgrade to future-style type annotations

    Add pyupgrade to pre-commit and upgrade to future-style type annotations

    What does this pull request change?

    Upgrades to new type annotations style.

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    maintenance 
    opened by TRoboto 3
  • Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    What does this pull request change?

    • Removes datasets module.
    • Datasets are now hosted here

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    breaking changes deprecation 
    opened by TRoboto 3
  • Add the ability to parse names from text

    Add the ability to parse names from text

    What does this pull request change?

    Adds #24. Depends on #40

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 3
  • Add a deprecation system

    Add a deprecation system

    What does this pull request change?

    • Closes #23
    • Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    development 
    opened by saedx1 3
  • Prepare for the next release of Maha (v0.3.0)

    Prepare for the next release of Maha (v0.3.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.3.0.
    • Bumped pypi version to v0.3.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Ordinal: Add support to `بعد` in ordinal parsing

    Ordinal: Add support to `بعد` in ordinal parsing

    What does this pull request change?

    Closes #48.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Numeral: Add support for hierarchical parsing

    Numeral: Add support for hierarchical parsing

    What does this pull request change?

    Closes #25

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Prepare for the next release of Maha (v0.2.0)

    Prepare for the next release of Maha (v0.2.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.2.0.
    • Bumped pypi version to v0.2.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Update ci.yml

    Update ci.yml

    Check the support for python 3,10

    What does this pull request change? It checks if the library is supporting python 3.10.

    • ...

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [ ] tox passes
    opened by PAIN-BARHAM 1
  • Add the option to ignore Harakat when removing or replacing

    Add the option to ignore Harakat when removing or replacing

    What problem are you trying to solve?

    Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

    Examples (if relevant)

    Current:

    >> from maha.cleaners.functions import remove
    >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة")
    >> output
    يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى
    

    Suggested:

    >> from maha.cleaners.functions import remove
    >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True)
    >> output
    يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى
    

    Definition of Done

    • It must adhere to the coding style used in the defined cleaner functions.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by xaleel 1
  • Wrong parsed name using name dimension

    Wrong parsed name using name dimension

    What happened?

    The name parser extracted wrong name likes : بي, شكرا.

    Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

    I expect to extract the names on the name dataset only.

    Python version

    3.8

    What operating system are you using?

    Linux

    Code to reproduce the issue

    >>> from maha.parsers.functions import parse_dimension
    >>> text = `أريد البحث في سجل الإنفاق الخاص بي`
    >>> extracted = parse_dimension(text, names=True)
    [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]
    

    Relevant log output

    No response

    bug parsing 
    opened by PAIN-BARHAM 0
  • Add feature to parse duration period

    Add feature to parse duration period

    What problem are you trying to solve?

    Parsing the duration from the text that has the difference between the two dates.

    Examples (if relevant)

    >>> from maha.parsers.functions import parse_dimension
    >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value
    >>> output
    DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)
    
    

    Definition of Done

    • It must adhere to the coding style used in the defined dimensions, duration dimension.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by PAIN-BARHAM 1
  • Adding the parser functionality to Processors

    Adding the parser functionality to Processors

    What problem are you trying to solve?

    Adding the parser functionality to Processors to parse different dimensions.

    Examples (if relevant)

    >>> from pathlib import Path
    >>> import maha
    >>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
    >>> data = resource_path.read_text()
    >>> print(data)
    
    الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
    طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
    يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
    مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
    لما حد يسالني بتختفي كتير لية =..
    زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
    #Windows11 is on the horizon. What feature are you looking forward to
    Get vaccinate #savethesaviour
    Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit
    
    >>> from maha.processors import FileProcessor
    >>> proc = FileProcessor(resource_path)
    >>> parsed = proc.parse_dimension(time=True)
    [Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
     Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
     Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]
    
    

    Definition of Done

    • It must adhere to the coding style.
    • The implementation should cover most use cases.
    • Adding tests.
    good first issue feature request parsing 
    opened by PAIN-BARHAM 0
Releases(v0.3.0)
Owner
Mohammad Al-Fetyani
Machine Learning Engineer
Mohammad Al-Fetyani
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

ReekyStive 3 Nov 11, 2022
Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Introduction The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and

Alexander Keeney 1 Jan 05, 2022
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
PortaSpeech - PyTorch Implementation

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

Keon Lee 276 Dec 26, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 06, 2023
Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

Structural Guidance for Transformer Language Models This repository accompanies the paper, Structural Guidance for Transformer Language Models, publis

International Business Machines 10 Dec 14, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

117 Jan 07, 2023
PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

PRAnCER (Platform enabling Rapid Annotation for Clinical Entity Recognition) is a web platform that enables the rapid annotation of medical terms within clinical notes. A user can highlight spans of

Sontag Lab 39 Nov 14, 2022
내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Pocket Galaxy 아주 간단한 개인용, 혹은 내부용 툴을 만들어야하는데 이왕이면 웹이 편하죠? 그럴때를 위해 만들어둔 django와 vue(vuetify)로 이뤄진 boilerplate 입니다. 각 폴더에 있는 설명서대로 실행을 시키면 일단 당장 뭔가가 돌아갑니

Jamie J. Seol 16 Dec 03, 2021
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022