Get list of common stop words in various languages in Python

Overview

Python Stop Words

Overview

Get list of common stop words in various languages in Python.

Build Status Coverage Status PyPI Version PyPI Status License PyPI Py_versions

Available languages

  • Arabic
  • Bulgarian
  • Catalan
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Indonesian
  • Italian
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish
  • Ukrainian

Installation

stop-words is available on PyPI

http://pypi.python.org/pypi/stop-words

So easily install it by pip

$ pip install stop-words

Another way is by cloning stop-words's git repo

$ git clone --recursive git://github.com/Alir3z4/python-stop-words.git

Then install it by running:

$ python setup.py install

Basic usage

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')

Python compatibility

Python Stop Words is compatibe with:

  • Python 2.7
  • Python 3.4
  • Python 3.5
  • Python 3.6
  • Python 3.7
Comments
  • Enforces packaging of eggs into folders.

    Enforces packaging of eggs into folders.

    We had an error in our CI pipeline where a package build would fail since the .egg of stop-words is downloaded as a zip.

    This leads to the following error where the initializer tries to open a directory when it is actually a zip archive.

    Not a directory: '/opt/project/.eggs/stop_words-2015.2.23.1-py3.6.egg/stop_words/stop-words/languages.json'

    opened by hfjn 10
  • add indonesian stop word list

    add indonesian stop word list

    Add stop word list for indonesian language, added mapping to JSON file. Source: https://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf

    opened by frankdevans 4
  • can you handle a text๏ผŸ

    can you handle a text๏ผŸ

    hello, no description about how to use. Now I have a text: The University of Waterloo Stratford Campus is located in Stratford Ontario Canada. It is one of the three satellite campuses of the University of Waterloo a member of the U15 Group of Canadian Research Universities.Established in June 2009 the University of Waterloo Stratford Campus is part of the Faculty of Arts at the University of Waterloo. how to use python-stop-words to filter the stop-words to get a text without stop-words?

    thank you very much!!

    question 
    opened by PapaMadeleine2022 2
  • Python 3 support

    Python 3 support

    List of improvements:

    • Tests
    • Python 3 support
    • Dev installation via zc.buildout
    • Continuous integration via Travis

    Can you make a new release once the branch merged ?

    Regards

    enhancement 
    opened by Fantomas42 2
  • languages.json is missing, if you don't git clone with `--recursive`

    languages.json is missing, if you don't git clone with `--recursive`

    languages.json is still missing, if you don't clone with --recursive

    $ git clone git://github.com/Alir3z4/python-stop-words.git $ cd python-stop-words $ python3 setup.py install Traceback (most recent call last): File "setup.py", line 5, in version=import("stop_words").get_version(), File "./stop_words/init.py", line 9, in with open(os.path.join(STOP_WORDS_DIR, 'languages.json'), 'rb') as map_file: FileNotFoundError: [Errno 2] No such file or directory: './stop_words/stop-words/languages.json'

    opened by marcindulak 1
  • Update submodule to the latest

    Update submodule to the latest

    Include the stops for newly added languages

    https://github.com/Alir3z4/stop-words/pull/4 https://github.com/Alir3z4/stop-words/pull/5 https://github.com/Alir3z4/stop-words/pull/6 https://github.com/Alir3z4/stop-words/pull/7

    enhancement 
    opened by norkans7 1
  • Decode error AND Add catalan language to LANGUAGE_MAPPING

    Decode error AND Add catalan language to LANGUAGE_MAPPING

    1. Add catalan language to LANGUAGE_MAPPING. I previously I added the file with stop words in project "stop-words"

    2. Decode error

    stop_words = [line.strip().decode('utf-8')
                 for line in language_file.readlines()]
    

    Strip() return a copy of the string with leading and trailing whitespace characters removed. But if the string contains non-ascii characters, Strip() causes a UnicodeDecodeError error (eg UnicodeDecodeError: 'utf8' codec can not decode byte 0xc3 in position 34: unexpected end of data).

    The workaround is to reorder the call:

    stop_words = [line.decode('utf-8').strip()
                 for line in language_file.readlines()]
    
    opened by dmiro 1
  • Defining custom stop words in NLTK

    Defining custom stop words in NLTK

    Hi, I want to know what is the method for defining our own custom stop word? I'm currently developing a sentiment analysis in my local language in which i'm using Naive Bayes classifier to classify the text. I'm quite new to this type of NLP project so sorry if there's a method that I miss.

    Hope you can help me thanks.

    opened by AllikDaniel 0
  • Example not work on python 3.7.0

    Example not work on python 3.7.0

    It return empty []

    from stop_words import get_stop_words
    
    stop_words = get_stop_words('en')
    stop_words = get_stop_words('english')
    
    from stop_words import safe_get_stop_words
    
    stop_words = safe_get_stop_words('unsupported language')
    print(stop_words)
    
    opened by nadavvin 2
Releases(2018.7.23)
  • 2018.7.23(Jul 23, 2018)

    2018.7.23

    • Fixed #14: languages.json is missing, if you don't git clone with --recursive.
    • Feature: Support latest version of Python (3.7+).
    • Feature #22: Enforces packaging of eggs into folders.
    • Update the stop-words repository to get the latest languages.
    • Fixed Travis failing and tests due to bootstrap.

    PyPI: https://pypi.org/project/stop-words/2018.7.23/

    To install:

    $ pip install stop-words==2018.7.23
    
    Source code(tar.gz)
    Source code(zip)
  • 2015.2.23.1(Feb 23, 2015)

  • 2015.2.23(Feb 23, 2015)

    2015.2.23


    • Feature: Using the cache is optional
    • Feature: Filtering stopwords

    Special thanks to Taras Labiak @kissarat

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Source code(tar.gz)
    Source code(zip)
  • 2015.2.21(Feb 21, 2015)

    2015.2.21


    • Feature: LANGUAGE_MAPPING is loads from stop-words/languages.json
    • Fix: Made paths OS-independent

    PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

    Special thanks to Taras Labiak @kissarat

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.31(Feb 1, 2015)

  • 2015.1.22(Jan 22, 2015)

    2015.1.22


    • Feature: Tests
    • Feature: Python 3 support
    • Feature: Dev installation via zc.buildout
    • Feature: Continuous integration via Travis

    pypi: https://pypi.python.org/pypi/stop-words/2015.1.22

    Source code(tar.gz)
    Source code(zip)
  • 2015.1.19(Jan 19, 2015)

Owner
Alireza Savand
I am Alireza Savand, a Software Architect.
Alireza Savand
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 03, 2023
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
๐Ÿ›ธ Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 08, 2023
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 04, 2021
Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

Transformers-for-NLP-2nd-Edition @copyright 2022, Packt Publishing, Denis Rothman Contact me for any question you have on LinkedIn Get the book on Ama

Denis Rothman 150 Dec 23, 2022
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
Continuously update some NLP practice based on different tasks.

NLP_practice We will continuously update some NLP practice based on different tasks. prerequisites Software pytorch = 1.10 torchtext = 0.11.0 sklear

0 Jan 05, 2022
A desktop GUI providing an audio interface for GPT3.

Jabberwocky neil_degrasse_tyson_with_audio.mp4 Project Description This GUI provides an audio interface to GPT-3. My main goal was to provide a conven

16 Nov 27, 2022
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

Clova AI Research 94 Dec 30, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers ์ด ํ”„๋กœ์ ํŠธ๋Š” KoBERT ๋ชจ๋ธ์„ sentence-transformers ์—์„œ ๋ณด๋‹ค ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค. Ko-Sentence-BERT-SKTBERT ํ”„๋กœ์ ํŠธ์—์„œ๋Š” KoBERT ๋ชจ๋ธ์„ sentence-trans

Junghyun 40 Dec 20, 2022
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 07, 2022
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
Generate a cool README/About me page for your Github Profile

Github Profile README/ About Me Generator ๐Ÿ’ฏ This webapp lets you build a cool README for your profile. A few inputs + ~15 mins = Your Github Profile

Rahul Banerjee 179 Jan 07, 2023
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022