Snowball compiler and stemming algorithms

Last update: Jan 07, 2023

Related tags

Overview

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

Snowball was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

This repository contains the source code for the snowball compiler and the stemming algorithms. The snowball compiler is written in ISO C - you'll need a C compiler which support C99 to build it (but the C code it generates should work with any ISO C compiler.)

See https://snowballstem.org/ for more information about Snowball.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Snowball compiler and stemming algorithms

Related tags

Overview

What is Stemming?

Owner

Snowball Stemming language and algorithms

NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

RecipeReduce: Simplified Recipe Processing for Lazy Programmers

Codename generator using WordNet parts of speech database

This repository contains the code for "Generating Datasets with Pretrained Language Models".

Mycroft Core, the Mycroft Artificial Intelligence platform.

Mednlp - Medical natural language parsing and utility library

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

pytorch implementation of Attention is all you need

New Modeling The Background CodeBase

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

A python wrapper around the ZPar parser for English.

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

A 10000+ hours dataset for Chinese speech recognition

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

Learning to Rewrite for Non-Autoregressive Neural Machine Translation