Python Multilingual Ucrel Semantic Analysis System

Python Multilingual Ucrel Semantic Analysis System, it currently is a rule based token level semantic tagger which can be added to any spaCy pipeline. The current tagger system is flexible enough to support any semantic tagset, however the tagset we have concentrated on and give examples for throughout the documentation is the Ucrel Semantic Analysis System (USAS).

  • 📚 Usage Guides - What the package is, tutorials, how to guides, and explanations.
  • 🔎 API Reference - The docstrings of the library, with minimum working examples.

Install PyMUSAS

Can be installed on all operating systems and supports Python version >= 3.7, to install run:

pip install pymusas

Quick example

Here is a quick example of what PyMUSAS can do using the USASRuleBasedTagger, from now on called the USAS tagger, for a full tutorial, which explains all of the steps in this example, see the Using PyMUSAS tutorial in the documentation.

This example will semantically tag, at the token level, some Portuguese text. We do first need to download a spaCy Portuguese model (any version will do, but we choose the small version)

python -m spacy download pt_core_news_sm

Then we load the Portuguese spaCy tagger, add the USAS tagger, and apply it to the Portuguese text:

import spacy

from pymusas.file_utils import download_url_file
from pymusas.lexicon_collection import LexiconCollection
from pymusas.spacy_api.taggers import rule_based
from pymusas.pos_mapper import UPOS_TO_USAS_CORE

# We exclude ['parser', 'ner'] as these components are typically not needed
# for the USAS tagger
nlp = spacy.load('pt_core_news_sm', exclude=['parser', 'ner'])
# Adds the tagger to the pipeline and returns the tagger 
usas_tagger = nlp.add_pipe('usas_tagger')

# Rule based tagger requires a lexicon
portuguese_usas_lexicon_url = ''
portuguese_usas_lexicon_file = download_url_file(portuguese_usas_lexicon_url)
# Includes the POS information
portuguese_lexicon_lookup = LexiconCollection.from_tsv(portuguese_usas_lexicon_file)
# excludes the POS information
portuguese_lemma_lexicon_lookup = LexiconCollection.from_tsv(portuguese_usas_lexicon_file, 
# Add the lexicon information to the USAS tagger within the pipeline
usas_tagger.lexicon_lookup = portuguese_lexicon_lookup
usas_tagger.lemma_lexicon_lookup = portuguese_lemma_lexicon_lookup
# Maps from the POS model tagset to the lexicon POS tagset
usas_tagger.pos_mapper = UPOS_TO_USAS_CORE

text = "O Parque Nacional da Peneda-Gerês é uma área protegida de Portugal, com autonomia administrativa, financeira e capacidade jurídica, criada no ano de 1971, no meio ambiente da Peneda-Gerês."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:

This will output the following, whereby the USAS tags are a list of the most likely semantic tags, the first tag in the list is the most likely semantic tag. For more information on the USAS tagset see the USAS website.

Text    Lemma   POS     USAS Tags
O       O       DET     ['Z5']
Parque  Parque  PROPN   ['M2']
Nacional        Nacional        PROPN   ['M7/S2mf']
da      da      ADP     ['Z5']
Peneda-Gerês    Peneda-Gerês    PROPN   ['Z99']
é       ser     AUX     ['A3+', 'Z5']
uma     umar    DET     ['Z99']
área    área    NOUN    ['H2/S5+c', 'X2.2', 'M7', 'A4.1', 'N3.6']
protegida       protegido       ADJ     ['O4.5/A2.1', 'S1.2.5+']
de      de      ADP     ['Z5']
Portugal        Portugal        PROPN   ['Z2', 'Z3c']
,       ,       PUNCT   ['PUNCT']
com     com     ADP     ['Z5']
autonomia       autonomia       NOUN    ['A1.7-', 'G1.1/S7.1+', 'X6+/S5-', 'S5-']
administrativa  administrativo  ADJ     ['S7.1+']
,       ,       PUNCT   ['PUNCT']
financeira      financeiro      ADJ     ['I1', 'I1/G1.1']
e       e       CCONJ   ['Z5']
capacidade      capacidade      NOUN    ['N3.2', 'N3.4', 'N5.1+', 'X9.1+', 'I3.1', 'X9.1']
jurídica        jurídico        ADJ     ['G2.1']
,       ,       PUNCT   ['PUNCT']
criada  criar   VERB    ['I3.1/B4/S2.1f', 'S2.1f%', 'S7.1-/S2mf']
no      o       ADP     ['Z5']
ano     ano     NOUN    ['T1.3', 'P1c']
de      de      ADP     ['Z5']
1971    1971    NUM     ['N1']
,       ,       PUNCT   ['PUNCT']
no      o       ADP     ['Z5']
meio    mear    ADJ     ['M6', 'N5', 'N4', 'T1.2', 'N2', 'X4.2', 'I1.1', 'M3/H3', 'N3.3', 'A4.1', 'A1.1.1', 'T1.3']
ambiente        ambientar       NOUN    ['W5', 'W3', 'E1', 'Y2', 'O4.1']
da      da      ADP     ['Z5']
Peneda-Gerês    Peneda-Gerês    PROPN   ['Z99']
.       .       PUNCT   ['PUNCT']


When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:

pip install -e .[tests]

For a zsh shell, which is the default shell for the new Macs you will need to escape with \ the brackets:

pip install -e .\[tests\]

Running linters and tests

This code base uses flake8 and mypy to ensure that the format of the code is consistent and contain type hints. The flake8 settings can be found in ./setup.cfg and the mypy settings within ./pyproject.toml. To run these linters:

isort pymusas tests scripts

To run the tests with code coverage (NOTE these are the code coverage tests that the Continuos Integration (CI) reports at the top of this README, the doc tests are not part of this report):

coverage run # Runs the tests (uses pytest)
coverage report # Produces a report on the test coverage

To run the doc tests, these are tests to ensure that examples within the documentation run as expected:

coverage run -m pytest --doctest-modules pymusas/ # Runs the doc tests
coverage report # Produces a report on the doc tests coverage

Creating a build and checking it before release

If you would like to build this project and check it with twine before release there is a make command that can do this, this command will install build, twine, and the latest version of pip:

make check-twine


PyMUSAS is an open-source project that has been created and funded by the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University. For more information on who has contributed to this code base see the contributions page.

  • Wildcard single word lexicon rule matches (Auto Tag)

    Wildcard single word lexicon rule matches (Auto Tag)

    To support wildcard (*) syntax for single word lexicon files. This would also be useful for rules like all punctuation tokens, which should be labelled as the semantic category PUNCT, for punctuation.

    The wildcard symbol in this syntax would mean that zero or more characters may appear after the word token and/or Part Of Speech (POS) tag. This syntax will therefore hold the same meaning between single word and Multi Word Expression files.


    Assuming the single word lexicon file:

    lemma    pos    semantic_tags
    *kg   num     N3.5
    *    punc    PUNCT

    In the first case it would allow tagging anything that ended with kg, e.g. 15kg to be tagged as a measurement, the N3.5 semantic tag. In the second case it would label all punctuation with the punctuation semantic tag, PUNCT.

    opened by apmoore1 0
  • Auxiliary verb rule for single word semantic lexicon lookup

    Auxiliary verb rule for single word semantic lexicon lookup

    To incorporate auxiliary verb rules into the USAS Rule Based Tagger.

    Definition of auxiliary verb rules

    All POS tags used here are from the CLAWS C7 tagset.

    In English (at least in the C version of the semantic tagger) we use auxiliary verb rules for POS tags VB* (be), VD* (do), VH* (have), to determine the main and auxiliary verbs and therefore alter the semantic tag.

    An auxiliary verb would normally be given the USAS semantic tag Z5 grammatical bin, whereas the main verb would be given a non Z5 tag. For example in the sentence (format is token_USAS semantic tag) below the auxiliary verb is have and the main verb is finished:

    I_Z8mf have_Z5 finished_T2- my_Z8 lunch_F1 ._PUNC 

    We have approximately 35 rules in place for amending the semantic tags on be, do, and have after the initial set of potential semantic tags are applied. An example rule for have is as follows:

    VH*[Z5] (RR*n) (RT*n) (XX) (RR*n) (RT*n) V*N

    If the sequence of POS tags matches a given context, VH* (POS tag for have) followed by V*N (POS tag for the word finished) with optional intervening adverbs (R* POS tags) or negation (XX POS tag), then the rule instructs the tagger to change the semantic tag on the auxiliary verb have to be Z5.

    For semantic taggers in other languages (the Java versions), we do not have auxiliary/main verb rules in place.

    How this rule maps to spaCy pipeline through UPOS tagset

    In the UPOS tagset and therefore spaCy POS models we can use the AUX POS tag from the UPOS tagset, instead of VB* (be), VD* (do), VH* (have). Below is the code and output of running the small English spaCy model on the sentence I have finished my lunch.:

    import spacy
    nlp = spacy.load('en_core_web_sm')
    doc = nlp('I have finished my lunch.')
    for token in doc:


    Token	POS
    I	PRON
    have	AUX
    finished	VERB
    my	PRON
    lunch	NOUN
    .	PUNCT
    low priority Potential Future Enhancement 
    opened by apmoore1 1
  • df tag in MWE template.

    df tag in MWE template.

    To incorporate Df tags from MWE templates to enhance the USAS Rule Based Tagger.

    Definition of Df tags

    A small number (currently 93) of English MWE templates have the tag Df, which stands for default. The Df tags refers to the first token starting with a wildcard (*) single word's semantic tag from the single word semantic lexicon, and the Df tag is replaced in the tagger output. Note that, in the C version of the semantic tagger, only the first part of any slash tag is copied across, and any gender markers (lower case letters) on the single word semantic tag are also removed during the replacement step.

    Example 1

    In the MWE template below, the semantic tag Df would be replaced with the semantic tags of the adjective (JJ) token by looking up that token's semantic tags in the single word semantic lexicon:

    mwe_template    semantic_tags
    *_JJ style_NN1    Df

    To make this more concrete, given the text:

    The acting style.

    And the following single word lexicon

    lemma    pos    semantic_tags
    acting    JJ    A1/Z3

    As well as the above MWE template, then the tokens acting style will be tagged as an MWE with the A1 semantic tag.

    Example 2

    Some MWE templates can include membership to more than one semantic category using the slash (/) notation, here is an example of how the Df tag is processed in these cases. Given the MWE template:

    mwe_template    semantic_tags
    *_JJ style_NN1    C1/Df

    given the text:

    The acting style.

    And the following single word lexicon

    lemma    pos    semantic_tags
    acting    JJ    A1/Z3

    Then the tokens acting style will be tagged as an MWE with the C1/A1 semantic tags.

    Problems with the definition

    As stated in the definition that:

    The `Df` tags refers to the first token starting with a wildcard (`*`)

    This means that an MWE with a Df tag must contain a word token element starting with a wildcard. If no such token exists in the template, then a warning should be issued.

    low priority Potential Future Enhancement 
    opened by apmoore1 1
  • Multi Word Expressions

    Multi Word Expressions

    To incorporate Multi Word Expressions (MWE) into the rule based tagger. The MWEs will come from MWE lexicons, of which examples of these can be found in the Multilingual USAS GitHub repository, e.g. the Spanish MWE lexicon.


    • [x] Define the MWE template. Update definition of MWE syntax can be found within the documentation notes.
    • [x] Incorporate MWE templates that do not have a special syntax, e.g. all templates that do not use the wildcard (*) or curly brace syntax ({}).
    • [x] Incorporate MWE templates that use the wildcard (*) special syntax.
    • [ ] Incorporate MWE templates that use the curly brace ({}) special syntax, which cover discontinuous MWE's.
    documentation enhancement 
    opened by apmoore1 1
    cc52c6d Added languages that we support a0f748b Merge pull request #32 from UCREL/mwe 5feb6ef Added the changes to the documentation 39b88ae Added link to MWE syntax notes 9b63279 Updated so that it uses the pre-configured models 91a7089 Added that we support MWE and have models that can be downloaded 61b8265 Needs to be updated before being added back into the documentation 4ff95aa version 0.3.0     9283107 Changed the publish release part of the instructions fea9510 Prepare for release v0.2.0 5581882 Prepare for release v0.2.0 bd4c74f Prepare for release v0.2.0 3fa0346 Prepare for release v0.2.0

