Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Overview

Polish Wordnet Python library

Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic database of the Polish language. PlWordNet can also be browsed here.

I created this library, because since version 2.9, PlWordNet cannot be easily loaded into Python (for example with nltk), as it is only provided in a custom plwnxml format.

Usage

Load wordnet from an XML file (this will take about 20 seconds), and print basic statistics.

import plwordnet
wn = plwordnet.load('plwordnet_4_2.xml')
print(wn)

Expected output:

PlWordnet
  lexical units: 513410
  synsets: 353586
  relation types: 306
  synset relations: 1477849
  lexical relations: 393137

Find lexical units with name leśny and print all relations, where where that unit is in the subject/parent position.

for lu in wn.lemmas('leśny'):
    for s, p, o in wn.lexical_relations_where(subject=lu):
        print(p.format(s, o))

Expected output:

leśny.2 tworzy kolokację z polana.1
leśny.2 jest synonimem mpar. do las.1
leśny.3 przypomina las.1
leśny.4 jest derywatem od las.1
leśny.5 jest derywatem od las.1
leśny.6 przypomina las.1

Print all relation types and their ids:

for id, rel in wn.relation_types.items():
    print(id, rel.name)

Expected output:

10 hiponimia
11 hiperonimia
12 antonimia
13 konwersja
...

Installation

Note: plwordnet requires at Python 3.7 or newer.

pip install plwordnet

Version support

This library should be able to read future versions of PlWordNet without modification, even if more relation types are added. Still, if you use this library with a version of PlWordNet that is not listed below, please consider contributing information if it is supported.

  • PlWordNet 4.2
  • PlWordNet 4.0
  • PlWordNet 3.2
  • PlWordNet 3.0
  • PlWordNet 2.3
  • PlWordNet 2.2
  • PlWordNet 2.1

Documentation

See plwordnet/wordnet.py for RelationType, Synset and LexicalUnit class definitions.

Package functions

  • load(source): Reads PlWordNet, where src is a path to the wordnet XML file, or a path to the pickled wordnet object. Passed paths can point to files compressed with gzip or lzma.

Wordnet instance properties

  • lexical_relations: List of (subject, predicate, object) triples
  • synset_relations: List of (subject, predicate, object) triples
  • relation_types: Mapping from relation type id to object
  • lexical_units: Mapping from lexical unit id to unit object
  • synsets: Mapping from synset id to object
  • (lexical|synset)_relations_(s|o|p): Mapping from id of subject/object/predicate to a set of matching lexical unit/synset relation ids
  • lexical_units_by_name: Mapping from lexical unit name to a set of matching lexical unit ids

Wordnet methods

  • lemmas(value): Returns a list of LexicalUnit, where the name is equal to value
  • lexical_relations_where(subject, predicate, object): Returns lexical relation triples, with matching subject or/and predicate or/and object. Subject, predicate and object arguments can be integer ids or LexicalUnit and RelationType objects.
  • synset_relations_where(subject, predicate, object): Returns synset relation triples, with matching subject or/and predicate or/and object. Subject, predicate and object arguments can be integer ids or Synset and RelationType objects.
  • dump(dst): Pickles the Wordnet object to opened file dst or to a new file with path dst.

RelationType methods

  • format(x, y, short=False): Substitutes x and y into the RelationType display format display. If short, x and y are separated by the short relation name shortcut.
Comments
  • Fix for abstract attribute bug, MAJOR speedup of synset_relations_where

    Fix for abstract attribute bug, MAJOR speedup of synset_relations_where

    Hi Max.

    I've fixed the bug related to abstract attribute of the synset (it was always True, because bool("non-empty-string") is always True)

    I've also speeded up synset_relations_where by order of 3-4 magnitudes.

    opened by dchaplinsky 7
  • Exposing relations in Wordnet class

    Exposing relations in Wordnet class

    This might be a bit an overkill, but it has two advantages.

    First is: image

    Another is that you can rewrite code like this:

    def path_to_top(synset):
        spo = []
        for rel in [11, 107, 171, 172, 199, 212, 213]:
    

    with meaningful names, not numbers

    opened by dchaplinsky 3
  • Domains dict

    Domains dict

    I've used wikipedia (https://en.wikipedia.org/wiki/PlWordNet) to decipher 45 of 54 domains listed on Słowosieć.

    There might be more: image for example, zwz

    Can you try to decipher the rest? My Polish isn't too good (yet ))

    opened by dchaplinsky 3
  • WIP: hypernyms/hyponyms/hypernym_paths routines for WordNet class

    WIP: hypernyms/hyponyms/hypernym_paths routines for WordNet class

    So, here is my attempt. I've used standard python stack for now, will let you know if it caused any problems

    I've tested it on Africa/Afryka with different combinations, all looked sane to me:

    for lu in wn.find("Afryka"):
        for i, pth in enumerate(wn.hypernym_paths(lu.synset, full_searh=True, interlingual=True)):
            print(f"{i + 1}: " + "->".join(str(s) for s in pth))
    

    gave me

    1: {kontynent.2}->{ląd.1 ziemia.4}->{obszar.1 rejon.3 obręb.1}->{przestrzeń.1}
    2: {kontynent.2}->{ląd.1 ziemia.4}->{obszar.1 rejon.3 obręb.1}->{location.1}->{object.1 physical object.1}->{physical entity.1}->{entity.1}
    3: {kontynent.2}->{ląd.1 ziemia.4}->{land.4 dry land.1 earth.3 ground.1 solid ground.1 terra firma.1}->{object.1 physical object.1}->{physical entity.1}->{entity.1}
    

    Sorry, I accidentally blacked your file, so now it has more changes than expected. The important one, though is that:

    +        # For cases like Instance_Hypernym/Instance_Hyponym
    +        for rel in self.relation_types.values():
    +            if rel.inverse is not None and rel.inverse.inverse is None:
    +                rel.inverse.inverse = rel
    
    opened by dchaplinsky 1
  • Question: hypernym/hyponym tree traversal and export

    Question: hypernym/hyponym tree traversal and export

    Hello.

    The next logical step for me is to implement tree traversal and data export. For tree traversal I'd try to stick to the following algorithm:

    • Find the true top-level hypernyms for the english and polish (no interlingual hypernymy)
    • Calculate number of leaves under each top level hypernym (and/or number of LUs under it)
    • For each node calculate the distance from top-level hypernym

    To export I'd like to use the information above and pass some callables for filtering to only export particular nodes/rels. For example, I only need first 3-4 levels of the trees for nouns, that has more than X leaves. This way I'll have a way to export and visualize only parts of the trees I need.

    Speaking of export, I'm looking into graphviz (to basically lay top level ontology on paper) and ttl, but in the format, that is similar to PWN original TTL export.

    I'd like to have your opinion on two things:

    • General approach
    • How to incorporate that into code. It might be a part of Wordnet class, a separate file (maybe under contrib section), an usage example or a separate script which I/we do or don't publish at all
    opened by dchaplinsky 1
  • Separate file and classes for domains, support for bz2 in load helper

    Separate file and classes for domains, support for bz2 in load helper

    Hi Max. I've slightly cleaned up your spreadsheet on domains (replaced TODO and dashes with nones and made POSes compatible to UD POS tagset) and wrapped everything into classes. I've also made two rows out of cwytw / cwyt and moved pl description of adj/adv into english one. I made en fields default ones for str method

    It's up to you to replace str domains in LexicalUnit with instances of Domain class as it's still ok to compare Domain to str

    I've also added support for bz2 in loader helper.

    opened by dchaplinsky 1
  • Include sentiment annotations

    Include sentiment annotations

    PlWordNet 4.2 comes with a supplementary file (słownik_anotacji_emocjonalnej.csv) containing sentiment annotations for lexical units. Users should be able to load and access sentiment data.

    enhancement 
    opened by maxadamski 1
  • Parse the description format

    Parse the description format

    Currently, nothing is done with the description field in Synset and LexicalUnit. Information about the description format comes in PlWordNets readme.

    Parsing should be done lazily to avoid slowing down the initial loading of PlWordNet into memory.

    Example description:

    ##K: og. ##D: owoc (wielopestkowiec) jabłoni. [##P: Jabłka są kształtem zbliżone do kuli, z zagłębieniem na szczycie, z którego wystaje ogonek.] {##L: http://pl.wikipedia.org/wiki/Jab%C5%82ko}
    

    Desired behavior:

    A new (memoized) method rich_description returns the following dict:

    dict(
      qualifier='og.',
      definition='owoc (wielopestkowiec) jabłoni.',
      examples=['Jabłka są kształtem zbliżone do kuli, z zagłębieniem na szczycie, z którego wystaje ogonek'],
      sources=['http://pl.wikipedia.org/wiki/Jab%C5%82ko'])
    
    enhancement 
    opened by maxadamski 1
Releases(0.1.5)
Owner
Max Adamski
Student of AI @ PUT
Max Adamski
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
Text Classification Using LSTM

Text classification is the task of assigning a set of predefined categories to free text. Text classifiers can be used to organize, structure, and categorize pretty much anything. For example, new ar

KrishArul26 3 Jan 03, 2023
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Meta Research 6.4k Jan 08, 2023
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Persian-Image-Captioning We fine-tuning the Vision Encoder Decoder Model for the task of image captioning on the coco-flickr-farsi dataset. The implem

Hamtech-ai 15 Aug 25, 2022
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.8k Jan 04, 2023
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Phil Wang 364 Jan 06, 2023
CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

Kevin 2 Jan 14, 2022
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 06, 2023
Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

sumekenov 7 Dec 14, 2022
Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

BP-Transformer This repo contains the code for our paper BP-Transformer: Modeling Long-Range Context via Binary Partition Zihao Ye, Qipeng Guo, Quan G

Zihao Ye 119 Nov 14, 2022
Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Rita Anjana 55 Nov 28, 2022
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

7 Dec 05, 2022
Linking data between GBIF, Biodiverse, and Open Tree of Life

GBIF-biodiverse-OpenTree Linking data between GBIF, Biodiverse, and Open Tree of Life The python scripts will rely on opentree and Dendropy. To set up

2 Oct 03, 2022
Community and sentiment analysis based on tweets

The project has set itself the goal of analyzing the thoughts and interaction of Italian users through the social posts expressed through the Twitter platform on the day of the entry into force of th

3 Nov 17, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

117 Jan 07, 2023