Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship Attribution

Overview

Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship Attribution

Abstract

Within the Latin (and ancient Greek) production, it is well known that peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on syllabic quantity, i.e., on on the length of the involved syllables (which can be long or short), and there is much evidence suggesting that certain authors held a preference for certain rhythmic schemes over others.
In this project, we investigate the possibility to employ syllabic quantity as a base to derive rhythmic features for the task of computational Authorship Attribution of Latin prose texts. We test the impact of these features on the attribution task when combined with other topic-agnostic features, employing three datasets and two different learning algorithms.

Syllabic Quantity for Authorship Attribution

Authorship Attribution (AA) is a subtask of the field of Authorship Analysis, which aims to infer various characteristics of the writer of a document, its identity included. In particular, given a set of candidate authors A1... Am and a document d, the goal of AA is to find the most probable author for the document d among the set of candidates; AA is thus a single-label multi-class classification problem, where the classes are the authors in the set.
In this project, we investigate the possibility to employ features extracted from the quantity of the syllables in a document as discriminative features for AA on Latin prose texts. Syllables are sound units a single word can be divided into; in particular, a syllable can be thought as an oscillation of sound in the word pronunciation, and is characterized by its quantity (long or short), which is the amount of time required to pronounce it. It is well known that classical Latin (and Greek) poetry followed metric patterns based on sequences of short and long syllables. In particular, syllables were combined in what is called a "foot", and in turn a series of "feet" composed the metre of a verse. Yet, similar metric schemes were followed also in many prose compositions, in order to give a certain cadence to the discourse and focus the attention on specific parts. In particular, the end of sentences and periods was deemed as especially important in this sense, and known as clausola. During the Middle Ages, Latin prosody underwent a gradual but profound change. The concept of syllabic quantity lost its relevance as a language discriminator, in favour of the accent, or stress. However, Latin accentuation rules are largely dependent on syllabic quantity, and medieval writers retained the classical importance of the clausola, which became based on stresses and known as cursus. An author's practice of certain rhythmic patterns might thus play an important role in the identification of that author's style.

Datasets

In this project, we employ 3 different datasets:

  • LatinitasAntiqua. The texts can be automatically downloaded with the script in the corresponding code file (src/dataset_prep/LatinitasAntiqua_prep.py). They come from the Corpus Corporum repository, developed by the University of Zurich, and in particular its sub-section called Latinitas antiqua, which contains various Latin works from the Perseus Digital library; in total, the corpus is composed of 25 Latin authors and 90 prose texts, spanning through the Classical, Imperial and Early-Medieval periods, and a variety of genres (mostly epistolary, historical and rhetoric).
  • KabalaCorpusA. The texts can be downloaded from the followig [link](https://www.jakubkabala.com/gallus-monk/). In particular, we use Corpus A, which consists of 39 texts by 22 authors from the 11-12th century.
  • MedLatin. The texts can be downloaded from the following link: . Originally, the texts were divided into two datasets, but we combine them together. Note that we exclude the texts from the collection of Petrus de Boateriis, since it is a miscellanea of authors. We delete the quotations from other authors and the insertions in languages other than Latin, marked in the texts.
The documents are automatically pre-processed in order to clean them from external information and noise. In particular, headings, editors' notes and other meta-information are deleted where present. Symbols (such as asterisks or parentheses) and Arabic numbers are deleted as well. Punctuation marks are normalized: every occurrence of question and exclamation points, semicolons, colons and suspension points are exchanged with a single point, while commas are deleted. The text is finally lower-cased and normalized: the character v is exchanged with the character u and the character j with the character i, and every stressed vowels is exchanged with the corresponding non-stressed vowel. As a final step, each text is divided into sentences, where a sentence is made of at least 5 distinct words (shorter sentences are attached to the next sentence in the sequence, or the previous one in case it is the last one in the document). This allows to create the fragments that ultimately form the training, validation and and test sets for the algorithms. In particular, each fragment is made of 10 consecutive, non-overlapping sentences.

Experiments

In order to transform the Latin texts into the corresponding syllabic quantity (SQ) encoding, we employ the prosody library available on the [Classical Language ToolKit](http://cltk.org/).
We also experiment with the four Distortion Views presented by [Stamatatos](https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/asi.23968?casa_token=oK9_O2SOpa8AAAAA%3ArLsIRzk4IhphR7czaG6BZwLmhh9mk4okCj--kXOJolp1T70XzOXwOw-4vAOP8aLKh-iOTar1mq8nN3B7), which, given a list Fw of function words, are:

  • Distorted View – Multiple Asterisks (DVMA): every word not included in Fw is masked by replacing each of its characters with an asterisk.
  • Distorted View – Single Asterisk (DVSA): every word not included in Fw is masked by replacing it with a single asterisk.
  • Distorted View – Exterior Characters (DVEX): every word not included in Fw is masked by replacing each of its characters with an asterisk, except the first and last one.
  • Distorted View – Last 2 (DVL2): every word not included in Fw is masked by replacing each of its characters with an asterisk, except the last two characters.
BaseFeatures (BFs) and it's made of: function words, word lengths, sentence lengths.
We experiment with two different learning methods: Support Vector Machine and Neural Network. All the experiments are conducted on the same train-validation-test split.
For the former, we compute the TfIdf of the character n-grams in various ranges, extracted from the various encodings of the text, which we concatenate to BaseFeatures, and feed the resulting features matrix to a LinearSVC implemented in the [scikit-learn package](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html).
For the latter, we compute various parallel, identical branches, each one processing a single encoding or the Bfs matrix, finally combining the different outputs into a single decision layer. The network is implimented with the [PyTorch package](https://pytorch.org/). Each branch outputs a matrix of probabilities, which are stacked together, and an average-pooling operation is applied in order to obtain the average value of the decisions of the different branches. The final decision is obtained through a final dense layer applying a softmax (for training) or argmax (for testing) operation over the classes probabilities. The training of the network is conducted with the traditional backpropagation method; we employ cross-entropy as the loss function and the Adam optimizer.
We employ the macro-F1 and micro-F1 as measures in order to assess the performance of the methods. For each method employing SQ-based features, we compute the statistical significance against its baseline (the same method without SQ-based features); to this aim, we employ the McNemar's paired non-parametric statistical hypothesis test, taking $0.05$ as significance value.

NN architecture

Code

The code is organized as follows int the src directory:

  • NN_models: the directory contains the code to build the Neural Networks tried in the project, one file for each architecture. The one finally used in the project is in the file NN_cnn_deep_ensemble.py.
  • dataset_prep: the directory contains the code to preprocess the various dataset employed in the project. The file NN_dataloader.py prepares the data to be processed for the Neural Network.
  • general: the directory contains: a) helpers.py, with various functions useful for the current project; b) significance.py, with the code for the significance test; c) utils.py, with more comme useful functions; d) visualization.py, with functions for drawing graphics and similar.
  • NN_classification.py: it performs the Neural Networks experiments.
  • SVM_classification.py: it performs the Support Vector Machine experiments.
  • feature_extractor.py: it extract the features for the SVM experiments.
  • main.py

Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation

Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation Introduction This is a PyTorch

XMed-Lab 30 Sep 23, 2022
classify fashion-mnist dataset with pytorch

Fashion-Mnist Classifier with PyTorch Inference 1- clone this repository: git clone https://github.com/Jhamed7/Fashion-Mnist-Classifier.git 2- Instal

1 Jan 14, 2022
[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

template-pose Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions

Van Nguyen Nguyen 92 Dec 28, 2022
Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

78 Dec 27, 2022
Data Augmentation with Variational Autoencoders

Documentation Pyraug This library provides a way to perform Data Augmentation using Variational Autoencoders in a reliable way even in challenging con

112 Nov 30, 2022
Code for the paper "Offline Reinforcement Learning as One Big Sequence Modeling Problem"

Trajectory Transformer Code release for Offline Reinforcement Learning as One Big Sequence Modeling Problem. Installation All python dependencies are

Michael Janner 266 Dec 27, 2022
A new benchmark for Icon Question Answering (IconQA) and a large-scale icon dataset Icon645.

IconQA About IconQA is a new diverse abstract visual question answering dataset that highlights the importance of abstract diagram understanding and c

Pan Lu 24 Dec 30, 2022
This project uses ViT to perform image classification tasks on DATA set CIFAR10.

Vision-Transformer-Multiprocess-DistributedDataParallel-Apex Introduction This project uses ViT to perform image classification tasks on DATA set CIFA

Kaicheng Yang 3 Jun 03, 2022
clustimage is a python package for unsupervised clustering of images.

clustimage The aim of clustimage is to detect natural groups or clusters of images. Image recognition is a computer vision task for identifying and ve

Erdogan Taskesen 52 Jan 02, 2023
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usec

The Kompute Project 1k Jan 06, 2023
Pytorch implementation of Depth-conditioned Dynamic Message Propagation forMonocular 3D Object Detection

DDMP-3D Pytorch implementation of Depth-conditioned Dynamic Message Propagation forMonocular 3D Object Detection, a paper on CVPR2021. Instroduction T

Li Wang 32 Nov 09, 2022
Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

CasRel-pytorch-reimplement Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The o

longlongman 170 Dec 01, 2022
Learning to Adapt Structured Output Space for Semantic Segmentation, CVPR 2018 (spotlight)

Learning to Adapt Structured Output Space for Semantic Segmentation Pytorch implementation of our method for adapting semantic segmentation from the s

Yi-Hsuan Tsai 782 Dec 30, 2022
Collection of generative models in Pytorch version.

pytorch-generative-model-collections Original : [Tensorflow version] Pytorch implementation of various GANs. This repository was re-implemented with r

Hyeonwoo Kang 2.4k Dec 31, 2022
AI that generate music

PianoGPT ai that generate music try it here https://share.streamlit.io/annasajkh/pianogpt/main/main.py or here https://huggingface.co/spaces/Annas/Pia

Annas 28 Nov 27, 2022
A repository built on the Flow software package to explore cyber-security attacks on intelligent transportation systems.

A repository built on the Flow software package to explore cyber-security attacks on intelligent transportation systems.

George Gunter 4 Nov 14, 2022
Permeability Prediction Via Multi Scale 3D CNN

Permeability-Prediction-Via-Multi-Scale-3D-CNN Data: The raw CT rock cores are obtained from the Imperial Colloge portal. The CT rock cores are sub-sa

Mohamed Elmorsy 2 Jul 06, 2022
StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

3k Jan 08, 2023
This application explain how we can easily integrate Deepface framework with Python Django application

deepface_suite This application explain how we can easily integrate Deepface framework with Python Django application install redis cache install requ

Mohamed Naji Aboo 3 Apr 18, 2022
CLIPort: What and Where Pathways for Robotic Manipulation

CLIPort CLIPort: What and Where Pathways for Robotic Manipulation Mohit Shridhar, Lucas Manuelli, Dieter Fox CoRL 2021 CLIPort is an end-to-end imitat

246 Dec 11, 2022