Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Overview

Sonnet finder

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Usage

This is a Python script that should run without a GPU or any other special hardware requirements.

  1. Install the required packages, e.g. via: pip install -r requirements.txt

  2. Prepare a plain text file, say input.txt, with text you want to make a sonnet out of (sonnet-ize? sonnet-ify?). It can have multiple sentences on the same line, but a sentence should not be split across multiple lines.

    For example, I used pandoc --to=plain --wrap=none to generate a text file from my LaTeX papers. You could also start by grabbing some text files from Project Gutenberg.

  3. Run sonnet finder: python sonnet_finder.py input.txt -o output.tsv

    Using -o will save a list of all extracted candidate phrases, sorted by rhyme pattern, so you can generate new sonnets more quickly (see below) or browse and cherry-pick from the candidates to make your own sonnet out of these lines.

    Either way, the script will output a full example sonnet to STDOUT (provided enough rhyming pairs in iambic pentameter were found).

  4. If you've saved an output.tsv file before, you can quickly generate new sonnets via python sonnet_remix.py output.tsv. Since the stress and pronunciation prediction can be slow on larger files, this is much better than re-running sonnet_finder.py if you want more automatically generated suggestions.

Examples

This is a sonnet (with cherry-picked lines) made out of my PhD thesis:

the application of existing tools
describe a mapping to a modern form
applying similar replacement rules
the base ensembles slightly outperform

hungarian, icelandic, portuguese
perform a similar evaluation
contemporary lexemes or morphemes
a single dataset in isolation

historical and modern language stages
the weighted combination of encoder
the german dative ending -e in phrases
predictions fed into the next decoder

in this example from the innsbruck letter
machine translation still remains the better

These stanzas are compiled from a couple of automatically-generated suggestions based on the abstracts of all papers published in 2021 in the ACL Anthology:

effective algorithm that enables
improvements on a wide variety
and training with adjudicated labels
anxiety and test anxiety

obtain remarkable improvements on
decoder architecture, which equips
associated with the lexicon
surprising personal relationships

the impact of the anaphoric one
complexity prediction competition
developed for a laboratory run
existing parsers typically condition

examples, while in practice, most unseen
evaluate translation tasks between

Here's the same using Moby Dick:

among the marble senate of the dead
offensive matters consequent upon
a crawling reptile of the land, instead
fifteen, eighteen, and twenty hours on

the lakeman now patrolled the barricade
egyptian tablets, whose antiquity
the waters seemed a golden finger laid
maintains a permanent obliquity

the pequod with the little negro pippin
and with a frightful roll and vomit, he
increased, besides perhaps improving it in
transparent air into the summer sea

the traces of a simple honest heart
the fishery, and not the thousandth part

(The emjambment in the third stanza here is a lucky coincidence; the script currently doesn't do any kind of syntactic analysis or attempt coherence between lines.)

How it works

This script relies on the grapheme-to-phoneme library g2p_en by Park & Kim to convert the English input text to phoneme sequences (i.e., how the text would be pronounced). I chose this because it's a pip-installable Python library that fulfills two important criteria:

  1. it's not restricted to looking up pronunciations in a dictionary, but can handle arbitrary words through the use of a neural model (although, obviously, this will not always be accurate);

  2. it provides stress information for each vowel (i.e., whether any given vowel should be stressed or unstressed, which is important for determining the poetic meter).

The script then scans the g2p output for occurrences of iambic pentameter, i.e. a 0101010101(0) pattern, additionally checking if they coincide with word boundaries.

For finding snippets that rhyme, I rely mostly on Ghazvininejad et al. (2016), particularly §3 (relaxing the iambic pentameter a bit by allowing words that end in 100) and §5.2 (giving an operational definition of "slant rhyme" that I mostly try to follow).

QNA (Questions Nobody Asked)

  • Why does the script sometimes output lines that don't rhyme or don't fit the iambic meter? This script can only be as good as the grapheme-to-phoneme algorithm that's used. It frequently fails on words it doesn't know (for example, it tries to rhyme datasets with Portuguese?!) and also usually fails on abbreviations. Maybe there's a better g2p library that could be used, or the existing g2p_en could be modified to accept a custom dictionary, so you could manually define pronunciations for commonly used words.

  • Could this script also generate other types of poems? Sure. You could start by changing the regex iambic_pentameter to something else; maybe a sequence of dactyls? There are some further hardcoded assumptions in the code about iambic pentameter in the function get_stress_and_boundaries() that might have to be modified.

  • Could this script generate poems in languages other than English? This would require a suitable replacement for g2p_en that predicts pronunciations and stress patterns for the desired language, as well as re-writing the code that determines whether two phrases can rhyme; see the comments in the script for details. In particular, the code for English uses ARPABET notation for the pronunciation, which won't be suitable for other languages.

  • Can this script generate completely novel phrases in the style of an input text? This script does not "hallucinate" any text or generate anything that wasn't already there in the input; if you want to do that, take a look at Deep-speare maybe.

etc.

Written by Marcel Bollmann, inspired by a tweet, licensed under the MIT License.

I'm not the first one to write a script like this, but it was a fun exercise!

Owner
Marcel Bollmann
Computational linguist, postdoc, programming enthusiast.
Marcel Bollmann
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021
WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Exploration of BERT-based models on twitter sentiment classifications

twitter-sentiment-analysis Explore the relationship between twitter sentiment of Tesla and its stock price/return. Explore the effect of different BER

Sammy Cui 2 Oct 02, 2022
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
Word Bot for JKLM Bomb Party

Word Bot for JKLM Bomb Party A bot for Bomb Party on https://www.jklm.fun (Only English) Requirements pynput pyperclip pyautogui Usage: Step 1: Run th

Nicolas 7 Oct 30, 2022
AutoGluon: AutoML for Text, Image, and Tabular Data

AutoML for Text, Image, and Tabular Data AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in yo

Amazon Web Services - Labs 5.2k Dec 29, 2022
nlp基础任务

NLP算法 说明 此算法仓库包括文本分类、序列标注、关系抽取、文本匹配、文本相似度匹配这五个主流NLP任务,涉及到22个相关的模型算法。 框架结构 文件结构 all_models ├── Base_line │   ├── __init__.py │   ├── base_data_process.

zuxinqi 23 Sep 22, 2022
使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

SimCSE复现 项目描述 SimCSE是一种简单但是很巧妙的NLP对比学习方法,创新性地引入Dropout的方式,对样本添加噪声,从而达到对正样本增强的目的。 该框架的训练目的为:对于batch中的每个样本,拉近其与正样本之间的距离,拉远其与负样本之间的距离,使得模型能够在大规模无监督语料(也可以

58 Dec 20, 2022
Anuvada: Interpretable Models for NLP using PyTorch

Anuvada: Interpretable Models for NLP using PyTorch So, you want to know why your classifier arrived at a particular decision or why your flashy new d

EDGE 102 Oct 01, 2022
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Bethge Lab 61 Dec 21, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022
Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated. This engine can later be used for downstream tasks in NLP such as Q&A, summarization, generation

Diego 1 Mar 20, 2022
Simple Text-To-Speech Bot For Discord

Simple Text-To-Speech Bot For Discord This is a very simple TTS bot for discord made with python. For this bot you need FFMPEG, see installation to se

1 Sep 26, 2022
A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022