Fake Shakespearean Text Generator

Overview

Fake Shakespearean Text Generator

This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts.

Files and folders of the project.

models folder

This folder contains to zip file, one for stateful model and the other for stateless model (this model files are fully saved model architectures,not just weights).

weights.zip

As you its name implies, this zip file contains the model's weights as checkpoint format (see tensorflow model save formats).

tokenizer.save

This file is an saved and trained (sure on the dataset) instance of Tensorflow Tokenizer (used at inference time).

shakespeare.txt

This file is the dataset and composed of regular texts (see below what does it look like).

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

train.py

Contains codes for training.

inference.py

Contains codes for inference.

How to Train the Model

A more depth look into train.py file


First, it gets the dataset from the specified url (line 11). Then reads the dataset to train the tokenizer object just mentioned above and trains the tokenizer (line 18). After training, encodes the dataset (line 24). Since this is a stateful model, all sequences in batch should be start where the sequences at the same index number in the previous batch left off. Let's say a batch composes of 32 sequences. The 33th sequence (i.e. the first sequence in the second batch) should exactly start where the 1st sequence (i.e. first sequence in the first batch) ended up. The second sequence in the 2nd batch should start where 2nd sequnce in first batch ended up and so on. Codes between line 28 and line 48 do this and result the dataset. Codes between line 53 and line 57 create the stateful model. Note that to be able to adjust recurrent_dropout hyperparameter you have to train the model on a GPU. After creation of model, a callback to reset states at the beginning of each epoch is created. Then the training start with the calling fit method and then model (see tensorflow' entire model save), model's weights and the tokenizer is saved.

Usage of the Model

Where the magic happens (inference.py file)


To be able use the model, it should first converted to a stateless model due to a stateful model expects a batch of inputs instead of just an input. To do this a stateless model with the same architecture of stateful model should be created. Codes between line 44 and line 49 do this. To load weights the model should be builded. After building weight are loaded to the stateless model. This model uses predicted character at time step t as an inputs at time t + 1 to predict character at t + 2 and this operation keep goes until the prediction of last character (in this case it 100 but you can change it whatever you want. Note that the longer sequences end up with more inaccurate results). To predict the next characters, first the provided initial character should be tokenized. preprocess function does this. To prevent repeated characters to be shown in the generated text, the next character should be selected from candidate characters randomly. The next_char function does this. The randomness can be controlled with temperature parameter (to learn usage of it check the comment at line 30). The complete_text function, takes a character as an argument, predicts the next character via next_char function and concatenates the predicted character to the text. It repeats the process until to reach n_chars. Last, the stateless model will be saved also.

Results

Effects of the magic


print(complete_text("a"))

arpet:
like revenge borning and vinged him not.

lady good:
then to know to creat it; his best,--lord


print(complete_text("k"))

ken countents.
we are for free!

first man:
his honour'd in the days ere in any since
and all this ma


print(complete_text("f"))

ford:
hold! we must percy and he was were good.

gabes:
by fair lord, my courters,
sir.

nurse:
well


print(complete_text("h"))

holdred?
what she pass myself in some a queen
and fair little heartom in this trumpet our hands?
the

Owner
Recep YILDIRIM
Software Imagineering
Recep YILDIRIM
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk James Turk 1.8k Dec 21, 2022

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Prefere

SafeGraph 251 Jan 01, 2023
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

Emily 68 Jan 07, 2023
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
Chatbot with Pytorch, Python & Nextjs

Installation Instructions Make sure that you have Python 3, gcc, venv, and pip installed. Clone the repository $ git clone https://github.com/sahr

Rohit Sah 0 Dec 11, 2022
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

Bhushan Kotnis 1 Oct 20, 2021
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022
Poetry PEP 517 Build Backend & Core Utilities

Poetry Core A PEP 517 build backend implementation developed for Poetry. This project is intended to be a light weight, fully compliant, self-containe

Poetry 293 Jan 02, 2023
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022