Random dataframe and database table generator

Overview

Random database/dataframe generator

Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA

Introduction

Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one's own choice?

After all, databases break every now and then and it is safest to practice with a randomly generated one :-)

https://imgs.xkcd.com/comics/exploits_of_a_mom.png

While it is easy to generate random numbers or simple words for Pandas or dataframe operation learning, it is often non-trivial to generate full data tables with meaningful yet random entries of most commonly encountered fields in the world of database, such as

  • name,
  • age,
  • birthday,
  • credit card number,
  • SSN,
  • email id,
  • physical address,
  • company name,
  • job title,

This Python package generates a random database TABLE (or a Pandas dataframe, or an Excel file) based on user's choice of data types (database fields). User can specify the number of samples needed. One can also designate a "PRIMARY KEY" for the database table. Finally, the TABLE is inserted into a new or existing database file of user's choice.

https://raw.githubusercontent.com/tirthajyoti/pydbgen/master/images/Top_image_1.png

Dependency and Acknowledgement

At its core, pydbgen uses Faker as the default random data generating engine for most of the data types. Original function is written for few data types such as realistic email and license plate. Also the default phone number generated by Faker is free-format and does not correspond to US 10 digit format. Therefore, a simple phone number data type is introduced in pydbgen. The original contribution of pydbgen is to take the single data-generating function from Faker and use it cleverly to generate Pandas data series or dataframe or SQLite database tables as per the specification of the user. Here is the link if you want to look up more about Faker package,

Faker Documentation Home

Installation

(On Linux and Windows) You can use pip to install pydbgen:

pip install pydbgen

(On Mac OS), first install pip,

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Then proceed as above.

Usage

Current version (1.0.0) of pydbgen comes with the following primary methods,

  • gen_data_series()
  • gen_dataframe()
  • gen_table()
  • gen_excel()

The gen_table() method allows you to build a database with as many tables as you want, filled with random data and fields of your choice. But first, you have to create an object of pydb class:

myDB = pydbgen.pydb()

gen_data_series()

Returns a Pandas series object with the desired number of entries and data type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Phone number can be of two types:

  • phone_number_simple generates 10 digit US number in xxx-xxx-xxxx format
  • phone_number_full may generate an international number with different format

Code example:

se=myDB.gen_data_series(data_type='date')
print(se)

0    1995-08-09
1    2001-08-01
2    1980-06-26
3    2018-02-18
4    1972-10-12
5    1983-11-12
6    1975-09-04
7    1970-11-01
8    1978-03-23
9    1976-06-03
dtype: object

gen_dataframe()

Generates a Pandas dataframe filled with random entries. User can specify the number of rows and data type of the fields/columns.

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.

Code example:

testdf=myDB.gen_dataframe(
25,fields=['name','city','phone',
'license_plate','email'],
real_email=True,phone_simple=True
)

gen_table()

Attempts to create a table in a database (.db) file using Python's built-in SQLite engine. User can specify various data types to be included as database table fields.

All data types (fields) in the SQLite table will be of VARCHAR type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
  • db_file: Name of the database where the TABLE will be created or updated. Default database name will be chosen if not specified by user.
  • table_name: Name of the table, to be chosen by user. Default table name will be chosen if not specified by user.
  • primarykey: User can choose a PRIMARY KEY from among the various fields. If nothing specified, the first data field will be made PRIMARY KEY. If user chooses a field, which is not in the specified list, an error will be thrown and no table will be generated.

Code example:

myDB.gen_table(
20,fields=['name','city','job_title','phone','company','email'],
db_file='TestDB.db',table_name='People',
primarykey='name',real_city=False
)

gen_excel()

Attempts to create an Excel file using Pandas excel_writer function. User can specify various data types to be included. All data types (fields) in the Excel file will be of text type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
  • filename: Name of the Excel file to be created or updated. Default file name will be chosen if not specified by user.

Code example:

myDB.gen_excel(15,fields=['name','year','email','license_plate'],
        filename='TestExcel.xlsx',real_email=True)

Other auxiliary methods available

Few other auxiliary functions available in this package.

Owner
Tirthajyoti Sarkar
Data Sc/Engineering manager , Industry 4.0, edge-computing, semiconductor technologist, Author, Python pkgs - pydbgen, MLR, and doepy,
Tirthajyoti Sarkar
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 07, 2021
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Pokémon Analysis Andreas Nikolaidis February 2022 Introduction Exploratory Analysis Correlations & Descriptive Statistics Principal Component Analysis

Andreas 1 Feb 18, 2022
A 2-dimensional physics engine written in Cairo

A 2-dimensional physics engine written in Cairo

Topology 38 Nov 16, 2022
Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021
Functional tensors for probabilistic programming

Funsor Funsor is a tensor-like library for functions and distributions. See Functional tensors for probabilistic programming for a system description.

208 Dec 29, 2022
Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

Chris Carbonell 1 Dec 03, 2021
[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

Jun Li 65 Dec 09, 2022
Analytical view of olist e-commerce in Brazil

Analysis of E-Commerce Public Dataset by Olist The objective of this project is to propose an analytical view of olist e-commerce in Brazil. For this

Gurpreet Singh 1 Jan 11, 2022
Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

3 Aug 13, 2021
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Theis Lab 2 Mar 29, 2022
WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

Institute for Complex Systems (ICS), Johannes Kepler University Linz 40 Dec 13, 2022
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
COVID-19 deaths statistics around the world

COVID-19-Deaths-Dataset COVID-19 deaths statistics around the world This is a daily updated dataset of COVID-19 deaths around the world. The dataset c

Nisa Efendioğlu 4 Jul 10, 2022
PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

NCL (Neighborhood-enrighed Contrastive Learning) This is the official PyTorch implementation for the paper: Zihan Lin*, Changxin Tian*, Yupeng Hou* Wa

RUCAIBox 73 Jan 03, 2023