MongoDB utility to inflate the contents of small collection to a new larger collection

Overview

MongoDB Data Inflater ("data-inflater")

The data-inflater tool is a MongoDB utility to automate the creation of a new large database collection using data sourced from an existing smaller database collection.

By default, the utility will use the Atlas 'sample data set' database collection sample_mflix.movies as the source collection. However, most users will provide parameters to the utility to specify the use of their own database and source collection. If you do want to use the Atlas sample data set, see the sample data manual page for more information.

The data-inflater utility issues multiple concurrent aggregation processes, each copying batches of records in parallel for increased performance. The resulting collection will contain documents with duplicated data but with new unique _id field values. The variance ratio of data in the new collection will approximately reflect the variance ratio of the source collection. Therefore, you should ensure you have supplied at least a few different documents (if not a few hundred or thousand) in the source collection.

If you are running a sharded cluster, the utility will ensure the target collection is sharded with a shard key, and where it can, it will pre-split the chunks to avoid subsequent needless balancer overhead. For example, if you specify the --shardkey parameter for this utility to reference a field (e.g. product_name) as the range based shard key, before creating the target collection, the utility will introspect the spread of values for the shard key field (e.g. product_name). The utility will then create pre-split chunks in the new empty target collection before any data is copied to it, to maximise performance.

How To Run

In a running MongoDB cluster (self-managed or running in Atlas), ensure you have created and populated a source collection with at least one sample record in it (ideally more with varying values for the fields across the different documents to reflect the shape and variance you desire).

Ensure Python3 (version 3.8 or greater) and the MongoDB Python Driver (PyMongo) are already installed on your workstation. Example to install PyMongo:

pip3 install --user pymongo

Ensure the .py script is executable and then execute the following to view the utility's help instructions and the full list of parameters that you can provide:

./data-inflater.py -h

Execute the following to connect to a locally running single server database (default port) to copy and expand the data from an existing source collection, mydb.mySrcColl, to an a new collection, mydb.myDestColl, which will contain 1 million records:

./data-inflater.py --url 'mongodb://localhost:27017' -d 'mydb' -c 'mySrcColl' -t 'myDestColl' -s 1000000

Execute the following to connect to an Atlas cluster (ensure you've already loaded the Atlas sample data set), to inflate the data from the source movies collection to the new movies_big collection, which will contain 100 million records (note, first change the URL username, password and hostname shown, to match the URL of your Atlas cluster):

./data-inflater.py --url 'mongodb+srv://usr:[email protected]/'
Owner
Paul Done
Paul Done
produces PCA on genotypes from fasta files (popPhyl's ID format)

popPhyl_PCA Performs PCA of genotypes. Works in two steps. 1. Input file A single fasta file containing different loci, in different populations/speci

camille roux 2 Oct 08, 2021
A (very dirty) experiment to remove layers from a Docker image.

Surgically remove layers from a Docker image (with a chainsaw)

Jérôme Petazzoni 9 Jun 08, 2022
API Rate Limit Decorator

ratelimit APIs are a very common way to interact with web services. As the need to consume data grows, so does the number of API calls necessary to re

Tomas Basham 575 Jan 05, 2023
Parse URLs for DOIs, PubMed identifiers, PMC identifiers, arXiv identifiers, etc.

citation-url Parse URLs for DOIs, PubMed identifiers, PMC identifiers, arXiv identifiers, etc. This module has a single parse() function that takes in

Charles Tapley Hoyt 2 Feb 12, 2022
A simple and easy to use Spam Bot made in Python!

This is a simple spam bot made in python. You can use to to spam anyone with anything on any platform.

7 Sep 08, 2022
Similar looking domain detection using python fuzzywuzzy

Major cause of phishing and BEC incident is similar looking domain, if you detect it early, you can prevent incidents early, python fuzzywuzzy module let you do that

2 Nov 07, 2021
Python USD rate in RUB parser

Python EUR and USD rate parser. Python USD and EUR rate in RUB parser. Parsing i

Andrew 2 Feb 17, 2022
Homebase Name Changer for Fortnite: Save the World.

Homebase Name Changer This program allows you to change the Homebase name in Fortnite: Save the World. How to use it? After starting the HomebaseNameC

PRO100KatYT 7 May 21, 2022
JavaScript to Python Translator & JavaScript interpreter written in 100% pure Python🚀

Pure Python JavaScript Translator/Interpreter Everything is done in 100% pure Python so it's extremely easy to install and use. Supports Python 2 & 3.

Piotr Dabkowski 2.1k Dec 30, 2022
🚧Useful shortcuts for simple task on windows

Windows Manager A tool containg useful utilities for performing simple shortcut tasks on Windows 10 OS. Features Lit Up - Turns up screen brightness t

Olawale Oyeyipo 0 Mar 24, 2022
A python app which aggregates and splits costs from multiple public cloud providers into a csv

Cloud Billing This project aggregates the costs public cloud resources by accounts, services and tags by importing the invoices from public cloud prov

1 Oct 04, 2022
This is Cool Utility tools that you can use in python.

This is Cool Utility tools that you can use in python. There are a few tools that you might find very useful, you can use this on pretty much any project and some utils might help you a lot and save

Senarc Studios 6 Apr 18, 2022
Protect your eyes from eye strain using this simple and beautiful, yet extensible break reminder

Protect your eyes from eye strain using this simple and beautiful, yet extensible break reminder

Gobinath 1.2k Jan 01, 2023
VerSign: Easy Signature Verification in Python

VerSign: Easy Signature Verification in Python versign is a small Python package which can be used to perform verification of offline signatures. It a

Muhammad Saif Ullah Khan 3 Dec 01, 2022
A simple dork generator written in python that outputs dorks with the domain extensions you enter

Dork Gen A simple dork generator written in python that outputs dorks with the domain extensions you enter in a ".txt file". Usage The code is pretty

Z3NToX 4 Oct 30, 2022
Simple script to export contacts from telegram into vCard file

Telegram Contacts Exporter Simple script to export contacts from telegram into vCard file Getting Started Prerequisites You must to put your Telegram

Pere Antoni 1 Oct 17, 2021
Simple RGB to HEX game made in python

Simple RGB to HEX game made in python

5 Aug 26, 2022
A utility tool to create .env files

A utility tool to create .env files dump-env takes an .env.template file and some optional environmental variables to create a new .env file from thes

wemake.services 89 Dec 08, 2022
Retrying is an Apache 2.0 licensed general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything.

Retrying Retrying is an Apache 2.0 licensed general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just

Ray Holder 1.9k Dec 29, 2022
Extends the pyranges module with operations on joined genomic intervals

tiedpyranges Extends the pyranges module with operations on joined genomic intervals (e.g. exons of same transcript) Install with: pip install tiedpyr

Marco Mariotti 4 Aug 05, 2022