CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python.

Overview

CaskDB - Disk based Log Structured Hash Table Store

made-with-python build codecov MIT license

architecture

CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python. It is more focused on the educational capabilities than using it in production. The file format is platform, machine, and programming language independent. Say, the database file created from Python on macOS should be compatible with Rust on Windows.

This project aims to help anyone, even a beginner in databases, build a persistent database in a few hours. There are no external dependencies; only the Python standard library is enough.

If you are interested in writing the database yourself, head to the workshop section.

Features

  • Low latency for reads and writes
  • High throughput
  • Easy to back up / restore
  • Simple and easy to understand
  • Store data much larger than the RAM

Limitations

Most of the following limitations are of CaskDB. However, there are some due to design constraints by the Bitcask paper.

  • Single file stores all data, and deleted keys still take up the space
  • CaskDB does not offer range scans
  • CaskDB requires keeping all the keys in the internal memory. With a lot of keys, RAM usage will be high
  • Slow startup time since it needs to load all the keys in memory

Dependencies

CaskDB does not require any external libraries to run. For local development, install the packages from requirements_dev.txt:

pip install -r requirements_dev.txt

Installation

PyPi is not used for CaskDB yet (issue #5), and you'd have to install it directly from the repository by cloning.

Usage

disk: DiskStorage = DiskStore(file_name="books.db")
disk.set(key="othello", value="shakespeare")
author: str = disk.get("othello")
# it also supports dictionary style API too:
disk["hamlet"] = "shakespeare"

Prerequisites

The workshop is for intermediate-advanced programmers. Knowing Python is not a requirement, and you can build the database in any language you wish.

Not sure where you stand? You are ready if you have done the following in any language:

  • If you have used a dictionary or hash table data structure
  • Converting an object (class, struct, or dict) to JSON and converting JSON back to the things
  • Open a file to write or read anything. A common task is dumping a dictionary contents to disk and reading back

Workshop

NOTE: I don't have any workshops scheduled shortly. Follow me on Twitter for updates. Drop me an email if you wish to arrange a workshop for your team/company.

CaskDB comes with a full test suite and a wide range of tools to help you write a database quickly. A Github action is present with an automated tests runner, code formatter, linter, type checker and static analyser. Fork the repo, push the code, and pass the tests!

Throughout the workshop, you will implement the following:

  • Serialiser methods take a bunch of objects and serialise them into bytes. Also, the procedures take a bunch of bytes and deserialise them back to the things.
  • Come up with a data format with a header and data to store the bytes on the disk. The header would contain metadata like timestamp, key size, and value.
  • Store and retrieve data from the disk
  • Read an existing CaskDB file to load all keys

Tasks

  1. Read the paper. Fork this repo and checkout the start-here branch
  2. Implement the fixed-sized header, which can encode timestamp (uint, 4 bytes), key size (uint, 4 bytes), value size (uint, 4 bytes) together
  3. Implement the key, value serialisers, and pass the tests from test_format.py
  4. Figure out how to store the data on disk and the row pointer in the memory. Implement the get/set operations. Tests for the same are in test_disk_store.py
  5. Code from the task #2 and #3 should be enough to read an existing CaskDB file and load the keys into memory

Use make lint to run mypy, black, and pytype static analyser. Run make test to run the tests locally. Push the code to Github, and tests will run on different OS: ubuntu, mac, and windows.

Not sure how to proceed? Then check the hints file which contains more details on the tasks and hints.

Hints

  • Check out the documentation of struck.pack for serialisation methods in Python
  • Not sure how to come up with a file format? Read the comment in the format module

What next?

I often get questions about what is next after the basic implementation. Here are some challenges (with different levels of difficulties)

Level 1:

  • Crash safety: the bitcask paper stores CRC in the row, and while fetching the row back, it verifies the data
  • Key deletion: CaskDB does not have a delete API. Read the paper and implement it
  • Instead of using a hash table, use a data structure like the red-black tree to support range scans
  • CaskDB accepts only strings as keys and values. Make it generic and take other data structures like int or bytes.

Level 2:

  • Hint file to improve the startup time. The paper has more details on it
  • Implement an internal cache which stores some of the key-value pairs. You may explore and experiment with different cache eviction strategies like LRU, LFU, FIFO etc.
  • Split the data into multiple files when the files hit a specific capacity

Level 3:

  • Support for multiple processes
  • Garbage collector: keys which got updated and deleted remain in the file and take up space. Write a garbage collector to remove such stale data
  • Add SQL query engine layer
  • Store JSON in values and explore making CaskDB as a document database like Mongo
  • Make CaskDB distributed by exploring algorithms like raft, paxos, or consistent hashing

Name

This project was named cdb earlier and now renamed to CaskDB.

Line Count

$ tokei -f format.py disk_store.py
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Python                  2          391          261          103           27
-------------------------------------------------------------------------------
 disk_store.py                      204          120           70           14
 format.py                          187          141           33           13
===============================================================================
 Total                   2          391          261          103           27
===============================================================================

License

The MIT license. Please check LICENSE for more details.

Owner
I git stuff done
Convert .1pux to .csv

1PasswordConverter Convert .1pux to .csv 1Password uses this new export format .1pux, I assume stands for 1 Password User eXport. As of right now, 1Pa

Shayne Hartford 7 Dec 16, 2022
ticguide: quick + painless TESS observing information

ticguide: quick + painless TESS observing information Complementary to the TESS observing tool tvguide (see also WTV), which tells you if your target

Ashley Chontos 5 Nov 05, 2022
Yet another Python Implementation of the Elo rating system.

Python Implementation - Elo Rating System Yet another Python Implementation of the Elo rating system (how innovative am I right?). Only supports 1vs1

Kraktoos 5 Dec 22, 2022
Developing a python based app prototype with KivyMD framework for a competition :))

Developing a python based app prototype with KivyMD framework for a competition :))

Jay Desale 1 Jan 10, 2022
This repository contains a lot of short scripting programs implemented both in Python (Flask) and TypeScript (NodeJS).

fast-scripts This repository contains a lot of short scripting programs implemented both in Python (Flask) and TypeScript (NodeJS). In python These wi

Nahum Maurice 3 Dec 10, 2022
A tool for generating skill map/tree like diagram

skillmap A tool for generating skill map/tree like diagram. What is a skill map/tree? Skill tree is a term used in video games, and it can be used for

Yue 98 Jan 07, 2023
Script Repository for the ICGM-CNRS FRANCE

Here you will find my Python Work repesitory for the ICGM institute - Montpellier - France.

CABOS Matthieu 1 Apr 13, 2022
Python library for converting Python calculations into rendered latex.

Covert art by Joshua Hoiberg handcalcs: Python calculations in Jupyter, as though you wrote them by hand. handcalcs is a library to render Python calc

Connor Ferster 5.1k Jan 07, 2023
TB Set color display - Add-on for Blender to set multiple objects and material Display Color at once.

TB_Set_color_display Add-on for Blender with operations to transfer name between object, data, materials and action names Set groups of object's or ma

1 Jun 01, 2022
An example module hooking system, will be used in PySAMP.

An example module hooking system, will be used in PySAMP.

2 May 01, 2022
Brython (Browser Python) is an implementation of Python 3 running in the browser

brython Brython (Browser Python) is an implementation of Python 3 running in the browser, with an interface to the DOM elements and events. Here is a

5.9k Jan 02, 2023
A tool to improve Boolean satisfiability (SAT) solver user's life

SatHelper This is a tool to improve the Boolean satisfiability (SAT) and MaxSAT solver user's life. It helps you model various problems as SAT and Max

Tomas Balyo 1 Nov 16, 2021
HiSim - House Infrastructure Simulator

HiSim is a Python package for simulation and analysis of household scenarios using modern components as alternative to fossil fuel based ones.

FZJ-IEK3 17 Dec 17, 2022
K2HASH Python library - NoSQL Key Value Store(KVS) library

k2hash_python Overview k2hash_python is an official python driver for k2hash. Install Firstly you must install the k2hash shared library: curl -o- htt

Yahoo! JAPAN 3 Oct 19, 2022
This is a Docker-based pipeline for preparing sextractor-ready multiwavelength images

Pipeline for creating NB422-detected (ODI) catalog The repository contains a Docker-based pipeline for preprocessing observational data. The pipeline

1 Sep 01, 2022
This is the accompanying repository for the Bloomberg Global Coal Countdown website.

This is the accompanying repository for the Bloomberg Global Coal Countdown (BGCC) website. Data Sources Dashboard Data Schema and Validation License

7 Jun 01, 2022
Python data loader for Solar Orbiter's (SolO) Energetic Particle Detector (EPD).

Data loader (and downloader) for Solar Orbiter/EPD energetic charged particle sensors EPT, HET, and STEP. Supports level 2 and low latency data provided by ESA's Solar Orbiter Archive.

Jan Gieseler 9 Dec 16, 2022
Antchain-MPC is a library of MPC (Multi-Parties Computation)

Antchain-MPC Antchain-MPC is a library of MPC (Multi-Parties Computation). It include Morse-STF: A tool for machine learning using MPC. Others: Commin

Alipay 37 Nov 22, 2022
Pengenalan para anggota KOMPETEGRAM

Pengenalan Anggota KOMPETEGRAM Apa isi repositori ini ? 💬 Repositori ini berisi pengenalan nama anggota KOMPETEGRAM dari seluruh angkatan atau Batch.

Repositori KOMPETEGRAM 7 Sep 17, 2022
Minutaria is a basic educational Python timer used to learn python and software testing libraries.

minutaria minutaria is a basic educational Python timer. The project is educational, it aims to teach myself programming, python programming, python's

1 Jul 16, 2021