BREP : Binary Search in plaintext and gzip files

Overview

BREP : Binary Search in plaintext and gzip files

Search large files in O(log n) time using binary search.
We support plaintext and Gzipped files.

Benchmark : 8x faster than grep on a 2GB dataset !

brep is usually faster than grep for >1GB datasets.

Check tests/benchmark.py to reproduce the results.

grep ^777 test.txt : 1.594 s (15 runs)
brep 777 test.txt : 206.8 ms (15 runs)

Installation

pip install brep or pip install . from this repo

Index your file

In order to conduct binary search, your file needs to be sorted.
We recommend GNU sort, as it's multithreaded and supports large files.
LC_ALL=C sort -u -o output_file input_file

BREP supports compressed file in the GZIP format.
We recommend pigz for quick multicore compression : pigz file

Usage

Provide 1 prefix search term and 1 filepath
brep 77777 test/large.gz

You can also search from our Python class

from brep import Search

for result in Search("77777", "test/large.gz"):
    print(result)

Contribute

PRs are welcome!

Install dev dependencies: pip install -e .[dev]
Test and lint before submitting: pytest && flake8

Todo

  • Reimplement in Rust
  • Faster gz size estimation
  • Search multiple strings at once
You might also like...
dotsend is a web application which helps you to upload your large files and share file via link

dotsend is a web application which helps you to upload your large files and share file via link

A wrapper for DVD file structure and ISO files.

vs-parsedvd DVDs were an error. A wrapper for DVD file structure and ISO files. You can find me in the IEW Discord server

Python Fstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems.

PyFstab Generator PyFstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems. NOTE : Th

A tool for batch processing large fasta files and accompanying metadata table to upload to repositories via API

Fasta Uploader A tool for batch processing large fasta files and accompanying metadata table to repositories via API The python fasta_uploader.py scri

Python interface for reading and appending tar files

Python interface for reading and appending tar files, while keeping a fast index for finding and reading files in the archive. This interface has been

A simple file module for creating, editing and saving files.

A simple file module for creating, editing and saving files.

This program can help you to move and rename many files at once
This program can help you to move and rename many files at once

This program can help you to rename and save many files in a folder in seconds, but don't give the same name to files, it can delete both files.

Fast Python reader and editor for ASAM MDF / MF4 (Measurement Data Format) files
Fast Python reader and editor for ASAM MDF / MF4 (Measurement Data Format) files

asammdf is a fast parser and editor for ASAM (Association for Standardization of Automation and Measuring Systems) MDF (Measurement Data Format) files

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.
Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series. The Fourier series can be animated and visualized, the function can be output as a two dimensional vector for Desmos and there is a method to output the coefficients as LaTeX code.

Releases(v1.0.1)
Owner
Arnaud de Saint Meloir
Software Engineer
Arnaud de Saint Meloir
useful files for the Freenove Big Hexapod

FreenoveBigHexapod useful files for the Freenove Big Hexapod HexaDogPos is a utility for converting the Freenove xyz co-ordinate system to servo angle

Alex 2 May 28, 2022
gitfs is a FUSE file system that fully integrates with git - Version controlled file system

gitfs is a FUSE file system that fully integrates with git. You can mount a remote repository's branch locally, and any subsequent changes made to the files will be automatically committed to the rem

Presslabs 2.3k Jan 08, 2023
MHS2 Save file editing tools. Transfers save files between players, switch and pc version, encrypts and decrypts.

SaveTools MHS2 Save file editing tools. Transfers save files between players, switch and pc version, encrypts and decrypts. Credits Written by Asteris

31 Nov 17, 2022
fast change directory with python and ruby

fcdir fast change directory with python and ruby run run python script , chose drirectoy and change your directory need you need python and ruby deskt

XCO 2 Jun 20, 2022
This is just a GUI that detects your file's real extension using the filetype module.

Real-file.extnsn This is just a GUI that detects your file's real extension using the filetype module. Requirements Python 3.4 and above filetype modu

1 Aug 08, 2021
Search for files under the specified directory. Extract the file name and file path and import them as data.

Search for files under the specified directory. Extract the file name and file path and import them as data. Based on that, search for the file, select it and open it.

G-jon FujiYama 2 Jan 10, 2022
A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

README 1 gnucash2ledger gnucash2ledger is a Python script based on the Github Gist by nonducor (nonducor/gcash2ledger.py). This Python script will tak

Thomas Freeman 0 Jan 28, 2022
QSynthesis is a Python3 API to perform I/O based program synthesis of bitvector expressions.

QSynthesis is a Python3 API to perform I/O based program synthesis of bitvector expressions. It aims at facilitating code deobfuscation. The algorithm is greybox approach combining both a blackbox I/

Quarkslab 103 Dec 30, 2022
PaddingZip - a tool that you can craft a zip file that contains the padding characters between the file content.

PaddingZip - a tool that you can craft a zip file that contains the padding characters between the file content.

phithon 53 Nov 07, 2022
CleverCSV is a Python package for handling messy CSV files.

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line applicati

The Alan Turing Institute 1k Dec 19, 2022
ZipFly is a zip archive generator based on zipfile.py

ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without m

Buzon 506 Jan 04, 2023
Remove [x]_ from StudIP zip Archives and archive_filelist.csv completely

This tool removes the "[x]_" at the beginning of StudIP zip Archives. It also deletes the "archive_filelist.csv" file

Kelke vl 1 Jan 19, 2022
Publicly Open Amazon AWS S3 Bucket Viewer

S3Viewer Publicly open storage viewer (Amazon S3 Bucket, Azure Blob, FTP server, HTTP Index Of/) s3viewer is a free tool for security researchers that

Sharon Brizinov 377 Dec 02, 2022
Pure Python tools for reading and writing all TIFF IFDs, sub-IFDs, and tags.

Tiff Tools Pure Python tools for reading and writing all TIFF IFDs, sub-IFDs, and tags. Developed by Kitware, Inc. with funding from The National Canc

Digital Slide Archive 32 Dec 14, 2022
A tiny Configuration File Parser for Python Projects

A tiny Configuration File Parser for Python Projects. Currently working on JSON Config Files only.

Tanmoy Sen Gupta 1 Feb 12, 2022
A small Python module for determining appropriate platform-specific dirs, e.g. a "user data dir".

the problem What directory should your app use for storing user data? If running on macOS, you should use: ~/Library/Application Support/AppName If

ActiveState Software 948 Dec 31, 2022
Python library for reading and writing tabular data via streams.

tabulator-py A library for reading and writing tabular data (csv/xls/json/etc). [Important Notice] We have released Frictionless Framework. This frame

Frictionless Data 231 Dec 09, 2022
BOOTH宛先印刷用CSVから色々な便利なリストを作成してCSVで出力するプログラムです。

BOOTH注文リスト作成スクリプト このPythonスクリプトは、BOOTHの「宛名印刷用CSV」から、 未発送の注文 今月の注文 特定期間の注文 を抽出した上で、各注文を商品毎に一覧化したCSVとして出力するスクリプトです。 簡単な使い方 ダウンロード 通常は、Relaseから、booth_ord

hinananoha 1 Nov 28, 2021
dotsend is a web application which helps you to upload your large files and share file via link

dotsend is a web application which helps you to upload your large files and share file via link

Devocoe 0 Dec 03, 2022
LightCSV - This CSV reader is implemented in just pure Python.

LightCSV Simple light CSV reader This CSV reader is implemented in just pure Python. It allows to specify a separator, a quote char and column titles

Jose Rodriguez 6 Mar 05, 2022