Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359

Weird Sort-and-Compress Thing

Related tags

Overview

Weird Sort-and-Compress Thing

How it works

Memory efficiency

Owner

Douglas

Spam filtering made easy for you

FewCLUE: 为中文NLP定制的小样本学习测评基准

Simple and efficient RevNet-Library with DeepSpeed support

Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

It analyze the sentiment of the user, whether it is postive or negative.

To classify the News into Real/Fake using Features from the Text Content of the article

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

Open Source Neural Machine Translation in PyTorch

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

The code for two papers: Feedback Transformer and Expire-Span.

The proliferation of disinformation across social media has led the application of deep learning techniques to detect fake news.

Mlcode - Continuous ML API Integrations

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

YACLC - Yet Another Chinese Learner Corpus

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Summarization module based on KoBART

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.