当前位置:网站首页>Hash table, bloom filter, distributed consistency hash
Hash table, bloom filter, distributed consistency hash
2022-07-19 02:13:00 【It's a snail knight】
List of articles
1. hash Application scenarios of
It is mainly used to find out whether a string exists in massive data , such as :
Use word When the document ,word How to tell if a word is spelled correctly ?
Web crawler , How to keep it from climbing the same url page ?
How to design Spam Filtering Algorithm ?
When the police handle a case , How to judge whether a suspect is on the Internet escape list ?
How to solve the cache penetration problem ?
2. Hash table
2.1 How to choose hash function
hash The function needs to have :1. The calculation efficiency is fast 2. Strong randomness ( Equal probability 、 Evenly distributed throughout the address space ).
murmurhash1,murmurhash2,murmurhash3,siphash(redis6.0 Make ⽤,rust And most languages hash Algorithm to achieve hashmap),cityhash All have strong random distribution ; The test address is as follows :https://github.com/aappleby/smhasher
siphash It mainly solves the strong random distribution of string proximity ;
2.2 Composition of hash table ,hash How to resolve the conflict
Hash table consists of hash function and array , adopt hash Function calculation key Mapping relationship in memory :index = hash(key) / size;,hash The function may put two or more different key Map to the same address , This situation is called conflict ( perhaps hash Collision ). solve hash There are several ways to conflict :
1. Zipper method : Map to the same location key Connected through a linked list , However, there may be an extreme situation where too many conflicting elements lead to the length of the linked list being too long , At this time, red and black trees can be used to store these conflict nodes , The time complexity of the search is determined by O(n) Reduced to O(log2N), When the number of nodes in the linked list exceeds 256 It is recommended to use red and black trees to store conflict nodes , This is an empirical value .
2. Open addressing : Store all elements in the array of hash table , No additional data structures ; Generally, the idea of linear exploration is used to solve .
- When inserting a new element , Use hash function to locate element position in hash table ;
- Check whether there is an element in the slot index in the array . If the slot is empty , Then insert ⼊, otherwise 3;
- stay 2 Add a certain step to the detected slot index, and then check 2; Add ⼀ The fixed step size is divided into the following :
1. i+1,i+2,i+3,i+4, … ,i+n
2. i- ,i+ ,i- ,1+, …
Will lead to hash Gather , It will affect the performance of search , Using double hashing can avoid hash Gather .
2.3 hash Dynamic management of array capacity
Load factor : The number of elements in the hash table / The capacity of the array , Describe the intensity of the conflict .
Capacity expansion : When the load factor is greater than 1 when , Expand the capacity of the array to the original 2 times , Proceed again rehash Recalculate key Storage mapping location of .
Shrinkage capacity : When the load factor is greater than 1 when , Reduce the capacity of the array to the original 1/2, Proceed again rehash Recalculate key Storage mapping location of .
2.4 The hash table is in STL Application of containers

3. The bloon filter
3.1 Features of bloon filter
Bloom filter is a kind of Probability type data structure , It is characterized by efficient insertion and query , Can determine a string It must not exist perhaps Possible .
The bloon filter No specific data is stored , So it takes up little space , There is an error in the query result , But the error is controllable , meanwhile Delete operation not supported .
3.2 The realization principle of Bloom filter
Bitmap (BIT Array )+ n individual hash function :
1. Insert : When an element is inserted into a bitmap , adopt k individual hash Function calculation key Mapped to a bitmap k A little bit , And set their values to 1.
2. lookup : Re pass k individual hash Function operation to detect the of bitmap k Whether all points are 1; If there is any reason not to 1 The point of , So think it's time to key non-existent ; If it's all 1, There may be .
3. The bloom filter does not support deletion : There are only two states for each slot in the bitmap (0 perhaps 1), A slot is set to 1 state , But I'm not sure how many times it's set ; I don't know how many key From hash mapping and by which hash Function mapping .
3.3 Application scenarios of Bloom filter
Bloom filter is usually used to judge A string must not exist , And it is used to allow certain errors in the existence of strings .
1. Used to determine whether a character exists in the file , Generate a bloom filter for a file , There is no need to load the file into memory , Through the bloom filter, we can determine whether the characters exist in the file , File update requires regenerating the bloom filter .
2. Solution of cache penetration , In order to reduce the database (mysql) Access pressure for , stay server Client and database (mysql) Add cache between to store hotspot data .
4. Distributed consistency hash
4.1 Distributed consistency hash Introduction to
Distributed consistency hash The algorithm organizes the hash space into a virtual ring , The size of the ring is 232; Algorithm for :hash(ip) %232, In the end, you'll get one [ 0, 232 - 1 ] An unsigned integer between , This integer represents the number of the server ; Multiple servers are running in this way hash Map a point on the ring to identify the location of the server ; When a user operates a key, Generate a value by the same algorithm , Position a server clockwise along the ring , Then the key In this server .
4.2 Cache invalidation
Adding or deleting a node will result in Cache invalidation , Data migration is needed .
When adding nodes : For example, the original node hash value list is [1,100,500,1000], New node 800 after , stay 501~799 The data in the range was originally assigned a hash value of 1000 The node of , Now we need to migrate this part of data to the node 800;
Delete node : For example, the original node hash value list is [1,100,500,800,1000], Delete node 500 after , The original scope is 101~500 The data of should be migrated to the node 800.
How to conduct data migration reference :https://github.com/metang326/consistent_hashing_cpp
4.3 Hash offset
To solve the problem of hash offset , Added the concept of virtual node ; Theoretically , The more nodes on the hash ring , The more evenly distributed the data ; Calculate multiple hash nodes for each service node ( Virtual node ); The usual way is ,hash(“IP:PORT:seqno”) %232
Distributed cache ; Distribute data evenly among different servers , Used to share the pressure of cache server ; The change in the number of cache servers should not affect cache invalidation ;
5. Use 2G In memory 20 Find out the integer with the most occurrences among 100 million integers
Thinking questions : How to use 2G In memory 20 Find out the integer with the most occurrences among 100 million integers ?
Want to count the number of occurrences of each integer , We can think of integers as key, Count each integer key Number of occurrences value. Consider an extreme situation :20 100 million integers are the same number , in other words value The maximum value of is 2*109, Use uint32 You can meet value The maximum range of . One key–value Objects need to occupy 8bit Space ,2 100 million integers need to be occupied 1.6G Of memory , obviously , We want to make statistics at one time 20 It is impossible for an integer to appear the most times in 100 million numbers . At this time, we can put this 20 Divide 100 million integers into 10 Share , Each is close to 2 Million number , When splitting, we need to ensure that the same number is divided together , Let's go to each integer hash value , And then hash It's worth it 10 Remainder , utilize hash Strong randomness of function , This not only ensures that the same number is divided together , And can make 20 Evenly split 100 million integers . Data statistics : Use a hash table to count the number of occurrences of each integer ,value The integer with the largest value is the integer with the most occurrences .
Recommend a free open course of zero sound College , Personally, I think the teacher spoke well , Share with you :Linux,Nginx,ZeroMQ,MySQL,Redis,fastdfs,MongoDB,ZK, Streaming media ,CDN,P2P,K8S,Docker,TCP/IP, coroutines ,DPDK Etc , Click learn now :
边栏推荐
- gdb+vscode进行调试2——gdb断点相关
- Suivi du mode de méthode de l'usine
- Opengauss Developer Day 2022 dongfangtong sincerely invites you to visit the "dongfangtong ecological tools sub forum"
- Characteristics and application points of electrolytic capacitor
- 如何理解Volatile以及如何使用它
- 04 design of indoor wireless positioning system based on ZigBee
- Leetcode 1: Two Sum
- Envi IDL: lire la teneur en colonne de NO2 de tous les produits OMI et calculer la moyenne mensuelle, la moyenne trimestrielle, la moyenne annuelle + résolution
- Aurix development studio installation
- Compilation and link of C language program
猜你喜欢

Labelme正常启动,但无法打开

How to understand volatile and how to use it

VS Code 问题:launch:program‘...\.vscode\launch.exe‘ dose not exist

树和堆知识点总结

SAE j1708/j1587 protocol details

windows安装mysql和jdbc

DQN理论基础及其代码实现【Pytorch + CartPole-v0】

06 design of smart electronic medicine box based on stm32

Compilation and link of C language program

Can protocol communication
随机推荐
Memory pooling of pooled components
Oozie 集成 Shell
成信大ENVI_IDL第三周课堂内容1:读取OMI数据(HDF5文件)以及输出+解析
RuntimeError_ Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor)
Line process of pool components
04基于ZigBee的室内无线定位系统设计
leetcode力扣经典问题——42.接雨水
Engineering compilation: makefile and cmake (I)
Opengauss Developer Day 2022 dongfangtong sincerely invites you to visit the "dongfangtong ecological tools sub forum"
Leetcode 1: Two Sum
ENVI_ Idl: reading of text files (mainly txt and CSV files)
03基于ZigBee的城市道路除尘降温系统设计
Differences between saber PSPICE Simulink power supply simulation software
Switch details
散列表、布隆过滤器、分布式一致性hash
04 design of indoor wireless positioning system based on ZigBee
ENVI_IDL:讀取所有OMI產品的NO2柱含量並計算月均值、季均值、年均值+解析
DoubleDQN的理论基础及其代码实现【Pytorch + Pendulum-v0】
Labelme正常启动,但无法打开
Hue oozie editor scheduling shell