当前位置：网站首页>Hash table, bloom filter, distributed consistency hash

Hash table, bloom filter, distributed consistency hash

2022-07-19 02:13:00 【It's a snail knight】

List of articles

1. hash Application scenarios of
2. Hash table
3. The bloon filter
4. Distributed consistency hash
5. Use 2G In memory 20 Find out the integer with the most occurrences among 100 million integers

1. hash Application scenarios of

It is mainly used to find out whether a string exists in massive data , such as ：

Use word When the document ,word How to tell if a word is spelled correctly ？
Web crawler , How to keep it from climbing the same url page ？
How to design Spam Filtering Algorithm ？
When the police handle a case , How to judge whether a suspect is on the Internet escape list ？
How to solve the cache penetration problem ？

2. Hash table

2.1 How to choose hash function

hash The function needs to have ：1. The calculation efficiency is fast 2. Strong randomness （ Equal probability 、 Evenly distributed throughout the address space ）.

murmurhash1,murmurhash2,murmurhash3,siphash（redis6.0 Make ⽤,rust And most languages hash Algorithm to achieve hashmap）,cityhash All have strong random distribution ; The test address is as follows ：https://github.com/aappleby/smhasher
siphash It mainly solves the strong random distribution of string proximity ;

2.2 Composition of hash table ,hash How to resolve the conflict

Hash table consists of hash function and array , adopt hash Function calculation key Mapping relationship in memory ：index = hash(key) / size;,hash The function may put two or more different key Map to the same address , This situation is called conflict （ perhaps hash Collision ）. solve hash There are several ways to conflict ：
1. Zipper method ： Map to the same location key Connected through a linked list , However, there may be an extreme situation where too many conflicting elements lead to the length of the linked list being too long , At this time, red and black trees can be used to store these conflict nodes , The time complexity of the search is determined by O(n) Reduced to O(log₂N), When the number of nodes in the linked list exceeds 256 It is recommended to use red and black trees to store conflict nodes , This is an empirical value .
2. Open addressing ： Store all elements in the array of hash table , No additional data structures ; Generally, the idea of linear exploration is used to solve .

When inserting a new element , Use hash function to locate element position in hash table ;
Check whether there is an element in the slot index in the array . If the slot is empty , Then insert ⼊, otherwise 3;
stay 2 Add a certain step to the detected slot index, and then check 2; Add ⼀ The fixed step size is divided into the following ：
1. i+1,i+2,i+3,i+4, … ,i+n
2. i- ,i+ ,i- ,1+, …
Will lead to hash Gather , It will affect the performance of search , Using double hashing can avoid hash Gather .

2.3 hash Dynamic management of array capacity

Load factor ： The number of elements in the hash table / The capacity of the array , Describe the intensity of the conflict .
Capacity expansion ： When the load factor is greater than 1 when , Expand the capacity of the array to the original 2 times , Proceed again rehash Recalculate key Storage mapping location of .
Shrinkage capacity : When the load factor is greater than 1 when , Reduce the capacity of the array to the original 1/2, Proceed again rehash Recalculate key Storage mapping location of .

2.4 The hash table is in STL Application of containers

The hash table is in STL Application in

3. The bloon filter

3.1 Features of bloon filter

Bloom filter is a kind of Probability type data structure , It is characterized by efficient insertion and query , Can determine a string It must not exist perhaps Possible .
The bloon filter No specific data is stored , So it takes up little space , There is an error in the query result , But the error is controllable , meanwhile Delete operation not supported .

3.2 The realization principle of Bloom filter

Bitmap （BIT Array ）+ n individual hash function ：
The realization principle of Bloom filter
1. Insert ： When an element is inserted into a bitmap , adopt k individual hash Function calculation key Mapped to a bitmap k A little bit , And set their values to 1.
2. lookup ： Re pass k individual hash Function operation to detect the of bitmap k Whether all points are 1; If there is any reason not to 1 The point of , So think it's time to key non-existent ; If it's all 1, There may be .
3. The bloom filter does not support deletion ： There are only two states for each slot in the bitmap （0 perhaps 1）, A slot is set to 1 state , But I'm not sure how many times it's set ; I don't know how many key From hash mapping and by which hash Function mapping .

3.3 Application scenarios of Bloom filter

Bloom filter is usually used to judge A string must not exist , And it is used to allow certain errors in the existence of strings .
1. Used to determine whether a character exists in the file , Generate a bloom filter for a file , There is no need to load the file into memory , Through the bloom filter, we can determine whether the characters exist in the file , File update requires regenerating the bloom filter .
2. Solution of cache penetration , In order to reduce the database （mysql） Access pressure for , stay server Client and database （mysql） Add cache between to store hotspot data .

4. Distributed consistency hash

4.1 Distributed consistency hash Introduction to

Distributed consistency hash The algorithm organizes the hash space into a virtual ring , The size of the ring is 2³²; Algorithm for ：hash(ip) %2³², In the end, you'll get one [ 0, 2³² - 1 ] An unsigned integer between , This integer represents the number of the server ; Multiple servers are running in this way hash Map a point on the ring to identify the location of the server ; When a user operates a key, Generate a value by the same algorithm , Position a server clockwise along the ring , Then the key In this server .
Insert picture description here

4.2 Cache invalidation

Adding or deleting a node will result in Cache invalidation , Data migration is needed .
When adding nodes ： For example, the original node hash value list is [1,100,500,1000], New node 800 after , stay 501~799 The data in the range was originally assigned a hash value of 1000 The node of , Now we need to migrate this part of data to the node 800;
Delete node ： For example, the original node hash value list is [1,100,500,800,1000], Delete node 500 after , The original scope is 101~500 The data of should be migrated to the node 800.
How to conduct data migration reference ：https://github.com/metang326/consistent_hashing_cpp

4.3 Hash offset

To solve the problem of hash offset , Added the concept of virtual node ; Theoretically , The more nodes on the hash ring , The more evenly distributed the data ; Calculate multiple hash nodes for each service node （ Virtual node ）; The usual way is ,hash(“IP:PORT:seqno”) %2³²
Distributed cache ; Distribute data evenly among different servers , Used to share the pressure of cache server ; The change in the number of cache servers should not affect cache invalidation ;

5. Use 2G In memory 20 Find out the integer with the most occurrences among 100 million integers

Thinking questions ： How to use 2G In memory 20 Find out the integer with the most occurrences among 100 million integers ？
Want to count the number of occurrences of each integer , We can think of integers as key, Count each integer key Number of occurrences value. Consider an extreme situation ：20 100 million integers are the same number , in other words value The maximum value of is 2*10⁹, Use uint32 You can meet value The maximum range of . One key–value Objects need to occupy 8bit Space ,2 100 million integers need to be occupied 1.6G Of memory , obviously , We want to make statistics at one time 20 It is impossible for an integer to appear the most times in 100 million numbers . At this time, we can put this 20 Divide 100 million integers into 10 Share , Each is close to 2 Million number , When splitting, we need to ensure that the same number is divided together , Let's go to each integer hash value , And then hash It's worth it 10 Remainder , utilize hash Strong randomness of function , This not only ensures that the same number is divided together , And can make 20 Evenly split 100 million integers . Data statistics ： Use a hash table to count the number of occurrences of each integer ,value The integer with the largest value is the integer with the most occurrences .
Recommend a free open course of zero sound College , Personally, I think the teacher spoke well , Share with you ：Linux,Nginx,ZeroMQ,MySQL,Redis,fastdfs,MongoDB,ZK, Streaming media ,CDN,P2P,K8S,Docker,TCP/IP, coroutines ,DPDK Etc , Click learn now :

原网站

版权声明
本文为[It's a snail knight]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207170007213640.html