Useful Data Structures

Reading time ~3 minutes

Hashing

Basic Concepts

Definition 1: [Simple Uniform Hashing] A randomized algorithm for constructing hash functions is called as simple uniform hashing, if given any element is is equally likely to hash into any of the slots.

Definition 1: [Universal Hashing] A randomized algorithm for constructing hash functions is called universal, if for any in , we have,

We also say that a set H is a universal hash function family if randomly choosing produces a universal hashing.

Performances

1. Bloom Filter

Bloom filter@Bloom1970 is a space-efficient probabilistic data structure. It is used for membership query, i.e., answering the question whether an element belongs to a given set . However, bloom filter has a so-call false positive issue, where it may answer “YES” when .

1.1 Algorithmic Description

In standard bloom filter, the baseline data structure is a bit array of bits with independent hash functions. In this following description, we will use to denote the bit array, and to represent the hash functions.

At the very beginning, all of its bits are set to . Standard bloom filter supports the following two operations:

  1. INSERT: the following pseudo-code inserts all elements of set into the bloom filter.

    ~~~python for x in S: for h in hash_functions.values(): B[h(x)] = 1 ~~~

  2. QUERY: the following pseudo-code queries whether .

    ~~~python answer = True for h in hash_functions.values(): answer &= B[h(y)] == 1 ~~~

1.2 False Positive Analysis

From Section 1.1, you might notice that, for an element , the corresponding bits (i.e., ) might be all 1’s, therefore might be falsely claimed to be in the set . The phenomenon is so-called false positive. In this subsection, we want to analyze the probability of this phenomenon, which is usually called as false positive probability.

Assume that the bloom filter has bits and independent hash functions, and .

First of all, for any given bit, the probability that after inserting all the elements of , its value is still is,

As , when are large enough, the probability in Eqn. can be approximated (well) by using . Therefore, the false positive probability can be expressed as,

Case 1: given and , finding optimal .

Let , and

Hence, we have,

Let , then,

It is easy to figure out, when (i.e., ), . To check whether is the minimal point and the only extreme point, we can draw the curve of (which is shown in the following figure).

From the above curve, we can verify that is the only extreme point and it is the minimal point, therefore, it is the minimum point. Therefore, the optimal is,

References

Useful Linux Software

Ubuntu Software Installation Continue reading

Useful user-defined LaTeX commands

Published on November 27, 2016