Hashing Tutorial
Section 6 - Improved Collision Resolution Methods
Section 6.1 - Linear Probing by Steps
How can we avoid primary clustering? One possible improvement might be to use linear probing, but to skip slots by some constant c other than 1. This would make the probe function p(K, i) = ci, and so the ith slot in the probe sequence will be h(K) + ic) mod M. In this way, records with adjacent home positions will not follow the same probe sequence. For example, if we were to skip by twos, then our offsets from the home slot would be 2, then 4, then 6, and so on.
One quality of a good probe sequence is that it will cycle through all slots in the hash table before returning to the home position. Clearly linear probing (which "skips" slots by one each time) does this. Unfortunately, not all values for c will make this happen. For example, if c = 2 and the table contains an even number of slots, then any key whose home position is in an even slot will have a probe sequence that cycles through only the even slots. Likewise, the probe sequence for a key whose home position is in an odd slot will cycle through the odd slots. Thus, this combination of table size and linear probing constant effectively divides the records into two sets stored in two disjoint sections of the hash table. So long as both sections of the table contain the same number of records, this is not really important. However, just from chance it is likely that one section will become fuller than the other, leading to more collisions and poorer performance for those records. The other section would have fewer records, and thus better performance. But the overall system performance will be degraded, as the additional cost to the side that is more full outweighs the improved performance of the less-full side.
Constant c must be relatively prime to M to generate a linear probing sequence that visits all slots in the table (that is, c and M must share no factors). For a hash table of size M = 10, if c is any one of 1, 3, 7, or 9, then the probe sequence will visit all slots for any key. When M = 11, any value for c between 1 and 10 generates a probe sequence that visits all slots for every key.
This applet will let you try different step sizes for collision resolution. As you try this, be sure to pick keys to insert that will generate lots of collisions so that you can try this out. A good way to do this is to pick keys that are mostly multiples of the table size, which will hash to slot 0 in the table. Try some values for c that will visit every slot in the table if there are enough collisions. Try some values for c that will not visit every slot in the table. What happens when we pick a bad value for c and the subset of slots reachable are all full?
This applet shows one performance effect of using a collision resolution method that generates disjoint tables. The graph shows the probability that an insert operation will fail as the hash table fills up. When the probe function visits every slot of the table, insertion can never fail unless the hash table is full. But if the probe function does not visit every slot (such as when the step size for linear probing is 2 and the table size is even), then it is possible for one of the sub-tables to become full before the entire table is full. In this graph, the step size is always 2.
Consider the situation where c = 2 and we wish to insert a record with key k1 such that h(k1) = 3. The probe sequence for k1 is 3, 5, 7, 9, and so on. If another key k2 has home position at slot 5, then its probe sequence will be 5, 7, 9, and so on. The probe sequences of k1 and k2 are linked together in a manner that contributes to clustering. In other words, linear probing with a value of c > 1 does not solve the problem of primary clustering. We would like to find a probe function that does not link keys together in this way. We would prefer that the probe sequence for k1 after the first step on the sequence should not be identical to the probe sequence of k2. Instead, their probe sequences should diverge.