Hashing Tutorial

Section 5 - Collision Resolution

We now turn to the most commonly used form of hashing: closed hashing with no bucketing, and a collision resolution policy that can potentially use any slot in the hash table.

During insertion, the goal of collision resolution is to find a free slot in the hash table when the home position for the record is already occupied. We can view any collision resolution method as generating a sequence of hash table slots that can potentially hold the record. The first slot in the sequence will be the home position for the key. If the home position is occupied, then the collision resolution policy goes to the next slot in the sequence. If this is occupied as well, then another slot must be found, and so on. This sequence of slots is known as the probe sequence, and it is generated by some probe function that we will call p. Insertion works as follows:

// Insert e into hash table HT
void hashInsert(const Key& k, const Elem& e) {
  int home;                     // Home position for e
  int pos = home = h(k);        // Init probe sequence
  for (int i=1; EMPTYKEY != (HT[pos]).key(); i++) {
    pos = (home + p(k, i)) % M; // probe
    if (k == HT[pos].key()) {
      cout << "Duplicates not allowed\n";
      return;
    }
  }
  HT[pos] = e;
}

Method "hashInsert" first checks to see if the home slot for the key is empty. If the home slot is occupied, then we use the probe function p(k, i) to locate a free slot in the table. Function p has two parameters, the key k and a count i of where in the probe sequence we wish to be. That is, to get the first position in the probe sequence after the home slot for key K, we call p(K, 1). For the next slot in the probe sequence, call p(K, 2). Note that the probe function returns an offset from the original home position, rather than a slot in the hash table. Thus, the for loop in "hashInsert" is computing positions in the table at each iteration by adding the value returned from the probe function to the home position. The ith call to p returns the ith offset to be used.

Searching in a hash table follows the same probe sequence that was followed when inserting records. In this way, a record not in its home position can be recovered. A C++ implementation for the search procedure is as follows.

// Search for the record with Key K
bool hashSearch(const Key& K, Elem& e) const {
  int home;              // Home position for K
  int pos = home = h(K); // Initial position is the home slot
  for (int i = 1;
       (K != (HT[pos]).key()) && (EMPTYKEY != (HT[pos]).key());
       i++)
    pos = (home + p(K, i)) % M; // Next on probe sequence
  if (K == (HT[pos]).key()) {   // Found it
    e = HT[pos];
    return true;
  }
  else return false;            // K not in hash table
}

Both the insert and the search routines assume that at least one slot on the probe sequence of every key will be empty. Otherwise they will continue in an infinite loop on unsuccessful searches. Thus, the hash system should keep a count of the number of records stored, and refuse to insert into a table that has only one free slot.

The discussion on bucket hashing presented a simple method of collision resolution. If the home position for the record is occupied, then move down the bucket until a free slot is found. This is an example of a technique for collision resolution known as linear probing. The probe function for simple linear probing is p(K, i) = i. That is, the ith offset on the probe sequence is just i, meaning that the ith step is simply to move down i slots in the table. Once the bottom of the table is reached, the probe sequence wraps around to the beginning of the table. Linear probing has the virtue that all slots in the table will be candidates for inserting a new record before the probe sequence returns to the home position.

Use this applet to try out linear probing for yourself. Can you see any reason why this might not be the best approach to collision resolution?

While linear probing is probably the first idea that comes to mind when considering collision resolution policies, it is not the only one possible. Probe function p allows us many options for how to do collision resolution. In fact, linear probing is one of the worst collision resolution methods. The main problem is illustrated by the figure below. Here, we see a hash table of ten slots used to store four-digit numbers. The hash function used is h(K) = K mod 10. The four values 1001, 9050, 9877, and 2037 are inserted into the table. Then the value 1059 is added to the hash table.

linear probing figure

The ideal behavior for a collision resolution mechanism is that each empty slot in the table will have equal probability of receiving the next record inserted (assuming that every slot in the table has equal probability of being hashed to initially). In this example, assume that the hash function gives each slot (roughly) equal probability of being the home position for the next key. However, consider what happens to the next record if its key has its home position at slot 0. Linear probing will send the record to slot 2. The same will happen to records whose home position is at slot 1. A record with home position at slot~2 will remain in slot 2. Thus, the probability is 3/10 that the next record inserted will end up in slot 2. In a similar manner, records hashing to slots 7 or 8 will end up in slot 9. However, only records hashing to slot 3 will be stored in slot 3, yielding one chance in ten of this happening. Likewise, there is only one chance in ten that the next record will be stored in slot 4, one chance in ten for slot 5, and one chance in ten for slot 6. Thus, the resulting probabilities are not equal.

To make matters worse, if the next record ends up in slot 9 (which already has a higher than normal chance of happening), then the following record will end up in slot 2 with probability 6/10. This is illustrated in the right side of the figure. This tendency of linear probing to cluster items together is known as primary clustering. Small clusters tend to merge into big clusters, making the problem worse. The objection to primary clustering is that it leads to long probe sequences.

Now go back up to the applet, and enter the numbers shown in the figure. Be sure to click the "Show Count" button before you do this, so that you can see how many home slots will probe to a given free slot.