Hashing Tutorial

Section 2.2 - Binning

Say we are given keys in the range 0 to 999, and have a hash table of size 10. In this case, a possible hash function might simply divide the key value by 100. Thus, all keys in the range 0 to 99 would hash to slot 0, keys 100 to 199 would hash to slot 1, and so on. In other words, this hash function "bins" the first 100 keys to the first slot, the next 100 keys to the second slot, and so on.

"Binning" in this way has the problem that it will cluster together keys if the distribution does not divide evenly on the high-order bits. In the above example, if more records have keys in the range 900-999 (first digit 9) than have keys in the range 100-199 (first digit 1), more records will hash to slot 9 than to slot 1. (A similar, analogous problem arises if we were instead hashing strings based on the first letter in the string.)

Try out the binning hash function. Set the table size, then insert different values into the table. The key range is 0-999 The key value home slot is computed as KEY/(1000/TABLESIZE). So if the table size is 10, keys 0-99 hash to slot 0, 100-199 hash to slot 1, and so on. Collisions are resolved using simple linear probing. For each value, before you insert it, try to predict where it will be stored in the table (you can turn on "test mode" to check your predictions). A good table size would be 10 so that you can easily calculate where the value should go.

Binning is like the inverse of the mod function: the mod function, for a power of two, looks at the low-order bits, while binning looks at the high-order bits. As another example, consider hashing a collection of keys whose values follow a normal distribution. Keys near the mean of the normal distribution are far more likely to occur than keys near the tails of the distribution. For a given slot, think of where the keys come from within the distribution. Binning would be taking thick slices out of the distribution and assign those slice to bins. If we use a hash table of size 16, we would divide the key range into 16 equal-width slices and assign each slice to a slot in the table. Since a normal distribution is more likely to generate keys from the middle slice, the middle slot of the table is most likely to be used. In contrast, if we use the mod function, then we are assigning to any given slot in the table a series of thin slices in steps of 16. In the normal distribution, some of these slices associated with any given slot are near the tails, and some are near the center. Thus, each table slot is equally likely (roughly) to get a key value.

The following applet will show you performance information about this hash function. Try with different table sizes, different levels of loading (the number of keys inserted), and different input distributions. In particular, see what happens when a normal distribution is used where the standard deviation is an order of magnitude less than the mean (ex, μ = 35000, σ = 1000).