Hashing Tutorial

Section 2.2 - Binning

Say we are given keys in the range 0 to 99, and have a hash table of size 10. In this case, a possible hash function might simply divide the key value by 10. Thus, all keys in the range 0 to 9 would hash to slot 0, keys 10 to 19 would hash to slot 1, and so on. In other words, this hash function "bins" the first 10 keys to the first slot, the next 10 keys to the second slot, and so on.

"Binning" in this way has the problem that it will cluster together keys if the distribution does not divide evenly on the high-order bits. In the above example, if more records have keys in the range 90-99 (first digit 9) than have keys in the range 10-19 (first digit 1), more records will hash to slot 9 than to slot 1. (A similar, analogous problem arises if we were instead hashing strings based on the first letter in the string.)

Try out the binning hash function. Set the table size, then insert different values into the table. For each value, before you insert it, try to predict where it will be stored in the table.

Binning is like the inverse of the mod function: the mod function, for a power of two, looks at the low-order bits, while binning looks at the high-order bits. As another example, consider hashing a collection of keys whose values follow a normal distribution. Keys near the mean of the normal distribution are far more likely to occur than keys near the tails of the distribution. For a given slot, think of where the keys come from within the distribution. Binning would be taking thick slices out of the distribution and assign those slice to bins. If we use a hash table of size 16, we would divide the key range into 16 equal-width slices and assign each slice to a slot in the table. Since a normal distribution is more likely to generate keys from the middle slice, the middle slot of the table is most likely to be used. In contrast, if we use the mod function, then we are assigning to any given slot in the table a series of thin slices in steps of 16. In the normal distribution, some of these slices associated with any given slot are near the tails, and some are near the center. Thus, each table slot is equally likely (roughly) to get a key value.

The following applet will show you performance information about this hash function. Try with different table sizes, different levels of loading (the number of keys inserted), and different input distributions. In particular, see what happens when a normal distribution is used where the standard deviation is an order of magnitude less than the mean (ex, μ = 35000, σ = 1000).