H A S H   T A B L E S

Introduction

The binary search trees provide excellent implementations of list data structure. They make it possible to perform all of the standard list operations efficiently; for example, if the tree contains 10,000 items, the operations retrieve, insert, and delete each require approximately log210,000 = 13 steps. As impressive as this efficiency may be, situations do occur for which the search tree implementations are not adequate.

Access time can be extremely vital to an application. For example, when a call comes into the 911 emergency system, the system detects the caller's telephone number and searches a database for the caller's address. Similarly, an air traffic control system searches a database of flight information, given a flight number. Clearly both of these searches must be rapid.

A radically different strategy is necessary to locate (and insert and delete) an item virtually instantaneously. Imagine an array of N items -- with each array slot capable of holding a single item -- and a seemingly magical box called an "address calculator." Whenever you have a new item that you want to insert into the array, the address calculator will tell you where you should place it in the array.

You can therefore perform an insertion into the array as follows:

insert ( in newItem : ItemType ) {
  index = addressCalculator(newItem.getKey())
  array[index] = newItem
  }

You would also use the addressCalculator() for the retrieve and delete operations.

retrieve (  in searchKey : KeyType;
           out item : ItemType ) {
  index = addressCalculator(searchKey)
  if (array[index].getKey() == searchKey)
    item = array[index]
  else
    error!
  }

delete (  in searchKey : KeyType;
         out item : ItemType ) {
  index = addressCalculator(searchKey)
  if (array[index].getKey() == searchKey)
    Delete the item from the array
  else
    error!
  }

If you were to implement such a scheme, you must, of course, be able to construct an addressCalculator() function that can, with very little work (so it does it fast!), tell you were a given item should be. Address calculators are actually not as mysterious as they seem; in fact, many exist that can approximate an idealized behavior just described. Such an address calculator is usually referred to as a hash function. The scheme just described is an idealized description of a method known as hashing, and the array is called the hash table.

Consider the 911 emergency call example. If, for each person, the system had a record whose search key was the person's telephone number, it could store these records in a search tree. Although searching a tree would be fast, faster access to a particular record would be possible by storing the records in a table array.

You could store the record for a person whose telephone number is t in array[t]. Retrieval of the record, given the search key, is almost instantaneous. For example, you can store the record for the telephone number 123-4567 in array[1234567]; if you can spare ten million memory locations for array.

Since 911 systems are regional, you could consider only the one telephone exchange, and you could therefore store the record for number 123-4567 in array[4567] and get by with an array of only 10,000 locations.

The transformation of 1234567 into an array index of 4567 is a simple example of a hash function. A hash function h must take an arbitrary integer x and map it into an integer that you can use as an array index. Although this example of a hash array is not typical (since it would be full), it serves to illustrate the idea of a hash function.

What if many fewer records were in the array? Consider, for example, an air traffic control system flight number. You could store a record for Flight  4567 in array[4567], but you still would need an array of 10,000 locations, even if only 50 flights were current.

A different hash function would save memory. If you allow space for a maximum of 101 flights, for example, so that the array has indexes 0 through 100, the necessary hash function h should map any four-digit flight number into an integer in the range of 0 through 100.

Such a hash function would appear to be instanteous. If it is therefore so good, why did we spend all that time talking about other methods of quick retrieval of data. Why is hashing not quite as simple as it sounds?

Obviously, the hash table must be larger enough to store all the items we want to store. Using a fixed-size array, the implementation has the same pitfalls as the other array-based data structure solutions we've looked at. Even if the number of items to be stored will never exceed the size of the hash table, the implementation still has a major flaw.

Ideally, you want the hash function to map each x into a unique integer i. The hash function in the ideal situation is called a perfect hash function; an example is the 911 emergency system.

In practice, however, a hash function can map two or more search keys x and y into the same integer; that is, the hash function tells you to store two or more items in the same array location array[i]. This occurrence is called a collision.

Even if the number of items that can be in your array at any one time is small, the only way to avoid collisions completely is for the hash table to be large enough for each possible search key value can have its own location. Most of these solutions require a large amount of memory. Because of this, collision-resolution schemes are necessary to make hashing feasible. Such resolution schemes usually require that the hash function place items evenly throughout the hash table.

The hash function, therefore must:

Hash Functions

Should we only consider hash functions that have an arbitrary integer as the search key? What if the search key isn't an integer; say its a string, like someone's name? It is possible to easily convert a string into a number (we'll see how shortly).

There are many ways to convert an arbitrary integer into an integer within a certain range, such as 0 through 100. Therefore, there are many ways to construct a hash function.

Selecting Digits

If your search key is the nine-digit employee ID number 001364825, you could select the fourth digit and the last digit (for example), to obtain 35 as the index to the hash table. That is:

    h(001364825) = 35

Therefore you could store the items whose search key is 001364825 in array[35]. You do need to be careful about which digits you choose. Understanding the nature of the data itself may guide your digit selection.

Digit-selection hash functions are simple and fast, but generally they do not evenly distribute the items in the hash table. The hash function really should utilize the entire search key.

Folding

One way to improve on the previous method is to add the digits. For example, you could add all the digits in 001364825 to obtain:

    0 + 0 + 1 + 3 + 6 + 4 + 8 + 2 + 5 = 29

This search key would therefore go into array[29]. Notice that if you add all the digits from a nine-digit search key:

    0 <= h(searchKey) <= 81

This would only use array[0] throught array[81] of the hash table. To change the situation or to increase the size of the hash table, you can group the digits in the search key and add the groups.

Let's say you divide the number into three groups of three digits:

    001 + 364 + 825 = 1,190

For this hash function,

    0 <= h(searchKey) <= 3 * 999 = 2,997

If 2,997 is larger than the size of the hash table that you want, you can alter the groups that you choose. Perhaps not quite as obvious is that you can apply more than one hash function to a search key. For example, you could select some of the digits from the search key before adding them, or you could either select digits from the previous result 1,190 or apply folding to it once again by adding 11 and 90.

Modulo Arithmetic

Modulo arithmetic provides a simple and effective hash function. No matter what the table size you would like to use for your hash table, modulo arithmetic will always work. Consider the function:

    h(x) = x mod tableSize

For example, if tableSize is 101, h(x) = x mod 101 maps any integer x into the range 0 to 100. For example, h maps 001364825 into 12.

For this hash function, many x's map into array[0], many x's map into array[1], and so on. In other words, collisions will occur. However, you can distribute the items across the table evenly — thus reducing collisions — by choosing a prime number as tableSize.

Converting a character string into an integer

If you search key is a character string — such as a name — you could convert it into an integer before applying the hash function. To do so, you could first assign each character in the string an integer value.

For example, for the word "NOTE", you could assign the ASCII values 78, 79, 84, and 69, to the letters. Or, you could assign the values 1 through 26 to the letters A through Z, so you would assign 14 to N, 15 to O, 20 to T, and 5 to E.

strToNum.cpp:
#include <iostream>
#include <cstdlib>

using namespace std;

int main () {
  string name;
  int num = 0;
  
  cout << "Enter a name: ";
  cin >> name;
  for (int i=0; i < name.length(); i++)
    num += name[i];

  cout << "The result is " << num << endl;
  
  return 0;
}

If you simply add these numbers, you will get an integer, but it will not be unique to the character string. For example, the string "TONE" will give you the same result.

Another approach would be to write the numeric value of each character in binary and concatenate the results. If you assign the value 1 through 26 to the letters A through Z, you obtain the following for "NOTE":

N is 14, or 01110 in binary
O is 15, or 01111 in binary
T is 20, or 10100 in binary
E is  5, or 00101 in binary

Concatenating the binary values gives you the binary integer:

    01110011111010000101

which is 474,757 in decimal. You can apply the hash function x mod tableSize for x = 474,757.

Now consider a more efficient way to compute the 474,757. Rather than converting the previous binary number to decimal, you can evaluate the expression:

    14 x 323 + 15 * 322 + 20 * 321 + 5 * 320

This computation is possible because we have represented each character as a 5-bit binary number, and 25 is 32.

By factoring this expression, you can minimize the number of arithmetic operations. This technique is called Horner's Rule and results in:

    ((14 * 32 + 15) * 32 + 20) * 32 + 5

Although both of these expressions have the same value, the result in either case could very well be larger than a typical computer could represent (at least using standard-sized integers with a maximum value of 65,535).

However, because we plan to se the modulo hash function, you can prevent an overflow condition by applying the modulo operator after computing each parenthesized expression.

strToNum2.cpp:
// Convert String to Num Example using Horner's Rule
#include <iostream>
#include <cstdlib>

const int tableSize = 101;

using namespace std;

int main () {
  string name;
  int i, digit, num;

  cout << "Enter a name: ";
  cin >> name;
  
  // show the numeric value used for each letter in name
  for (i=0; i < name.length(); i++) {
    // convert each character into a value from 1-26
    digit = name[i] - 'A' + 1;
    cout << name[i] << " => " << digit << endl;
    }

  num = name[0] - 'A' + 1;
  for (i=1; i < name.length(); i++) {
    digit = name[i] - 'A' + 1;
    num = (num * 32 + digit) % tableSize;
    }

  cout << "The result is " << num << endl;

  return 0;
}


Resolving Collisions

Suppose you want to insert an item whose search key is 4567. The modulo hash function indicates the item should be placed in array[22], because 4567 mod 101 is 22. Suppose, however, that array[22] already has an item in it. Where do you place the new item?

Two general approaches to collision resolution are common. One approach allocates another location within the hash table to the new item. A second approach changes the structure of the hash table so that each location array[i] can accomodate more than one item. The collision-resolving schemes described next examplify these two approaches.

Approach 1: Open Addressing

During an attempt to insert a new item into a table, if the hash function indicates a location in the hash table is already occupied, you probe for some other empty, or open, location in which to place the item. The sequence of locations that you examine is called the probe sequence.

Such schemes are said to use open addressing. The concern is that you must be able to find a table item efficiently after you have inserted it. That is, the delete and retrieve operations must be able to reproduce the probe sequence that insert used and must do it efficiently.

Linear Probing

In this simple scheme to resolve a collision, you search the hash table sequentially, starting from the original hash location. Typically, you wrap around from the last table location to the first location if necessary.

In the absence of deletions, implementing retrieve is straightforward. You merely follow the same probe sequence that insert used until you either find the item you are searching for, reach an empty location (which indicates the item is not present), or visit every table location.

Deletions, however, add a slight complication. The delete operation itself is no problem. You merely find the desired item, as in retrieve, and delete it, making the location empty. What happens to retrieve after deletions? The new empty locations that delete created along a probe sequence could cause retrieve to stop prematurely, incorrectly indicating a failure.

You can resolve this problem by allowing a location to be in one of three states: occupied (currently in use), empty (has never been used), or deleted (once was occupied but is now available). You could therefore modify the retrieve operation to continue probing when it encounters a location in the deleted state. Similarly, you modify insert to insert into either empty or deleted locations.

One of the problems with the linear-probing scheme is that table items tend to cluster together in the hash table. The table contains groups of consecutively occupied locations; this is referred to as primary clustering. Large clusters tend to get even bigger. Primary clustering causes long probe searches and therefore decreases the overall efficiency of hashing.

Quadratic Probing

You can virtually eliminate the primary clusters by adjusting the linear probing scheme; instead of probing to consecutive locations from the original hash location, you check locations array[h(searchKey)+12], table[h(searchKey)+22], table[h(searchKey)+32], and so on until you find an available location.

This scheme is called quadratic probing. Unfortunately, when two items hash into the same location, quadratic probing uses the same probe sequence for each item. This phenomenon — called secondary clustering — delays the resolution of the collision.

Double Hashing

Double hashing, which is yet another open-addressing scheme, drastically reduces clustering. The probe sequences that both the linear probing and quadratic probing use are key independent. For example, linear probing inspects the table locations sequentially, no matter what the hash key is. In contrast, double hashing defined key-dependent probe sequences. In this scheme the probe sequence still searches the table in a linear order, starting at h1(key), but a second hash function h2 (sometimes known as the rehash) determines the size of the steps taken.

Although we choose h1 as usual, you must follow these guidelines for h2:

    h2(key) != 0     h2 != h1

For example, define the primary and secondary (rehash) functions as:

    h1(key) = key mod 11
    h2(key) = 7 - (key mod 7)

where the hash table only has 11 items in it so we can readily see the effect of these functions on the table. If the key = 58, h1 hashes to location 3 (58 mod 11), and h2 indicates that the probe sequence should take steps of size 5 (7 - 58 mod 7); the probe sequence will be 3, 8, 2 (wraps around), 7, 1, (wraps around), 6, 0, 5, 10, 4, 9.

If key = 14, h1 hashes key to location 3 (14 mod 11), and h2 indicates the probe sequence take steps of size 7 (7 - 14 mod 7), and so the probe sequence would be 3, 10, 6, 2, 9, 5, 1, 8, 4, 0.

Each of these probe sequences visits all of the table locations. This phenomenon always occurs if the size of the table and the size of the probe step are relatively prime; that is, if their greatest common divisor is 1. Because the size of a hash table is commonly a prime number, it will be relatively prime to all step sizes.

If one rehash is good, how about more? What if the probe size changed for each rehash? While this might be desirable, such schemes are difficult to implement.

Increasing the size of the Hash Table

With any of the open-addressing schemes, as the hash table fills, the probability of a collision increases. At some point, a larger hash table becomes desirable. If you use a dynamically allocated array for the hash table, you can increase its size whenever the table becomes too full.

You cannot simply double the size of the array, for example, because the size of the hash table needs to remain prime. Secondly, you do not copy the items from the original hash table to the new hash table. If the hash function is x mod tableSize, the hashed code changes as tableSize changes. Thus, you need to apply your new hash function to every item in the old hash table before placing it into the new hash table. Clearly, once again, there is no free lunch when dealing with resizing a hash table.

Approach 2: Restructuring the Hash Table

Another way to resolve collisions is to change the structure of the hash table so that it can accomodate more than one item in the same location.

Buckets

If you define the hash table so that each location array[i] is itself an array called a bucket, you then can store the items that hash to array[i] in this array.

The problem with this approach, of course, is choosing the size B of each bucket. If B is too small, you will have only postponed the problem of collisions until B + 1 items map into some array locations. If you attempt to make B large enough so that each array location can accomodate the largest number of items the might map into it, you are likely to waste a good deal of storage.

Separate Chaining

A better approach is to design the hash table as an array of linked lists. In this collision-resolution method, known as separate chaining, each entry array[i] is a pointer to a linked list — the chain — of items that the hash function has mapped into location i.


Hash Table Simulation

To help you understand the steps involved with a hash table (inserting, retrieving, and deleting items), a Hash Table Simulation page has been designed to show these steps, providing a variety of both hashing and collision techniques, as well as different bucket sizes, so you can see the process in action. You can view the simulation tool here.