The binary search trees provide excellent implementations of list
data structure. They make it possible to perform all of the standard
list operations efficiently; for example, if the tree contains 10,000
items, the operations *retrieve, insert,* and *delete* each
require approximately log_{2}10,000 = 13 steps. As impressive
as this efficiency may be, situations do occur for which the search tree
implementations are not adequate.

Access time can be extremely vital to an application. For example, when a call comes into the 911 emergency system, the system detects the caller's telephone number and searches a database for the caller's address. Similarly, an air traffic control system searches a database of flight information, given a flight number. Clearly both of these searches must be rapid.

A radically different strategy is necessary to locate (and insert and
delete) an item virtually instantaneously. Imagine an array of *N*
items -- with each array slot capable of holding a single item -- and a
seemingly magical box called an "address calculator." Whenever you have a
new item that you want to insert into the array, the address calculator will
tell you where you should place it in the array.

You can therefore perform an insertion into the array as follows:

insert ( in newItem : ItemType ) { index = addressCalculator(newItem.getKey()) array[index] = newItem } |

You would also use the addressCalculator() for the *retrieve* and
*delete* operations.

retrieve ( in searchKey : KeyType; out item : ItemType ) { index = addressCalculator(searchKey) if (array[index].getKey() == searchKey) item = array[index] else error! } |

delete ( in searchKey : KeyType; out item : ItemType ) { index = addressCalculator(searchKey) if (array[index].getKey() == searchKey) Delete the item from the array else error! } |

If you were to implement such a scheme, you must, of course, be able to
construct an addressCalculator() function that can, with very little work
(so it does it fast!), tell you were a given item should be. Address
calculators are actually not as mysterious as they seem; in fact, many exist
that can approximate an idealized behavior just described. Such an address
calculator is usually referred to as a **hash function**. The scheme just
described is an idealized description of a method known as **hashing**,
and the array is called the **hash table**.

Consider the 911 emergency call example. If, for each person, the system had a record whose search key was the person's telephone number, it could store these records in a search tree. Although searching a tree would be fast, faster access to a particular record would be possible by storing the records in a table array.

You could store the record for a person whose
telephone number is *t* in *array[t]*. Retrieval of the record,
given the search key, is almost instantaneous. For example, you can store
the record for the telephone number 123-4567 in *array[1234567]*; if
you can spare ten million memory locations for *array*.

Since 911 systems are regional, you could consider only the one telephone
exchange, and you could therefore store the record for number 123-4567 in
*array[4567]* and get by with an array of only 10,000 locations.

The transformation of 1234567 into an array index of 4567 is a simple
example of a hash function. A hash function *h* must take an arbitrary
integer *x* and map it into an integer that you can use as an array
index. Although this example of a hash array is not typical (since it would
be full), it serves to illustrate the idea of a hash function.

What if many fewer records were in the array? Consider, for example, an air
traffic control system flight number. You could store a record for
Flight 4567 in *array[4567]*, but you still would need an array of
10,000 locations, even if only 50 flights were current.

A different hash function would save memory. If you allow space for a
maximum of 101 flights, for example, so that the array has indexes 0 through
100, the necessary hash function *h* should map any four-digit flight
number into an integer in the range of 0 through 100.

Such a hash function would appear to be instanteous. If it is therefore so good, why did we spend all that time talking about other methods of quick retrieval of data. Why is hashing not quite as simple as it sounds?

Obviously, the hash table must be larger enough to store all the items we want to store. Using a fixed-size array, the implementation has the same pitfalls as the other array-based data structure solutions we've looked at. Even if the number of items to be stored will never exceed the size of the hash table, the implementation still has a major flaw.

Ideally, you want the hash function to map each *x* into a unique
integer *i*. The hash function in the ideal situation is called a
**perfect hash function**; an example is the 911 emergency system.

In practice, however, a hash function can map two or more search keys
*x* and *y* into the *same* integer; that is, the hash function
tells you to store two or more items in the same array location *array[i]*.
This occurrence is called a **collision**.

Even if the number of items that can be in your array at any one time is small, the only way to avoid collisions completely is for the hash table to be large enough for each possible search key value can have its own location. Most of these solutions require a large amount of memory. Because of this, collision-resolution schemes are necessary to make hashing feasible. Such resolution schemes usually require that the hash function place items evenly throughout the hash table.

The hash function, therefore must:

- Be easy and fast to compute.
- Place items evenly throughout the hash table.

Should we only consider hash functions that have an arbitrary integer as the search key? What if the search key isn't an integer; say its a string, like someone's name? It is possible to easily convert a string into a number (we'll see how shortly).

There are many ways to convert an arbitrary integer into an integer within a certain range, such as 0 through 100. Therefore, there are many ways to construct a hash function.

If your search key is the nine-digit employee ID number 001364825, you could select the fourth digit and the last digit (for example), to obtain 35 as the index to the hash table. That is:

` `

*h*(001__3__6482__5__) = 35

Therefore you could store the items whose search key is 001364825 in
*array[35]*. You do need to be careful about which digits you choose.
Understanding the nature of the data itself may guide your digit selection.

Digit-selection hash functions are simple and fast, but generally they do not evenly distribute the items in the hash table. The hash function really should utilize the entire search key.

One way to improve on the previous method is to add the digits. For example, you could add all the digits in 001364825 to obtain:

` 0 + 0 + 1 + 3 + 6 + 4 + 8 + 2 + 5 = 29`

` 0 <= `

*h*(searchKey) <= 81

This would only use *array[0]* throught *array[81]* of the hash
table. To change the situation or to increase the size of the hash table,
you can group the digits in the search key and add the groups.

Let's say you divide the number into three groups of three digits:

` 001 + 364 + 825 = 1,190`

For this hash function,

` 0 <= `

*h*(searchKey) <= 3 * 999 = 2,997

If 2,997 is larger than the size of the hash table that you want, you can alter the groups that you choose. Perhaps not quite as obvious is that you can apply more than one hash function to a search key. For example, you could select some of the digits from the search key before adding them, or you could either select digits from the previous result 1,190 or apply folding to it once again by adding 11 and 90.

Modulo arithmetic provides a simple and effective hash function. No matter what the table size you would like to use for your hash table, modulo arithmetic will always work. Consider the function:

` `

*h*(x) = x mod *tableSize*

For example, if *tableSize* is 101,

maps any integer *h*(x) = x mod 101*x* into the range 0 to 100. For example, *h* maps
001364825 into 12.

For this hash function, many *x*'s map into *array[0]*, many *x*'s
map into *array[1]*, and so on. In other words, collisions will occur.
However, you can distribute the items across the table evenly — thus
reducing collisions — by choosing a prime number as *tableSize*.

If you search key is a character string — such as a name — you could convert it into an integer before applying the hash function. To do so, you could first assign each character in the string an integer value.

For example, for the word "NOTE", you could assign the ASCII values 78, 79, 84, and 69, to the letters. Or, you could assign the values 1 through 26 to the letters A through Z, so you would assign 14 to N, 15 to O, 20 to T, and 5 to E.

If you simply add these numbers, you will get an integer, but it will not be unique to the character string. For example, the string "TONE" will give you the same result.

Another approach would be to write the numeric value of each character in binary and concatenate the results. If you assign the value 1 through 26 to the letters A through Z, you obtain the following for "NOTE":

N is 14, or 01110 in binary O is 15, or 01111 in binary T is 20, or 10100 in binary E is 5, or 00101 in binary |

Concatenating the binary values gives you the binary integer:

` 01110011111010000101`

which is 474,757 in decimal. You can apply the hash function *x*
mod *tableSize* for x = 474,757.

Now consider a more efficient way to compute the 474,757. Rather than converting the previous binary number to decimal, you can evaluate the expression:

` 14 x 32`^{3} + 15 * 32^{2} + 20 * 32^{1} + 5 * 32^{0}

This computation is possible because we have represented each character as a
5-bit binary number, and 2^{5} is 32.

By factoring this expression, you can minimize the number of arithmetic operations. This technique is called Horner's Rule and results in:

` ((14 * 32 + 15) * 32 + 20) * 32 + 5`

Although both of these expressions have the same value, the result in either case could very well be larger than a typical computer could represent (at least using standard-sized integers with a maximum value of 65,535).

However, because we plan to se the modulo hash function, you can prevent an overflow condition by applying the modulo operator after computing each parenthesized expression.

Suppose you want to insert an item whose search key is 4567. The modulo
hash function indicates the item should be placed in *array[22]*, because
4567 mod 101 is 22. Suppose, however, that *array[22]* already has an
item in it. Where do you place the new item?

Two general approaches to collision resolution are common. One approach
allocates another location *within* the hash table to the new item.
A second approach changes the structure of the hash table so that each location
*array[i]* can accomodate more than one item. The collision-resolving
schemes described next examplify these two approaches.

During an attempt to insert a new item into a table, if the hash function
indicates a location in the hash table is already occupied, you probe for some
other empty, or open, location in which to place the item. The sequence of
locations that you examine is called the **probe sequence**.

Such schemes are said to use **open addressing**. The concern is that
you must be able to find a table item efficiently after you have inserted it.
That is, the *delete* and *retrieve* operations must be able to
reproduce the probe sequence that *insert* used and must do it
efficiently.

In this simple scheme to resolve a collision, you search the hash table
sequentially, starting from the original hash location. Typically, you
*wrap around* from the last table location to the first location if
necessary.

In the absence of deletions, implementing *retrieve* is straightforward.
You merely follow the same probe sequence that *insert* used until you
either find the item you are searching for, reach an empty location (which
indicates the item is not present), or visit every table location.

Deletions, however, add a slight complication. The *delete* operation
itself is no problem. You merely find the desired item, as in *retrieve*,
and delete it, making the location empty. What happens to *retrieve*
after deletions? The new empty locations that *delete* created along a
probe sequence could cause *retrieve* to stop prematurely, incorrectly
indicating a failure.

You can resolve this problem by allowing a location to be in one of three
states: __occupied__ (currently in use), __empty__ (has never been used),
or __deleted__ (once was occupied but is now available). You could therefore
modify the *retrieve* operation to continue probing when it encounters a
location in the deleted state. Similarly, you modify *insert* to insert
into either empty or deleted locations.

One of the problems with the linear-probing scheme is that table items
tend to **cluster** together in the hash table. The table contains groups
of consecutively occupied locations; this is referred to as *primary
clustering*. Large clusters tend to get even bigger. Primary clustering
causes long probe searches and therefore decreases the overall efficiency of
hashing.

You can virtually eliminate the primary clusters by adjusting the linear
probing scheme; instead of probing to consecutive locations from the original
hash location, you check locations *array[h(searchKey)+1 ^{2}],
table[h(searchKey)+2^{2}], table[h(searchKey)+3^{2}]*, and
so on until you find an available location.

This scheme is called **quadratic probing**. Unfortunately, when two
items hash into the same location, quadratic probing uses the same probe
sequence for each item. This phenomenon — called *secondary
clustering* — delays the resolution of the collision.

Although we choose *h _{1}* as usual, you must follow these
guidelines for

*h _{2}(key)* != 0

For example, define the primary and secondary (rehash) functions as:

*h _{1}(key) = key* mod 11

where the hash table only has 11 items in it so we can readily see the
effect of these functions on the table. If the key = 58, *h _{1}*
hashes to location 3 (58 mod 11), and

If key = 14, *h _{1}* hashes key to location 3 (14 mod 11), and

Each of these probe sequences visits *all* of the table locations.
This phenomenon always occurs if the size of the table and the size of the
probe step are relatively prime; that is, if their greatest common divisor is 1.
Because the size of a hash table is commonly a prime number, it will be
relatively prime to all step sizes.

With any of the open-addressing schemes, as the hash table fills, the probability of a collision increases. At some point, a larger hash table becomes desirable. If you use a dynamically allocated array for the hash table, you can increase its size whenever the table becomes too full.

You cannot simply double the size of the array, for example, because the
size of the hash table needs to remain prime. Secondly, you do not copy the
items from the original hash table to the new hash table. If the hash
function is *x* mod *tableSize*, the hashed code changes as
*tableSize* changes. Thus, you need to apply your new hash function to
every item in the old hash table before placing it into the new hash table.
Clearly, once again, there is no free lunch when dealing with resizing a hash
table.

Another way to resolve collisions is to change the structure of the hash table so that it can accomodate more than one item in the same location.

If you define the hash table so that each location *array[i]* is
itself an array called a **bucket**, you then can store the items that
hash to *array[i]* in this array.

The problem with this approach, of course, is choosing the size **B**
of each bucket. If **B** is too small, you will have only postponed the
problem of collisions until **B** + 1 items map into some array locations.
If you attempt to make **B** large enough so that each array location can
accomodate the largest number of items the might map into it, you are likely
to waste a good deal of storage.

A better approach is to design the hash table as an array of linked lists.
In this collision-resolution method, known as **separate chaining**,
each entry *array[i]* is a pointer to a linked list — the
**chain** — of items that the hash function has mapped into
location *i*.

To help you understand the steps involved with a hash table (inserting, retrieving, and deleting items), a Hash Table Simulation page has been designed to show these steps, providing a variety of both hashing and collision techniques, as well as different bucket sizes, so you can see the process in action. You can view the simulation tool here.