File Structure: HASHING

Introduction

Key driven file access should be O(1) - that is, the time to access a record should be a constant which does not vary with the size of the dataset.
Indexing can be regarded as a table driven function which translates a key to a numeric location.
Hashing can be regarded as a computation driven function which translates a key to a numeric location.
hashing

The transformation of a search key into a number by means of mathematical calculations.
randomize

To transform in an apparently random way.
Hashing uses a repeatable pseudorandom function.
The hashing function should produce a uniform distribution of hash values.
uniform distribution

A randomization in which each value in a range has an equal probability.
For each key, the result of the hashing function is used as the home address of the record.
home address

The address produced by the hashing of a record key.
Under ideal conditions, hashing provides O(1) key driven file access.

Hashing Algorithms

Modulus

Modulus - the key is divided by the size of the table, and the remainder is used as the hash function.
Example:
Key = 123-45-6789123456789 % 11 = 5 h(123-45-6789) = 5
Modulus functions work better when the divisor is a prime number, or at least not a composite of small numbers.

Fold and Add

fold and add

A hashing techique which separates a long key field into smaller parts which are added together.
Example:
Key = 123-45-6789 123 456 +789 1368
h(123-45-6789) = 1368
Fold and add is used for long keys.

Mid-square

mid-square

hashing method which squares the key and the uses the middle digits of the result.
Example:
Key = 123-45-6789123456789² = 15241578750190521 h(123-45-6789) = 8750

Combined methods

Practical hashing functions often combine techniques.
Example:
Key = 123-45-6789 123 456 +789 1368 1368 % 11 = 4
h(123-45-6789) = 4
For non-numeric keys, the key is simply treated as though it were a number, using its internal binary representation.
Example:
Key = "Kemp"
"Kemp" = 4B656D70₁₆ = 1264938352₁₀

Hashing Functions and Record Distributions

The size of the key space is typically much larger than the space of hashed values.
This means that more than one key will map to the same hash value.

Collisions

synonyms

Keys which hash to the same value.
collision

An attempt to store a record at an address which does not have sufficient room
packing density

The ratio of used space to allocated space.
For simple hashing, the probability of a synonym is the same as the packing density.

How Much Extra Memory Should be Used?

Increasing memory (i.e., increasing the size of the hash table) will decrease collisions.

Collision Resolution by Progressive Overflow

progressive overflow

A collision resolution technique which places overflow records at the first empty address after the home address
With progressive overflow, a sequential search is performed beginning at the home address.
The search is continued until the desired key or a blank record is found.
Progressive overflow is also referred to as linear probing.

Storing more than One Record per Address: Buckets

bucket

An area of a hash table with a single hash address which has room for more than one record.
When using buckets, an entire bucket is read or written as a unit. (Records are not read individually.)
The use of buckets will reduce the average number of probes required to find a record.

Making Deletions

tombstone

A marker placed on a deleted record.
Using a tombstone avoids terminating a probe prematurely.
Locations containing a tombstone can be be used for records being added.

Other Collision Resolution Techniques

Double Hashing

Double hashing is similar to progressive overflow.
The second hash value is used as a stepping distance for the probing.
The second hash value should never be one. (Add one.)
The second hash value should be relatively prime to the size of the table. (This happens if the table size is prime.)
double hashing

A collision resolution scheme which applies a second hash function to keys which collide, to determine a probing distance.
The use of double hashing will reduce the average number of probes required to find a record.
Linear probing and double hashing are both referred to as open addressing collision handling methods.

Open Chaining

Open chaining forms a linked list, or chain, of synonyms.
The overflow records can be kept in the same file as the hash table itself:
The overflow records can be kept in a separate file:

Scatter Tables

If all records are moved into a separate "overflow" area, with only links being left in the hash table, the result is a scatter table.
A scatter table scatter table is smaller than an index for the same data.

Patterns of Record Access

Loading items in decreasing order of retrieval frequncy will improve the average access time.
The idea is that records accessed more frequently can be accessed more quickly.
This will lower the average retrieval time.

File Structure

Thursday, 29 November 2012

HASHING

Introduction

Hashing Algorithms

Modulus

Fold and Add

Mid-square

Combined methods

Hashing Functions and Record Distributions

Collisions

How Much Extra Memory Should be Used?

Collision Resolution by Progressive Overflow

Storing more than One Record per Address: Buckets

Making Deletions

Other Collision Resolution Techniques

Double Hashing

Open Chaining

Scatter Tables

Patterns of Record Access

1 comment: