Thursday, 29 November 2012

HASHING


Introduction

  • Key driven file access should be O(1) - that is, the time to access a record should be a constant which does not vary with the size of the dataset.
  • Indexing can be regarded as a table driven function which translates a key to a numeric location.
  • Hashing can be regarded as a computation driven function which translates a key to a numeric location.
  • hashing
    The transformation of a search key into a number by means of mathematical calculations.
  • randomize
    To transform in an apparently random way.
  • Hashing uses a repeatable pseudorandom function.
  • The hashing function should produce a uniform distribution of hash values.
  • uniform distribution
    A randomization in which each value in a range has an equal probability.
  • For each key, the result of the hashing function is used as the home address of the record.
  • home address
    The address produced by the hashing of a record key.
  • Under ideal conditions, hashing provides O(1) key driven file access.

Hashing Algorithms

Modulus

  • Modulus - the key is divided by the size of the table, and the remainder is used as the hash function.
  • Example:
    Key = 123-45-6789
    123456789 % 11 = 5
    h(123-45-6789) = 5
  • Modulus functions work better when the divisor is a prime number, or at least not a composite of small numbers.

Fold and Add

  • fold and add
    A hashing techique which separates a long key field into smaller parts which are added together.
  • Example:
    Key = 123-45-6789
     123
     456
    +789
    1368

    h(123-45-6789) = 1368
  • Fold and add is used for long keys.

Mid-square

  • mid-square
    hashing method which squares the key and the uses the middle digits of the result.
  • Example:
    Key = 123-45-6789
    1234567892 = 15241578750190521
    h(123-45-6789) = 8750

Combined methods

  • Practical hashing functions often combine techniques.
  • Example:
    Key = 123-45-6789
     123
     456
    +789
    1368

    1368 % 11 = 4

    h(123-45-6789) = 4
  • For non-numeric keys, the key is simply treated as though it were a number, using its internal binary representation.
  • Example:
    Key = "Kemp"
    "Kemp" = 4B656D7016 = 126493835210

Hashing Functions and Record Distributions

  • The size of the key space is typically much larger than the space of hashed values.
  • This means that more than one key will map to the same hash value.

Collisions

  • synonyms
    Keys which hash to the same value.
  • collision
    An attempt to store a record at an address which does not have sufficient room
  • packing density
    The ratio of used space to allocated space.
  • For simple hashing, the probability of a synonym is the same as the packing density.

How Much Extra Memory Should be Used?

  • Increasing memory (i.e., increasing the size of the hash table) will decrease collisions.

Collision Resolution by Progressive Overflow

  • progressive overflow
    A collision resolution technique which places overflow records at the first empty address after the home address
  • With progressive overflow, a sequential search is performed beginning at the home address.
  • The search is continued until the desired key or a blank record is found.
  • Progressive overflow is also referred to as linear probing.

Storing more than One Record per Address: Buckets

  • bucket
    An area of a hash table with a single hash address which has room for more than one record.
  • When using buckets, an entire bucket is read or written as a unit.  (Records are not read individually.)
  • The use of buckets will reduce the average number of probes required to find a record.

Making Deletions

  • tombstone
    A marker placed on a deleted record.
  • Using a tombstone avoids terminating a probe prematurely.
  • Locations containing a tombstone can be be used for records being added.

Other Collision Resolution Techniques

Double Hashing

  • Double hashing is similar to progressive overflow.
  • The second hash value is used as a stepping distance for the probing.
  • The second hash value should never be one.  (Add one.)
  • The second hash value should be relatively prime to the size of the table.  (This happens if the table size is prime.)
  • double hashing
    A collision resolution scheme which applies a second hash function to keys which collide, to determine a probing distance.
  • The use of double hashing will reduce the average number of probes required to find a record.
  • Linear probing and double hashing are both referred to as open addressing collision handling methods.

Open Chaining

  • Open chaining forms a linked list, or chain, of synonyms.
  • The overflow records can be kept in the same file as the hash table itself:
  • The overflow records can be kept in a separate file:

Scatter Tables

  • If all records are moved into a separate "overflow" area, with only links being left in the hash table, the result is a scatter table.
  • A scatter table scatter table is smaller than an index for the same data.

Patterns of Record Access

  • Loading items in decreasing order of retrieval frequncy will improve the average access time.
  • The idea is that records accessed more frequently can be accessed more quickly.
  • This will lower the average retrieval time.

1 comment: