HASHING
Introduction
- Key driven file access should be O(1) - that is, the
time to access a record should be a constant which does not vary
with the size of the dataset.
- Indexing can be regarded as a table driven function which
translates a key to a numeric location.
- Hashing can be regarded as a computation driven function which
translates a key to a numeric location.
-
- hashing
- The transformation of a search key into a number by means of
mathematical calculations.
-
- randomize
- To transform in an apparently random way.
- Hashing uses a repeatable pseudorandom function.
- The hashing function should produce a uniform distribution of
hash values.
-
- uniform distribution
- A randomization in which each value in a range has an equal
probability.
- For each key, the result of the hashing function is used as the
home address of the record.
-
- home address
- The address produced by the hashing of a record key.
- Under ideal conditions, hashing provides O(1) key driven
file access.
Hashing Algorithms
Modulus
- Modulus - the key is divided by the size of the table, and the
remainder is used as the hash function.
- Example:
Key = 123-45-6789
123456789 % 11 = 5
h(123-45-6789) = 5
- Modulus functions work better when the divisor is a prime
number, or at least not a composite of small numbers.
Fold and
Add
-
- fold and add
- A hashing techique which separates a long key field into
smaller parts which are added together.
- Example:
Key = 123-45-6789
123
456
+789
1368
h(123-45-6789) = 1368
- Fold and add is used for long keys.
Mid-square
-
- mid-square
- hashing method which squares the key and the uses the middle
digits of the result.
- Example:
Key = 123-45-6789
1234567892 = 15241578750190521
h(123-45-6789) = 8750
Combined
methods
- Practical hashing functions often combine techniques.
- Example:
Key = 123-45-6789
123
456
+789
1368
1368 % 11 = 4
h(123-45-6789) = 4
- For non-numeric keys, the key is simply treated as though it
were a number, using its internal binary representation.
- Example:
Key = "Kemp"
"Kemp" = 4B656D7016 = 126493835210
Hashing Functions and Record Distributions
- The size of the key space is typically much larger than the
space of hashed values.
- This means that more than one key will map to the same hash
value.
Collisions
-
- synonyms
- Keys which hash to the same value.
-
- collision
- An attempt to store a record at an address which does not have
sufficient room
-
- packing density
- The ratio of used space to allocated space.
- For simple hashing, the probability of a synonym is the same as
the packing density.
How Much Extra Memory Should be Used?
- Increasing memory (i.e., increasing the size of the hash table)
will decrease collisions.
Collision Resolution by Progressive Overflow
-
- progressive overflow
- A collision resolution technique which places overflow records
at the first empty address after the home address
- With progressive overflow, a sequential search is performed
beginning at the home address.
- The search is continued until the desired key or a blank record
is found.
- Progressive overflow is also referred to as linear
probing.
Storing more than One Record per Address:
Buckets
-
- bucket
- An area of a hash table with a single hash address which has
room for more than one record.
- When using buckets, an entire bucket is read or written as a
unit. (Records are not read individually.)
- The use of buckets will reduce the average number of probes
required to find a record.
Making Deletions
-
- tombstone
- A marker placed on a deleted record.
- Using a tombstone avoids terminating a probe prematurely.
- Locations containing a tombstone can be be used for records
being added.
Other Collision Resolution Techniques
Double
Hashing
- Double hashing is similar to progressive overflow.
- The second hash value is used as a stepping distance for the
probing.
- The second hash value should never be one. (Add
one.)
- The second hash value should be relatively prime to the size of
the table. (This happens if the table size is prime.)
-
- double hashing
- A collision resolution scheme which applies a second hash
function to keys which collide, to determine a probing
distance.
- The use of double hashing will reduce the average number of
probes required to find a record.
- Linear probing and double hashing are both referred to as
open addressing collision handling methods.
Open
Chaining
- Open chaining forms a linked list, or chain, of synonyms.
- The overflow records can be kept in the same file as the hash
table itself:
- The overflow records can be kept in a separate file:
Scatter
Tables
- If all records are moved into a separate "overflow" area, with
only links being left in the hash table, the result is a scatter
table.
- A scatter table scatter table is smaller than an index for the
same data.
Patterns of Record Access
- Loading items in decreasing order of retrieval frequncy will
improve the average access time.
- The idea is that records accessed more frequently can be
accessed more quickly.
- This will lower the average retrieval time.
The images have a problem
ReplyDelete