File Structure: 2012

Thursday, 29 November 2012

EXTENDIBLE HASHING

How Extendible Hashing Works

Tries

trie

A search tree in which the child of each node is determined by subsequent charaters of the key.
An alphabetic (radix 26) trie potentially has one child node for each letter of the alphabet.
A decimal (radix 10) trie has up to 10 children for each note.
The trie can be shortened by the use of buckets.
The bucket distribution can be balanced by the use of hashing.
Key Hash

5554321 100111001

5550123 10111010

5541234 100111100

5551234 1011110

3217654 100111101

1237654 10011011

5557654 101110011

1234567 1101001

Key	Hash
5554321	100111001
5550123	10111010
5541234	100111100
5551234	1011110
3217654	100111101
1237654	10011011
5557654	101110011
1234567	1101001

Turning the Tries into a Directory

extendible hashing

An application of hashing that works well with files that over time undergo substantial changes in size.

Splitting to Handle Overflow

splitting

Creation of a new node when a node overflows, with the partial distribution of the contents of the overflowing node to the new node.

Linear Hashing

Term

linear hashing

An application of hashing in which the address space is extended by one bucket each time an overflow occurs.

HASHING

Introduction

Key driven file access should be O(1) - that is, the time to access a record should be a constant which does not vary with the size of the dataset.
Indexing can be regarded as a table driven function which translates a key to a numeric location.
Hashing can be regarded as a computation driven function which translates a key to a numeric location.
hashing

The transformation of a search key into a number by means of mathematical calculations.
randomize

To transform in an apparently random way.
Hashing uses a repeatable pseudorandom function.
The hashing function should produce a uniform distribution of hash values.
uniform distribution

A randomization in which each value in a range has an equal probability.
For each key, the result of the hashing function is used as the home address of the record.
home address

The address produced by the hashing of a record key.
Under ideal conditions, hashing provides O(1) key driven file access.

Hashing Algorithms

Modulus

Modulus - the key is divided by the size of the table, and the remainder is used as the hash function.
Example:
Key = 123-45-6789123456789 % 11 = 5 h(123-45-6789) = 5
Modulus functions work better when the divisor is a prime number, or at least not a composite of small numbers.

Fold and Add

fold and add

A hashing techique which separates a long key field into smaller parts which are added together.
Example:
Key = 123-45-6789 123 456 +789 1368
h(123-45-6789) = 1368
Fold and add is used for long keys.

Mid-square

mid-square

hashing method which squares the key and the uses the middle digits of the result.
Example:
Key = 123-45-6789123456789² = 15241578750190521 h(123-45-6789) = 8750

Combined methods

Practical hashing functions often combine techniques.
Example:
Key = 123-45-6789 123 456 +789 1368 1368 % 11 = 4
h(123-45-6789) = 4
For non-numeric keys, the key is simply treated as though it were a number, using its internal binary representation.
Example:
Key = "Kemp"
"Kemp" = 4B656D70₁₆ = 1264938352₁₀

Hashing Functions and Record Distributions

The size of the key space is typically much larger than the space of hashed values.
This means that more than one key will map to the same hash value.

Collisions

synonyms

Keys which hash to the same value.
collision

An attempt to store a record at an address which does not have sufficient room
packing density

The ratio of used space to allocated space.
For simple hashing, the probability of a synonym is the same as the packing density.

How Much Extra Memory Should be Used?

Increasing memory (i.e., increasing the size of the hash table) will decrease collisions.

Collision Resolution by Progressive Overflow

progressive overflow

A collision resolution technique which places overflow records at the first empty address after the home address
With progressive overflow, a sequential search is performed beginning at the home address.
The search is continued until the desired key or a blank record is found.
Progressive overflow is also referred to as linear probing.

Storing more than One Record per Address: Buckets

bucket

An area of a hash table with a single hash address which has room for more than one record.
When using buckets, an entire bucket is read or written as a unit. (Records are not read individually.)
The use of buckets will reduce the average number of probes required to find a record.

Making Deletions

tombstone

A marker placed on a deleted record.
Using a tombstone avoids terminating a probe prematurely.
Locations containing a tombstone can be be used for records being added.

Other Collision Resolution Techniques

Double Hashing

Double hashing is similar to progressive overflow.
The second hash value is used as a stepping distance for the probing.
The second hash value should never be one. (Add one.)
The second hash value should be relatively prime to the size of the table. (This happens if the table size is prime.)
double hashing

A collision resolution scheme which applies a second hash function to keys which collide, to determine a probing distance.
The use of double hashing will reduce the average number of probes required to find a record.
Linear probing and double hashing are both referred to as open addressing collision handling methods.

Open Chaining

Open chaining forms a linked list, or chain, of synonyms.
The overflow records can be kept in the same file as the hash table itself:
The overflow records can be kept in a separate file:

Scatter Tables

If all records are moved into a separate "overflow" area, with only links being left in the hash table, the result is a scatter table.
A scatter table scatter table is smaller than an index for the same data.

Patterns of Record Access

Loading items in decreasing order of retrieval frequncy will improve the average access time.
The idea is that records accessed more frequently can be accessed more quickly.
This will lower the average retrieval time.

MULTILEVEL INDEXING AND B-TREES

Indexing with Binary Search Trees

T E R M S

binary tree: A tree in which each node has at most two children.
leaf: A node at the lowest level of a tree.
height-balanced tree: A tree in which the difference between the heights of subtrees in limited.

Binary trees grow from the top down: new nodes are added as new leaves.
Binary trees become unbalanced as new nodes are added.
The imbalance of a binary tree depends on the order in which nodes are added and deleted.
Worst case search with a balanced binary tree is log₂ (N + 1) compares.
Worst case search with an unbalanced binary tree is >N compares.

-Tree Methods: Search, Insert, and Others

Searching

Insertion

Create, Open, and Close

Example of Creating a B-Tree

COSEQUENCIAL PROCESSING AND THE SORTING OF LARGE FILES

An Object Oriented Model for Implementing Cosequential Processes

cosequential operations

Operations which involve accessing two or more input files sequentially and in parallel, resulting in one or more output files produced by the combination of the input data.

Considerations for Cosequential Algorithms

Initialization - What has to be set up for the main loop to work correctly.
Getting the next item on each list - This should be simple and easy, from the main algorithm.
Synchronization - Progress of access in the lists should be coordinated.
Handling End-Of-File conditions - For a match, processing can stop when the end of any list is reached.
Recognizing Errors - Items out of sequence can "break" the synchronization.

¶ 8.1.2 Matching Names in Two Lists

T E R M S

match: The process of forming a list containing all items common to two or more lists.

Cosequential Match Algorithm

Initialize (open the input and output files.)
Get the first item from each list.
While there is more to do:
- Compare the current items from each list.
- If the items are equal,
  - Process the item.
  - Get the next item from each list.
  - Set more to true iff none of this lists is at end of file.
- If the item from list A is less than the item from list B,
  - Get the next item from list A.
  - Set more to true iff list A is not at end-of-file.
- If the item from list A is more than the item from list B,
  - Get the next item from list B.
  - Set more to true iff list B is not at end-of-file.
Finalize (close the files.)

Cosequential Match Code

void Match (char * InputName1, 
            char * InputName2,
            char * OutputName) {
  /* Local Declarations                      */
  OrderedFile Input1;
  OrderedFile Input2;
  OrderedFile Output;
  int Item1;
  int Item2;
  int more;

  /* Initialization                          */
  cout << "Data Matched:" << endl;
  Input1.open (InputName1, ios::in);
  Input2.open (InputName2, ios::in);
  Output.open (OutputName, ios::out);
  
  /* Algorithm                               */
  Input1 >> Item1;
  Input2 >> Item2;
  more = Input1.good() && Input2.good();
  while (more) {
    cout << Item1 << ':' << Item2 << " => " << flush;  /* DEMO only */
    if (Item1 < Item2) {
      Input1 >> Item1;
      cout << '\n';                           /* DEMO only */
    } else if (Item1 > Item2) {
      Input2 >> Item2;
      cout << '\n';                           /* DEMO only */
    } else {
      Output << Item1 << endl;
      cout << Item1 << endl;                  /* DEMO only */
      Input1 >> Item1;
      Input2 >> Item2;
    }
    more = Input1.good() && Input2.good();
  }

  /* Finalization                            */
  Input1.close ();
  Input2.close ();
  Output.close ();
}

Heaps

Logical Structue of a Heap

A heap is a version of binary tree.
A heap is ordered along each branch, from the root to any of the leaves.
A heap is not ordered in the left-right method of a binary search tree.
A heap is not ordered horizontally, within the tree levels.
A heap is ordered vertically, from the leaves to the root.
A heap is a complete binary tree.
All levels of a heap are filled, except for the lowest.
The lowest level is filled from left to right.
The root of a heap is always the lowest (or highest) item in the heap.

Physical Structue of a Heap

The nodes of a heap are stored physically in an array.
The root of the tree is in the first element of the array.
The children of the node at location n are at locations 2n and 2n + 1.
The parent of the node at location n is at location n/2.

Building a Heap

A heap is built by inserting elements at the end of the heap, as a new leaf, and then "sifting up," sorting the branch from the new leaf to the root.
Example:
Insert R: R
Insert L: R L
R

L
Sift L: L R
L

R
Insert C: L R C
L

R
C
Sift C: C R L
C

R
L
Insert A: C R L A
C

R
L

A
Sift A: A C L R
A

C

L

R
Insert H: A C L R H
A

C

L

R
H
Sift H: A C L R H
A

C

L

R
H
Insert V: A C L R H V
A

C

L

R
H
V
Sift V: A C L R H V
A

C

L

R
H
V
Insert E: A C L R H V E
A

C

L

R
H
V
E
Sift E: A C E R H V L
A

C

E

R
H
V
L

INDEXING

What is an Index?

index

A structure containing a set of entries, each consisting of a key field and a reference field, which is used to locate records in a data file.

key field

The part of an index which contains keys.

reference field

The part of an index which contains information to locate records.

An index imposes order on a file without rearranging the file.
Indexing works by indirection.

A Simple Index for Entry-Sequenced Files

simple index

An index in which the entries are a key ordered linear list.

Simple indexing can be useful when the entire index can be held in memory.
Changes (additions and deletions) require both the index and the data file to be changed.
Updates affect the index if the key field is changed, or if the record is moved.
An update which moves a record can be handled as a deletion followed by an addition.

Object Oriented Support for Indexed, Entry Sequenced Files

T E R M S

entry-sequenced file: A file in which the record order is determined by the order in which they are entered.

The physical order of records in the file may not be the same as order of entry, because of record deletions and space reuse.
The index should be read into memory when the data file is opened.

Indexes That are too Large to Hold in Memory

Searching of a simple index on disk takes too much time.
Maintaining a simple index on disk in sorted order takes too much time.
Tree structured indexes such as B-trees are a scalable alternative to simple indexes.
Hashed organization is an alternative to indexing when only a primary index is needed.

Indexing to Provide Access by Multiple Keys

secondary key

A search key other than the primary key.

secondary index

An index built on a secondary key.

Secondary indexes can be built on any field of the data file, or on combinations of fields.
Secondary indexes will typically have multiple locations for a single key.
Changes to the data may now affect multiple indexes.
The reference field of a secondary index can be a direct reference to the location of the entry in the data file.
The reference field of a secondary index can also be an indirect reference to the location of the entry in the data file, through the primary key.
Indirect secondary key references simplify updating of the file set.
Indirect secondary key references increase access time.

Retrieval Using Combinations of Secondary Keys

The search for records by multiple keys can be done on multiple index, with the combination of index entries defining the records matching the key combination.
If two keys are to be combined, a list of entries from each key index is retrieved.
For an "or" combination of keys, the lists are merged.
I.e., any entry found in either list matches the search.
For an "and" combination of keys, the lists are matched.
I.e., only entries found in both lists match the search.

Improving the Secondary Index Structure: Inverted Lists

inverted list

An index in which the reference field is the head pointer of a linked list of reference items.

Selective Indexes

selective index

An index which contains keys for only part of the records in a data file.

Binding

binding

The association of a symbol with a value.

locality

A condition in which items accessed temporally close are also physically close.

Wednesday, 28 November 2012

ORGANIZING FILES FOR THE PERFORMANCE

Introduction

Compression can reduce the size of a file, improving performance.
File maintenance can produce fragmentation inside of the file. There are ways to reuse this space.
There are better ways than sequential search to find a particular record in a file.
Keysorting is a way to sort medium size files.
We have already considered how important it is for the file system designer to consider how a file is to be accessed when deciding how to create fields, records, and other file structures. In this chapter, we continue to focus on file organization, but the motivation is different. We look at ways to organize or reorganize files in order to improve performance.
In the first section, we look at how to organize files to make them smaller. Compression techniques make file smaller by encoding them to remove redundant or unnecessary information.

Data Compression

data compression

The encoding of data in such a way as to reduce its size.

redundancy reduction

Any form of compression which removes only redundant information.

In this section, we look at ways to make files smaller, using data compression. As with many programming techniques, there are advantages and disadvantages to data compression. In general, the compression must be reversed before the information is used. For this tradeoff,
- Smaller files use less storage space.
- The transfer time of disk access is reduced.
- The transmission time to transfer files over a network is reduced.
But,
- Program complexity and size are increased.
- Computation time is increased.
- Data portability may be reduced.
- With some compression methods, information is unrecoverably lost.
- Direct access may become prohibitably expensive.
- Data compression is possible because most data contains redundant (repeated) or unnecessary information.
Data compression is possible because most data contains redundant (repeated) or unnecessary information. Reversible compression removes only redundant information, making it possible to restore the data to its original form. Irreversible compression goes further, removing information which is not actually necessary, making it impossible to recover the original form of the data.
Next we look at ways to reclaim unused space in files to improve performance. Compaction is a batch process that we can use to purge holes of unused space from a file that has undergone many deletions and updates. Then we investigate dynamic ways to maintain performance by reclaiming space made available by deletions and updates of records during the life of the file.

Compact Notation

The replacement of field values with an ordinal number which index an enumeration of possible field values.

Compact notation can be used for fields which have an effectively fixed range of values.
Compact notation can be used for fields which have an effectively fixed range of values. The State field of the Person record, as used earler, is an example of such a field. There are 676 (26 x 26) possible two letter abbreviations, but there are only 50 states. By assigning an ordinal number to each state, and storing the code as a one byte binary number, the field size is reduced by 50 percent.
No information has been lost in the process. The compression can be completely reversed, replacing the numeric code with the two letter abbreviation when the file is read. Compact notation is an example of redundancy reduction.
On the other hand, programs which access the compressed data will need additional code to compress and expand the data. An array can used as a translation table to convert between the numeric codes and the letter abbreviations. The translation table can be coded within the program, using literal constants, or stored in a file which is read into the array by the program.
Since a file using compact notation contains binary data, it cannot be viewed with a text editor, or typed to the screen. The use of delimited records is prohibitively expensive, since the delimiter will occur in the compacted field.

Run Length Encoding

run-length encoding

An encoding scheme which replaces runs of a single symbol with the symbol and a repetition factor.

Run-length encoding is useful only when the text contains long runs of a single value.
Run-length encoding is useful for images which contain solid color areas.
Run-length encoding may be useful for text which contains strings of blanks.

Example:

uncompressed text (hexadecimal format):   
   40 40 40 40 40 40 43 43 41 41 41 41 41 42 
compressed text (hexadecimal format):   
   FE 06 40 43 43 FE 05 41 42

where FE is the compression escape code, followed by a length byte, and the byte to be repeated.

Variable Length Codes

variable length encoding

An encoding scheme in which the codes for differenct symbols may be of different length.

huffman code

A variable length code, in which each code is determined by the occurence frequency of the corresponding symbol.

prefix code

A variable length code in which the length of a code element can be determined from the first bits of the code element.

The optimal Huffman code can be different for each source text.
Huffman encoding takes two passes through the source text: one to build the code, and a second to encode the text by applying the code.
The code must be stored with the compressed text.
Huffman codes are based on the frequency of occurrence of characters in the text being encoded.
The characters with the highest occurence frequency are assigned the shortest codes, minimizing the average encoded length.
Huffman code is a prefix code.

Example:

Uncompressed Text: 
  abdeacfaag   (80 bits)
Frequencies:   a  4           e   1
               b  1           f   1
               c  1           g   1
               d  1
code:   a  1           e   0001
        b  010         f   0010
        c  011         g   0011
        d  0000
Compressed Text (binary):
  10100000000110110010110011    (26 bits)
Compressed Text (hexadecimal):

A0 1B 96 60

Irreversible Compression Techniques

irreversible compression

Any form of compression which reduces information.

reversible compression

Compression with no alteration of original information upon reconstruction.

Irreversible compression goes beyond redundancy reduction, removing information which is not actually necessary, making it impossible to recover the original form of the data.
Irreversible compression is useful for reducing the size of graphic images.
Irreversible compression is used to reduce the bandwidth of audio for digital recording and telecommunications.
JPEG image files use an irreversible compression based on cosine transforms.
The amount of information removed by JPEG compression is controllable.
The more information removed, the smaller the file.
For photographic images, a significant amount of information can be removed without noticably affecting the image.
For line graphic images, the JPEG compression may introduce aliasing noise.
GIF images files irreversibly compress images which contain more than 256 colors.
The GIF format only allows 256 colors.
The compression of GIF formatting is reversible for images which have fewer than 256 colors, and lossy for images which have more than 256 colors.
Recommendation:
- Use JPEG for photographic images.
- Use GIF for line drawings.

Reclaiming Space in Files

Record Deletion and Storage Compaction

external fragmentation

Fragmentation in which the unused space is outside of the allocated areas.

compaction

The removal of fragmentation from a file by moving records so that they are all physically adjacent.

As files are maintained, records are added, updated, and deleted.
The problem: as records are deleted from a file, they are replaced by unused spaces within the file.
The updating of variable length records can also produce fragmentation.
Compaction is a relatively slow process, especially for large files, and is not routinely done when individual records are deleted.

Deleting Fixed-Length Records for Reclaiming Space Dynamically

linked list

A container consisting of a series of nodes, each containing data and a reference to the location of the logically next node.

avail list

A list of the unused spaces in a file.

stack

A last-in first-out container, which is accessed only at one end.

Record 1 Record 2 Record 3 Record 4 Record 5
Deleted records must be marked so that the spaces will not be read as data.
One way of doing this is to put a special character, such as an asterisk, in the first byte of the deleted record space.
Record 1 Record 2 * Record 4 Record 5
If the space left by deleted records could be reused when records are added, fragmentation would be reduced.
To reuse the empty space, there must be a mechanism for finding it quickly.
One way of managing the empty space within a file is to organize as a linked list, known as the avail list.
The location of the first space on the avail list, the head pointer of the linked list, is placed in the header record of the file.
Each empty space contains the location of the next space on the avail list, except for the last space on the list.
The last space contains a number which is not valid as a file location, such as -1.
Header Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6

3 * -1 Record 2 * 1 Record 4 Record 5 Record 6
If the file uses fixed length records, the spaces are interchangable; any unused space can be used for any new record.
The simplest way of managing the avail list is as a stack.
As each record is deleted, the old list head pointer is moved from the header record to the deleted record space, and the location of the deleted record space is placed in the header record as the new avail list head pointer, pushing the new space onto the stack.
Header Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6

5 * -1 Record 2 * 1 Record 4 * 3 Record 6
When a record is added, it is placed in the space which is at the head of the avail list.
The push process is reversed; the empty space is popped from the stack by moving the pointer in the first space to the header record as the new avail list head pointer.
Header Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6

3 * -1 Record 2 * 1 Record 4 Record 5 Record 6
With fixed length records, the relative record numbers (RRNs) can be used as location pointers in the avail list.

Header	Slot 1	Slot 2	Slot 3	Slot 4	Slot 5	Slot 6
3	* -1	Record 2	* 1	Record 4	Record 5	Record 6

Header	Slot 1	Slot 2	Slot 3	Slot 4	Slot 5	Slot 6
5	* -1	Record 2	* 1	Record 4	* 3	Record 6

Header	Slot 1	Slot 2	Slot 3	Slot 4	Slot 5	Slot 6
3	* -1	Record 2	* 1	Record 4	Record 5	Record 6

Deleting Variable-Length Records

If the file uses variable length records, the spaces not are interchangable; a new record will not fit just any unused space.
With variable length records, the byte offset of each record can be used as location pointers in the avail list.
The size of each deleted record space should also be placed in

Storage Fragmentation

coalescence

The combination of two (or more) physically adjacent unused spaces into a single unused space.

internal fragmentation

Fragmentation in which the unused space is within the allocated areas.

Placement Strategies

placement strategy: A policy for determining the location of a new record in a file.
first fit: A placement strategy which selects the first space on the free list which is large enough.
best fit: A placement strategy which selects the smallest space from the free list which is large enough.
worst fit: A placement strategy which selects the largest space from the free list (if it is large enough.)

First Fit
Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430

370 * -1 70 Record * 50 100 Record * 200 60 Record
The simplest placement strategy is first fit.
With first fit, the spaces on the avail list are scanned in their logical order on the avail list.
The first space on the list which is big enough for a new record to be added is the one used.
The used space is delinked from the avail list, or, if the new record leaves unused space, the new (smaller) space replaces the olf space.
Adding a 70 byte record, only the first two entries on the list are checked:

Header Slot @50 Slot @120 Slot @200 Slot @230 Slot @300 Slot @370 Slot @430

370 * -1 70 Record * 50 30 Record Record * 200 60 Record
As records are deleted, the space can be added to the head of the list, as when the list is managed as a stack.
Best Fit
The best fit strategy leaves the smallest space left over when the new record is added.
There are two possible algorithms:
1. Manage deletions by adding the new record space to the head of the list, and scan the entire list for record additions.
2. Manage the avail list as a sorted list; the first fit on the list will then be the best fit.
Best Fit, Sorted List:

Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430

370 * 200 70 Record * -1 100 Record * 50 60 Record

Adding a 65 byte record, only the first two entries on the list are checked:

Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430

370 Record Record * -1 100 Record * 200 60 Record
Best Fit, Unsorted List:

Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430

200 * 370 70 Record * 50 100 Record * -1 60 Record

Adding a 65 byte record, all three entries on the list are checked:

Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430

200 Record Record * 370 100 Record * -1 60 Record
The 65 byte record has been stored in a 70 byte space; rather than create a 5 byte external fragment, which would be useless, the 5 byte excess has become internal fragmentation within tbe record.
Worst Fit
The worst fit strategy leaves the largest space left over when the new record is added.
The rational is that the leftover space is most likely to be usable for another new record addition.
There are two possible algorithms:
1. Manage deletions by adding the new record space to the head of the list, and scan the entire list for record additions.
2. Manage the avail list as a reverse sorted list; the first fit on the list, which will be the first entry, will then be the worst fit.
Worst Fit, Sorted List:

Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430

200 * 370 70 Record * 50 100 Record * -1 60 Record

Adding a 65 byte record, only the first entry on the list are checked:

Header Slot @50 Slot @120 Slot @200 Slot @235 Slot @300 Slot @370 Slot @430

50 * 370 70 Record * -1 35 Record Record * 200 60 Record
Worst Fit, Unsorted List:

Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430

200 * -1 70 Record * 370 100 Record * 50 60 Record

Adding a 65 byte record, all three entries on the list are checked:

Header Slot @50 Slot @120 Slot @200 Slot @235 Slot @300 Slot @370 Slot @430

200 * -1 70 Record * 370 35 Record Record * 50 60 Record
Commonalities
Regardless of placement strategy, when the record does not match the slot size (as is usually the case), there are two possible actions:
1. Create a new empty slot with the extra space, creating external fragmentation.
2. Embed the extra space in the record, creating internal fragmentation.
Finding Things Quickly: An Introduction to Internal Sorting and Binary Searching

Search by Guessing: Binary Search

binary search

A search which can be applied to an ordered linear list to progressively divide the possible scope of a search in half until the search object is found.
Example: Search for 'M' in the list:
A B C D E F G H I J K L M N O P Q R S
Compare 'M' to the middle li in the list:
A B C D E F G H I J K L M N O P Q R S
'M' > 'J': Narrow the search to the last half. (Eliminate the first half.)
A B C D E F G H I J K L M N O P Q R S
Compare 'M' to the middle li in the remainder of the list:
A B C D E F G H I J K L M N O P Q R S
'M' < 'O': Narrow the search to the first half of the remainder. (Eliminate the last half.)
A B C D E F G H I J K L M N O P Q R S
Compare 'M' to the middle li in the remainder of the list:
A B C D E F G H I J K L M N O P Q R S
'M' > 'L': Narrow the search to the last half. (Eliminate the first half.)
A B C D E F G H I J K L M N O P Q R S
Compare 'M' to the middle li in the remainder of the list:
A B C D E F G H I J K L M N O P Q R S
'M' == 'M': The search is over.

Header	Slot @50	Slot @120	Slot @200	Slot @300	Slot @370	Slot @430
370	* -1 70	Record	* 50 100	Record	* 200 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @230	Slot @300	Slot @370	Slot @430
370	* -1 70	Record	* 50 30	Record	Record	* 200 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @300	Slot @370	Slot @430
370	* 200 70	Record	* -1 100	Record	* 50 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @300	Slot @370	Slot @430
370	Record	Record	* -1 100	Record	* 200 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @300	Slot @370	Slot @430
200	* 370 70	Record	* 50 100	Record	* -1 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @300	Slot @370	Slot @430
200	Record	Record	* 370 100	Record	* -1 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @300	Slot @370	Slot @430
200	* 370 70	Record	* 50 100	Record	* -1 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @235	Slot @300	Slot @370	Slot @430
50	* 370 70	Record	* -1 35	Record	Record	* 200 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @300	Slot @370	Slot @430
200	* -1 70	Record	* 370 100	Record	* 50 60	Record

Header	Slot @50	Slot @120	Slot @200	Slot @235	Slot @300	Slot @370	Slot @430
200	* -1 70	Record	* 370 35	Record	Record	* 50 60	Record

Binary Search versus Sequential Search

A binary search of n items requires log₂ n + 1 comparisons at most.
A binary search of n items requires log₂ n + 1/2 comparisons on average.
Binary searching is O(log₂ n).
Sequential search of n items requires n comparisons at most.
A successful sequential search of n items requires n / 2 comparisons on average.
A unsuccessful sequential search of n items requires n comparisons.
Sequential searching is O(n).
Binary searching is only possible on ordered lists.

Sorting a Disk File in Memory

internal sort

A sort performed entirely in main memory.

The algorithms used for internal assume fast random access, and are not suitable for sorting files directly.
A small file which can be entirely read into memory can be sorted with an internal sort, and then written back to a file.

The limitations of Binary Searching and Internal Sorting

Binary searching requires more than one or two accesses.
More than one or two accesses is too many.
Keeping a file sorted is very expensive.
An internal sort works only on small files.

Keysorting

keysort

A sort performed by first sorting keys, and then moving records.

Description of the Method

Read each record sequentially into memory, one by one

George    Washington 1789 1797 None
John      Adams      1797 1801 Fed
Thomas    Jefferson  1801 1809 DR
James     Madison    1809 1817 DR
James     Monroe     1817 1825 DR
John    Q Adams      1825 1829 DR
Andrew    Jackson    1829 1837 Dem
Martin    Van Buren  1837 1841 Dem
William   Harrison   1841 1841 Whig

Save the key of the record, and the location of the record, in an array (KEYNODES).

Washington George    1
Adams      John      2
Jefferson  Thomas    3
Madison    James     4
Monroe     James     5
Adams      John    Q 6
Jackson    Andrew    7
Van Buren  Martin    8
Harrison   William   9

After all records have been read, internally sort the KEYNODES array of record keys and locations.

Adams      John      2
Adams      John    Q 6
Harrison   William   9
Jackson    Andrew    7
Jefferson  Thomas    3
Madison    James     4
Monroe     James     5
Van Buren  Martin    8
Washington George    1

Using the KEYNODES array, read each record back into memory a second time using direct access.
Write each record sequentially into a sorted file.

John      Adams      1797 1801 Fed
John    Q Adams      1825 1829 DR
William   Harrison   1841 1841 Whig
Andrew    Jackson    1829 1837 Dem
Thomas    Jefferson  1801 1809 DR
James     Madison    1809 1817 DR
James     Monroe     1817 1825 DR
Martin    Van Buren  1837 1841 Dem
George    Washington 1789 1797 None

Limitations of the Keysort Method

Keysort is only possible when the KEYNODES array is small enough to be held in memory.
Each record must be read twice: once sequentially and once directly.
The direct reads each require a seek.
If the original file and the output file are on the same physical drive, there will also be a seek for each write.
Keysorting is a way to sort medium size files.

FILE ACCESS

File organization is static.
File access is dynamic.

Sequential Access

Terms

sequential access

Access of data in order.

Accessing data from a file whose records are organized on the basis of their successive physical positions.

Remarks

Sequential access processes a file from its beginning.
All operating systems support sequential access of files.
Sequential access is the fastest way to read or write all of the records in a file.
Sequential access is slow when reading a singlt random record, since all the preceeding records must be read.

Direct Access

Terms

direct access

: Access of data in arbitrary order, with variable access time.; Accessing data from a file by record position with the file, without accessing intervening records.

relative record number

: An ordinal number indicating the position of a record within a file.

Remarks

Direct access processes single records at a time by position in the file.
Mainfreme and midrange operating systems support direct access of files.
The Windows, DOS, UNIX, and Linux operating systems do not natively support direct access of files.
When using Windows, DOS, UNIX, and Linux operating systems applications must be programmed to use direct access.
Direct access is slower than sequential when reading or writing all of the records in a file.
Direct access is fast when reading a singlt random record, since the preceeding records are ignored.
Direct access allows individual records to be read from different locations in the file without reading intervening records.
When files are organized with fixed length records, the location of a record in a file can be calculated from its relative record number, and the file can be accessed using the seek functions.

        
   ByteOffset = (RRN - 1) × RecLen

When files have variable length records supported by an index, the records can be accessed directly through the index, with the use of the seek function.
For direct access to be useful, the relative record number of the record of interest must be known.
Direct access is often used to support keyed access.

Keyed Access

Terms

keyed access

: Accessing data from a file by an alphanumeric key associated with each record.

key

: A value which is contained within or associated with a record and which can be used to identify the record.

Remarks

Keyed access processes single records at a time by record key.
Mainfreme and midrange operating systems support keyed access of files.
The Windows, DOS, UNIX, and Linux operating systems do not natively support keyed access of files.
When using Windows, DOS, UNIX, and Linux operating systems applications must be programmed to use keyed access.
Keyed access will be covered in more detail in later chapters.

FUNDAMENTAL FILE STRUCTURE CONCEPTS

General

persistent

Retained after execution of the program which created it.

When we build file structures, we are making it possible to make data persistent. That is, one program can store data from memory to a file, and terminate. Later, another program can retrieve the data from the file, and process it in memory.
In this chapter, we look at file structures which can be used to organize the data within the file, and at the algorithms which can be used to store and retrieve the data sequentially.

Field and Record Organization

Data Representation in Memory

record

A subdivision of a file, containing data related to a single entity.

field

A subdivision of a record containing a single attribute of the entity which the record describes.

stream of bytes

A file which is regarded as being without structure beyond separation

into a sequential set of bytes.

Within a program, data is temporarily stored in variables.
Individual values can be aggregated into structures, which can be treated as a single variable with parts.
In C++, classes are typically used as as an aggregate structure.

C++ Person class (version 0.1):

class Person {
  public:
    char FirstName [11];
    char LastName[11];
    char Address [21];
    char City [21];
    char State [3];
    char ZIP [5];
};

With this class declaration, variables can be declared to be of type Person. The individual fields within a Person can be referred to as the name of the variable and the name of the field, separated by a period (.).

C++ Program:

#include 

class Person {
  public:
    char FirstName [11];
    char LastName[11];
    char Address [31];
    char City [21];
    char State [3];
    char ZIP [5];
};

void Display (Person);

int main () {
  Person Clerk;
  Person Customer;
  
  strcpy (Clerk.FirstName, "Fred");
  strcpy (Clerk.LastName, "Flintstone");
  strcpy (Clerk.Address, "4444 Granite Place");
  strcpy (Clerk.City, "Rockville");
  strcpy (Clerk.State, "MD");
  strcpy (Clerk.ZIP, "00001");
  
  strcpy (Customer.FirstName, "Lily");
  strcpy (Customer.LastName, "Munster");
  strcpy (Customer.Address, "1313 Mockingbird Lane");
  strcpy (Customer.City, "Hollywood");
  strcpy (Customer.State, "CA");
  strcpy (Customer.ZIP, "90210");
  
  Display (Clerk);
  Display (Customer);
}

void Display (Person Someone) {
  cout << Someone.FirstName << Someone.LastName
       << Someone.Address << Someone.City
       << Someone.State << Someone.ZIP;
}

In memory, each Person will appear as an aggregate, with the individual values being parts of the aggregate:

Person

Clerk

FirstName LastName Address City State ZIP

Fred Flintstone 4444 Granite Place Rockville MD 0001
The output of this program will be:
FredFlintstone4444 Granite PlaceRockvilleMD00001LilyMunster1313 Mockingbird LaneHollywoodCA90210
Obviously, this output could be improved. It is marginally readable by people, and it would be difficult to program a computer to read and correctly interpret this output.

Person
Clerk
FirstName	LastName	Address	City	State	ZIP
Fred	Flintstone	4444 Granite Place	Rockville	MD	0001

Delineation of Records in a File

Fixed Length Records

A record which is predetermined to be the same length as the other records in the file.

Record 1 Record 2 Record 3 Record 4 Record 5
The file is divided into records of equal size.
All records within a file have the same size.
Different files can have different length records.
Programs which access the file must know the record length.
Offset, or position, of the nth record of a file can be calculated.
There is no external overhead for record separation.
There may be internal fragmentation (unused space within records.)
There will be no external fragmentation (unused space outside of records) except for deleted records.
Individual records can always be updated in place.
Algorithms for Accessing Fixed Length Records
Code for Accessing Fixed Length Records
Code for Accessing String Records

Example (80 byte records):

   0  66 69 72 73 74 20 6C 69 6E 65  0  0  1  0  0  0  first line......
  10   0  0  0  0  0  0  0  0 FF FF FF FF  0  0  0  0  ................
  20  68 FB 12  0 DC E0 40  0 3C BA 42  0 78 FB 12  0  h.....@.<.B.x...
  30  CD E3 40  0 3C BA 42  0  8 BB 42  0 E4 FB 12  0  ..@.<.B...B.....
  40  3C 18 41  0 C4 FB 12  0  2  0  0  0 FC 3A 7C  0  <.A..........:|.
  50  73 65 63 6F 6E 64 20 6C 69 6E 65  0  1  0  0  0  second line.....
  60   0  0  0  0  0  0  0  0 FF FF FF FF  0  0  0  0  ................
  70  68 FB 12  0 DC E0 40  0 3C BA 42  0 78 FB 12  0  h.....@.<.B.x...
  80  CD E3 40  0 3C BA 42  0  8 BB 42  0 E4 FB 12  0  ..@.<.B...B.....
  90  3C 18 41  0 C4 FB 12  0  2  0  0  0 FC 3A 7C  0  <.A..........:|.

Advantage: the offset of each record can be calculated from its record number. This makes direct access possible.
Advantage: there is no space overhead.
Disadvantage: there will probably be internal fragmentation (unusable space within records.)

Delimited Variable Length Records

variable length record

A record which can differ in length from the other records of the file.

delimited record

A variable length record which is terminated by a special character or sequence of characters.

delimiter

A special character or group of characters stored after a field or record, which indicates the end of the preceding unit.

Record 1 # Record 2 # Record 3 # Record 4 # Record 5 #
The records within a file are followed by a delimiting byte or series of bytes.
The delimiter cannot occur within the records.
Records within a file can have different sizes.
Different files can have different length records.
Programs which access the file must know the delimiter.
Offset, or position, of the nth record of a file cannot be calculated.
There is external overhead for record separation equal to the size of the delimiter per record.
There should be no internal fragmentation (unused space within records.)
There may be no external fragmentation (unused space outside of records) after file updating.
Individual records cannot always be updated in place.
Algorithms for Accessing Delimited Variable Length Records
Code for Accessing Delimited Variable Length Records
Code for Accessing Variable Length Line Records

Example (Delimiter = ASCII 30 (IE) = RS character:

   0  66 69 72 73 74 20 6C 69 6E 65 1E 73 65 63 6F 6E  first line.secon
  10  64 20 6C 69 6E 65 1E                             d line.

Example (Delimiter = '\n'):

   0  46 69 72 73 74 20 28 31 73 74 29 20 4C 69 6E 65  First (1st) Line
  10   D  A 53 65 63 6F 6E 64 20 28 32 6E 64 29 20 6C  ..Second (2nd) l
  20  69 6E 65  D  A                                   ine..

Disadvantage: the offset of each record cannot be calculated from its record number. This makes direct access impossible.
Advantage: there is space overhead for the length prefix.
Advantage: there will probably be no internal fragmentation (unusable space within records.)

Delineation of Fields in a Record

Fixed Length Fields

Field 1 Field 2 Field 3 Field 4 Field 5
Each record is divided into fields of correspondingly equal size.
Different fields within a record have different sizes.
Different records can have different length fields.
Programs which access the record must know the field lengths.
There is no external overhead for field separation.
There may be internal fragmentation (unused space within fields.)

Delimited Variable Length Fields

Field 1 ! Field 2 ! Field 3 ! Field 4 ! Field 5 !
The fields within a record are followed by a delimiting byte or series of bytes.
Fields within a record can have different sizes.
Different records can have different length fields.
Programs which access the record must know the delimiter.
The delimiter cannot occur within the data.
If used with delimited records, the field delimiter must be different from the record delimiter.
There is external overhead for field separation equal to the size of the delimiter per field.
There should be no internal fragmentation (unused

Length Prefixed Variable Length Fields

12 Field 1 4 Field 2 10 Field 3 8 Field 4 7 Field 5
The fields within a record are prefixed by a length byte or bytes.
Fields within a record can have different sizes.
Different records can have different length fields.
Programs which access the record must know the size and format of the length prefix.
There is external overhead for field separation equal to the size of the length prefix per field.
There should be no internal fragmentation (unused space within fields.)

MAGNETIC DISKS

magnetic disk

: A disk read and written by electromagnetic means

hard disk

: A magnetic disk with a rigid substrate

floppy disk

: A magnetic disk with a flexible substrate

fixed disk

: A disk drive with non-removable media.

cylinder

: The set of tracks of a disk drive which can be accessed without changing the position of the access arm.

track

: The (circular) area on a disk platter which can be accessed without moving the access arm of the drive.

sector

: A fixed size physical data block on a disk drive.

seek

: To move to a specified location in a file.

block

: A physical data record, separated on the medium from other blocks by inter-block gaps.

interblock gap

: An area between data blocks which contains no data and which separates the blocks.

access time

: The total time required to store or retrieve data.

seek time

: The time required for the head of a disk drive to be positioned to a designated cylinder.

rotational delay

: The time required for a designated sector to rotate to the head of a disk drive.

transfer time

The time required to transfer the data from a sector, once the transfer has begun.

Magnetic Technology

Data is stored on a magnetic disk by controlling the direction in which small areas of the disk surface are magnetized.
Data is stored on a disk serially, that is, one bit at a time.
Data is recorded on magnetic disks as a series of magnitized areas on the surface of the disk:
The data is read by a head with a coil which is sensitive to changes in magnetism on the data track as the disk rotates.
The heads are attached to a common shuttle, which moves them in and out, to different redial positions, together:
The data is blocked into physical sectors, which are arranged in many concentric circles called tracks. For hard drives, the sectors are all the same physical size, and the number of sectors varies with the circumference of the track
For floppy disks, there are the same number of sectors in each track, and the physical size of a sectors varies with the circumference of the track:
Track and cylinder locations are determined by the physical geometry of the drive.
Track and cylinder numbers begin with 0.
Tracks are often referred to as heads.
The sectors on each track are numbered from 1 up:
The block size of magnetic disks is almost always 512 bytes.:
The blocks, or sectors, are separated by interblock gaps.
Fixed disks are always hard disks.
Removable disks are usually floppy disks.
Accessing a sector on the drive requires three stems:
1. Seeking: Moving the head to the right cylinder.
2. Rotation: WAaiting for the right sector to reach the head.
3. Transfer: WAaiting for the sector to pass under the head, reading or writing the data.
Seek time is affected by the size of the drive, the number of cylinders in the drive, and the mechanical responsiveness of the access arm.
Average seek time is approximately the time to move across 1/3 of the cylinders.
Rotational delay is also referred to as latency.
Rotational delay is inversely proportional to the rotational speed of the drive.
Average rotational delay is the time for the disk to rotate 180°.
Actual transfer time may be limited by the disk interface.
Transfer is inversely proportional to the rotational speed of the drive.
Transfer time is inversely proportional to the physical length of a sector.
Transfer time is roughly inversely proportional to the number of sectors per track.
Data is always read or written in complete blocks.

SECONDARY STORAGE AND SYSTEM SOFTWARE

Disks

Data is stored on a magnetic disk by controlling the direction in which small areas of the disk surface are magnetized.
Data is stored on a disk serially, that is, one bit at a time.

Disk Drive Physical Structure
Data is recorded on each surface of the disk in sectors, which are located in concentric circles.
The sectors are separated by gaps which contain no data.
On PC disk drives, each sector contains 512 bytes of data.

Track and cylinder locations are determined by the physical geometry of the drive.
Track and cylinder numbers begin with 0.
Tracks are often referred to as heads.

Organizing Tracks by Sector

Sector locations are determined by the electronics of the drive, and are identified by recorded address marks.
Sector numbers begin with 1.
Today, logically adjacent sectors are typically physically adjacent.

Estimating Capacities and Space Needs

bytes/track = sectors/track * bytes/sector
bytes/cylinder = tracks/cylinder * bytes/track
bytes/drive = cylinders/drive * bytes/cylinder
bytes/drive = cylinders/drive * tracks/cylinder * sectors/track * bytes/sector

File allocation

cluster

A group of sectors handled as a unit of file allocation.

extent

A physical section of a file occupying adjacent clusters.

fragmentation

unused space with in a file.



The Cost of a Disk Access

Access time of a disk is related to physical movement of the disk parts.
Disk access time has three components: seek time, rotational delay, and transfer time.
Seek time is affected by the size of the drive, the number of cylinders in the drive, and the mechanical responsiveness of the access arm.
Average seek time is approximately the time to move across 1/3 of the cylinders.
Rotational delay is also referred to as latency.
Rotational delay is inversely proportional to the rotational speed of the drive.
Average rotational delay is the time for the disk to rotate 180°.
Transfer is inversely proportional to the rotational speed of the drive.
Transfer time is inversely proportional to the physical length of a sector.
Transfer time is roughly inversely proportional to the number of sectors per track.
Actual transfer time may be limited by the disk interface.

Effect of Block Size

Fragmentation waste increases as cluster size increases.
Average access time decreases as cluster size increases.

Several techniques have been developed to improve disk access time.
Striping allows disk transfers to be made in parallel.
There are 6 versions, or levels, of RAID technology.
RAID-0 uses striping.
RAID-0 improves access time, but does not provide redundancy.
RAID-1 uses mirroring, in which two drives are written with the same data.
RAID-1 provides complete redundancy. If one drive fails, the other provides data backup.
RAID-1 improves read access time, but slows write access time.
RAM disks appear to programs as fast disk drives.
RAM disks are volatile.
Solid state disks appear to computer systems as fast disk drives.
Solid state disks are used on high performance data base systems.
Caching improves average access time.
Disk caching can occur at three levels: in the computer main memory, in the disk controller, and in the disk drive.
Windows operating systems use main memory caching.
Disk controller caching requires special hardware.
Most disk drives now contain caching memory.
With caching, writes are typically reported as complete when the data is in the cache.
The physical write is delayed until later.
With caching, reads typically read more data than is requested, storing the unrequested data in the cache.
If a read can be satisfied from data already in the cache, no additional physical read is needed.
Read caching works on average because of program locality.

File System Organization

File Allocation Table

A table on a disk volume containing chained lists of the physical locations of all files on the volume.

index node

A data structure, associated with a file, which describes the file.

Storage as a Hierarchy

Journey of a Byte

Introduction to CD-ROM

A single disc can hold more than 600 megabytes of data (~ 400 books of the textbook’s size)

CD-ROM is read only. i.e., it is a publishing medium rather than a data storage and retrieval like magnetic disks.

CD-ROM Strengths: High storage capacity, inexpensive price, durability.

CD-ROM Weaknesses: extremely slow seek performance (between 1/2 a second to a second) ==> Intelligent File Structures are critical.

Physical Organization of CD-ROM
- CD-ROM is a descendent of CD Audios. i.e., listening to music is sequential and does not require fast random access to data.
- Reading Pits and Lands: CD-ROMs are stamped from a glass master disk which has a coating that is changed by the laser beam. When the coating is developed, the areas hit by the laser beam turn into pits along the track followed by the beam. The smooth unchanged areas between the pits are called
CD-ROM Strengths & Weaknesses
- Seek Performance: very bad
Data Transfer Rate: Not Terrible/Not Great
Storage Capacity: Great
- Benefit: enables us to build indexes and other support structures that can help overcome some of the limitations associated with CD-ROM’s poor performance.
Read-Only Access: There can’t be any changes ==> File organization can be optimized.
No need for interaction with the user (which requires a quick response)
etc..

Thursday, 29 November 2012

How Extendible Hashing Works

Tries

Turning the Tries into a Directory

Splitting to Handle Overflow

Linear Hashing

Term

Introduction

Hashing Algorithms

Modulus

Fold and Add

Mid-square

Combined methods

Hashing Functions and Record Distributions

Collisions

How Much Extra Memory Should be Used?

Collision Resolution by Progressive Overflow

Storing more than One Record per Address: Buckets

Making Deletions

Other Collision Resolution Techniques

Double Hashing

Open Chaining

Scatter Tables

Patterns of Record Access

Indexing with Binary Search Trees

-Tree Methods: Search, Insert, and Others

Example of Creating a B-Tree

An Object Oriented Model for Implementing Cosequential Processes

¶ 8.1.2 Matching Names in Two Lists

Cosequential Match Algorithm

Cosequential Match Code

Heaps

Logical Structue of a Heap

Physical Structue of a Heap

Building a Heap

What is an Index?

A Simple Index for Entry-Sequenced Files

Object Oriented Support for Indexed, Entry Sequenced Files

Indexes That are too Large to Hold in Memory

Indexing to Provide Access by Multiple Keys

Retrieval Using Combinations of Secondary Keys

Improving the Secondary Index Structure: Inverted Lists

Selective Indexes

Binding

Wednesday, 28 November 2012

Introduction

Data Compression

Compact Notation

Run Length Encoding

Variable Length Codes

variable length encoding

Irreversible Compression Techniques

Reclaiming Space in Files

Record Deletion and Storage Compaction

Deleting Fixed-Length Records for Reclaiming Space Dynamically

Deleting Variable-Length Records

Storage Fragmentation

Placement Strategies

First Fit

Best Fit

Worst Fit

Commonalities

Finding Things Quickly: An Introduction to Internal Sorting and Binary Searching

Search by Guessing: Binary Search

Binary Search versus Sequential Search

Sorting a Disk File in Memory

The limitations of Binary Searching and Internal Sorting

Keysorting

Description of the Method

Limitations of the Keysort Method

FILE ACCESS

Sequential Access

Terms

sequential access

Remarks

Direct Access

Terms

direct access

relative record number

Remarks