Thursday, 29 November 2012

EXTENDIBLE HASHING

How Extendible Hashing Works

Tries

  • trie
    A search tree in which the child of each node is determined by subsequent charaters of the key.
  • An alphabetic (radix 26) trie potentially has one child node for each letter of the alphabet.
  • A decimal (radix 10) trie has up to 10 children for each note.
  • The trie can be shortened by the use of buckets.
  • The bucket distribution can be balanced by the use of hashing.
  • Key Hash
    5554321 100111001
    5550123 10111010
    5541234 100111100
    5551234 1011110
    3217654 100111101
    1237654 10011011
    5557654 101110011
    1234567 1101001

Turning the Tries into a Directory

  • extendible hashing
    An application of hashing that works well with files that over time undergo substantial changes in size.

Splitting to Handle Overflow

  • splitting
    Creation of a new node when a node overflows, with the partial distribution of the contents of the overflowing node to the new node.

Linear Hashing

Term

  • linear hashing
    An application of hashing in which the address space is extended by one bucket each time an overflow occurs.

HASHING


Introduction

  • Key driven file access should be O(1) - that is, the time to access a record should be a constant which does not vary with the size of the dataset.
  • Indexing can be regarded as a table driven function which translates a key to a numeric location.
  • Hashing can be regarded as a computation driven function which translates a key to a numeric location.
  • hashing
    The transformation of a search key into a number by means of mathematical calculations.
  • randomize
    To transform in an apparently random way.
  • Hashing uses a repeatable pseudorandom function.
  • The hashing function should produce a uniform distribution of hash values.
  • uniform distribution
    A randomization in which each value in a range has an equal probability.
  • For each key, the result of the hashing function is used as the home address of the record.
  • home address
    The address produced by the hashing of a record key.
  • Under ideal conditions, hashing provides O(1) key driven file access.

Hashing Algorithms

Modulus

  • Modulus - the key is divided by the size of the table, and the remainder is used as the hash function.
  • Example:
    Key = 123-45-6789
    123456789 % 11 = 5
    h(123-45-6789) = 5
  • Modulus functions work better when the divisor is a prime number, or at least not a composite of small numbers.

Fold and Add

  • fold and add
    A hashing techique which separates a long key field into smaller parts which are added together.
  • Example:
    Key = 123-45-6789
     123
     456
    +789
    1368

    h(123-45-6789) = 1368
  • Fold and add is used for long keys.

Mid-square

  • mid-square
    hashing method which squares the key and the uses the middle digits of the result.
  • Example:
    Key = 123-45-6789
    1234567892 = 15241578750190521
    h(123-45-6789) = 8750

Combined methods

  • Practical hashing functions often combine techniques.
  • Example:
    Key = 123-45-6789
     123
     456
    +789
    1368

    1368 % 11 = 4

    h(123-45-6789) = 4
  • For non-numeric keys, the key is simply treated as though it were a number, using its internal binary representation.
  • Example:
    Key = "Kemp"
    "Kemp" = 4B656D7016 = 126493835210

Hashing Functions and Record Distributions

  • The size of the key space is typically much larger than the space of hashed values.
  • This means that more than one key will map to the same hash value.

Collisions

  • synonyms
    Keys which hash to the same value.
  • collision
    An attempt to store a record at an address which does not have sufficient room
  • packing density
    The ratio of used space to allocated space.
  • For simple hashing, the probability of a synonym is the same as the packing density.

How Much Extra Memory Should be Used?

  • Increasing memory (i.e., increasing the size of the hash table) will decrease collisions.

Collision Resolution by Progressive Overflow

  • progressive overflow
    A collision resolution technique which places overflow records at the first empty address after the home address
  • With progressive overflow, a sequential search is performed beginning at the home address.
  • The search is continued until the desired key or a blank record is found.
  • Progressive overflow is also referred to as linear probing.

Storing more than One Record per Address: Buckets

  • bucket
    An area of a hash table with a single hash address which has room for more than one record.
  • When using buckets, an entire bucket is read or written as a unit.  (Records are not read individually.)
  • The use of buckets will reduce the average number of probes required to find a record.

Making Deletions

  • tombstone
    A marker placed on a deleted record.
  • Using a tombstone avoids terminating a probe prematurely.
  • Locations containing a tombstone can be be used for records being added.

Other Collision Resolution Techniques

Double Hashing

  • Double hashing is similar to progressive overflow.
  • The second hash value is used as a stepping distance for the probing.
  • The second hash value should never be one.  (Add one.)
  • The second hash value should be relatively prime to the size of the table.  (This happens if the table size is prime.)
  • double hashing
    A collision resolution scheme which applies a second hash function to keys which collide, to determine a probing distance.
  • The use of double hashing will reduce the average number of probes required to find a record.
  • Linear probing and double hashing are both referred to as open addressing collision handling methods.

Open Chaining

  • Open chaining forms a linked list, or chain, of synonyms.
  • The overflow records can be kept in the same file as the hash table itself:
  • The overflow records can be kept in a separate file:

Scatter Tables

  • If all records are moved into a separate "overflow" area, with only links being left in the hash table, the result is a scatter table.
  • A scatter table scatter table is smaller than an index for the same data.

Patterns of Record Access

  • Loading items in decreasing order of retrieval frequncy will improve the average access time.
  • The idea is that records accessed more frequently can be accessed more quickly.
  • This will lower the average retrieval time.

MULTILEVEL INDEXING AND B-TREES

 

Indexing with Binary Search Trees

T E R M S
binary tree
A tree in which each node has at most two children.
leaf
A node at the lowest level of a tree.
height-balanced tree
A tree in which the difference between the heights of subtrees in limited.


  • Binary trees grow from the top down: new nodes are added as new leaves.
  • Binary trees become unbalanced as new nodes are added.
  • The imbalance of a binary tree depends on the order in which nodes are added and deleted.
  • Worst case search with a balanced binary tree is log2 (N + 1) compares.
  • Worst case search with an unbalanced binary tree is >N compares.

-Tree Methods: Search, Insert, and Others

 Searching
  Insertion
  Create, Open, and Close
 
 

Example of Creating a B-Tree

 

COSEQUENCIAL PROCESSING AND THE SORTING OF LARGE FILES

An Object Oriented Model for Implementing Cosequential Processes

cosequential operations

Operations which involve accessing two or more input files sequentially and in parallel, resulting in one or more output files produced by the combination of the input data.

 
 Considerations for Cosequential Algorithms
 
  • Initialization - What has to be set up for the main loop to work correctly.
  • Getting the next item on each list - This should be simple and easy, from the main algorithm.
  • Synchronization - Progress of access in the lists should be coordinated.
  • Handling End-Of-File conditions - For a match, processing can stop when the end of any list is reached.
  • Recognizing Errors - Items out of sequence can "break" the synchronization.

¶ 8.1.2 Matching Names in Two Lists

 
T E R M S
match
The process of forming a list containing all items common to two or more lists.

   Cosequential Match Algorithm

 
  • Initialize (open the input and output files.)
  • Get the first item from each list.
  • While there is more to do:
    • Compare the current items from each list.
    • If the items are equal,
      • Process the item.
      • Get the next item from each list.
      • Set more to true iff none of this lists is at end of file.
    • If the item from list A is less than the item from list B,
      • Get the next item from list A.
      • Set more to true iff list A is not at end-of-file.
    • If the item from list A is more than the item from list B,
      • Get the next item from list B.
      • Set more to true iff list B is not at end-of-file.
  • Finalize (close the files.)


Cosequential Match Code

 
  • void Match (char * InputName1, 
                char * InputName2,
                char * OutputName) {
      /* Local Declarations                      */
      OrderedFile Input1;
      OrderedFile Input2;
      OrderedFile Output;
      int Item1;
      int Item2;
      int more;
    
      /* Initialization                          */
      cout << "Data Matched:" << endl;
      Input1.open (InputName1, ios::in);
      Input2.open (InputName2, ios::in);
      Output.open (OutputName, ios::out);
      
      /* Algorithm                               */
      Input1 >> Item1;
      Input2 >> Item2;
      more = Input1.good() && Input2.good();
      while (more) {
        cout << Item1 << ':' << Item2 << " => " << flush;  /* DEMO only */
        if (Item1 < Item2) {
          Input1 >> Item1;
          cout << '\n';                           /* DEMO only */
        } else if (Item1 > Item2) {
          Input2 >> Item2;
          cout << '\n';                           /* DEMO only */
        } else {
          Output << Item1 << endl;
          cout << Item1 << endl;                  /* DEMO only */
          Input1 >> Item1;
          Input2 >> Item2;
        }
        more = Input1.good() && Input2.good();
      }
    
      /* Finalization                            */
      Input1.close ();
      Input2.close ();
      Output.close ();
    } 
     
     
     

    Heaps

     

    Logical Structue of a Heap

     
  • A heap is a version of binary tree.
  • A heap is ordered along each branch, from the root to any of the leaves.
  • A heap is not ordered in the left-right method of a binary search tree.
  • A heap is not ordered horizontally, within the tree levels.
  • A heap is ordered vertically, from the leaves to the root.
  • A heap is a complete binary tree.
  • All levels of a heap are filled, except for the lowest.
  • The lowest level is filled from left to right.
  • The root of a heap is always the lowest (or highest) item in the heap.

 

Physical Structue of a Heap

 
  • The nodes of a heap are stored physically in an array.
  • The root of the tree is in the first element of the array.
  • The children of the node at location n are at locations 2n and 2n + 1.
  • The parent of the node at location n is at location n/2.


Building a Heap

 
  • A heap is built by inserting elements at the end of the heap, as a new leaf, and then "sifting up," sorting the branch from the new leaf to the root.
  • Example:
  • Insert R: R
  • Insert L: R L
  •  R
    L

  • Sift L: L R
  •  L
    R

  • Insert C: L R C

  • L
    R
    C
  • Sift C: C R L

  • C
    R
    L
  • Insert A: C R L A


  • C

    R
    L
    A

  • Sift A: A C L R


  • A

    C

    L
    R

  • Insert H: A C L R H



  • A

    C


    L
    R
    H
  • Sift H: A C L R H



  • A

    C


    L
    R
    H
  • Insert V: A C L R H V



  • A

    C


    L
    R
    H
    V
  • Sift V: A C L R H V



  • A

    C


    L
    R
    H
    V
  • Insert E: A C L R H V E



  • A

    C


    L
    R
    H
    V
    E
  • Sift E: A C E R H V L



  • A

    C


    E
    R
    H
    V
    L

INDEXING

What is an Index?

index

A structure containing a set of entries, each consisting of a key field and a reference field, which is used to locate records in a data file.

key field

The part of an index which contains keys.

reference field

The part of an index which contains information to locate records.

  • An index imposes order on a file without rearranging the file.
  • Indexing works by indirection.

A Simple Index for Entry-Sequenced Files

simple index

An index in which the entries are a key ordered linear list.

  • Simple indexing can be useful when the entire index can be held in memory.
  • Changes (additions and deletions) require both the index and the data file to be changed.
  • Updates affect the index if the key field is changed, or if the record is moved.
  • An update which moves a record can be handled as a deletion followed by an addition.

Object Oriented Support for Indexed, Entry Sequenced Files

T E R M S
entry-sequenced file
A file in which the record order is determined by the order in which they are entered.

  • The physical order of records in the file may not be the same as order of entry, because of record deletions and space reuse.
  • The index should be read into memory when the data file is opened.

  Indexes That are too Large to Hold in Memory

  • Searching of a simple index on disk takes too much time.
  • Maintaining a simple index on disk in sorted order takes too much time.
  • Tree structured indexes such as B-trees are a scalable alternative to simple indexes.
  • Hashed organization is an alternative to indexing when only a primary index is needed.

 Indexing to Provide Access by Multiple Keys

secondary key

A search key other than the primary key.

secondary index

An index built on a secondary key.

  • Secondary indexes can be built on any field of the data file, or on combinations of fields.
  • Secondary indexes will typically have multiple locations for a single key.
  • Changes to the data may now affect multiple indexes.
  • The reference field of a secondary index can be a direct reference to the location of the entry in the data file.
  • The reference field of a secondary index can also be an indirect reference to the location of the entry in the data file, through the primary key.
  • Indirect secondary key references simplify updating of the file set.
  • Indirect secondary key references increase access time.

   Retrieval Using Combinations of Secondary Keys

  • The search for records by multiple keys can be done on multiple index, with the combination of index entries defining the records matching the key combination.
  • If two keys are to be combined, a list of entries from each key index is retrieved.
  • For an "or" combination of keys, the lists are merged.
  • I.e., any entry found in either list matches the search.
  • For an "and" combination of keys, the lists are matched.
  • I.e., only entries found in both lists match the search.

   Improving the Secondary Index Structure: Inverted Lists

     inverted list

  An index in which the reference field is the head pointer of a linked      list of reference items.

  Selective Indexes

   selective index

 An index which contains keys for only part of the records in a data file.

 Binding

      binding

     The association of a symbol with a value.

      locality

       A condition in which items accessed temporally close are also   physically close.

Wednesday, 28 November 2012

ORGANIZING FILES FOR THE PERFORMANCE

   

Introduction

 

  • Compression can reduce the size of a file, improving performance.
  • File maintenance can produce fragmentation inside of the file.  There are ways to reuse this space.
  • There are better ways than sequential search to find a particular record in a file.
  • Keysorting is a way to sort medium size files.
  • We have already considered how important it is for the file system designer to consider how a file is to be accessed when deciding how to create fields, records, and other file structures. In this chapter, we continue to focus on file organization, but the motivation is different. We look at ways to organize or reorganize files in order to improve performance.
  • In the first section, we look at how to organize files to make them smaller. Compression techniques make file smaller by encoding them to remove redundant or unnecessary information. 

Data Compression

data compression

The encoding of data in such a way as to reduce its size.

redundancy reduction

Any form of compression which removes only redundant information.

  • In this section, we look at ways to make files smaller, using data compression. As with many programming techniques, there are advantages and disadvantages to data compression. In general, the compression must be reversed before the information is used. For this tradeoff,
    • Smaller files use less storage space.
    • The transfer time of disk access is reduced.
    • The transmission time to transfer files over a network is reduced.
    But,
    • Program complexity and size are increased.
    • Computation time is increased.
    • Data portability may be reduced.
    • With some compression methods, information is unrecoverably lost.
    • Direct access may become prohibitably expensive.
    • Data compression is possible because most data contains redundant (repeated) or unnecessary information.
  • Data compression is possible because most data contains redundant (repeated) or unnecessary information. Reversible compression removes only redundant information, making it possible to restore the data to its original form. Irreversible compression goes further, removing information which is not actually necessary, making it impossible to recover the original form of the data.
  • Next we look at ways to reclaim unused space in files to improve performance.  Compaction is a batch process that we can use to purge holes of unused space from a file that has undergone many deletions and updates.  Then we investigate dynamic ways to maintain performance by reclaiming space made available by deletions and updates of records during the life of the file. 
  

Compact Notation

Compact Notation

The replacement of field values with an ordinal number which index an enumeration of possible field values.

  • Compact notation can be used for fields which have an effectively fixed range of values.
  • Compact notation can be used for fields which have an effectively fixed range of values. The State field of the Person record, as used earler, is an example of such a field. There are 676 (26 x 26) possible two letter abbreviations, but there are only 50 states. By assigning an ordinal number to each state, and storing the code as a one byte binary number, the field size is reduced by 50 percent.
  • No information has been lost in the process. The compression can be completely reversed, replacing the numeric code with the two letter abbreviation when the file is read. Compact notation is an example of redundancy reduction.
  • On the other hand, programs which access the compressed data will need additional code to compress and expand the data. An array can used as a translation table to convert between the numeric codes and the letter abbreviations. The translation table can be coded within the program, using literal constants, or stored in a file which is read into the array by the program.
  • Since a file using compact notation contains binary data, it cannot be viewed with a text editor, or typed to the screen. The use of delimited records is prohibitively expensive, since the delimiter will occur in the compacted field. 


Run Length Encoding

 

run-length encoding

An encoding scheme which replaces runs of a single symbol with the symbol and a repetition factor.

  • Run-length encoding is useful only when the text contains long runs of a single value.
  • Run-length encoding is useful for images which contain solid color areas.
  • Run-length encoding may be useful for text which contains strings of blanks.
  • Example:
    uncompressed text (hexadecimal format):   
       40 40 40 40 40 40 43 43 41 41 41 41 41 42 
    compressed text (hexadecimal format):   
       FE 06 40 43 43 FE 05 41 42 
    
    where FE is the compression escape code, followed by a length byte, and the byte to be repeated. 


Variable Length Codes

  variable length encoding

An encoding scheme in which the codes for differenct symbols may be of different length.

huffman code

A variable length code, in which each code is determined by the occurence frequency of the corresponding symbol.

prefix code

A variable length code in which the length of a code element can be determined from the first bits of the code element.

  • The optimal Huffman code can be different for each source text.
  • Huffman encoding takes two passes through the source text: one to build the code, and a second to encode the text by applying the code.
  • The code must be stored with the compressed text.
  • Huffman codes are based on the frequency of occurrence of characters in the text being encoded.
  • The characters with the highest occurence frequency are assigned the shortest codes, minimizing the average encoded length.
  • Huffman code is a prefix code.
  • Example:
    Uncompressed Text: 
      abdeacfaag   (80 bits)
    Frequencies:   a  4           e   1
                   b  1           f   1
                   c  1           g   1
                   d  1
    code:   a  1           e   0001
            b  010         f   0010
            c  011         g   0011
            d  0000
    Compressed Text (binary):
      10100000000110110010110011    (26 bits)
    Compressed Text (hexadecimal):
    A0 1B 96 60 
     
     

    Irreversible Compression Techniques

    irreversible compression

    Any form of compression which reduces information.

    reversible compression

    Compression with no alteration of original information upon reconstruction.

  • Irreversible compression goes beyond redundancy reduction, removing information which is not actually necessary, making it impossible to recover the original form of the data.
  • Irreversible compression is useful for reducing the size of graphic images.
  • Irreversible compression is used to reduce the bandwidth of audio for digital recording and telecommunications.
  • JPEG image files use an irreversible compression based on cosine transforms.
  • The amount of information removed by JPEG compression is controllable.
  • The more information removed, the smaller the file.
  • For photographic images, a significant amount of information can be removed without noticably affecting the image.
  • For line graphic images, the JPEG compression may introduce aliasing noise.
  • GIF images files irreversibly compress images which contain more than 256 colors.
  • The GIF format only allows 256 colors.
  • The compression of GIF formatting is reversible for images which have fewer than 256 colors, and lossy for images which have more than 256 colors.
  • Recommendation:
    • Use JPEG for photographic images.
    • Use GIF for line drawings.
 
 

  Reclaiming Space in Files

Record Deletion and Storage Compaction

external fragmentation

Fragmentation in which the unused space is outside of the allocated areas.

compaction

The removal of fragmentation from a file by moving records so that they are all physically adjacent.

  • As files are maintained, records are added, updated, and deleted.
  • The problem: as records are deleted from a file, they are replaced by unused spaces within the file.
  • The updating of variable length records can also produce fragmentation.
  • Compaction is a relatively slow process, especially for large files, and is not routinely done when individual records are deleted.

Deleting Fixed-Length Records for Reclaiming Space Dynamically

linked list

A container consisting of a series of nodes, each containing data and a reference to the location of the logically next node.

avail list

A list of the unused spaces in a file.

stack

A last-in first-out container, which is accessed only at one end.

 

  • Record 1 Record 2 Record 3 Record 4 Record 5
  • Deleted records must be marked so that the spaces will not be read as data.
  • One way of doing this is to put a special character, such as an asterisk, in the first byte of the deleted record space.
  • Record 1 Record 2 * Record 4 Record 5
  • If the space left by deleted records could be reused when records are added, fragmentation would be reduced.
  • To reuse the empty space, there must be a mechanism for finding it quickly.
  • One way of managing the empty space within a file is to organize as a linked list, known as the avail list.
  • The location of the first space on the avail list, the head pointer of the linked list, is placed in the header record of the file.
  • Each empty space contains the location of the next space on the avail list, except for the last space on the list.
  • The last space contains a number which is not valid as a file location, such as -1.
  • Header Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6
    3 * -1 Record 2 *  1 Record 4 Record 5 Record 6
  • If the file uses fixed length records, the spaces are interchangable; any unused space can be used for any new record.
  • The simplest way of managing the avail list is as a stack.
  • As each record is deleted, the old list head pointer is moved from the header record to the deleted record space, and the location of the deleted record space is placed in the header record as the new avail list head pointer, pushing the new space onto the stack.
  • Header Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6
    5 * -1 Record 2 *  1 Record 4 *  3 Record 6
  • When a record is added, it is placed in the space which is at the head of the avail list.
  • The push process is reversed; the empty space is popped from the stack by moving the pointer in the first space to the header record as the new avail list head pointer.
  • Header Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6
    3 * -1 Record 2 *  1 Record 4 Record 5 Record 6
  • With fixed length records, the relative record numbers (RRNs) can be used as location pointers in the avail list.

Deleting Variable-Length Records

 
  • If the file uses variable length records, the spaces not are interchangable; a new record will not fit just any unused space.
  • With variable length records, the byte offset of each record can be used as location pointers in the avail list.
  • The size of each deleted record space should also be placed in


Storage Fragmentation

coalescence

The combination of two (or more) physically adjacent unused spaces into a single unused space.

internal fragmentation

Fragmentation in which the unused space is within the allocated areas.
 
 

Placement Strategies


placement strategy
A policy for determining the location of a new record in a file.
first fit
A placement strategy which selects the first space on the free list which is large enough.
best fit
A placement strategy which selects the smallest space from the free list which is large enough.
worst fit
A placement strategy which selects the largest space from the free list (if it is large enough.)

  • First Fit

  • Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430
    370 * -1 70 Record *  50 100 Record *  200 60 Record
  • The simplest placement strategy is first fit.
  • With first fit, the spaces on the avail list are scanned in their logical order on the avail list.
  • The first space on the list which is big enough for a new record to be added is the one used.
  • The used space is delinked from the avail list, or, if the new record leaves unused space, the new (smaller) space replaces the olf space.
  • Adding a 70 byte record, only the first two entries on the list are checked:
    Header Slot @50 Slot @120 Slot @200 Slot @230 Slot @300 Slot @370 Slot @430
    370 * -1 70 Record *  50 30 Record Record *  200 60 Record
  • As records are deleted, the space can be added to the head of the list, as when the list is managed as a stack.
  • Best Fit

  • The best fit strategy leaves the smallest space left over when the new record is added.
  • There are two possible algorithms:
    1. Manage deletions by adding the new record space to the head of the list, and scan the entire list for record additions.
    2. Manage the avail list as a sorted list; the first fit on the list will then be the best fit.
  • Best Fit, Sorted List:
    Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430
    370 * 200 70 Record *  -1 100 Record *  50 60 Record
    Adding a 65 byte record, only the first two entries on the list are checked:
    Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430
    370 Record Record *  -1 100 Record 200 60 Record
  • Best Fit, Unsorted List:
    Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430
    200 * 370 70 Record *  50 100 Record *  -1 60 Record
    Adding a 65 byte record, all three entries on the list are checked:
    Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430
    200 Record Record 370 100 Record *  -1 60 Record
  • The 65 byte record has been stored in a 70 byte space; rather than create a 5 byte external fragment, which would be useless, the 5 byte excess has become internal fragmentation within tbe record.
  • Worst Fit

  • The worst fit strategy leaves the largest space left over when the new record is added.
  • The rational is that the leftover space is most likely to be usable for another new record addition.
  • There are two possible algorithms:
    1. Manage deletions by adding the new record space to the head of the list, and scan the entire list for record additions.
    2. Manage the avail list as a reverse sorted list; the first fit on the list, which will be the first entry, will then be the worst fit.
  • Worst Fit, Sorted List:
    Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430
    200 * 370 70 Record *  50 100 Record *  -1 60 Record
    Adding a 65 byte record, only the first entry on the list are checked:
    Header Slot @50 Slot @120 Slot @200 Slot @235 Slot @300 Slot @370 Slot @430
    50 * 370 70 Record -1 35 Record Record 200 60 Record
  • Worst Fit, Unsorted List:
    Header Slot @50 Slot @120 Slot @200 Slot @300 Slot @370 Slot @430
    200 * -1 70 Record *  370 100 Record *  50 60 Record
    Adding a 65 byte record, all three entries on the list are checked:
    Header Slot @50 Slot @120 Slot @200 Slot @235 Slot @300 Slot @370 Slot @430
    200 * -1 70 Record *  370 35 Record Record *  50 60 Record
  • Commonalities

  • Regardless of placement strategy, when the record does not match the slot size (as is usually the case), there are two possible actions:
    1. Create a new empty slot with the extra space, creating external fragmentation.
    2. Embed the extra space in the record, creating internal fragmentation.


    Finding Things Quickly: An Introduction to Internal Sorting and Binary Searching

     

    Search by Guessing: Binary Search

    binary search

    A search which can be applied to an ordered linear list to progressively divide the possible scope of a search in half until the search object is found.

  • Example: Search for 'M' in the list:
  • A B C D E F G H I J K L M N O P Q R S
  • Compare 'M' to the middle li in the list:
  • A B C D E F G H I J K L M N O P Q R S
  • 'M' > 'J': Narrow the search to the last half. (Eliminate the first half.)
  • A B C D E F G H I J K L M N O P Q R S
  • Compare 'M' to the middle li in the remainder of the list:
  • A B C D E F G H I J K L M N O P Q R S
  • 'M' < 'O': Narrow the search to the first half of the remainder. (Eliminate the last half.)
  • A B C D E F G H I J K L M N O P Q R S
  • Compare 'M' to the middle li in the remainder of the list:
  • A B C D E F G H I J K L M N O P Q R S
  • 'M' > 'L': Narrow the search to the last half. (Eliminate the first half.)
  • A B C D E F G H I J K L M N O P Q R S
  • Compare 'M' to the middle li in the remainder of the list:
  • A B C D E F G H I J K L M N O P Q R S
  • 'M' == 'M': The search is over.
 

Binary Search versus Sequential Search

 
  • A binary search of n items requires log2 n + 1 comparisons at most.
  • A binary search of n items requires log2 n + 1/2 comparisons on average.
  • Binary searching is O(log2 n).
  • Sequential search of n items requires n comparisons at most.
  • A successful sequential search of n items requires n / 2 comparisons on average.
  • A unsuccessful sequential search of n items requires n comparisons.
  • Sequential searching is O(n).
  • Binary searching is only possible on ordered lists.


Sorting a Disk File in Memory

 

internal sort

A sort performed entirely in main memory.

  • The algorithms used for internal assume fast random access, and are not suitable for sorting files directly.
  • A small file which can be entirely read into memory can be sorted with an internal sort, and then written back to a file.


The limitations of Binary Searching and Internal Sorting

 
  • Binary searching requires more than one or two accesses.
  • More than one or two accesses is too many.
  • Keeping a file sorted is very expensive.
  • An internal sort works only on small files.


Keysorting

keysort

A sort performed by first sorting keys, and then moving records.

 Description of the Method  

  • Read each record sequentially into memory, one by one
  • George    Washington 1789 1797 None
    John      Adams      1797 1801 Fed
    Thomas    Jefferson  1801 1809 DR
    James     Madison    1809 1817 DR
    James     Monroe     1817 1825 DR
    John    Q Adams      1825 1829 DR
    Andrew    Jackson    1829 1837 Dem
    Martin    Van Buren  1837 1841 Dem
    William   Harrison   1841 1841 Whig
    
  • Save the key of the record, and the location of the record, in an array (KEYNODES).
  • Washington George    1
    Adams      John      2
    Jefferson  Thomas    3
    Madison    James     4
    Monroe     James     5
    Adams      John    Q 6
    Jackson    Andrew    7
    Van Buren  Martin    8
    Harrison   William   9
    
  • After all records have been read, internally sort the KEYNODES array of record keys and locations.
  • Adams      John      2
    Adams      John    Q 6
    Harrison   William   9
    Jackson    Andrew    7
    Jefferson  Thomas    3
    Madison    James     4
    Monroe     James     5
    Van Buren  Martin    8
    Washington George    1
    
  • Using the KEYNODES array, read each record back into memory a second time using direct access.
  • Write each record sequentially into a sorted file.
  • John      Adams      1797 1801 Fed
    John    Q Adams      1825 1829 DR
    William   Harrison   1841 1841 Whig
    Andrew    Jackson    1829 1837 Dem
    Thomas    Jefferson  1801 1809 DR
    James     Madison    1809 1817 DR
    James     Monroe     1817 1825 DR
    Martin    Van Buren  1837 1841 Dem
    George    Washington 1789 1797 None
     
     
     

    Limitations of the Keysort Method

  • Keysort is only possible when the KEYNODES array is small enough to be held in memory.
  • Each record must be read twice: once sequentially and once directly.
  • The direct reads each require a seek.
  • If the original file and the output file are on the same physical drive, there will also be a seek for each write.
  • Keysorting is a way to sort medium size files.

FILE ACCESS


 FILE ACCESS


  • File organization is static.
  • File access is dynamic.

Sequential Access

Terms


sequential access


  • Access of data in order. 
  • Accessing data from a file whose records are organized on the basis of their successive physical positions.

Remarks

  • Sequential access processes a file from its beginning.
    Sequential Access
  • All operating systems support sequential access of files.
  • Sequential access is the fastest way to read or write all of the records in a file.
  • Sequential access is slow when reading a singlt random record, since all the preceeding records must be read.

Direct Access

Terms


direct access


Access of data in arbitrary order, with variable access time.
Accessing data from a file by record position with the file, without accessing intervening records.

relative record number


An ordinal number indicating the position of a record within a file. 

Remarks

  • Direct access processes single records at a time by position in the file.
    Direct Access
  • Mainfreme and midrange operating systems support direct access of files.
  • The Windows, DOS, UNIX, and Linux operating systems do not natively support direct access of files.
  • When using Windows, DOS, UNIX, and Linux operating systems applications must be programmed to use direct access.
  • Direct access is slower than sequential when reading or writing all of the records in a file.
  • Direct access is fast when reading a singlt random record, since the preceeding records are ignored.
  • Direct access allows individual records to be read from different locations in the file without reading intervening records.
  • When files are organized with fixed length records, the location of a record in a file can be calculated from its relative record number, and the file can be accessed using the seek functions.
    Direct Access
  •         
       ByteOffset = (RRN - 1) × RecLen
            
    
  • When files have variable length records supported by an index, the records can be accessed directly through the index, with the use of the seek function.
    Direct Access
  • For direct access to be useful, the relative record number of the record of interest must be known.
  • Direct access is often used to support keyed access.

Keyed Access

Terms


keyed access


Accessing data from a file by an alphanumeric key associated with each record.

key


A value which is contained within or associated with a record and which can be used to identify the record. 

Remarks

  • Keyed access processes single records at a time by record key.
    Keyed Access
  • Mainfreme and midrange operating systems support keyed access of files.
  • The Windows, DOS, UNIX, and Linux operating systems do not natively support keyed access of files.
  • When using Windows, DOS, UNIX, and Linux operating systems applications must be programmed to use keyed access.
  • Keyed access will be covered in more detail in later chapters.

FUNDAMENTAL FILE STRUCTURE CONCEPTS


General

persistent

Retained after execution of the program which created it.


  • When we build file structures, we are making it possible to make data persistent. That is, one program can store data from memory to a file, and terminate. Later, another program can retrieve the data from the file, and process it in memory.
  • In this chapter, we look at file structures which can be used to organize the data within the file, and at the algorithms which can be used to store and retrieve the data sequentially.                                


Field and Record Organization

 

Data Representation in Memory

record

A subdivision of a file, containing data related to a single entity.

field

A subdivision of a record containing a single attribute of the entity which the record describes.

stream of bytes

A file which is regarded as being without structure beyond separation 
into a sequential set of bytes. 
 
 
 
  • Within a program, data is temporarily stored in variables.
  • Individual values can be aggregated into structures, which can be treated as a single variable with parts.
  • In C++, classes are typically used as as an aggregate structure.
  • C++ Person class (version 0.1):
    class Person {
      public:
        char FirstName [11];
        char LastName[11];
        char Address [21];
        char City [21];
        char State [3];
        char ZIP [5];
    };
    
  • With this class declaration, variables can be declared to be of type Person.  The individual fields within a Person can be referred to as the name of the variable and the name of the field, separated by a period (.).
  • C++ Program:
    #include 
    
    class Person {
      public:
        char FirstName [11];
        char LastName[11];
        char Address [31];
        char City [21];
        char State [3];
        char ZIP [5];
    };
    
    void Display (Person);
    
    int main () {
      Person Clerk;
      Person Customer;
      
      strcpy (Clerk.FirstName, "Fred");
      strcpy (Clerk.LastName, "Flintstone");
      strcpy (Clerk.Address, "4444 Granite Place");
      strcpy (Clerk.City, "Rockville");
      strcpy (Clerk.State, "MD");
      strcpy (Clerk.ZIP, "00001");
      
      strcpy (Customer.FirstName, "Lily");
      strcpy (Customer.LastName, "Munster");
      strcpy (Customer.Address, "1313 Mockingbird Lane");
      strcpy (Customer.City, "Hollywood");
      strcpy (Customer.State, "CA");
      strcpy (Customer.ZIP, "90210");
      
      Display (Clerk);
      Display (Customer);
    }
    
    void Display (Person Someone) {
      cout << Someone.FirstName << Someone.LastName
           << Someone.Address << Someone.City
           << Someone.State << Someone.ZIP;
    }
    
  • In memory, each Person will appear as an aggregate, with the individual values being parts of the aggregate:
    Person
    Clerk
    FirstNameLastName AddressCity StateZIP
    FredFlintstone 4444 Granite PlaceRockville MD0001
  • The output of this program will be:
    FredFlintstone4444 Granite PlaceRockvilleMD00001LilyMunster1313 Mockingbird LaneHollywoodCA90210
  • Obviously, this output could be improved.  It is marginally readable by people, and it would be difficult to program a computer to read and correctly interpret this output. 


Delineation of Records in a File

 

Fixed Length Records

A record which is predetermined to be the same length as the other records in the file.

 
  • Record 1 Record 2 Record 3 Record 4 Record 5
  • The file is divided into records of equal size.
  • All records within a file have the same size.
  • Different files can have different length records.
  • Programs which access the file must know the record length.
  • Offset, or position, of the nth record of a file can be calculated.
  • There is no external overhead for record separation.
  • There may be internal fragmentation (unused space within records.)
  • There will be no external fragmentation (unused space outside of records) except for deleted records.
  • Individual records can always be updated in place.
  • Algorithms for Accessing Fixed Length Records
  • Code for Accessing Fixed Length Records
  • Code for Accessing String Records
  • Example (80 byte records):
       0  66 69 72 73 74 20 6C 69 6E 65  0  0  1  0  0  0  first line......
      10   0  0  0  0  0  0  0  0 FF FF FF FF  0  0  0  0  ................
      20  68 FB 12  0 DC E0 40  0 3C BA 42  0 78 FB 12  0  h.....@.<.B.x...
      30  CD E3 40  0 3C BA 42  0  8 BB 42  0 E4 FB 12  0  ..@.<.B...B.....
      40  3C 18 41  0 C4 FB 12  0  2  0  0  0 FC 3A 7C  0  <.A..........:|.
      50  73 65 63 6F 6E 64 20 6C 69 6E 65  0  1  0  0  0  second line.....
      60   0  0  0  0  0  0  0  0 FF FF FF FF  0  0  0  0  ................
      70  68 FB 12  0 DC E0 40  0 3C BA 42  0 78 FB 12  0  h.....@.<.B.x...
      80  CD E3 40  0 3C BA 42  0  8 BB 42  0 E4 FB 12  0  ..@.<.B...B.....
      90  3C 18 41  0 C4 FB 12  0  2  0  0  0 FC 3A 7C  0  <.A..........:|.
    
  • Advantage: the offset of each record can be calculated from its record number.  This makes direct access possible.
  • Advantage: there is no space overhead.
  • Disadvantage: there will probably be internal fragmentation (unusable space within records.)
 
 

Delimited Variable Length Records

 

variable length record

A record which can differ in length from the other records of the file.

delimited record

A variable length record which is terminated by a special character or sequence of characters.

delimiter

A special character or group of characters stored after a field or record, which indicates the end of the preceding unit.

     

  • Record 1 # Record 2 # Record 3 # Record 4 # Record 5 #
  • The records within a file are followed by a delimiting byte or series of bytes.
  • The delimiter cannot occur within the records.
  • Records within a file can have different sizes.
  • Different files can have different length records.
  • Programs which access the file must know the delimiter.
  • Offset, or position, of the nth record of a file cannot be calculated.
  • There is external overhead for record separation equal to the size of the delimiter per record.
  • There should be no internal fragmentation (unused space within records.)
  • There may be no external fragmentation (unused space outside of records) after file updating.
  • Individual records cannot always be updated in place.
  • Algorithms for Accessing Delimited Variable Length Records
  • Code for Accessing Delimited Variable Length Records
  • Code for Accessing Variable Length Line Records
  • Example (Delimiter = ASCII 30 (IE) = RS character:
       0  66 69 72 73 74 20 6C 69 6E 65 1E 73 65 63 6F 6E  first line.secon
      10  64 20 6C 69 6E 65 1E                             d line.         
    
  • Example (Delimiter = '\n'):
       0  46 69 72 73 74 20 28 31 73 74 29 20 4C 69 6E 65  First (1st) Line
      10   D  A 53 65 63 6F 6E 64 20 28 32 6E 64 29 20 6C  ..Second (2nd) l
      20  69 6E 65  D  A                                   ine..           
    
  • Disadvantage: the offset of each record cannot be calculated from its record number.  This makes direct access impossible.
  • Advantage: there is space overhead for the length prefix.
  • Advantage: there will probably be no internal fragmentation (unusable space within records.) 

   

    Delineation of Fields in a Record

 

  Fixed Length Fields

  • Field 1 Field 2 Field 3 Field 4 Field 5
  • Each record is divided into fields of correspondingly equal size.
  • Different fields within a record have different sizes.
  • Different records can have different length fields.
  • Programs which access the record must know the field lengths.
  • There is no external overhead for field separation.
  • There may be internal fragmentation (unused space within fields.)
 
   

  Delimited Variable Length Fields

  

  • Field 1 ! Field 2 ! Field 3 ! Field 4 ! Field 5 !
  • The fields within a record are followed by a delimiting byte or series of bytes.
  • Fields within a record can have different sizes.
  • Different records can have different length fields.
  • Programs which access the record must know the delimiter.
  • The delimiter cannot occur within the data.
  • If used with delimited records, the field delimiter must be different from the record delimiter.
  • There is external overhead for field separation equal to the size of the delimiter per field.
  • There should be no internal fragmentation (unused
 
  

   Length Prefixed Variable Length Fields

  

  • 12 Field 1 4 Field 2 10 Field 3 8 Field 4 7 Field 5
  • The fields within a record are prefixed by a length byte or bytes.
  • Fields within a record can have different sizes.
  • Different records can have different length fields.
  • Programs which access the record must know the size and format of the length prefix.
  • There is external overhead for field separation equal to the size of the length prefix per field.
  • There should be no internal fragmentation (unused space within fields.)
  

    

MAGNETIC DISKS

 

magnetic disk


A disk read and written by electromagnetic means

hard disk


A magnetic disk with a rigid substrate

floppy disk


A magnetic disk with a flexible substrate

fixed disk


A disk drive with non-removable media. 

cylinder


The set of tracks of a disk drive which can be accessed without changing the position of the access arm. 

track


The (circular) area on a disk platter which can be accessed without moving the access arm of the drive. 

sector


A fixed size physical data block on a disk drive. 

seek


To move to a specified location in a file.

block


A physical data record, separated on the medium from other blocks by inter-block gaps.

interblock gap


An area between data blocks which contains no data and which separates the blocks. 

access time


The total time required to store or retrieve data.

seek time


The time required for the head of a disk drive to be positioned to a designated cylinder. 

rotational delay


The time required for a designated sector to rotate to the head of a disk drive. 

transfer time


The time required to transfer the data from a sector, once the transfer has begun.  
 
 
 
 

Magnetic Technology

 

  • Data is stored on a magnetic disk by controlling the direction in which small areas of the disk surface are magnetized.
  • Data is stored on a disk serially, that is, one bit at a time. 
  • Data is recorded on magnetic disks as a series of magnitized areas on the surface of the disk:
    Magnetic pits and lands
  • The data is read by a head with a coil which is sensitive to changes in magnetism on the data track as the disk rotates. 
    Magnetic mechanism
  • The heads are attached to a common shuttle, which moves them in and out, to different redial positions, together:
    Magnetic track
  • The data is blocked into physical sectors, which are arranged in many concentric circles called tracks.  For hard drives, the sectors are all the same physical size, and the number of sectors varies with the circumference of the track
    Magnetic track
  • For floppy disks, there are the same number of sectors in each track, and the physical size of a sectors varies with the circumference of the track:
    Magnetic track
  • Track and cylinder locations are determined by the physical geometry of the drive. 
    Magnetic track
  • Track and cylinder numbers begin with 0. 
  • Tracks are often referred to as heads. 
    Magnetic track
  • The sectors on each track are numbered from 1 up:
    Magnetic track
  • The block size of magnetic disks is almost always 512 bytes.:
  • The blocks, or sectors, are separated by interblock gaps. 
  • Fixed disks are always hard disks. 
  • Removable disks are usually floppy disks. 
  • Accessing a sector on the drive requires three stems:
    1. Seeking: Moving the head to the right cylinder.
    2. Rotation: WAaiting for the right sector to reach the head.
    3. Transfer: WAaiting for the sector to pass under the head, reading or writing the data.
  • Seek time is affected by the size of the drive, the number of cylinders in the drive, and the mechanical responsiveness of the access arm.
  • Average seek time is approximately the time to move across 1/3 of the cylinders.
  • Rotational delay is also referred to as latency.
  • Rotational delay is inversely proportional to the rotational speed of the drive.
  • Average rotational delay is the time for the disk to rotate 180°.
  • Actual transfer time may be limited by the disk interface.
  • Transfer is inversely proportional to the rotational speed of the drive.
  • Transfer time is inversely proportional to the physical length of a sector.
  • Transfer time is roughly inversely proportional to the number of sectors per track.
  • Data is always read or written in complete blocks. 

 

 
 
 
 
 
 

SECONDARY STORAGE AND SYSTEM SOFTWARE



Disks 

 

  • Data is stored on a magnetic disk by controlling the direction in which small areas of the disk surface are magnetized.
  • Data is stored on a disk serially, that is, one bit at a time. 

 Organization of Disks

 

 

fixed disk

A disk drive with non-removable media.

block

A physical data record, separated on the medium from other blocks by inter-block gaps.

sector

A fixed size physical data block on a disk drive.

interblock gap

An area between data blocks which contains no data and which separates the blocks.

cylinder

The set of tracks of a disk drive which can be accessed without changing the position of the access arm.

track

The (circular) area on a disk platter which can be accessed without moving the access arm of the drive. 
  • Disk Drive Physical Structure
  • Data is recorded on each surface of the disk in sectors, which are located in concentric circles.
  • The sectors are separated by gaps which contain no data.
  • On PC disk drives, each sector contains 512 bytes of data.
  • Track and cylinder locations are determined by the physical geometry of the drive.
  • Track and cylinder numbers begin with 0.
  • Tracks are often referred to as heads.
 

 Organizing Tracks by Sector

  • Sector locations are determined by the electronics of the drive, and are identified by recorded address marks.
  • Sector numbers begin with 1.
  • Today, logically adjacent sectors are typically physically adjacent.

 Estimating Capacities and Space Needs

  • bytes/track = sectors/track * bytes/sector
  • bytes/cylinder = tracks/cylinder * bytes/track
  • bytes/drive = cylinders/drive * bytes/cylinder
  • bytes/drive = cylinders/drive * tracks/cylinder * sectors/track * bytes/sector

 File allocation

  cluster

A group of sectors handled as a unit of file allocation.

   extent

A physical section of a file occupying adjacent clusters.

   fragmentation
  

                 unused space with in a file.
                 


  

The Cost of a Disk Access
 
 

direct access device

A data storage device which supports direct access.

direct access

Accessing data from a file by record position with the file, without accessing intervening records.

access time

The total time required to store or retrieve data.

transfer time

The time required to transfer the data from a sector, once the transfer has begun.

seek time

The time required for the head of a disk drive to be positioned to a designated cylinder.

rotational delay

The time required for a designated sector to rotate to the head of a disk drive.

 

  • Access time of a disk is related to physical movement of the disk parts.
  • Disk access time has three components: seek time, rotational delay, and transfer time.
  • Seek time is affected by the size of the drive, the number of cylinders in the drive, and the mechanical responsiveness of the access arm.
  • Average seek time is approximately the time to move across 1/3 of the cylinders.
  • Rotational delay is also referred to as latency.
  • Rotational delay is inversely proportional to the rotational speed of the drive.
  • Average rotational delay is the time for the disk to rotate 180°.
  • Transfer is inversely proportional to the rotational speed of the drive.
  • Transfer time is inversely proportional to the physical length of a sector.
  • Transfer time is roughly inversely proportional to the number of sectors per track.
  • Actual transfer time may be limited by the disk interface.

 

Effect of Block Size

  • Fragmentation waste increases as cluster size increases.
  • Average access time decreases as cluster size increases.

Disk as a bottleneck

 

striping

The distribution of single files to two or more physical disk drives.

Redundant Array of Inexpensive Disks

An array of multiple disk drives which appears as a single drive to the system.

RAM disk

A virtual disk drive which actually exists in main memory.

solid state disk

A solid state memory array with an interface which responds as a disk drive.

cache

Solid state memory used to buffer and store data temporarily.

  • Several techniques have been developed to improve disk access time.
  • Striping allows disk transfers to be made in parallel.
  • There are 6 versions, or levels, of RAID technology.
  • RAID-0 uses striping.
  • RAID-0 improves access time, but does not provide redundancy.
  • RAID-1 uses mirroring, in which two drives are written with the same data.
  • RAID-1 provides complete redundancy.  If one drive fails, the other provides data backup.
  • RAID-1 improves read access time, but slows write access time.
  • RAM disks appear to programs as fast disk drives.
  • RAM disks are volatile.
  • Solid state disks appear to computer systems as fast disk drives.
  • Solid state disks are used on high performance data base systems.
  • Caching improves average access time.
  • Disk caching can occur at three levels: in the computer main memory, in the disk controller, and in the disk drive.
  • Windows operating systems use main memory caching.
  • Disk controller caching requires special hardware.
  • Most disk drives now contain caching memory.
  • With caching, writes are typically reported as complete when the data is in the cache. 
  • The physical write is delayed until later.
  • With caching, reads typically read more data than is requested, storing the unrequested data in the cache. 
  • If a read can be satisfied from data already in the cache, no additional physical read is needed.
  • Read caching works on average because of program locality. 

File System Organization

 

File Allocation Table

A table on a disk volume containing chained lists of the physical locations of all files on the volume.

index node

A data structure, associated with a file, which describes the file.

Storage as a Hierarchy

 

 Journey of a Byte

 

      

    Introduction to CD-ROM

  • A single disc can hold more than 600 megabytes of data (~ 400 books of the textbook’s size)
  • CD-ROM is read only. i.e., it is a publishing medium rather than a data storage and retrieval like magnetic disks.
  • CD-ROM Strengths: High storage capacity, inexpensive price, durability.
  • CD-ROM Weaknesses: extremely slow seek performance (between 1/2 a second to a second) ==> Intelligent File Structures are critical.
     

    Physical Organization of CD-ROM

    •  CD-ROM is a descendent of CD Audios. i.e., listening to music is sequential and does not require fast random access to data. 

    • Reading Pits and Lands: CD-ROMs are stamped from a glass master disk which has a coating that is changed by the laser beam. When the coating is developed, the areas hit by the laser beam turn into pits along the track followed by the beam. The smooth unchanged areas between the pits are called 


    CD-ROM Strengths & Weaknesses

    • Seek Performance: very bad
       
  • Data Transfer Rate: Not Terrible/Not Great

  • Storage Capacity: Great

    • Benefit: enables us to build indexes and other support structures that can help overcome some of the limitations associated with CD-ROM’s poor performance.
  • Read-Only Access: There can’t be any changes ==> File organization can be optimized.

  • No need for interaction with the user (which requires a quick response)
  •  etc..