File Structure: FUNDAMENTAL FILE STRUCTURE CONCEPTS

General

persistent

Retained after execution of the program which created it.

When we build file structures, we are making it possible to make data persistent. That is, one program can store data from memory to a file, and terminate. Later, another program can retrieve the data from the file, and process it in memory.
In this chapter, we look at file structures which can be used to organize the data within the file, and at the algorithms which can be used to store and retrieve the data sequentially.

Field and Record Organization

Data Representation in Memory

record

A subdivision of a file, containing data related to a single entity.

field

A subdivision of a record containing a single attribute of the entity which the record describes.

stream of bytes

A file which is regarded as being without structure beyond separation

into a sequential set of bytes.

Within a program, data is temporarily stored in variables.
Individual values can be aggregated into structures, which can be treated as a single variable with parts.
In C++, classes are typically used as as an aggregate structure.

C++ Person class (version 0.1):

class Person {
  public:
    char FirstName [11];
    char LastName[11];
    char Address [21];
    char City [21];
    char State [3];
    char ZIP [5];
};

With this class declaration, variables can be declared to be of type Person. The individual fields within a Person can be referred to as the name of the variable and the name of the field, separated by a period (.).

C++ Program:

#include 

class Person {
  public:
    char FirstName [11];
    char LastName[11];
    char Address [31];
    char City [21];
    char State [3];
    char ZIP [5];
};

void Display (Person);

int main () {
  Person Clerk;
  Person Customer;
  
  strcpy (Clerk.FirstName, "Fred");
  strcpy (Clerk.LastName, "Flintstone");
  strcpy (Clerk.Address, "4444 Granite Place");
  strcpy (Clerk.City, "Rockville");
  strcpy (Clerk.State, "MD");
  strcpy (Clerk.ZIP, "00001");
  
  strcpy (Customer.FirstName, "Lily");
  strcpy (Customer.LastName, "Munster");
  strcpy (Customer.Address, "1313 Mockingbird Lane");
  strcpy (Customer.City, "Hollywood");
  strcpy (Customer.State, "CA");
  strcpy (Customer.ZIP, "90210");
  
  Display (Clerk);
  Display (Customer);
}

void Display (Person Someone) {
  cout << Someone.FirstName << Someone.LastName
       << Someone.Address << Someone.City
       << Someone.State << Someone.ZIP;
}

In memory, each Person will appear as an aggregate, with the individual values being parts of the aggregate:

Person

Clerk

FirstName LastName Address City State ZIP

Fred Flintstone 4444 Granite Place Rockville MD 0001
The output of this program will be:
FredFlintstone4444 Granite PlaceRockvilleMD00001LilyMunster1313 Mockingbird LaneHollywoodCA90210
Obviously, this output could be improved. It is marginally readable by people, and it would be difficult to program a computer to read and correctly interpret this output.

Person
Clerk
FirstName	LastName	Address	City	State	ZIP
Fred	Flintstone	4444 Granite Place	Rockville	MD	0001

Delineation of Records in a File

Fixed Length Records

A record which is predetermined to be the same length as the other records in the file.

Record 1 Record 2 Record 3 Record 4 Record 5
The file is divided into records of equal size.
All records within a file have the same size.
Different files can have different length records.
Programs which access the file must know the record length.
Offset, or position, of the nth record of a file can be calculated.
There is no external overhead for record separation.
There may be internal fragmentation (unused space within records.)
There will be no external fragmentation (unused space outside of records) except for deleted records.
Individual records can always be updated in place.
Algorithms for Accessing Fixed Length Records
Code for Accessing Fixed Length Records
Code for Accessing String Records

Example (80 byte records):

   0  66 69 72 73 74 20 6C 69 6E 65  0  0  1  0  0  0  first line......
  10   0  0  0  0  0  0  0  0 FF FF FF FF  0  0  0  0  ................
  20  68 FB 12  0 DC E0 40  0 3C BA 42  0 78 FB 12  0  h.....@.<.B.x...
  30  CD E3 40  0 3C BA 42  0  8 BB 42  0 E4 FB 12  0  ..@.<.B...B.....
  40  3C 18 41  0 C4 FB 12  0  2  0  0  0 FC 3A 7C  0  <.A..........:|.
  50  73 65 63 6F 6E 64 20 6C 69 6E 65  0  1  0  0  0  second line.....
  60   0  0  0  0  0  0  0  0 FF FF FF FF  0  0  0  0  ................
  70  68 FB 12  0 DC E0 40  0 3C BA 42  0 78 FB 12  0  h.....@.<.B.x...
  80  CD E3 40  0 3C BA 42  0  8 BB 42  0 E4 FB 12  0  ..@.<.B...B.....
  90  3C 18 41  0 C4 FB 12  0  2  0  0  0 FC 3A 7C  0  <.A..........:|.

Advantage: the offset of each record can be calculated from its record number. This makes direct access possible.
Advantage: there is no space overhead.
Disadvantage: there will probably be internal fragmentation (unusable space within records.)

Delimited Variable Length Records

variable length record

A record which can differ in length from the other records of the file.

delimited record

A variable length record which is terminated by a special character or sequence of characters.

delimiter

A special character or group of characters stored after a field or record, which indicates the end of the preceding unit.

Record 1 # Record 2 # Record 3 # Record 4 # Record 5 #
The records within a file are followed by a delimiting byte or series of bytes.
The delimiter cannot occur within the records.
Records within a file can have different sizes.
Different files can have different length records.
Programs which access the file must know the delimiter.
Offset, or position, of the nth record of a file cannot be calculated.
There is external overhead for record separation equal to the size of the delimiter per record.
There should be no internal fragmentation (unused space within records.)
There may be no external fragmentation (unused space outside of records) after file updating.
Individual records cannot always be updated in place.
Algorithms for Accessing Delimited Variable Length Records
Code for Accessing Delimited Variable Length Records
Code for Accessing Variable Length Line Records

Example (Delimiter = ASCII 30 (IE) = RS character:

   0  66 69 72 73 74 20 6C 69 6E 65 1E 73 65 63 6F 6E  first line.secon
  10  64 20 6C 69 6E 65 1E                             d line.

Example (Delimiter = '\n'):

   0  46 69 72 73 74 20 28 31 73 74 29 20 4C 69 6E 65  First (1st) Line
  10   D  A 53 65 63 6F 6E 64 20 28 32 6E 64 29 20 6C  ..Second (2nd) l
  20  69 6E 65  D  A                                   ine..

Disadvantage: the offset of each record cannot be calculated from its record number. This makes direct access impossible.
Advantage: there is space overhead for the length prefix.
Advantage: there will probably be no internal fragmentation (unusable space within records.)

Delineation of Fields in a Record

Fixed Length Fields

Field 1 Field 2 Field 3 Field 4 Field 5
Each record is divided into fields of correspondingly equal size.
Different fields within a record have different sizes.
Different records can have different length fields.
Programs which access the record must know the field lengths.
There is no external overhead for field separation.
There may be internal fragmentation (unused space within fields.)

Delimited Variable Length Fields

Field 1 ! Field 2 ! Field 3 ! Field 4 ! Field 5 !
The fields within a record are followed by a delimiting byte or series of bytes.
Fields within a record can have different sizes.
Different records can have different length fields.
Programs which access the record must know the delimiter.
The delimiter cannot occur within the data.
If used with delimited records, the field delimiter must be different from the record delimiter.
There is external overhead for field separation equal to the size of the delimiter per field.
There should be no internal fragmentation (unused

Length Prefixed Variable Length Fields

12 Field 1 4 Field 2 10 Field 3 8 Field 4 7 Field 5
The fields within a record are prefixed by a length byte or bytes.
Fields within a record can have different sizes.
Different records can have different length fields.
Programs which access the record must know the size and format of the length prefix.
There is external overhead for field separation equal to the size of the length prefix per field.
There should be no internal fragmentation (unused space within fields.)

File Structure

Wednesday, 28 November 2012

FUNDAMENTAL FILE STRUCTURE CONCEPTS