Random access problem with sequential files

This is an interesting problem.

One basic way to address the problem is to partition these files by date-stamp so that there is one file per day. This solves a number of problems and creates other issues.

  • It makes it simple to find the beginning or end of a given day.

  • It makes purging simple if we are going for a simple approach of just deleting a day

  • It does make iteration through the data a little more complicated because one has to take into account file boundaries etc.

But what if you have a significantly large file for which you’d like to have the ability to navigate to a random position in the file? How do you find the start of a valid record?

There are a number of solutions that exist to this particular problem.

  • One can index files so that there is a smaller file or data structure where one has pre calculated the location of the various major points in the file.

    • Indexes can be simple in memory indexes - i.e. we load a file as we access, pass through the whole thing and index a number of major points in the file in memory

    • They can be stored as files with a range of data structures.

  • One can have an algorithm which relies on the structure of the file to see if the data conforms to the expected structure. This can be helped with the addition of checksums like MD5 hashes which confirm the integrity of the file structure.

  • One can use 'sentinel' majik byte patterns which mark the start of a record which provide a quicker way to find a promising starting point - the fun of course is that record data may include these majik bytes - so one needs to augment this algorithm with other techniques like the above.

  • A variation on the majik bytes at the start of each record is to put an entire majik message into flow - this has pros and cons also - how often do you do it etc. extra space taken up

 

 

Related pages