RFC: Tie::File::AnyData Lightweight Databases in Perl

Esteemed Monks,

Some months ago, Dominus post here a Meditation about a freed material he wrote on "Lightweight Database Techniques" in Perl (i.e. using flat text files as simple databases).

As he claims, these are useful techniques "When you don't have enough data to be bothered using a high-performance database, or when your data is simple enough that you don't want to bother with a relational database, you stick it in a flat file and hack up some file code to read it"

One of these techniques involves the use of his Tie::File module (i.e. how to access and manipulate a text file as if it were a Perl array). One limitation of Tie::File is that each element of the resulting array corresponds to one line in the tied file . You can always use the parameter "recsep" to define what a record is, but this just change $/ locally.

While working with this module I thought that it would be nice to be able to define "records" in a more complex way than just doing the 1 line <=> 1 record assignment.

For example, consider the following simple piece of data:

Peter   3
Peter   15
Peter   5
John    1
John    7
Mike    4
[download]

If you are accessing a text file that contains this data with Tie::File you will find that the first record is "Peter 3", the second "Peter 15" and so on. Maybe that is what you want, but in many cases it would be more useful to get all "Peter" entries in the first record, all "John" entries in the second, etc...

With this in mind I wrote a small and simple module: Tie::File::AnyData, which adds this functionality to Tie::File. This module accepts the optional extra parameter "code" to its constructor. This must be a code reference (an anonymous subroutine) that must be able to read one record per call from the tied file.

The source code and the documentation of this module can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-0.01.tar.gz:

One example of use could be:

use Tie::File::AnyData;

my $coderef = sub {
       ## Code to retrieve one by one the records from a file (one rec
+ord per call)
          };
tie my @data, 'Tie::File::AnyData', $file, code => $coderef;
       ## Use the tied array

untie @data;
[download]

The module works by hacking (re-defining) the function "_read_record" in Tie::File (the function that reads the records from the file). The rest of the functionality of Tie::File remains intact. This means that if you don't provide the "code" parameter, you obtain the same results as with Tie::File

use Tie::File::AnyData;
tie my @data, 'Tie::File::AnyData', $file;
       ## Use the tied array as with Tie::File
[download]

Because it may be hard and tedious to define a new anonymous subroutine that can parse the records of a file each time you use the module, you can subclass it with predefined formats. For example, Tie::File::AnyData::CSV (that can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-CSV-0.01.tar.gz) can parse correctly the kind of data given in the above example:

Peter   3
Peter   15
Peter   5
John    1
John    7
Mike    4
[download]

use Tie::File::AnyData::CSV;

tie my @arr, 'Tie::File::AnyData::CSV', $file or die;
print "$arr[0]\n";
[download]

Prints:

Peter   3
Peter   15
Peter   5
[download]

Another example is given in Tie::File::AnyData::Bio::Fasta (that can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-Bio-Fasta-0.01.tar.gz), this module subclass Tie::File::AnyData and is able to read a FASTA file as a Perl array where each element in the array corresponds to one fasta sequence. One example of use could be:

use Tie::File::AnyData::Bio::Fasta;

tie my @fastaArray, 'Tie::File::AnyData::Bio::Fasta' or die $!;

# Substitute the 10th sequence:
$fastaArray[9] = $newsequence;

# Get 10 random sequences:
use List::Util qw/shuffle/;
my @out = (shuffle @fastaArray)[0..9];
[download]

citromatik

Back to Meditations