RFC: Tie::File::AnyData Lightweight Databases in Perl

Esteemed Monks,

Some months ago, Dominus post here a Meditation about a freed material he wrote on "Lightweight Database Techniques" in Perl (i.e. using flat text files as simple databases).

As he claims, these are useful techniques "When you don't have enough data to be bothered using a high-performance database, or when your data is simple enough that you don't want to bother with a relational database, you stick it in a flat file and hack up some file code to read it"

One of these techniques involves the use of his Tie::File module (i.e. how to access and manipulate a text file as if it were a Perl array). One limitation of Tie::File is that each element of the resulting array corresponds to one line in the tied file . You can always use the parameter "recsep" to define what a record is, but this just change $/ locally.

While working with this module I thought that it would be nice to be able to define "records" in a more complex way than just doing the 1 line <=> 1 record assignment.

For example, consider the following simple piece of data:

Peter   3
Peter   15
Peter   5
John    1
John    7
Mike    4
[download]

If you are accessing a text file that contains this data with Tie::File you will find that the first record is "Peter 3", the second "Peter 15" and so on. Maybe that is what you want, but in many cases it would be more useful to get all "Peter" entries in the first record, all "John" entries in the second, etc...

With this in mind I wrote a small and simple module: Tie::File::AnyData, which adds this functionality to Tie::File. This module accepts the optional extra parameter "code" to its constructor. This must be a code reference (an anonymous subroutine) that must be able to read one record per call from the tied file.

The source code and the documentation of this module can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-0.01.tar.gz:

One example of use could be:

use Tie::File::AnyData;

my $coderef = sub {
       ## Code to retrieve one by one the records from a file (one rec
+ord per call)
          };
tie my @data, 'Tie::File::AnyData', $file, code => $coderef;
       ## Use the tied array

untie @data;
[download]

The module works by hacking (re-defining) the function "_read_record" in Tie::File (the function that reads the records from the file). The rest of the functionality of Tie::File remains intact. This means that if you don't provide the "code" parameter, you obtain the same results as with Tie::File

use Tie::File::AnyData;
tie my @data, 'Tie::File::AnyData', $file;
       ## Use the tied array as with Tie::File
[download]

Because it may be hard and tedious to define a new anonymous subroutine that can parse the records of a file each time you use the module, you can subclass it with predefined formats. For example, Tie::File::AnyData::CSV (that can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-CSV-0.01.tar.gz) can parse correctly the kind of data given in the above example:

Peter   3
Peter   15
Peter   5
John    1
John    7
Mike    4
[download]

use Tie::File::AnyData::CSV;

tie my @arr, 'Tie::File::AnyData::CSV', $file or die;
print "$arr[0]\n";
[download]

Prints:

Peter   3
Peter   15
Peter   5
[download]

Another example is given in Tie::File::AnyData::Bio::Fasta (that can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-Bio-Fasta-0.01.tar.gz), this module subclass Tie::File::AnyData and is able to read a FASTA file as a Perl array where each element in the array corresponds to one fasta sequence. One example of use could be:

use Tie::File::AnyData::Bio::Fasta;

tie my @fastaArray, 'Tie::File::AnyData::Bio::Fasta' or die $!;

# Substitute the 10th sequence:
$fastaArray[9] = $newsequence;

# Get 10 random sequences:
use List::Util qw/shuffle/;
my @out = (shuffle @fastaArray)[0..9];
[download]

citromatik

Comment on RFC: Tie::File::AnyData Lightweight Databases in Perl Select or Download Code

Replies are listed 'Best First'.

Re: RFC: Tie::File::AnyData Lightweight Databases in Perl
by Jenda (Abbot) on Dec 11, 2007 at 01:00 UTC

Sounds neat except ... what has got the example to do with CSV? It looks like either fixed position records or (though trying to select the spaces between the names and numbers disproves that) tab separated. With a fairly strange twist of using the first column as a grouping key.

From Tie::File::AnyData::CSV I would expect to get a tied array of arrays or array of hashes that'd let me access the individual items in the rows. And hopefully the ability to include the record separator within quoted data. Please find a better name for that format.

Jenda
Support Denmark!
Defend the free world!

[reply]

Re^2: RFC: Tie::File::AnyData Lightweight Databases in Perl

by citromatik (Curate) on Dec 11, 2007 at 10:19 UTC

... what has got the example to do with CSV?

Well, in fact, a more correct name would be multilineCSV. This module takes several consecutive lines of fields (CSV, tabular, or whatever) that have the same value in a given field (the "key" field") and joins them in one logical record.

Well, I think that it has to do with CSV at least because internally it uses Parse::CSV to parse the lines.

Let's consider the following example:

Mike,5,6
Mike,5,3
John,5,1
John,3,0
Frank,6,1
[download]

Using this code:

use Tie::File::AnyData::CSV;

tie my @arr,'Tie::File::AnyData::CSV', key=>0, field_sep => ",";
print "$arr[0]\n";
## prints
# Mike,5,6
# Mike,5,3

untie @arr;

tie my @arr2,'Tie::File::AnyData::CSV', key=>1, field_sep => ',';
print "$arr[0]\n";
## Prints
# Mike,5,6
# Mike,5,3
# John,5,1
[download]

In the first case, all consecutive lines that have the same value in the first field ("Mike" in the key field), are considered a record. In the second case, all consecutive lines that have the same value in the second field ("5") are considered a record

Hmmm, totally agree, I will try to change that name

Thanks for your comments!

citromatik

[reply]
[d/l]
[select]

Re: RFC: Tie::File::AnyData Lightweight Databases in Perl
by tcf03 (Deacon) on Dec 10, 2007 at 20:15 UTC

DBM::Deep

Storable

Ted
--
"That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
--Ralph Waldo Emerson

[reply]

Re^2: RFC: Tie::File::AnyData Lightweight Databases in Perl

by citromatik (Curate) on Dec 11, 2007 at 09:57 UTC

Sounds kinda like DBM::Deep or Storable

Like Tie::File does?

This module is nothing more than a hack to Tie::File, so it mantains all the goodies that that module offers (deferred writing, indexation - it doesn't load the whole file in memory, read cache, etc...)

citromatik

[reply]

Re: RFC: Tie::File::AnyData Lightweight Databases in Perl
by jZed (Prior) on Dec 11, 2007 at 18:55 UTC

AnyData

[reply]

Re: RFC: Tie::File::AnyData Lightweight Databases in Perl
by metaperl (Curate) on Dec 11, 2007 at 14:50 UTC

JDB is a package of commands for manipulating flat-ASCII databases from shell/Perl scripts
JDB was inspired by RDB
and RDB is similar to NOSQL

I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup

[tag://rdbms,etl,data]
[download]

[reply]
[d/l]


Just another Perl shrine
	PerlMonks