Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

RFC: Tie::File::AnyData Lightweight Databases in Perl

by citromatik (Curate)
on Dec 10, 2007 at 15:31 UTC ( [id://656142]=perlmeditation: print w/replies, xml ) Need Help??

Esteemed Monks,

Some months ago, Dominus post here a Meditation about a freed material he wrote on "Lightweight Database Techniques" in Perl (i.e. using flat text files as simple databases).

As he claims, these are useful techniques "When you don't have enough data to be bothered using a high-performance database, or when your data is simple enough that you don't want to bother with a relational database, you stick it in a flat file and hack up some file code to read it"

One of these techniques involves the use of his Tie::File module (i.e. how to access and manipulate a text file as if it were a Perl array). One limitation of Tie::File is that each element of the resulting array corresponds to one line in the tied file . You can always use the parameter "recsep" to define what a record is, but this just change $/ locally.

While working with this module I thought that it would be nice to be able to define "records" in a more complex way than just doing the 1 line <=> 1 record assignment.

For example, consider the following simple piece of data:

Peter 3 Peter 15 Peter 5 John 1 John 7 Mike 4

If you are accessing a text file that contains this data with Tie::File you will find that the first record is "Peter 3", the second "Peter 15" and so on. Maybe that is what you want, but in many cases it would be more useful to get all "Peter" entries in the first record, all "John" entries in the second, etc...

With this in mind I wrote a small and simple module: Tie::File::AnyData, which adds this functionality to Tie::File. This module accepts the optional extra parameter "code" to its constructor. This must be a code reference (an anonymous subroutine) that must be able to read one record per call from the tied file.

The source code and the documentation of this module can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-0.01.tar.gz:

One example of use could be:

use Tie::File::AnyData; my $coderef = sub { ## Code to retrieve one by one the records from a file (one rec +ord per call) }; tie my @data, 'Tie::File::AnyData', $file, code => $coderef; ## Use the tied array untie @data;

The module works by hacking (re-defining) the function "_read_record" in Tie::File (the function that reads the records from the file). The rest of the functionality of Tie::File remains intact. This means that if you don't provide the "code" parameter, you obtain the same results as with Tie::File

use Tie::File::AnyData; tie my @data, 'Tie::File::AnyData', $file; ## Use the tied array as with Tie::File

Because it may be hard and tedious to define a new anonymous subroutine that can parse the records of a file each time you use the module, you can subclass it with predefined formats. For example, Tie::File::AnyData::CSV (that can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-CSV-0.01.tar.gz) can parse correctly the kind of data given in the above example:

Peter 3 Peter 15 Peter 5 John 1 John 7 Mike 4
use Tie::File::AnyData::CSV; tie my @arr, 'Tie::File::AnyData::CSV', $file or die; print "$arr[0]\n";

Prints:

Peter 3 Peter 15 Peter 5

Another example is given in Tie::File::AnyData::Bio::Fasta (that can be obtained from http://lotka.uv.es/scriptome/data/attic/wiki/Tie-File-AnyData-Bio-Fasta-0.01.tar.gz), this module subclass Tie::File::AnyData and is able to read a FASTA file as a Perl array where each element in the array corresponds to one fasta sequence. One example of use could be:

use Tie::File::AnyData::Bio::Fasta; tie my @fastaArray, 'Tie::File::AnyData::Bio::Fasta' or die $!; # Substitute the 10th sequence: $fastaArray[9] = $newsequence; # Get 10 random sequences: use List::Util qw/shuffle/; my @out = (shuffle @fastaArray)[0..9];

citromatik

Replies are listed 'Best First'.
Re: RFC: Tie::File::AnyData Lightweight Databases in Perl
by Jenda (Abbot) on Dec 11, 2007 at 01:00 UTC

    Sounds neat except ... what has got the example to do with CSV? It looks like either fixed position records or (though trying to select the spaces between the names and numbers disproves that) tab separated. With a fairly strange twist of using the first column as a grouping key.

    From Tie::File::AnyData::CSV I would expect to get a tied array of arrays or array of hashes that'd let me access the individual items in the rows. And hopefully the ability to include the record separator within quoted data. Please find a better name for that format.

      ... what has got the example to do with CSV?

      Well, in fact, a more correct name would be multilineCSV. This module takes several consecutive lines of fields (CSV, tabular, or whatever) that have the same value in a given field (the "key" field") and joins them in one logical record.

      Well, I think that it has to do with CSV at least because internally it uses Parse::CSV to parse the lines.

      Let's consider the following example:

      Mike,5,6 Mike,5,3 John,5,1 John,3,0 Frank,6,1

      Using this code:

      use Tie::File::AnyData::CSV; tie my @arr,'Tie::File::AnyData::CSV', key=>0, field_sep => ","; print "$arr[0]\n"; ## prints # Mike,5,6 # Mike,5,3 untie @arr; tie my @arr2,'Tie::File::AnyData::CSV', key=>1, field_sep => ','; print "$arr[0]\n"; ## Prints # Mike,5,6 # Mike,5,3 # John,5,1

      In the first case, all consecutive lines that have the same value in the first field ("Mike" in the key field), are considered a record. In the second case, all consecutive lines that have the same value in the second field ("5") are considered a record

      From Tie::File::AnyData::CSV I would expect to get a tied array of arrays or array of hashes that'd let me access the individual items in the rows. And hopefully the ability to include the record separator within quoted data. Please find a better name for that format.

      Hmmm, totally agree, I will try to change that name

      Thanks for your comments!

      citromatik
Re: RFC: Tie::File::AnyData Lightweight Databases in Perl
by tcf03 (Deacon) on Dec 10, 2007 at 20:15 UTC
    Sounds kinda like DBM::Deep or Storable

    Ted
    --
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson
      Sounds kinda like DBM::Deep or Storable

      Like Tie::File does?

      This module is nothing more than a hack to Tie::File, so it mantains all the goodies that that module offers (deferred writing, indexation - it doesn't load the whole file in memory, read cache, etc...)

      citromatik
Re: RFC: Tie::File::AnyData Lightweight Databases in Perl
by jZed (Prior) on Dec 11, 2007 at 18:55 UTC
    What relation does this have to my AnyData modules other than using the same idea (they have provided a tied-hash interface to XML, CSV, Fixed Width, DBI and many other formats for seven years). If there's no relation other than the idea, you might want to change the name.
Re: RFC: Tie::File::AnyData Lightweight Databases in Perl
by metaperl (Curate) on Dec 11, 2007 at 14:50 UTC
    1. JDB is a package of commands for manipulating flat-ASCII databases from shell/Perl scripts
    2. JDB was inspired by RDB
    3. and RDB is similar to NOSQL
    But for me personally, SQLite can import any of the data you give as examples and I dont have to write my own buggy accessors and modifiers - I just use SQL from there on out.
    I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup
    [tag://rdbms,etl,data]

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://656142]
Approved by moritz
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-19 22:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found