Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Modules dealing with data files

by grinder (Bishop)
on Nov 10, 2006 at 18:30 UTC ( #583378=perlquestion: print w/ replies, xml ) Need Help??
grinder has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

(Cross-posted from module-authors, because perl.org seems to be having difficulties today).

I have written a module that deals with France's INSEE codes, which allows one to look up postcodes and stuff like that. I've been toying with Geography::FR::Postcode as a name. (any other ideas?)

The thing is, it relies on a text file that is 750KiB zipped, updated periodically. So I'm looking at a reader package that knows how to pick apart a certain format (or formats) of the data file and answer questions (for instance, what towns have the postcode 66100). Reading the unzipped file on each run and producing hashes takes about a second, which is good enough for a first version.

One problem is that the INSEE web site doesn't make it easy to predict what the new filename will be, so I can't fetch the data from INSEE during the installation process. And I would like to avoid wrapping it up as a CPAN module. So I create another package, that contains a solitary package variable that contains the URI that points to the data file on INSEE's web site, and I just update that when new versions come out.

Something like this:

Geography::FR::Postcode

depends on
Geography::FR::Postcode::Data

Installing Geography::FR::Postcode forces the dependency on GGeography::FR::Postcode::Data to be resolved first. So Data is downloaded and as part of its installation process, the file is downloaded and installed somewhere on the local system.

I suppose it will default to the site_perl directory if run in batch mode, but interactive installations allow the directory to be specified. OS distribution maintainers may wish to override the default (how? an environment variable à la PERL_G_F_P_PATH=/usr/local/share/doc/insee?)

After Geography::FR::Postcode::Data is installed, the installation of Geography::FR::Postcode goes forward (waving hands: knowing where Data put the damned file).

Next year, a new version of the INSEE file comes out. I test, and see that the current reader code can deal with it. I release a new version of Geography::FR::Postcode::Data. The client sees that there is an update for this, and installs it. New data file, everyone happy. (Assuming the installation causes the new file to overwrite the old one, otherwise Postcode will continue to run with the old file).

The following year, a new version comes out, and surprise! they've added a new column in the file. So I release a new version of Geography::FR::Postcode as well, that knows how to read both formats, and a new version of Geography::FR::Postcode::Data.

Does that sound sane? Does anyone have some pointers on how to deal with the placement of datafiles on the local system with one module, and having the other module know where to find them?

Or am I making this unnecessarily complicated? (I could just bundle the data file with the distribution, but the size of the data file, and the probability that the format is unlikely to change invites the above approach).

• another intruder with the mooring in the heart of the Perl

Comment on Modules dealing with data files
Re: Modules dealing with data files
by andyford (Curate) on Nov 10, 2006 at 19:43 UTC
    If you can make the auto update part work then super more power to ya!
    However, you should include an option at compile time for the installer of the module to decline auto updates.
    I'm sure there will be people who don't want it or can't use it.

    On the file name format front: Have you asked them if they could standardize their naming? Perhaps they just never considered that someone would want to automate the process? It's a longshot, but might be amusing to hear what they say.

    andyford
    or non-Perl: Andy Ford

Re: Modules dealing with data files
by brian_d_foy (Abbot) on Nov 10, 2006 at 19:59 UTC

    I created Business::ISBN::Data so I could update the data without forcing people to upgrade Business::ISBN. If you do something similar, you could post the data file on CPAN so at least people can find it easily if the post office changes their website, etc. A clever Makefile.PL (or Build.PL or whatever) can shove the data file into the same directory as the module since %INC knows where that is. If the data module is a prerequisite, CPAN.pm or CPANPLUS will install it first.

    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
      I created Business::ISBN::Data so I could update the data without forcing people to upgrade

      Aha! That was the thread I was trying to dig up in the module-authors archive, which was what I had in the back of my mind when I was figuring this out. Thanks for mentioning it, I can now go and reread it.

      • another intruder with the mooring in the heart of the Perl

Re: Modules dealing with data files
by jgamble (Pilgrim) on Nov 10, 2006 at 20:46 UTC

    I have written a module that deals with France's INSEE codes, which allows one to look up postcodes and stuff like that. I've been toying with Geography::FR::Postcode as a name. (any other ideas?)

    May I recommend the Geo::Postcodes base class for your module? Then your module would presumably be Geo::Postcodes::FR.

      I looked at that briefly, but dismissed it as being too americanocentric. The notions of borough_of and county_of either have no counterpart in France, or if they do, in any case, I don't know which one is subordinate to the other, so I wouldn't know what to map to what. And thus I doubt that a French person does either.

      Conversely, I doubt an American knows how communes, régions and departements relate to each other.

      If this module were truly generic, it would have defined an abstract hierarchy of nested geographic concepts, to which country-specific labels could be attached.

      Finally, I find the Geo namespace somewhat ambiguous, since it is not clear whether we are talking about geography or geology.

      • another intruder with the mooring in the heart of the Perl

        but dismissed it as being too americanocentric

        Doubtful. I won't deny that there's an awful lot of, as you say, americanocentric, stuff on the Internet, but I don't buy that this counts, especially given that the author is (as best I can tell) in Norway. Also, in the US, "borough" is not a widespread concept. Only a few states use that term, and they use it for different things: in NY it's a subdivision of a city, in CT it's an incorporated community within a larger Town, and in AK it's basically equivalent to a county.

Re: Modules dealing with data files
by pileofrogs (Priest) on Nov 10, 2006 at 23:53 UTC

    If I were doing something like this, I'd try and make my module try a number of tricks rather than only one.

    E.G.

    1. Look for recent data file from CPAN ala brian_d_foy's response above
    2. Failing that, try previous known-good file names
    3. Failing that, try guessing at file names based on any patterns you've recognised
    4. Failing that, ask the user
    5. ...etc...

    This way, you have the known-good option first, but you also have a few fallbacks so users have some recourse if they're cut off from CPAN or you die in a car crash and can't update the file, or whatever.

    --Pileofrogs

Re: Modules dealing with data files
by Cabrion (Friar) on Nov 11, 2006 at 13:23 UTC
    Consider converting the data to berkleydb or SQLite formats as databases lend themselves to "asking questions of the data." When your module launches it could test the date-stamp on the source textfile and create/reload the data as updates are obtained. Written generically enough, an import utility could cope with the addition of or removal of columns. Text::CSV and Text::CSV::DetectSeparator would be a good starting point for a making flexible reader/loader even if you didn't convert to a database.

    Another idea would be to create a loader for each year's data and presumably call the right loader based on some header information in the source files. You could extend this to load a specific year's data on demand provided the end user had copies of previous and current year's data.

    Just food for thought.

Re: Modules dealing with data files
by DrHyde (Prior) on Nov 13, 2006 at 10:51 UTC
    How often does the data change? I have a somewhat similar problem with my module Number::Phone::UK. That relies on a data set that changes occasionally, and while it has only changed locations on me once, the format of the files changes more often.

    I decided not to make users download and parse the data themselves. Instead, I do that every so often and distribute it myself. It's a lot less work for me, the process is easier to test, and it's more reliable for my users.

    As for the format in which I distribute the data - none of the standard tools for packaging perl modules handle non-perl files at all well. Consequently, the data is buried in another module Number::Phone::UK::Data, as a DBM::Deep database in a __DATA__ segment.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://583378]
Approved by chargrill
Front-paged by brian_d_foy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2014-09-30 21:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (384 votes), past polls