http://www.perlmonks.org?node_id=932757

Ansi has asked for the wisdom of the Perl Monks concerning the following question:

I have very little expereince of Perl, so in the first instance please excuse my ignorance. All I'm looking for to start is some quick pointers on how to write a CSV handling application using perl without reinventing the wheel.The app in question I've already written albeit in MsAccess about ten years ago; I believe it's some subset of VB but I'm not too sure. I write it solely by using the help files that MsAceess provides but beyond that I don't know much about the language used but I imagine the methodology wouldbe the same nomatter what the language

I have done some reading on Perl but in the first instance I really want to avoid learning much more than I need to produce the app thatI need. To that end I'mjust after some general pointers of what modules etc I should be using

In brief, the program handles and manipulates text files, here's a rough flow chart of what happens


CSVaddress file (one address record per line) is imported

Look for postcode in each address/line (which is a predicatable alphanumeric format)

Postcode is then looked up in a two column table for a corresponding value

Corresponding value is added as an additonal field to the address line

If no postcode is found, then each field is checked against a different two column table for a matching post town andcorresponding value which is then appended tothe address line/record

Thereafter there are some sql sorting routines which involve selecting and sorting the "corresponding value" fieldi, and adding sequential counts etc and producing some reports

This is all a little simplified; there is a bit more to it than this, but the process above is the raison d'etre of the app.

What woud be helpful in the first instance is some general guidance on the best way of handling the above. I'm mindful ofthe fact that if handled incorrectly memory could be an issue. The address file can have (in theory) over a million records though in practice 100,000 is very large and mostly it's around 10,000 records. The 2 x two column tables are both relatively small, no more than 1500 rows

So, what modules/methodology would be best to sort through the address records, then cycle through the fields to find the value which would then be used to cycle throogh one or both two column table to find the value toadd back to the address record. I'll leave the sorting and everything else for later. In the first instance if I canget it to output the postcode that's a good start. When I wrote this initially i did it all using babysteps anyway

Replies are listed 'Best First'.
Re: Parsing/manipulating CSV files
by CountZero (Bishop) on Oct 20, 2011 at 22:10 UTC
    For handling csv-files, you should check out the Text::CSV module.

    Unless you need all the records in memory at once it seems to me that you could deal with the large files one record at a time.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Parsing/manipulating CSV files
by tospo (Hermit) on Oct 20, 2011 at 22:39 UTC
    If you are handling the CSV files in Perl, you might also want to do the database interaction with your Perl scripts to keep it all together. It is possible to query/manipulate an MSAccess database from Perl using the DBI module. See DBD::ODBC

      In this particular instance, I'm looking to get rid of any reliance on MsAccess. It's purely accidental that I use Access now; it just happened that I could write the app within Access as a module as it was available at work so I could use their time to write it (and leave more time for beer at home).

      Database connectivity isn't a requirement for this, I want it to stand alone though I can see that it could be useful for future projects.

        If your data collection is large (for any value of large), DBD::CSV might grow (very) slow. For relatively small dtasets, DBD::CSV gives you instant relief: combining CSV files and selecting from one based on contents of others gets dead easy. When the dataset grows, and you still do not need a full relational dtabase like PostgreSQL or MariaDB, DBD::SQLite might be the interface of choice.


        Enjoy, Have FUN! H.Merijn
Re: Parsing/manipulating CSV files
by Anonymous Monk on Oct 20, 2011 at 23:11 UTC

    Hi,

    There is also DBD::CSV which handles CSV files using SQL.

    J.C.

Re: Parsing/manipulating CSV files
by mrstlee (Beadle) on Oct 21, 2011 at 15:22 UTC
    There's plenty of options. Maybe too many!

    Regular expressions are one. For example - assuming your line has address, state & postcode:
    my ($addr,$state,$pc) = $line =~ /([^,]+),([^,]+),([^,]+)/;
    Stick that in a loop as you iterate over your file:
    open DATA , "<", "path-to-file"; my @pcodes; while (my $line = <DATA>){ my ($addr,$state,$pc) = $line =~ /([^,]+),([^,]+),([^,]+)/; ## Not a valid data line? Then move on. defined $pc or next; push @pcodes, $pc; } ## Do stuff with @pcodes close DATA;
    Now you can do what you like with the post codes. Note that this doesn't read the whole file in one go - but 1 line at time.
    If memory is a problem you could pause every 10k records, or so, and print out to file.

    Is that any use to you?

      Very much so;thank you.

      Of course you're the other side of the pond so any regexs aren't likely to work for me, but it points me in the right direction. As you point out there are likely many ways of doing the task and I don't want to waste too much time exploring them. I would rather know a way that it will work and learn what I need to of a relatively small area of the language

      I am going to have to iterate over the fields in each record to find the postcode (it's not in a regular place)using a comparison to a known format which is what I did last time

      I'll get there!

        I'm not clear what you mean about the regex's.
        They tend to be portable to different locale's (well, within reason ...). For example:
        \d - matches 0 .. 9 (\d+) - means match at least 1 digit, could be more. [^ ] - means don't match space \S - Another way to match something that isn't a space
        And so on are universal. It is a vast area and easy to get lost in. But absolutely invaluable for data parsing. Baby steps and you will indeed get there! OK - so the postcode isn't in a predictable place. I assume it is in a predictable format that won't match anything else in a record?
        e.g 2 upper case chars, followed by 3 digits and 2 further upper case chars. Then your regex:
        $line =~ /,?([A-Z]{3}\d{2}[A-Z]),/;
        The ',?' bit means maybe match a comma ahead of the postcode. This is to catch the case that the pc is at the start of the record.
        $1 will contain your postcode.
        If you pass me the format I'd be happy to provide a suitable regex.