Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
The stupid question is the question not asked
 
PerlMonks  

DBD::CSV and really bad legacy flat file

by harleypig (Scribe)
on Jul 19, 2005 at 01:24 UTC ( #475943=perlquestion: print w/ replies, xml ) Need Help??
harleypig has asked for the wisdom of the Perl Monks concerning the following question:

I have a really nasty legacy flat file I'm trying to parse. I can convert the file, but right now my employer doesn't want to do this.

It appears that DBD::CSV or Text::CSV_XS (I'm thinking the latter actually) will only support a single character field separator and escape character. The current field separator is ' _ ' (that's space underscore space) with _UNSC being the escape string (in the original code it's really just replacing _ with _UNSC_ but it's basically the same thing).

Any pointers on making this work? Do I need to replace Text::CSV_XS with my own module (please, no) or can I make this work somehow?

Upon doing a quick search it appears that Text::xSV is a pure perl solution, but it's got a hard coded limitation that restricts the separator to 1 character as well. I'll see if I can figure out why, but my time is limited--my boss wants this soonest. If I can't make this work then I'll have to go back to the old code *shudder*

Update: It appears I wasn't as clear as I thought I was.

Sample data:
12345 _ value1 _ value2 _ value3 _ ...
12346 _ value1 _ value2 _ value3 _ ...
12347 _ valuewith_UNSC_ _ value2 _ value3 _ ...

I can convert this data easily enough to "standard" csv:

if ( open my $FH, "<dbfile" ) { my @newrows; for my $row ( <$FH> ) { chomp $row; push @newrows, ( join ',', map { s/"/""/g ; $_ = "\"$_\"" } split / _ /, $row ); } # print @newrows to file }

My ultimate goal is to convince my boss to move to a RDBMS of some kind. He *loves* being able to just open the data file and modify something quick and easy. And he sees no reason to learn SQL or hide his data in a binary format. So I need to get DBD::CSV working with the original file so that he can switch between my new code and the live code and see the same data. I can't convince him to change the format, he's used to working with it the way he is. Once I show him how much simpler the code is with DBD::CSV and SQL (albeit simplified via SQL::Statement) I can convince him to move to a RDBMS.

Harley J Pig

Comment on DBD::CSV and really bad legacy flat file
Select or Download Code
Re: DBD::CSV and really bad legacy flat file
by jhourcle (Prior) on Jul 19, 2005 at 01:59 UTC

    This is actually fairly easy to parse, although it takes a couple of steps. Either of the two will work:

    1. Select a character or sequence that isn't in use anywhere in the file.
    2. Replace all occurances of '_UNSC_' with that character.
    3. Split the line on '_'
    4. Replace any occurances of that character in any of the fields with '_'

    or

    1. Select a character or sequence that isn't in use anywhere in the file.
    2. Replace all occurances of '_' with that sequence.
    3. Replace all occurances of (sequence)UNSC(sequence) with '_'.
    4. Split on (sequence)

    Sometimes, your problems don't fit into a module, and you just have to roll your own:

      I can easily convert the file ... my boss doesn't want to convert it. Until the new code is up and working the old code has to be able to work. I don't feel like setting up a script to copy and convert the db everytime it's updated, I've got too much else on my plate.

      Harley J Pig

        To be honest, I've done these conversions a few times. Including converting from a human-typed table (which was autoconverted into HTML via Lotus Domino) to an RDBMS. I did my development, against direct orders, while the table was still being updated. All I did was write the tool to convert, and then develop everything around that "sample" data, and then, once the switch was made, the "original" was considered frozen, I redid the conversion (took 10 or 15 minutes), and then put my database live.

        So, the question is, will this legacy flat file continue to live, or is it eventually going to be replaced with this something new?

        If it is going to live, and you're going to need to continue to read directly from it, you may be able to subclass DBD::File somehow to fake this - it may not be as fast as working on converted data, but it may still be faster than converting all the data, only to work with a subset of it. Or, at the least, it means you'll only have a single source for data, rather than working from an "unofficial" data source.

        If you're not converting it, then what are you doing with it?

        I just went over parsing it (which you said you wanted) -- it's what you do inside the loop that determines what you do with it after that.

        If you need to write the format back out, that's easy too.

        So... if you could please explain a little more about what it is that you're trying to do, I could probably give an answer that you might find more useful. (and not the short view -- I know you're trying to use Text::CSV_XS or DBD::CSV, but why are you trying to use them -- what's the main objective?) I know Text::CSV_XS is for manipulation of CSV, but what sort of manipulations are you trying to do?

        When the boss tells me to do something, and that I have to do it a certain way, I always come back to one quote:

        "We're the technical experts. We were hired so that management could ignore our recommendations and tell us how to do our jobs." -- Mike Andrews in alt.sysadmin.recovery 10 October 2000 <eUJE5.880$ln6.119642@news.flash.net>
Re: DBD::CSV and really bad legacy flat file
by greenFox (Vicar) on Jul 19, 2005 at 05:25 UTC

    I'm probably missing something but any-way why won't something along the lines of this work?

    my @line = split /_ /, $line;

    Posting some sample data would help!

    --
    Murray Barton
    Do not seek to follow in the footsteps of the wise. Seek what they sought. -Basho

      That fails because parsing CSV-style data is more difficult than it appears. The first hurdle is how to handle when the separator appears in a quoted field ("joe _blow" _ 2 _ "hourly"). The second hurdle is more fiendish: newlines embedded in quoted fields. It's less likely to occur, but if it does, iterating over records becomes painful.

      Cheers,
      Ovid

      New address of my CGI Course.

        I hadn't considered newlines but the separator sounded like it would be unique enough in this instance, the original poster described it as "space underscore space with _UNSC being the escape string"... which I admit isn't entirely clear to me and why I suggested the OP post some sample data, I just took the liberty of simplifying it to "_ " for my example, my bad! :-)

        --
        Murray Barton
        Do not seek to follow in the footsteps of the wise. Seek what they sought. -Basho

Re: DBD::CSV and really bad legacy flat file
by radiantmatrix (Parson) on Jul 19, 2005 at 16:56 UTC

    use Text::CSV_XS; use IO::File; my $csv = new Text::CSV_XS({sep_char=>"\x1E", escape_char=>"\x10"}); my $io = new IO::File('< flat_file.txt'); until ($io->eof) { my $line = $io->getline(); $line =~ s/_UNSC/\x10/g; $line =~ s/(?<!\x10) _ /\x1E/g; my $column = $csv->parse($line); ## .. do something with the ArrayRef $column .. ## }

    This code reads each line (I assume each line is a record), converts escape characters to \x10 (ASCII 'Data link escape') and your record-separator string to \x1E (ASCII 'Record separator') in memory, then passes that much-cleaner line to the Text::CSV_XS object for parsing.

    This way, your manager won't see it as file conversion, because you're merely caching a simplified version in memory from which to work. You can reverse the substitution process if you need to save a new file in the same format.

    Of course, once you've got code like this working, it would be trivial to write the file back out in a newer format whenever your boss changes his/her mind.

    <-radiant.matrix->
    Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
    The Code that can be seen is not the true Code
      Hmmm ... so with this I could subclass Text::CSV_XS and pass that subclass to DBD::CSV instead ... I think I can do that ... thanks.
      Harley J Pig

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://475943]
Approved by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (11)
As of 2014-04-17 05:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (439 votes), past polls