Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: problems parsing CSV

by ww (Archbishop)
on Oct 09, 2010 at 14:26 UTC ( [id://864386]=note: print w/replies, xml ) Need Help??


in reply to problems parsing CSV

May we see your version of CSV?

Writing one is a fairly complex challenge; since you wrote one that apparently worked (within limits), comparing your work to Text::CSV might be enlightening.

Replies are listed 'Best First'.
Re^2: problems parsing CSV
by helenwoodson (Acolyte) on Oct 09, 2010 at 23:54 UTC

    Thanks for all the suggestions. Some of it is over my head, so I will need to meditate on it.

    I'm afraid I wasn't clear. The example I gave was not a real example, but an example of how the data looks that's not getting processed correctly.

    The input file is over a million records, and new data will be added from time to time. The problem input is not all the same but is formatted like my previous example. It is similar enough that I was able to get rid of it with a substitution, as I'm not even using that portion of the data. But it is ugly and slow.

    I've simplified the code to show the parsing attempt.

    #!/usr/bin/perl -w use strict; use Text::Undiacritic qw(undiacritic); use Text::CSV; my ( $tri, $chem, $year, $lbs, $gms, $rls, $csv, $err, @cols ); # open the release report file for input #$rls = "rls.tst"; $rls = "../ecodata/releases.txt"; open( RLS, $rls ) || die "bad open $rls"; # RLS: TRI, Release#, ChemName, RegNum, Year, Pounds, Grams while( <RLS> ) { $_ = undiacritic($_); s/\(\d{4} and after \"acid aerosols\" only\)//g; $csv = Text::CSV->new(); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);

    Here is a tiny bit of the output before I put in the substitution - I'm sure there is a better way to do this:

    Failed to parse line: 00617BRSTLSTATE,"1394080382029","Sulfuric acid ( +1994 and after "acid aerosols" only)",7664-93-9,1994,500.0,"" Failed to parse line: 00617BRSTLSTATE,"1394080382031","Hydrochloric ac +id (1995 and after "acid aerosols" only)",7647-01-0,1994,2842.0,""

    I hope this is more clear. Thanks so much for your help; now I must go meditate over what you've suggested.

      Removing diacritics from characters alters the data you're parsing. Are you supposed to do that? Probably not. You certainly shouldn't have to modify the data for any reason.

      What character encoding is the text in? ISO 8859-1 (Latin 1)? Windows-1252? UTF-8? If you don't know, I encourage you to find out. You really ought to know.

      I suspect your few problems — malformed CSV records and text that isn't in the ASCII character encoding — can be solved by using Text::CSV::Encoded and also enabling allow_loose_quotes as others have recommended.

      UPDATE: Using Text::CSV::Encoded may be overkill. Jenda's recommendation to set the binary attribute to true (1) may be all you really need. But I nonetheless still believe you ought to know what character encoding the text is in.

        Good point. It would be better to avoid removing the diacritics if possible. I did find that setting the binary attribute to true for Text::CSV did prevent the script from choking on the diacritics.

        I do not know the character encoding and don't know how to identify it, so I asked Mr. Google (Mr. Google knows all!) and am looking through what he dredged up. I looked at the documentation in CPAN for Text::CSV::Encoded. It appears that, in order to use this, you need to know the enocoding for the input and what you want to use for the output.

        I seem to have it working fairly well except for some of the cases where the weight is "" or 0.0. I haven't yet figured out why it works correctly for some records and not for others. I will look at the records where it fails and see if I can identify that.

        Thanks very much.

      I refactored your script. Most of the changes I made are ones recommended in Perl Best Practices by Damian Conway. The revised script is untested and intended merely to suggest coding improvements.

      #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; my $release_file = '../ecodata/releases.txt'; open my $release_fh, '<', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ allow_loose_quotes => 1, binary => 1, }); CSV_RECORD: while (my $csv_record = <$release_fh>) { # Skip header... next CSV_RECORD if $INPUT_LINE_NUMBER == 1; # SPECIAL CASE: Remove unwanted text... $csv_record =~ s/ \(\d{4} and after "acid aerosols" only\)//g; if (not $csv->parse($csv_record)) { my $csv_error_input = $csv->error_input(); warn "Can't parse CSV record at line $INPUT_LINE_NUMBER" . " of release file $release_file:\n$csv_error_input\n"; next CSV_RECORD; } # TRI, Release#, ChemName, RegNum, Year, Pounds, Grams my ($tri, undef, $chem_name, undef, $year, $pounds, $grams) = $csv->fields(); # [Do what you need to do with the data here] } close $release_fh; exit 0;

      I moved the Text::CSV->new() constructor outside the while loop. I moved the lexical variables inside the while loop. I added a space to the beginning of the regular expression pattern matching the string being globally removed from the CSV records.

      I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out:

      while <RLS> ) { # $_ = undiacritic($_); # s/\(\d{4} and after \"acid aerosols\" only\)//g; # $csv = Text::CSV->new(); $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);

      allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic().

      Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice.

      Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this:

      if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els +e { $lbs = $cols[5] } if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else +{ $gms = $cols[6] }

      and then this test

        if( !$lbs && !$gms )

      gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not.

      Thanks so much for your help.

        Code like

        while (<RLS>) { $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse ($_)) {

        will very soon start failing when the fields have embedded newlines, as <RLS> doesn't see the quotation and escapes. When parsing CSV with Text::CSV or Text::CSV_XS, please use getline (). Also note that in your example you create a new csv parsing object for every line read. You should bring the object outside the loop. If possible, choose encoding on the lowest possible level: at open () time.

        my $csv = Text::CSV_XS->new ({ binary => 1, escape_char => "\\", allow_loose_quotes => 1, auto_diag => 1, }); open my $fh, "<:encoding(utf-8)", "text.csv" or die "text.csv: $!"; $csv->getline ($fh); # skip first line while (my $row = $csv->getline ($fh)) { my ($tri, $chem, $year, $lbs, $gms) = map { undiacritic ($_) } @{$row}[0,2,4,5,6]; : : } close $fh;

        Enjoy, Have FUN! H.Merijn

        Once again, I strongly recommend you move the object constructor outside the while loop. Tux, a.k.a. H. Merijn Brand, the author of Text::CSV_XS, which is the base module you're indirectly using, amplified this suggestion in his reply to your post. He explained that "you create a new csv parsing object for every line read." You really don't want to do this needlessly for 1.7 million records. And Tux's advice to use getline instead of <>/parse/fields is a wise one.

        I think you're trying too hard to handle the floating-point weight values. Just be explicit with your tests:

        if ($pounds == 0.0 and $grams == 0.0) { ... }

        The right coercions happen automagically. Here's the proof:

        D:\>perl -E "say '' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '0' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '0.0' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '123' == 0.0 ? 'Equals' : 'Does not equal'" Does not equal D:\>perl -E "say '123.4' == 0.0 ? 'Equals' : 'Does not equal'" Does not equal D:\>

        So you didn't like any of the other improvements I suggested in my refactored version of your script?

        Jim

        I've incorporated Tux's suggestion to use getline (getline_hr, actually) instead of <>/parse/fields. It really tightens up the whole script.

        #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; $OUTPUT_FIELD_SEPARATOR = "\n"; $OUTPUT_RECORD_SEPARATOR = "\n"; my $release_file = '../ecodata/releases.txt'; # Text is in the ISO 8859-1 (Latin 1) encoding open my $release_fh, '<:encoding(iso-8859-1)', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => '\\', }); # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' $csv->column_names($csv->getline($release_fh)); while (my $value = $csv->getline_hr($release_fh)) { { no warnings qw( numeric ); if ($value->{'Pounds'} == 0.0 and $value->{'Grams'} == 0.0) { warn "Release $value->{'Release#'} is weightless\n"; } } print $value->{'TRI'}, $value->{'Release#'}, $value->{'ChemName'}, $value->{'RegNum'}, $value->{'Year'}, $value->{'Pounds'}, $value->{'Grams'}; } close $release_fh; exit 0;
Re^2: problems parsing CSV
by shawnhcorey (Friar) on Oct 09, 2010 at 19:11 UTC

    How's this? And no, I haven't tested it exhaustively nor compared it to Text::CSV.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    local $/;
    my $data = <DATA>;
    
    while( $data =~ m{ (?: \" ( .*? (?: \"\" .*? )*? ) \" | ( [^,]*? ) ) ( \, | \n ) }msxg ){
      my $item = $1;
      if( not defined $item ){
        $item = $2;
      }
      my $nl = $3;
    
      print "\t«$item»\n";
      if( $nl eq "\n" ){
        print '-' x 40, "\n";
      }
    }
    
    __DATA__
    stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,""
    a,b,c,
    x,y
    "a","foo""bar",test
    a,"foo"bar",test
    </code>
Re^2: problems parsing CSV
by helenwoodson (Acolyte) on Oct 10, 2010 at 11:08 UTC

    My ignorance here is exceeded only by my humility. I'm afraid I don't understand the question.

    If you mean you want to see the code I wrote using Text::CSV and the attributes I set for it, that's in a code snippit I posted a little while ago.

    If that's not what you mean and you will explain it to me, I'll post whatever it is if I can figure it out

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864386]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2024-04-26 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found