Re: problems parsing CSV

Replies are listed 'Best First'.
Re^2: problems parsing CSV by helenwoodson (Acolyte) on Oct 09, 2010 at 23:54 UTC
Thanks for all the suggestions. Some of it is over my head, so I will need to meditate on it. I'm afraid I wasn't clear. The example I gave was not a real example, but an example of how the data looks that's not getting processed correctly. The input file is over a million records, and new data will be added from time to time. The problem input is not all the same but is formatted like my previous example. It is similar enough that I was able to get rid of it with a substitution, as I'm not even using that portion of the data. But it is ugly and slow. I've simplified the code to show the parsing attempt. #!/usr/bin/perl -w use strict; use Text::Undiacritic qw(undiacritic); use Text::CSV; my ( $tri, $chem, $year, $lbs, $gms, $rls, $csv, $err, @cols ); # open the release report file for input #$rls = "rls.tst"; $rls = "../ecodata/releases.txt"; open( RLS, $rls ) \|\| die "bad open $rls"; # RLS: TRI, Release#, ChemName, RegNum, Year, Pounds, Grams while( <RLS> ) { $_ = undiacritic($_); s/$\d{4} and after \"acid aerosols\" only$//g; $csv = Text::CSV->new(); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS); [download] Here is a tiny bit of the output before I put in the substitution - I'm sure there is a better way to do this: `Failed to parse line: 00617BRSTLSTATE,"1394080382029","Sulfuric acid ( +1994 and after "acid aerosols" only)",7664-93-9,1994,500.0,"" Failed to parse line: 00617BRSTLSTATE,"1394080382031","Hydrochloric ac +id (1995 and after "acid aerosols" only)",7647-01-0,1994,2842.0,""` [download] I hope this is more clear. Thanks so much for your help; now I must go meditate over what you've suggested.	[reply] [d/l] [select]
Re^3: problems parsing CSV by Jim (Curate) on Oct 10, 2010 at 04:02 UTC
Removing diacritics from characters alters the data you're parsing. Are you supposed to do that? Probably not. You certainly shouldn't have to modify the data for any reason. What character encoding is the text in? ISO 8859-1 (Latin 1)? Windows-1252? UTF-8? If you don't know, I encourage you to find out. You really ought to know. I suspect your few problems — malformed CSV records and text that isn't in the ASCII character encoding — can be solved by using Text::CSV::Encoded and also enabling allow_loose_quotes as others have recommended. UPDATE: Using Text::CSV::Encoded may be overkill. Jenda's recommendation to set the binary attribute to true (1) may be all you really need. But I nonetheless still believe you ought to know what character encoding the text is in.	[reply]
Re^4: problems parsing CSV by helenwoodson (Acolyte) on Oct 10, 2010 at 10:56 UTC
Good point. It would be better to avoid removing the diacritics if possible. I did find that setting the binary attribute to true for Text::CSV did prevent the script from choking on the diacritics. I do not know the character encoding and don't know how to identify it, so I asked Mr. Google (Mr. Google knows all!) and am looking through what he dredged up. I looked at the documentation in CPAN for Text::CSV::Encoded. It appears that, in order to use this, you need to know the enocoding for the input and what you want to use for the output. I seem to have it working fairly well except for some of the cases where the weight is "" or 0.0. I haven't yet figured out why it works correctly for some records and not for others. I will look at the records where it fails and see if I can identify that. Thanks very much.	[reply]
Re^3: problems parsing CSV by Jim (Curate) on Oct 10, 2010 at 05:13 UTC
I refactored your script. Most of the changes I made are ones recommended in Perl Best Practices by Damian Conway. The revised script is untested and intended merely to suggest coding improvements. #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; my $release_file = '../ecodata/releases.txt'; open my $release_fh, '<', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ allow_loose_quotes => 1, binary => 1, }); CSV_RECORD: while (my $csv_record = <$release_fh>) { # Skip header... next CSV_RECORD if $INPUT_LINE_NUMBER == 1; # SPECIAL CASE: Remove unwanted text... $csv_record =~ s/ $\d{4} and after "acid aerosols" only$//g; if (not $csv->parse($csv_record)) { my $csv_error_input = $csv->error_input(); warn "Can't parse CSV record at line $INPUT_LINE_NUMBER" . " of release file $release_file:\n$csv_error_input\n"; next CSV_RECORD; } # TRI, Release#, ChemName, RegNum, Year, Pounds, Grams my ($tri, undef, $chem_name, undef, $year, $pounds, $grams) = $csv->fields(); # [Do what you need to do with the data here] } close $release_fh; exit 0; [download] I moved the Text::CSV->new() constructor outside the while loop. I moved the lexical variables inside the while loop. I added a space to the beginning of the regular expression pattern matching the string being globally removed from the CSV records.	[reply] [d/l]
Re^3: problems parsing CSV by helenwoodson (Acolyte) on Oct 10, 2010 at 06:12 UTC
I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out: `while <RLS> ) { # $_ = undiacritic($_); # s/$\d{4} and after \"acid aerosols\" only$//g; # $csv = Text::CSV->new(); $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);` [download] allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic(). Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice. Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this: `if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els +e { $lbs = $cols[5] } if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else +{ $gms = $cols[6] }` [download] and then this test `if( !$lbs && !$gms )` gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not. Thanks so much for your help.	[reply] [d/l] [select]
Re^4: problems parsing CSV by Tux (Canon) on Oct 10, 2010 at 10:30 UTC
Code like `while (<RLS>) { $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse ($_)) {` [download] will very soon start failing when the fields have embedded newlines, as `<RLS>` doesn't see the quotation and escapes. When parsing CSV with Text::CSV or Text::CSV_XS, please use `getline ()`. Also note that in your example you create a new csv parsing object for every line read. You should bring the object outside the loop. If possible, choose encoding on the lowest possible level: at `open ()` time. `my $csv = Text::CSV_XS->new ({ binary => 1, escape_char => "\\", allow_loose_quotes => 1, auto_diag => 1, }); open my $fh, "<:encoding(utf-8)", "text.csv" or die "text.csv: $!"; $csv->getline ($fh); # skip first line while (my $row = $csv->getline ($fh)) { my ($tri, $chem, $year, $lbs, $gms) = map { undiacritic ($_) } @{$row}[0,2,4,5,6]; : : } close $fh;` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^4: problems parsing CSV by Jim (Curate) on Oct 10, 2010 at 23:46 UTC
Once again, I strongly recommend you move the object constructor outside the while loop. Tux, a.k.a. H. Merijn Brand, the author of Text::CSV_XS, which is the base module you're indirectly using, amplified this suggestion in his reply to your post. He explained that "you create a new csv parsing object for every line read." You really don't want to do this needlessly for 1.7 million records. And Tux's advice to use getline instead of <>/parse/fields is a wise one. I think you're trying too hard to handle the floating-point weight values. Just be explicit with your tests: `if ($pounds == 0.0 and $grams == 0.0) { ... }` [download] The right coercions happen automagically. Here's the proof: `D:\>perl -E "say '' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '0' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '0.0' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '123' == 0.0 ? 'Equals' : 'Does not equal'" Does not equal D:\>perl -E "say '123.4' == 0.0 ? 'Equals' : 'Does not equal'" Does not equal D:\>` [download] So you didn't like any of the other improvements I suggested in my refactored version of your script? Jim	[reply] [d/l] [select]
Re^4: problems parsing CSV by Jim (Curate) on Oct 11, 2010 at 05:51 UTC
I've incorporated Tux's suggestion to use getline (getline_hr, actually) instead of <>/parse/fields. It really tightens up the whole script. #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; $OUTPUT_FIELD_SEPARATOR = "\n"; $OUTPUT_RECORD_SEPARATOR = "\n"; my $release_file = '../ecodata/releases.txt'; # Text is in the ISO 8859-1 (Latin 1) encoding open my $release_fh, '<:encoding(iso-8859-1)', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => '\\', }); # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' $csv->column_names($csv->getline($release_fh)); while (my $value = $csv->getline_hr($release_fh)) { { no warnings qw( numeric ); if ($value->{'Pounds'} == 0.0 and $value->{'Grams'} == 0.0) { warn "Release $value->{'Release#'} is weightless\n"; } } print $value->{'TRI'}, $value->{'Release#'}, $value->{'ChemName'}, $value->{'RegNum'}, $value->{'Year'}, $value->{'Pounds'}, $value->{'Grams'}; } close $release_fh; exit 0; [download]	[reply] [d/l]
Re^5: problems parsing CSV by Tux (Canon) on Oct 11, 2010 at 06:40 UTC
Re^6: problems parsing CSV by Jim (Curate) on Oct 12, 2010 at 20:51 UTC
Some notes below your chosen depth have not been shown here
Re^2: problems parsing CSV by shawnhcorey (Friar) on Oct 09, 2010 at 19:11 UTC
How's this? And no, I haven't tested it exhaustively nor compared it to `Text::CSV`. #!/usr/bin/perl use strict; use warnings; local $/; my $data = <DATA>; while( $data =~ m{ (?: \" ( .? (?: \"\" .? )? ) \" \| ( [^,]? ) ) ( \, \| \n ) }msxg ){ my $item = $1; if( not defined $item ){ $item = $2; } my $nl = $3; print "\t«$item»\n"; if( $nl eq "\n" ){ print '-' x 40, "\n"; } } __DATA__ stuff,"more","foo (1994 "bar" only)",1234,1988,3.0,"" a,b,c, x,y "a","foo""bar",test a,"foo"bar",test </code>	[reply] [d/l]
Re^2: problems parsing CSV by helenwoodson (Acolyte) on Oct 10, 2010 at 11:08 UTC
My ignorance here is exceeded only by my humility. I'm afraid I don't understand the question. If you mean you want to see the code I wrote using Text::CSV and the attributes I set for it, that's in a code snippit I posted a little while ago. If that's not what you mean and you will explain it to me, I'll post whatever it is if I can figure it out	[reply]


Don't ask to ask, just ask
	PerlMonks