Re: Text CSV_XS memory crash
by marto (Cardinal) on Feb 02, 2011 at 15:14 UTC
|
In addition to the other comment, are you sure you've run this code? You should get an error along the lines of
Global symbol "$file" requires explict package name at ....
| [reply] [d/l] |
|
Marto, you're right, I was reducing the actual code to a test case and missed that.
| [reply] |
Re: Text CSV_XS memory crash
by Anonymous Monk on Feb 02, 2011 at 15:05 UTC
|
People report success reading files over 100GB, but I crash out around 100MB.
What is the nature of the file? Is it 100s/1000s of commas? 10k lines? Lots of quotes/escapes?
Perhaps you can write a simple file generator that will create a file resembling the one which crashes, and confirm it crashes with this one also :)
$ pmvers Text::CSV Text::CSV_XS Text::CSV_PP
Text::CSV: 1.21
Text::CSV_XS: 0.80
Text::CSV_PP: 1.29
| [reply] [d/l] |
Re: Text CSV_XS memory crash
by Anonyrnous Monk (Hermit) on Feb 02, 2011 at 15:07 UTC
|
{push @array, $row;}
I'd suppose you're simply running out of memory, because you're collecting all the data in the @array. 2 GB memory usage for representing 100 MB file contents in an array (of arrays) structure isn't that unusual.
Do you actually need to hold the entire data in memory, or might there perhaps be a way to process things sequentially?
| [reply] [d/l] [select] |
|
Since I'm acting on the column values, I think I have to read every line into the array, in order to pull out the first value of every row to get a column. Perhaps I'm missing a CSV_XS method for selecting the column instead of reading in every row?
| [reply] |
|
...in order to pull out the first value of every row to get a column.
Not sure I'm understanding you correctly, but if you only need the first column of every row, why not store only the first column (that would at least reduce memory usage somewhat).
The getline() method returns a reference to an array holding the columns. In other words, $row->[0] would be the first column.
If, OTOH, you actually do need access to all columns of all rows simultaneously, I'm afraid there's not much you can do except to upgrade memory (or write out the data into another (DB) file format that allows direct random access to individual fields).
| [reply] [d/l] [select] |
|
|
|
|
Re: Text CSV_XS memory crash
by glepore70 (Novice) on Feb 02, 2011 at 15:14 UTC
|
Nearly the exact same data is available at:
http://views.cira.colostate.edu/documents/Data/SourceFiles/EPA%20Clean%20Air%20Status%20and%20Trends%20Network%20%28CASTNet%29/CASTNet%20Ozone/
ozone_2007.csv
Comma separated, quote enclosed.
That's a 50MB document, just cat it to itself to get a 100MB sample file that crashes for me.
pmvers Text::CSV Text::CSV_XS Text::CSV_PP
Text::CSV: 1.21
Text::CSV_XS: 0.80
Text::CSV_PP: 1.29
Thanks.
| [reply] [d/l] |
|
50mb file is still a 50mb file ;) 1-10-20 lines of generator is a ton smaller, and fits on perlmonks without compression :)
| [reply] |
|
$ csv-check ozone_2007.csv
Checked ozone_2007.csv with csv-check 1.5 using Text::CSV_XS 0.80
OK: rows: 756689, columns: 8
sep = <,>, quo = <">, bin = <0>, eol = <"\r\n">
$ perl ozone.pl ozone_2007.csv
Reading ozone_2007.csv with Text::CSV_XS-0.80 ...
Data size = 4194400, total size = 409046824
$ cat ozone.pl
#!/pro/bin/perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use Devel::Size qw( size total_size );
my $file = shift;
print "Reading $file with Text::CSV_XS-$Text::CSV_XS::VERSION ...\n";
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<:encoding(utf-8)", $file;
my $dta = $csv->getline_all ($fh);
printf "Data size = %d, total size = %d\n", size ($dta), total_size ($
+dta);
$
or in a one-liner
$ perl -MDevel::Size -MText::CSV_XS -wle'print Devel::Size::total_size
+(Text::CSV_XS->new()->getline_all(*ARGV))' ozone_2007.csv
409046824
$
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] [select] |
Re: Text CSV_XS memory crash
by spazm (Monk) on Feb 02, 2011 at 18:13 UTC
|
I have modified your code to process the incoming stream looking for min, max and unique values. Configure @min_, @max_ and @unique_ columns to control which fields are processed.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
use Data::Dumper;
my $file = "./test.csv";
my @min_columns = qw( ozone ozone_f qa_code ozone_8hr);
my @max_columns = qw( ozone ozone_f qa_code rank );
my @unique_columns = qw( site_id );
my $data;
my $counter;
my $csv = Text::CSV_XS->new({
binary => 1,
auto_diag => 1,
}) or die "Cannot use CSV: " . Text::CSV_XS->error_diag();
open my $FH2, "<:encoding(utf8)", $file or die "$file: $!";
my $columns = $csv->getline($FH2);
for (@$columns) {s/^\s+//; s/\s+$//};
$csv->column_names(@$columns);
while (my $row = $csv->getline_hr($FH2)) {
process_row($row);
}
print "done with loop after $counter rows\n";
$csv->eof or $csv->error_diag();
close $FH2;
print Dumper $data;
sub process_row
{
my $row = shift;
$counter++;
print "processing: $counter\n" if ($counter % 100 == 0);
for my $column (@min_columns) {
my $val = $row->{$column};
my $d = $data->{$column}{min};
my $update = {value => $val, row => $row};
$data->{$column}{min} =
!exists $d->{value} ? $update
: $d->{value} > $val ? $update
: $d;
}
for my $column (@max_columns) {
my $val = $row->{$column};
my $d = $data->{$column}{max};
my $update = {value => $val, row => $row};
$data->{$column}{max} =
!exists $d->{value} ? $update
: $d->{value} > $val ? $update
: $d;
}
for my $column (@unique_columns) {
my $val = $row->{$column};
$data->{$column}{unique}{$val}++;
#could just be =1, but now we can get unique and count distinc
+t
}
}
I switched to Text::CSV_XS from _PP, as the example data had an issue around line 296. (PP parse error 2025, loose escape char)
This is an excellent spot to use DataCube, but that module is gone now. I would like David to fix it and put it back on CPAN. backpan DataCube
If you're going to be doing ad-hoc analysis of csv files, using r language may prove useful.
Edit: added updates from Tux's comment | [reply] [d/l] |
|
| [reply] [d/l] [select] |
|
Tux, thanks for replying to my post and reviewing my code!
- Yes, good point, no need to explicitly include use Text::CSV
- Because I don't want bound vars, I want a hashref. I use the hash for readability of the data and code for this test example.
- True, comma is the default. Leftover from modifying the original code. I'll clean that up.
- auto_diag sounds like a useful change for both scripts (mine and original poster's).
changes applied.
| [reply] [d/l] [select] |
|
|
|
This code will complete in O(n ln n) time using O( m ) space, where n is the number of rows and m is the number of distinct unique element tracked for uniqueness.
| [reply] |
Re: Text CSV_XS memory crash
by glepore70 (Novice) on Feb 02, 2011 at 17:36 UTC
|
OK, thanks to all for their help, I was confusing the row and column values. I've successfully re-written my code to pull out one column at a time and process each column separately. I still have a few things to work out, but I'm over the hump.
$row->[0] was the key.
| [reply] |