Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Text CSV_XS memory crash

by glepore70 (Novice)
on Feb 02, 2011 at 14:50 UTC ( [id://885752]=perlquestion: print w/replies, xml ) Need Help??

glepore70 has asked for the wisdom of the Perl Monks concerning the following question:

I am getting a crash, probably related to memory usage, while parsing a large (100MB) comma separated text file using Text::CSV_XS. Crash occurs on both Kubuntu (Plasma segfault) and Windows (process is killed). I am reading the CSV file in order to perform some basic funtions (min/max, distinct) on each column. The crash occurs using either CSV_XS or CSV_PP. Here is the test case, you will need to scare up a test.file which is CSV and very large:
#!/usr/bin/perl use strict; use warnings; use Text::CSV_PP; use Text::CSV_XS; use Text::CSV; my @array; my $csv = Text::CSV->new ({ binary => 1, sep_char=>','}) or die "Cannot use CSV: ".Text::CSV->error_diag (); open my $FH2, "<:encoding(utf8)", "test.file" or die "$file: $!"; while (my $row = $csv->getline ($FH2) ) { {push @array, $row;} } $csv->eof or $csv->error_diag (); close $FH2;
When running the code, I see memory usage climb to near 100% (2GB installed) and then the crash occurs.

So my questions are: why is the crash occurring and how can I avoid it? I've read whatever I can find on the subject, and I can't find anyone else having the same problem. People report success reading files over 100GB, but I crash out around 100MB. Thanks in advance for any help, and apologies in advance for missing any obvious resources for help. I did search...

Replies are listed 'Best First'.
Re: Text CSV_XS memory crash
by marto (Cardinal) on Feb 02, 2011 at 15:14 UTC

    In addition to the other comment, are you sure you've run this code? You should get an error along the lines of

    Global symbol "$file" requires explict package name at ....
      Marto, you're right, I was reducing the actual code to a test case and missed that.
Re: Text CSV_XS memory crash
by Anonymous Monk on Feb 02, 2011 at 15:05 UTC
    People report success reading files over 100GB, but I crash out around 100MB.

    What is the nature of the file? Is it 100s/1000s of commas? 10k lines? Lots of quotes/escapes?

    Perhaps you can write a simple file generator that will create a file resembling the one which crashes, and confirm it crashes with this one also :)

    $ pmvers Text::CSV Text::CSV_XS Text::CSV_PP Text::CSV: 1.21 Text::CSV_XS: 0.80 Text::CSV_PP: 1.29
Re: Text CSV_XS memory crash
by Anonyrnous Monk (Hermit) on Feb 02, 2011 at 15:07 UTC
    {push @array, $row;}

    I'd suppose you're simply running out of memory, because you're collecting all the data in the @array.  2 GB memory usage for representing 100 MB file contents in an array (of arrays) structure isn't that unusual.

    Do you actually need to hold the entire data in memory, or might there perhaps be a way to process things sequentially?

      Since I'm acting on the column values, I think I have to read every line into the array, in order to pull out the first value of every row to get a column. Perhaps I'm missing a CSV_XS method for selecting the column instead of reading in every row?
        ...in order to pull out the first value of every row to get a column.

        Not sure I'm understanding you correctly, but if you only need the first column of every row, why not store only the first column (that would at least reduce memory usage somewhat).

        The getline() method returns a reference to an array holding the columns. In other words, $row->[0] would be the first column.

        If, OTOH, you actually do need access to all columns of all rows simultaneously, I'm afraid there's not much you can do except to upgrade memory (or write out the data into another (DB) file format that allows direct random access to individual fields).

Re: Text CSV_XS memory crash
by glepore70 (Novice) on Feb 02, 2011 at 15:14 UTC
    Nearly the exact same data is available at:

    http://views.cira.colostate.edu/documents/Data/SourceFiles/EPA%20Clean%20Air%20Status%20and%20Trends%20Network%20%28CASTNet%29/CASTNet%20Ozone/

    ozone_2007.csv

    Comma separated, quote enclosed.

    That's a 50MB document, just cat it to itself to get a 100MB sample file that crashes for me.

    pmvers Text::CSV Text::CSV_XS Text::CSV_PP Text::CSV: 1.21 Text::CSV_XS: 0.80 Text::CSV_PP: 1.29

    Thanks.
      50mb file is still a 50mb file ;) 1-10-20 lines of generator is a ton smaller, and fits on perlmonks without compression :)
      $ csv-check ozone_2007.csv Checked ozone_2007.csv with csv-check 1.5 using Text::CSV_XS 0.80 OK: rows: 756689, columns: 8 sep = <,>, quo = <">, bin = <0>, eol = <"\r\n"> $ perl ozone.pl ozone_2007.csv Reading ozone_2007.csv with Text::CSV_XS-0.80 ... Data size = 4194400, total size = 409046824 $ cat ozone.pl #!/pro/bin/perl use strict; use warnings; use autodie; use Text::CSV_XS; use Devel::Size qw( size total_size ); my $file = shift; print "Reading $file with Text::CSV_XS-$Text::CSV_XS::VERSION ...\n"; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); open my $fh, "<:encoding(utf-8)", $file; my $dta = $csv->getline_all ($fh); printf "Data size = %d, total size = %d\n", size ($dta), total_size ($ +dta); $

      or in a one-liner

      $ perl -MDevel::Size -MText::CSV_XS -wle'print Devel::Size::total_size +(Text::CSV_XS->new()->getline_all(*ARGV))' ozone_2007.csv 409046824 $

      Enjoy, Have FUN! H.Merijn
Re: Text CSV_XS memory crash
by spazm (Monk) on Feb 02, 2011 at 18:13 UTC
    I have modified your code to process the incoming stream looking for min, max and unique values. Configure @min_, @max_ and @unique_ columns to control which fields are processed.

    #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; use Data::Dumper; my $file = "./test.csv"; my @min_columns = qw( ozone ozone_f qa_code ozone_8hr); my @max_columns = qw( ozone ozone_f qa_code rank ); my @unique_columns = qw( site_id ); my $data; my $counter; my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1, }) or die "Cannot use CSV: " . Text::CSV_XS->error_diag(); open my $FH2, "<:encoding(utf8)", $file or die "$file: $!"; my $columns = $csv->getline($FH2); for (@$columns) {s/^\s+//; s/\s+$//}; $csv->column_names(@$columns); while (my $row = $csv->getline_hr($FH2)) { process_row($row); } print "done with loop after $counter rows\n"; $csv->eof or $csv->error_diag(); close $FH2; print Dumper $data; sub process_row { my $row = shift; $counter++; print "processing: $counter\n" if ($counter % 100 == 0); for my $column (@min_columns) { my $val = $row->{$column}; my $d = $data->{$column}{min}; my $update = {value => $val, row => $row}; $data->{$column}{min} = !exists $d->{value} ? $update : $d->{value} > $val ? $update : $d; } for my $column (@max_columns) { my $val = $row->{$column}; my $d = $data->{$column}{max}; my $update = {value => $val, row => $row}; $data->{$column}{max} = !exists $d->{value} ? $update : $d->{value} > $val ? $update : $d; } for my $column (@unique_columns) { my $val = $row->{$column}; $data->{$column}{unique}{$val}++; #could just be =1, but now we can get unique and count distinc +t } }

    I switched to Text::CSV_XS from _PP, as the example data had an issue around line 296. (PP parse error 2025, loose escape char)

    This is an excellent spot to use DataCube, but that module is gone now. I would like David to fix it and put it back on CPAN. backpan DataCube

    If you're going to be doing ad-hoc analysis of csv files, using r language may prove useful.

    Edit: added updates from Tux's comment

      • If you already use Text::CSV_XS, there is no need to also use Text::CSV.
      • If you are using getline_hr () with column_names (), why not take the extra step to use bind_columns () and regain all the speed/performance you just lost by using hash references?
      • The comma is the default sep_char value. You should never need to pass that to the constructor.
      • You don't need any error_diag () call if you pass auto_diag => 1 to the constructor.

      Enjoy, Have FUN! H.Merijn
        Tux, thanks for replying to my post and reviewing my code!
        1. Yes, good point, no need to explicitly include use Text::CSV
        2. Because I don't want bound vars, I want a hashref. I use the hash for readability of the data and code for this test example.
        3. True, comma is the default. Leftover from modifying the original code. I'll clean that up.
        4. auto_diag sounds like a useful change for both scripts (mine and original poster's).
        changes applied.
      This code will complete in O(n ln n) time using O( m ) space, where n is the number of rows and m is the number of distinct unique element tracked for uniqueness.
Re: Text CSV_XS memory crash
by glepore70 (Novice) on Feb 02, 2011 at 17:36 UTC
    OK, thanks to all for their help, I was confusing the row and column values. I've successfully re-written my code to pull out one column at a time and process each column separately. I still have a few things to work out, but I'm over the hump.

    $row->[0] was the key.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://885752]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2025-03-17 07:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (54 votes). Check out past polls.