Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: Text CSV_XS memory crash

by spazm (Monk)
on Feb 02, 2011 at 18:13 UTC ( #885810=note: print w/replies, xml ) Need Help??

in reply to Text CSV_XS memory crash

I have modified your code to process the incoming stream looking for min, max and unique values. Configure @min_, @max_ and @unique_ columns to control which fields are processed.

#!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; use Data::Dumper; my $file = "./test.csv"; my @min_columns = qw( ozone ozone_f qa_code ozone_8hr); my @max_columns = qw( ozone ozone_f qa_code rank ); my @unique_columns = qw( site_id ); my $data; my $counter; my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1, }) or die "Cannot use CSV: " . Text::CSV_XS->error_diag(); open my $FH2, "<:encoding(utf8)", $file or die "$file: $!"; my $columns = $csv->getline($FH2); for (@$columns) {s/^\s+//; s/\s+$//}; $csv->column_names(@$columns); while (my $row = $csv->getline_hr($FH2)) { process_row($row); } print "done with loop after $counter rows\n"; $csv->eof or $csv->error_diag(); close $FH2; print Dumper $data; sub process_row { my $row = shift; $counter++; print "processing: $counter\n" if ($counter % 100 == 0); for my $column (@min_columns) { my $val = $row->{$column}; my $d = $data->{$column}{min}; my $update = {value => $val, row => $row}; $data->{$column}{min} = !exists $d->{value} ? $update : $d->{value} > $val ? $update : $d; } for my $column (@max_columns) { my $val = $row->{$column}; my $d = $data->{$column}{max}; my $update = {value => $val, row => $row}; $data->{$column}{max} = !exists $d->{value} ? $update : $d->{value} > $val ? $update : $d; } for my $column (@unique_columns) { my $val = $row->{$column}; $data->{$column}{unique}{$val}++; #could just be =1, but now we can get unique and count distinc +t } }

I switched to Text::CSV_XS from _PP, as the example data had an issue around line 296. (PP parse error 2025, loose escape char)

This is an excellent spot to use DataCube, but that module is gone now. I would like David to fix it and put it back on CPAN. backpan DataCube

If you're going to be doing ad-hoc analysis of csv files, using r language may prove useful.

Edit: added updates from Tux's comment

Replies are listed 'Best First'.
Re^2: Text CSV_XS memory crash
by Tux (Abbot) on Feb 02, 2011 at 18:25 UTC
    • If you already use Text::CSV_XS, there is no need to also use Text::CSV.
    • If you are using getline_hr () with column_names (), why not take the extra step to use bind_columns () and regain all the speed/performance you just lost by using hash references?
    • The comma is the default sep_char value. You should never need to pass that to the constructor.
    • You don't need any error_diag () call if you pass auto_diag => 1 to the constructor.

    Enjoy, Have FUN! H.Merijn
      Tux, thanks for replying to my post and reviewing my code!
      1. Yes, good point, no need to explicitly include use Text::CSV
      2. Because I don't want bound vars, I want a hashref. I use the hash for readability of the data and code for this test example.
      3. True, comma is the default. Leftover from modifying the original code. I'll clean that up.
      4. auto_diag sounds like a useful change for both scripts (mine and original poster's).
      changes applied.

        2. You can still have a hashref:

        my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); open my $fh2, "<:encoding(utf8)", $file or die "$file: $!"; my $columns = $csv->getline ($fh2); for (@$columns) { s/^\s+//; s/\s+$//; } my %row; $csv->bind_columns (\@row{@$columns}); while ($csv->getline ($fh2)) { process_row (\%row); }

        Enjoy, Have FUN! H.Merijn
Re^2: Text CSV_XS memory crash
by spazm (Monk) on Feb 02, 2011 at 18:21 UTC
    This code will complete in O(n ln n) time using O( m ) space, where n is the number of rows and m is the number of distinct unique element tracked for uniqueness.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://885810]
[hippo]: https://deps. still down here. :(
[marto]: just use metacpan

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2018-02-20 12:19 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (271 votes). Check out past polls.