Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Count in intervals

by rkk (Novice)
on Oct 01, 2014 at 00:01 UTC ( #1102490=perlquestion: print w/replies, xml ) Need Help??

rkk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

My input files have following. I need to compute counts that fall in each interval mention in FILE2 using third column in FILE1.

I am looking for ways other then using SQLITE. My FILE1 has more than 10 million lines so linear search is not a good option.

File1:

so 10 0.05

so 11 0.03

so 25 0.15

so 35 0.3

so 36 0.25

so 37 1

ge 14 0.12

ge 20 0.4

File2:

so 10 20

so 30 40

ge 10 30

sample output:

so 0.08

so 1.55

ge 0.52

Thanks in advance,

rkk

Replies are listed 'Best First'.
Re: Count in intervals
by Athanasius (Archbishop) on Oct 01, 2014 at 02:45 UTC

    Hello rkk,

    Assuming FILE2 is not too large, the best strategy is to read FILE2 into a suitable data structure, then read FILE1 line-by-line and add that line’s third-column datum to the appropriate interval count(s). This avoids the need to perform any kind of search on FILE1:

    #! perl use strict; use warnings; use Data::Dump; my %intervals = ( so => { "10,20" => 0, "30,40" => 0, }, ge => { "10,30" => 0 }, ); while (my $line = <DATA>) { my ($id, $num, $datum) = split /\s+/, $line; for my $interval (keys %{ $intervals{$id} }) { my ($min, $max) = split /,/, $interval; if ($num >= $min && $num <= $max) { $intervals{$id}->{$interval} += $datum; } } } dd \%intervals; __DATA__ so 10 0.05 so 11 0.03 so 25 0.15 so 35 0.3 so 36 0.25 so 37 1 ge 14 0.12 ge 20 0.4

    Output:

    12:41 >perl 1037_SoPW.pl { ge => { "10,30" => 0.52 }, so => { "10,20" => 0.08, "30,40" => 1.55 }, } 12:41 >

    (Populating %intervals from FILE2 is left as the proverbial exercise for the reader!)

    P.S. In future, please post non-site-related questions to Seekers of Perl Wisdom. Perl Monks Discussion is reserved for issues affecting the PerlMonks site itself, not Perl in general.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Count in intervals
by wrog (Friar) on Oct 01, 2014 at 02:55 UTC

    Actually, if FILE1 were truly huge, say a billion lines, a linear pass through is not only the right thing but probably the only way you could really do it.

    What matters is how much you have to keep in memory at any given time, i.e., how many different ranges do you have to compute counts for, i.e., how big is FILE2?

    Then again 100M of memory is not that big a deal these days so even in the worst case where every line of FILE1 is ending up in a different interval, it's still doable to have all of those counts in a single hash.

    Here's my first stab (note: not tested or anything). Assumes the intervals don't overlap, but if they do you'll find out:

    use warnings; use strict; our %intervals = (); our %counts = (); open F2,"<FILE2" or die; while (<F2>) { chomp; my ($pre,$lower,$upper) = split /\s-+/; my $intvs = $intervals{$pre}; my $i = scalar( grep { $_->[0]<=$lower } @$intvs); die "overlap found: $pre $intvs->[$i-1]->[0]..$intvs->[$i-1]->[1] vs +. $lower..$upper" if( $i>0 && !$intvs->[$i-1]->[1]<=$lower ); die "overlap found: $pre $lower..$upper vs. $intvs->[$i]->[0]..$intv +s->[$i]->[1]" if( $i<@$intvs && !$upper<=$intvs->[$i]->[0] ); splice @{$intervals{$pre}},$i,0,[$lower,$upper]; } close F2; open F1,"<FILE1" or die; while (<F1>) { chomp; my ($pre,$n,$r) = split /\s-+/; my ($intv) = grep {$_->[0] <= $n && $n < $_->[1]} $intervals{$pre}; $counts{"$pre $intv->[0]"} += $r; } close F1; for (keys %counts) { print "$_ $counts{$_}\n"; }
      And if both files are really huge, you use a modified Merge sort-type algorithm. (The last pass does the interval calculation instead of just sorting items.)
Re: Count in intervals
by Discipulus (Canon) on Oct 01, 2014 at 07:20 UTC
    Seems a good task for Iterator. This module come out from an intersting chapter in 'High Order Perl' book

    HtH
    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Seems a good task for Iterator.

      Why? That just transforms the OP from "How do I do this?" to "How do I do this with an iterator?", and doesn't offer a solution.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        Discipulus' answer seems pretty fair to me.

        OP posted no code; gave no evidence that the SOPW showed any effort on the author's part; nor even followed the instructions in either PMD (to which the question was originally posted) or SOPW.

        Providing a complete answer to such a question may display the responder's skills... but it does not comport with the Monastery's mission: that we'll help people learn, but not do a job for them.


        Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
        1. code
        2. verbatim error and/or warning messages
        3. a coherent explanation of what "doesn't work actually means.
Re: Count in intervals
by CountZero (Bishop) on Oct 01, 2014 at 14:37 UTC
    In Perl's spirit of TIMTOWTDI:
    use Modern::Perl '2014'; use Number::Interval; use List::Util qw/sum/; my %raw_intervals = ( so => [ [ 10, 20 ], [ 30, 40 ] ], ge => [ [ 10, 30 ], ], ); my @interval_objects; for my $id ( keys %raw_intervals ) { for my $interval ( @{ $raw_intervals{$id} } ) { push @interval_objects, [ $id, Number::Interval->new( IncMax => 1, IncMin => 1, Min => $interval->[0], Max => $interval->[1] ) ]; } } while (<DATA>) { my ( $id, $value, $count ) = split; for my $interval (@interval_objects) { if ( $id eq $interval->[0] and $interval->[1]->contains($value +) ) { push @$interval, $count; last; } } } for my $interval (@interval_objects) { my ( $id, $interval, @values ) = @$interval; say "$id - ", $interval->stringify, ' ', sum(@values); } __DATA__ so 10 0.05 so 11 0.03 so 25 0.15 so 35 0.3 so 36 0.25 so 37 1 ge 14 0.12 ge 20 0.4
    Output:
    ge - [10,30] 0.52 so - [10,20] 0.08 so - [30,40] 1.55
    Note that in the array @interval_objects are stored all the values of each line of FILE1 for this particular interval and thereby it becomes trivially easy to calculate minimum value, maximum value, average value, count, standard deviation, ... just by applying another function from List::Utils or any other function which accepts an array as its input.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Count in intervals
by davido (Cardinal) on Oct 01, 2014 at 17:40 UTC

    use strict; use warnings; use constant { LOW => 0, HIGH => 1, VALUE => 2 }; my $FILE1 = <<'EOF'; so 10 0.05 so 11 0.03 so 25 0.15 so 35 0.3 so 36 0.25 so 37 1 ge 14 0.12 ge 20 0.4 EOF my $FILE2 = <<'EOF'; so 10 20 so 30 40 ge 10 30 EOF my %ranges; # Read FILE2 and create a useful datastructure. open my $ifh2, '<', \$FILE2 or die $!; while( <$ifh2> ) { chomp; next unless length; my( $cat, $low, $high ) = split /\s+/, $_; push @{$ranges{$cat}}, [ $low, $high ]; } close $ifh2; # Read FILE1 and increment counters within our datastructure. open my $ifh1, '<', \$FILE1 or die $!; while( <$ifh1> ) { chomp; next unless length; my( $cat, $target, $value ) = split /\s+/, $_; foreach my $range ( @{$ranges{$cat}} ) { $range->[VALUE] += $value if $target >= $range->[LOW] && $target <= $range->[HIGH]; } } # Dump the information we need from our datastructure. while( my( $cat, $range ) = each %ranges ) { print "$cat $_->[VALUE]\n" foreach @$range; }

    This constructs a data structure that looks like this:

    %ranges = ( 'ge' => [ [ 10, 30, '0.52' ] ], 'so' => [ [ 10, 20, '0.08' ], [ 30, 40, '1.55' ] ] );

    Once the ranges are created, FILE1 can be read line by line, and given its general category, we will iterate over the various ranges within that category, incrementing the counter any time we determine that this line of FILE1 fits within a range.

    Here's the sample output:

    so 0.08 so 1.55 ge 0.52

    Dave

Re: Count in intervals
by FloydATC (Deacon) on Oct 02, 2014 at 12:51 UTC

    Are both files 1 and 2 huge, or is it possible to keep one of them in memory while doing a single scan of the other?

    If both files are huge, is it feasible to first sort the file contents by time so the files can be scanned in paralell? In pseudocode:

    sort file1 sort file2 while (not end-of-file1 and not end-of-file2) { get next interval from file 2 get all matching records from file 1 }
    -- FloydATC

    Time flies when you don't know what you're doing

Re: Count in intervals
by codiac (Beadle) on Oct 03, 2014 at 10:53 UTC
    10M rows in sqlite isn't that large. Have you declared indexes?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1102490]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2023-12-04 15:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?











    Results (25 votes). Check out past polls.

    Notices?