rkk has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
My input files have following. I need to compute counts that fall in each interval mention in FILE2 using third column in FILE1.
I am looking for ways other then using SQLITE. My FILE1 has more than 10 million lines so linear search is not a good option.
File1:
so 10 0.05
so 11 0.03
so 25 0.15
so 35 0.3
so 36 0.25
so 37 1
ge 14 0.12
ge 20 0.4
File2:
so 10 20
so 30 40
ge 10 30
sample output:
so 0.08
so 1.55
ge 0.52
Thanks in advance,
rkk
Re: Count in intervals
by Athanasius (Archbishop) on Oct 01, 2014 at 02:45 UTC
|
Hello rkk,
Assuming FILE2 is not too large, the best strategy is to read FILE2 into a suitable data structure, then read FILE1 line-by-line and add that line’s third-column datum to the appropriate interval count(s). This avoids the need to perform any kind of search on FILE1:
#! perl
use strict;
use warnings;
use Data::Dump;
my %intervals =
(
so => { "10,20" => 0, "30,40" => 0, },
ge => { "10,30" => 0 },
);
while (my $line = <DATA>)
{
my ($id, $num, $datum) = split /\s+/, $line;
for my $interval (keys %{ $intervals{$id} })
{
my ($min, $max) = split /,/, $interval;
if ($num >= $min && $num <= $max)
{
$intervals{$id}->{$interval} += $datum;
}
}
}
dd \%intervals;
__DATA__
so 10 0.05
so 11 0.03
so 25 0.15
so 35 0.3
so 36 0.25
so 37 1
ge 14 0.12
ge 20 0.4
Output:
12:41 >perl 1037_SoPW.pl
{
ge => { "10,30" => 0.52 },
so => { "10,20" => 0.08, "30,40" => 1.55 },
}
12:41 >
(Populating %intervals from FILE2 is left as the proverbial exercise for the reader!)
P.S. In future, please post non-site-related questions to Seekers of Perl Wisdom. Perl Monks Discussion is reserved for issues affecting the PerlMonks site itself, not Perl in general.
Hope that helps,
| [reply] [d/l] [select] |
Re: Count in intervals
by wrog (Friar) on Oct 01, 2014 at 02:55 UTC
|
Actually, if FILE1 were truly huge, say a billion lines, a linear pass through is not only the right thing but probably the only way you could really do it.
What matters is how much you have to keep in memory at any given time, i.e., how many different ranges do you have to compute counts for, i.e., how big is FILE2?
Then again 100M of memory is not that big a deal these days so even in the worst case where every line of FILE1 is ending up in a different interval, it's still doable to have all of those counts in a single hash.
Here's my first stab (note: not tested or anything). Assumes the intervals don't overlap, but if they do you'll find out:
use warnings;
use strict;
our %intervals = ();
our %counts = ();
open F2,"<FILE2" or die;
while (<F2>) {
chomp;
my ($pre,$lower,$upper) = split /\s-+/;
my $intvs = $intervals{$pre};
my $i = scalar( grep { $_->[0]<=$lower } @$intvs);
die "overlap found: $pre $intvs->[$i-1]->[0]..$intvs->[$i-1]->[1] vs
+. $lower..$upper"
if( $i>0 && !$intvs->[$i-1]->[1]<=$lower );
die "overlap found: $pre $lower..$upper vs. $intvs->[$i]->[0]..$intv
+s->[$i]->[1]"
if( $i<@$intvs && !$upper<=$intvs->[$i]->[0] );
splice @{$intervals{$pre}},$i,0,[$lower,$upper];
}
close F2;
open F1,"<FILE1" or die;
while (<F1>) {
chomp;
my ($pre,$n,$r) = split /\s-+/;
my ($intv) = grep {$_->[0] <= $n && $n < $_->[1]} $intervals{$pre};
$counts{"$pre $intv->[0]"} += $r;
}
close F1;
for (keys %counts) {
print "$_ $counts{$_}\n";
}
| [reply] [d/l] |
|
And if both files are really huge, you use a modified Merge sort-type algorithm. (The last pass does the interval calculation instead of just sorting items.)
| [reply] |
Re: Count in intervals
by Discipulus (Canon) on Oct 01, 2014 at 07:20 UTC
|
| [reply] [d/l] |
|
| [reply] |
|
Discipulus' answer seems pretty fair to me.
OP posted no code; gave no evidence that the SOPW showed any effort on the author's part; nor even followed the instructions in either PMD (to which the question was originally posted) or SOPW.
Providing a complete answer to such a question may display the responder's skills... but it does not comport with the Monastery's mission: that we'll help people learn, but not do a job for them.
Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
- code
- verbatim error and/or warning messages
- a coherent explanation of what "doesn't work actually means.
| [reply] [d/l] |
|
|
|
|
| [reply] |
|
Re: Count in intervals
by CountZero (Bishop) on Oct 01, 2014 at 14:37 UTC
|
In Perl's spirit of TIMTOWTDI: use Modern::Perl '2014';
use Number::Interval;
use List::Util qw/sum/;
my %raw_intervals = (
so => [ [ 10, 20 ], [ 30, 40 ] ],
ge => [ [ 10, 30 ], ],
);
my @interval_objects;
for my $id ( keys %raw_intervals ) {
for my $interval ( @{ $raw_intervals{$id} } ) {
push @interval_objects,
[
$id,
Number::Interval->new(
IncMax => 1,
IncMin => 1,
Min => $interval->[0],
Max => $interval->[1]
)
];
}
}
while (<DATA>) {
my ( $id, $value, $count ) = split;
for my $interval (@interval_objects) {
if ( $id eq $interval->[0] and $interval->[1]->contains($value
+) ) {
push @$interval, $count;
last;
}
}
}
for my $interval (@interval_objects) {
my ( $id, $interval, @values ) = @$interval;
say "$id - ", $interval->stringify, ' ', sum(@values);
}
__DATA__
so 10 0.05
so 11 0.03
so 25 0.15
so 35 0.3
so 36 0.25
so 37 1
ge 14 0.12
ge 20 0.4
Output:ge - [10,30] 0.52
so - [10,20] 0.08
so - [30,40] 1.55
Note that in the array @interval_objects are stored all the values of each line of FILE1 for this particular interval and thereby it becomes trivially easy to calculate minimum value, maximum value, average value, count, standard deviation, ... just by applying another function from List::Utils or any other function which accepts an array as its input.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] [select] |
Re: Count in intervals
by davido (Cardinal) on Oct 01, 2014 at 17:40 UTC
|
use strict;
use warnings;
use constant { LOW => 0, HIGH => 1, VALUE => 2 };
my $FILE1 = <<'EOF';
so 10 0.05
so 11 0.03
so 25 0.15
so 35 0.3
so 36 0.25
so 37 1
ge 14 0.12
ge 20 0.4
EOF
my $FILE2 = <<'EOF';
so 10 20
so 30 40
ge 10 30
EOF
my %ranges;
# Read FILE2 and create a useful datastructure.
open my $ifh2, '<', \$FILE2 or die $!;
while( <$ifh2> ) {
chomp;
next unless length;
my( $cat, $low, $high ) = split /\s+/, $_;
push @{$ranges{$cat}}, [ $low, $high ];
}
close $ifh2;
# Read FILE1 and increment counters within our datastructure.
open my $ifh1, '<', \$FILE1 or die $!;
while( <$ifh1> ) {
chomp;
next unless length;
my( $cat, $target, $value ) = split /\s+/, $_;
foreach my $range ( @{$ranges{$cat}} ) {
$range->[VALUE] += $value
if $target >= $range->[LOW] && $target <= $range->[HIGH];
}
}
# Dump the information we need from our datastructure.
while( my( $cat, $range ) = each %ranges ) {
print "$cat $_->[VALUE]\n" foreach @$range;
}
This constructs a data structure that looks like this:
%ranges = (
'ge' => [
[ 10, 30, '0.52' ]
],
'so' => [
[ 10, 20, '0.08' ],
[ 30, 40, '1.55' ]
]
);
Once the ranges are created, FILE1 can be read line by line, and given its general category, we will iterate over the various ranges within that category, incrementing the counter any time we determine that this line of FILE1 fits within a range.
Here's the sample output:
so 0.08
so 1.55
ge 0.52
| [reply] [d/l] [select] |
Re: Count in intervals
by FloydATC (Deacon) on Oct 02, 2014 at 12:51 UTC
|
Are both files 1 and 2 huge, or is it possible to keep one of them in memory while doing a single scan of the other?
If both files are huge, is it feasible to first sort the file contents by time so the files can be scanned in paralell? In pseudocode:
sort file1
sort file2
while (not end-of-file1 and not end-of-file2) {
get next interval from file 2
get all matching records from file 1
}
-- FloydATC
Time flies when you don't know what you're doing
| [reply] [d/l] |
Re: Count in intervals
by codiac (Beadle) on Oct 03, 2014 at 10:53 UTC
|
10M rows in sqlite isn't that large. Have you declared indexes? | [reply] |
|
|