working with files

smanicka has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: working with files by Marshall (Canon) on Apr 12, 2010 at 19:08 UTC
If your input file is sorted in the right way, then each line can just be processed as it is read. Since your test data set was sorted in that way, I took advantage of that to simplify the code. If that's not a good assumption, then of course things get more complicated. #!/usr/bin/perl use strict; use warnings; $\|=1; my $line =<DATA>; my ($a,$b,$c,$d,$cnt,$str,$end) = ($line =~ m/:(\d+)/g); while (my $line2 = <DATA>) { my ($A,$B,$C,$D,$Cnt,$Str,$End) = ($line2 =~ m/:(\d+)/g); if ($a == $A and $d == $D) { $b ="multiple" if ($b ne $B); $c ="multiple" if ($c ne $C); $cnt+=$Cnt; $end=$End; next; } print "A:$a B:$b C:$c D:$d CNT:$cnt STR:$str END:$end\n"; ($a,$b,$c,$d,$cnt,$str,$end)=($A,$B,$C,$D,$Cnt,$Str,$End); } print "A:$a B:$b C:$c D:$d CNT:$cnt STR:$str END:$end\n"; =prints with your test dataset: A:1 B:multiple C:3 D:4 CNT:3 STR:1 END:4 A:2 B:2 C:3 D:5 CNT:6 STR:4 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12 with another dataset of: A:1 B:2 C:3 D:4 CNT:1 STR:1 END:2 A:1 B:7 C:3 D:4 CNT:1 STR:2 END:3 A:1 B:2 C:3 D:8 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:1 STR:4 END:5 A:2 B:2 C:3 D:5 CNT:5 STR:5 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12 that prints: A:1 B:multiple C:3 D:4 CNT:2 STR:1 END:3 A:1 B:2 C:3 D:8 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:6 STR:4 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12 =cut __DATA__ A:1 B:2 C:3 D:4 CNT:1 STR:1 END:2 A:1 B:7 C:3 D:4 CNT:1 STR:2 END:3 A:1 B:2 C:3 D:4 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:1 STR:4 END:5 A:2 B:2 C:3 D:5 CNT:5 STR:5 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12 [download]	[reply] [d/l]
Re: working with files by choroba (Cardinal) on Apr 12, 2010 at 16:00 UTC
Just a little clarification needed: `A` is the key you group on, right? What `B` do you take, the first or the least? What `C`, `D`, and `STR` do you take? `CNT` seems to be summed and `END` is the last one, ok?	[reply] [d/l] [select]
Re^2: working with files by smanicka (Scribe) on Apr 12, 2010 at 16:08 UTC
sorry - i wasn't clear - I need to group on A and D and if the other things B,C etc are the same - then i would say the actual value or else say B:MULTIPLE C MULTIPLE etc.STR is the start sequence and END is the end Sequence number and the other is just a count - that sums up how many of each exist. Thanks	[reply]
Re^3: working with files by almut (Canon) on Apr 12, 2010 at 17:26 UTC
From your sample data/output it's not quite clear to me what you mean by "group on A and D"... but here's something that groups by A alone: #!/usr/bin/perl my %groups; while (<DATA>) { my ($grp, @other) = split ' '; for my $field (@other) { my ($name, $val) = split /:/, $field; if ($name eq "STR") { $groups{$grp}{$name} = $val unless defined $groups{$grp}{$ +name}; # first } elsif ($name eq "END") { $groups{$grp}{$name} = $val; # last } elsif ($name eq "CNT") { $groups{$grp}{$name} += $val; # sum } else { $groups{$grp}{$name}{$val}++; # unique values } } } #use Data::Dumper; # debug #print Dumper \%groups; for my $grp (sort keys %groups) { my $fields = $groups{$grp}; print $grp; for my $name (qw(B C D)) { my @elems = keys %{ $fields->{$name} }; my $val = @elems > 1 ? "MULTIPLE" : $elems[0]; print " $name:$val"; } for my $name (qw(CNT STR END)) { print " $name:$fields->{$name}"; } print "\n"; } __DATA__ A:1 B:2 C:3 D:4 CNT:1 STR:1 END:2 A:1 B:7 C:3 D:4 CNT:1 STR:2 END:3 A:1 B:2 C:3 D:4 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:1 STR:4 END:5 A:2 B:2 C:3 D:5 CNT:5 STR:5 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12 [download] Output: `A:1 B:MULTIPLE C:3 D:4 CNT:3 STR:1 END:4 A:2 B:2 C:3 D:5 CNT:6 STR:4 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12` [download] (STR and END are assumed to be sorted on input — in case they aren't, you'd need to compute the minimum and maximum value instead of taking the first and the last...)	[reply] [d/l] [select]
Re^3: working with files by choroba (Cardinal) on Apr 12, 2010 at 17:09 UTC
Something like this? #!/usr/bin/perl use warnings; use strict; my %hash; while(defined(my $line = <>)){ if($line =~ /A:([0-9]+) +B:([0-9]+) +C:([0-9]+) +D:([0-9]+) +CNT:([0 +-9]+) +STR:([0-9]+) +END:([0-9]+)/){ my($a,$b,$c,$d,$cnt,$str,$end) = ($1,$2,$3,$4,$5,$6,$7); my $key = "a:$a,d:$d"; if(exists $hash{$key}){ # B and C will be reported if multiple my $current = $hash{$key}; if($current->{b} ne $b){ $current->{b} = 'MULTIPLE'; } if($current->{c} ne $c){ $current->{c} = 'MULTIPLE'; } # STR is start, we need the least if($str < $current->{str}){ $current->{str} = $str; } # END is end, we need the greatest if($end > $current->{end}){ $current->{end} = $end; } # add to count $current->{cnt} += $cnt; }else{ # new key $hash{$key}{b} = $b; $hash{$key}{c} = $c; $hash{$key}{str} = $str; $hash{$key}{end} = $end; $hash{$key}{cnt} = $cnt; } } } # print the results foreach my $key (keys %hash){ $key =~ /a:(.),d:(.)/; my $current = $hash{$key}; my ($a,$d) = ($1,$2); print "A:$a "; print "B:$current->{b} "; print "C:$current->{c} "; print "D:$d "; print "CNT:$current->{cnt} "; print "STR:$current->{str} "; print "END:$current->{end}\n"; } [download]	[reply] [d/l]


No such thing as a small change
	PerlMonks