Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

working with files

by smanicka (Scribe)
on Apr 12, 2010 at 15:52 UTC ( [id://834299]=perlquestion: print w/replies, xml ) Need Help??

smanicka has asked for the wisdom of the Perl Monks concerning the following question:

hello

I have a file that looks like the example below - and i need a snippet of code that condenses the file as shown in the example - i can read the file line by line - but can anyone tell me how to compare with the rest of the file?

A:1 B:2 C:3 D:4 CNT:1 STR:1 END:2 A:1 B:7 C:3 D:4 CNT:1 STR:2 END:3 A:1 B:2 C:3 D:4 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:1 STR:4 END:5 A:2 B:2 C:3 D:5 CNT:5 STR:5 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12
should become
A:1 B:MULTIPLE C:3 D:4 CNT:3 STR:1 END:4 A:2 B:2 C:3 D:5 CNT:6 STR:4 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12
Any help would be appreciated.

Thanks

Replies are listed 'Best First'.
Re: working with files
by Marshall (Canon) on Apr 12, 2010 at 19:08 UTC
    If your input file is sorted in the right way, then each line can just be processed as it is read. Since your test data set was sorted in that way, I took advantage of that to simplify the code. If that's not a good assumption, then of course things get more complicated.
    #!/usr/bin/perl use strict; use warnings; $|=1; my $line =<DATA>; my ($a,$b,$c,$d,$cnt,$str,$end) = ($line =~ m/:(\d+)/g); while (my $line2 = <DATA>) { my ($A,$B,$C,$D,$Cnt,$Str,$End) = ($line2 =~ m/:(\d+)/g); if ($a == $A and $d == $D) { $b ="multiple" if ($b ne $B); $c ="multiple" if ($c ne $C); $cnt+=$Cnt; $end=$End; next; } print "A:$a B:$b C:$c D:$d CNT:$cnt STR:$str END:$end\n"; ($a,$b,$c,$d,$cnt,$str,$end)=($A,$B,$C,$D,$Cnt,$Str,$End); } print "A:$a B:$b C:$c D:$d CNT:$cnt STR:$str END:$end\n"; =prints with your test dataset: A:1 B:multiple C:3 D:4 CNT:3 STR:1 END:4 A:2 B:2 C:3 D:5 CNT:6 STR:4 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12 with another dataset of: A:1 B:2 C:3 D:4 CNT:1 STR:1 END:2 A:1 B:7 C:3 D:4 CNT:1 STR:2 END:3 A:1 B:2 C:3 D:8 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:1 STR:4 END:5 A:2 B:2 C:3 D:5 CNT:5 STR:5 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12 that prints: A:1 B:multiple C:3 D:4 CNT:2 STR:1 END:3 A:1 B:2 C:3 D:8 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:6 STR:4 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12 =cut __DATA__ A:1 B:2 C:3 D:4 CNT:1 STR:1 END:2 A:1 B:7 C:3 D:4 CNT:1 STR:2 END:3 A:1 B:2 C:3 D:4 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:1 STR:4 END:5 A:2 B:2 C:3 D:5 CNT:5 STR:5 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12
Re: working with files
by choroba (Cardinal) on Apr 12, 2010 at 16:00 UTC
    Just a little clarification needed: A is the key you group on, right? What B do you take, the first or the least? What C, D, and STR do you take? CNT seems to be summed and END is the last one, ok?

      sorry - i wasn't clear - I need to group on A and D and if the other things B,C etc are the same - then i would say the actual value or else say B:MULTIPLE C MULTIPLE etc.STR is the start sequence and END is the end Sequence number and the other is just a count - that sums up how many of each exist.

      Thanks

        From your sample data/output it's not quite clear to me what you mean by "group on A and D"... but here's something that groups by A alone:

        #!/usr/bin/perl my %groups; while (<DATA>) { my ($grp, @other) = split ' '; for my $field (@other) { my ($name, $val) = split /:/, $field; if ($name eq "STR") { $groups{$grp}{$name} = $val unless defined $groups{$grp}{$ +name}; # first } elsif ($name eq "END") { $groups{$grp}{$name} = $val; # last } elsif ($name eq "CNT") { $groups{$grp}{$name} += $val; # sum } else { $groups{$grp}{$name}{$val}++; # unique values } } } #use Data::Dumper; # debug #print Dumper \%groups; for my $grp (sort keys %groups) { my $fields = $groups{$grp}; print $grp; for my $name (qw(B C D)) { my @elems = keys %{ $fields->{$name} }; my $val = @elems > 1 ? "MULTIPLE" : $elems[0]; print " $name:$val"; } for my $name (qw(CNT STR END)) { print " $name:$fields->{$name}"; } print "\n"; } __DATA__ A:1 B:2 C:3 D:4 CNT:1 STR:1 END:2 A:1 B:7 C:3 D:4 CNT:1 STR:2 END:3 A:1 B:2 C:3 D:4 CNT:1 STR:3 END:4 A:2 B:2 C:3 D:5 CNT:1 STR:4 END:5 A:2 B:2 C:3 D:5 CNT:5 STR:5 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12

        Output:

        A:1 B:MULTIPLE C:3 D:4 CNT:3 STR:1 END:4 A:2 B:2 C:3 D:5 CNT:6 STR:4 END:10 A:3 B:2 C:3 D:4 CNT:1 STR:11 END:12

        (STR and END are assumed to be sorted on input — in case they aren't, you'd need to compute the minimum and maximum value instead of taking the first and the last...)

        Something like this?
        #!/usr/bin/perl use warnings; use strict; my %hash; while(defined(my $line = <>)){ if($line =~ /A:([0-9]+) +B:([0-9]+) +C:([0-9]+) +D:([0-9]+) +CNT:([0 +-9]+) +STR:([0-9]+) +END:([0-9]+)/){ my($a,$b,$c,$d,$cnt,$str,$end) = ($1,$2,$3,$4,$5,$6,$7); my $key = "a:$a,d:$d"; if(exists $hash{$key}){ # B and C will be reported if multiple my $current = $hash{$key}; if($current->{b} ne $b){ $current->{b} = 'MULTIPLE'; } if($current->{c} ne $c){ $current->{c} = 'MULTIPLE'; } # STR is start, we need the least if($str < $current->{str}){ $current->{str} = $str; } # END is end, we need the greatest if($end > $current->{end}){ $current->{end} = $end; } # add to count $current->{cnt} += $cnt; }else{ # new key $hash{$key}{b} = $b; $hash{$key}{c} = $c; $hash{$key}{str} = $str; $hash{$key}{end} = $end; $hash{$key}{cnt} = $cnt; } } } # print the results foreach my $key (keys %hash){ $key =~ /a:(.*),d:(.*)/; my $current = $hash{$key}; my ($a,$d) = ($1,$2); print "A:$a "; print "B:$current->{b} "; print "C:$current->{c} "; print "D:$d "; print "CNT:$current->{cnt} "; print "STR:$current->{str} "; print "END:$current->{end}\n"; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://834299]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2024-04-25 15:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found