Re: find common data in multiple files
by tybalt89 (Monsignor) on Dec 28, 2017 at 14:34 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
my %all = map{$_, $_} do{local @ARGV = shift; <>};
%all = map{$_, $_} grep defined, @all{do{local @ARGV = $_; <>}} for @A
+RGV;
print sort keys %all;
Outputs:
ID121 ABC14
ID122 EFG87
ID157 TSR11
which looks right.
Hash slices can be your friend :)
| [reply] [d/l] [select] |
Re: find common data in multiple files
by BrowserUk (Patriarch) on Dec 28, 2017 at 10:56 UTC
|
my %hash;
++$hash{ $_ } while <>;
It will add and increment an entry in the hash for each line of every file named on the command line.
Then all you need to do is run through the hash and print out any key with a value equal to the number of files: $hash{ $_ } == @ARGV and print for keys %hash;
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] [d/l] [select] |
|
Yes, I was thinking about a similar solution. Short, easy, and efficient. Please note, however, that this will work properly only provided there are are no duplicate entries in the individual input files.
| [reply] |
Re: find common data in multiple files
by thanos1983 (Parson) on Dec 28, 2017 at 10:18 UTC
|
Hello mao9856,
Since you are not telling us what is the problem e.g. the script is not running or it is not producing the desired output with a quick view we can not assist you.
A similar question parse multiple text files keep unique lines only was asked in the past and maybe you can find a possible solution to your problem that many Monks have tackled elegantly.
Update: I just tried to execute your sample of code, and it is not running. It looks you found the code somewhere you pasted here and did asked for someone to solve it for you. Can you show the minimum amount of effort that you tried to resolve it before and make the script executable?
Update 2: I had some time to kill so I put together this script that more or less does what you want. It reads all files from @ARGV and processes every line. Then it only keeps the lines that are in common. Assuming that lines are always the same and they are no combinations. By combinations I mean that you want only to detect duplicated lines.
Sample of code:
Update 2 continue: In case you want to detect uniquely lines that may contain only the $key or only the $value as duplicates,
you can easily do it like this.
Sample of code:
Update 2 continue: I used the module List::MoreUtils and more specifically the function List::MoreUtils/duplicates that "Returns a new list by stripping values in LIST occuring less than twice.". The DATA that I used are from the sample of DATA files that you provided us.
Hope this helps, BR.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
Thank you for help.
I tried to write this code based on my understanding. Please excuse me. I am very beginner of perl.
My data contain unique ids (ID157) and name (TSR11) separated by tab. i want to look for both ids and name (ID157 TSR11) if they are present in all 25 files. If ID157 TSR11 is present in all 25 files, it should be printed in the output.
This i want to print only those IDs and name that are present in all 25 files.
and id and name should print together separated by tab as: ID157 TSR11.
I am less familiar with using perl modules, but i am trying my best.
| [reply] |
|
Hello again mao9856,
Not knowing it is not a problem, nobody started coding and knew everything. This forum is open and free for people to learn and contribute. You provide us a bit of a non working code. You either got it from someone or you modified it to the point it was not working. It is good for you and all of us to practice and try to provide a working sample of code that shows what you have tried and where you got stuck. For us to provide you a solution it is really easy but it would not mean that it resolves your problem since you will not learn anything out of it.
Having said that here is a sample of code that it does what you want.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use List::MoreUtils 'frequency';
my (@lines);
my $numberOfFiles = scalar @ARGV;
while (<>) {
next if /^\s*$/; # skip empty lines (remove if not needed)
chomp;
push @lines, $_;
} continue {
close ARGV if eof; # Not eof()!
}
my @frequencyLines = frequency @lines;
my %frequencyHash = @frequencyLines;
my @unwanted;
foreach my $key (keys %frequencyHash) {
if ($frequencyHash{$key} != $numberOfFiles) {
push @unwanted, $key;
# push any related keys onto @unwanted
}
}
delete @frequencyHash{@unwanted};
my @matches = keys %frequencyHash;
print Dumper \@matches;
__END__
$ perl test.pl File1.txt File2.txt File3.txt
$VAR1 = [
'ID122 EFG87',
'ID121 ABC14',
'ID157 TSR11'
];
I used as input files the first 3 files (File 1 -3) as input DATA that you provided us.
Hope this helps, BR.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
|
|
|
Re: find common data in multiple files
by Discipulus (Canon) on Dec 28, 2017 at 10:20 UTC
|
Hello mao9856,
Have you run the code you posted? @counts and even more open my A, "<", "@counts "could not open file1 $!"; make no sense at all.
Also foreach (@ARGV) makes no sense infact you are never using $_ the implicit variable filled for you by foreach when you do not specify a named variable.
So the beginning of your program must be better wrote as:
use strict;
use warnings;
my %result;
# my @counts=(); # no need of this
foreach (@ARGV) {
# my $column=0; # no need of this too
open my $fh, "<", $_ or die "could not open file [$_] $!";
while (<$fh>) {
chomp;
...
Use print to debug your program: print "DEBUG: working on file [$_]\n" as first line of the loop will confirm you are reading all files.
Then you do not need to split you lines as you want to check for ID121 ABC14 whole presence in every file. So put the whole line as key of the results hash and ++ it as you are doing.
A tip: as you need only strings that are in all files add every entry of the first file and then for following files just ++ keys that are already present in the hash.
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
Hello Discipulus I haven't tried the code, but I wrote it based on my understanding. Please excuse me i am very beginner in perl. my @counts=() was wriiten to define number of files so while running this programme i thought I would use: perl prog.pl *.txt And ARGV was used because name of files vary. As per suggestions, i am trying to run new code. Thank you for suggestions:)
| [reply] |
Re: find common data in multiple files
by kcott (Archbishop) on Dec 28, 2017 at 21:15 UTC
|
G'day mao9856,
I'd read through one file and store all of its data in a hash;
then read through the remaining files, removing hash data that wasn't common.
Given these files (in the spoiler) using data from your OP:
$ cat pm_1206312_in1
ID121 ABC14
ID122 EFG87
ID145 XYZ43
ID157 TSR11
ID181 ABC31
ID962 YTS27
ID567 POH70
ID921 BAMD80
$ cat pm_1206312_in2
ID111 RET61
ID157 TSR11
ID181 ABC31
ID962 YTS27
ID452 FYU098
ID121 ABC14
ID122 EFG87
$ cat pm_1206312_in3
ID121 ABC14
ID612 FLOW12
ID122 EFG87
ID745 KIDP36
ID145 XYZ43
ID157 TSR11
$ cat pm_1206312_in25
ID122 EFG87
ID809 EYE24
ID157 TSR11
ID921 BAMD80
ID389 TOP30
ID121 ABC14
This code:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my @files = glob 'pm_1206312_in*';
my %uniq;
{
open my $fh, '<', shift @files;
while (<$fh>) {
my ($k, $v) = split;
$uniq{$k} = $v;
}
}
for my $file (@files) {
my %data;
open my $fh, '<', $file;
while (<$fh>) {
my ($k, $v) = split;
$data{$k} = $v;
}
for (keys %uniq) {
delete $uniq{$_} unless exists $data{$_} and $uniq{$_} eq $dat
+a{$_};
}
}
printf "%s %s\n", $_, $uniq{$_} for sort keys %uniq;
Produces this output:
ID121 ABC14
ID122 EFG87
ID157 TSR11
| [reply] [d/l] [select] |
|
| [reply] |
|
Declare hash
SINGLE BLOCK (reading one file):
Populate hash
LOOP BLOCK (reading all other files):
Remove data that isn't common from hash
Print hash data
If you move the Print operation to LOOP BLOCK,
you'll get multiple (24) groups of output.
That's not what you want, and it would have been plainly obvious if you'd done that,
so you've probably done something different to what you've described.
You've said "I am very beginner of perl" in a couple of places.
I suspect you haven't understood the anonymous block I used in SINGLE BLOCK
and ended up with logic more like this:
Declare hash
start SINGLE BLOCK
Populate hash
LOOP BLOCK
Print hash data
end SINGLE BLOCK
An anonymous block is just code wrapped in braces:
{
# code here
}
I've used it to provide a limited lexical scope.
The variables ($fh, $k and $v) that I've declared in that block, only exist in that block;
they are quite different to, and cannot interfere in any way with,
the similarly named variables elsewhere in the code.
There's also an additional benefit: when $fh goes out of scope, Perl performs an implicit
close.
Anyway, while that's probably useful information you can add to your "beginner of perl" knowledgebase,
it's very much guesswork on my part with respect to whatever modifications you made to my original code.
If you post your changes, I can provide more concrete feedback.
| [reply] [d/l] [select] |
|
|
|
|
Re: find common data in multiple files
by poj (Abbot) on Dec 28, 2017 at 14:24 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
my $filecount = 25;
my %count = ();
my @files = map { "File $_.txt" }(1..$filecount);
for my $i (0..$#files){
open IN,'<',$files[$i] or die "Could not open $files[$i] : $!";
while (<IN>){
chomp;
$count{$_}[$i] += 1;
}
close IN;
}
my @result=();
print join "\t",'ID Name',@files,"\n";
for my $key (sort keys %count){
my @all = ();
for my $i (0..$#files){
if (defined $count{$key}[$i]){
$all[$i] = $count{$key}[$i];
} else {
$all[$i] = 0;
}
}
print join "\t",$key,@all,"\n"; # debug
# skip if any are zero
next if grep( $_==0, @all );
push @result,$key;
}
print "\nCommon to all\n";
print "$_\n", for @result;
poj | [reply] [d/l] |
Re: find common data in multiple files
by BillKSmith (Monsignor) on Dec 28, 2017 at 18:28 UTC
|
I searched MetaCPAN for "intersection set". The second entry (App::setop) appears to be exactly what you need.
| [reply] |
|
Hello BillKSmith,
I had no idea, great module thanks for pointing it also to us.
Just to include a complete answer for future reference I will also add sample of code.
#!/usr/bin/perl
use strict;
use warnings;
use Capture::Tiny 'capture';
my $cmd = 'setop --intersect ' . join ' ', @ARGV;
my ($stdout, $stderr, $exitCode) = capture {
system( $cmd );
};
print $stdout if $exitCode == 0;
print 'Error: ' . $stderr unless $exitCode == 0;
__END__
$ perl test.pl File1.txt File2.txt File3.txt
ID121 ABC14
ID122 EFG87
ID157 TSR11
BR / Thanos
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
Re: find common data in multiple files
by Dallaylaen (Chaplain) on Dec 28, 2017 at 16:01 UTC
|
#!perl
use strict;
use warnings;
# The final hash
my %all;
# process files one by one, gather lines from each
foreach (@ARGV) {
my $hash = unique_in_file($_);
$all{$_}++ for keys %$hash;
};
# print result
print "$_\n" for grep {
$all{$_} == @ARGV; # don't hardcode the needed count, the number
+of files *will* change
} sort keys %all;
sub unique_in_file {
my $fname = shift;
# don't bother opening
local @ARGV = ($fname);
# do the same uniq(1) but for the current file
my %uniq;
while (<>) {
chomp;
# preprocess line here
$uniq{$_}++;
};
# could've returned array as well
return \%uniq;
};
| [reply] [d/l] |
Re: find common data in multiple files
by Dallaylaen (Chaplain) on Dec 28, 2017 at 15:49 UTC
|
$result{$key} = 1 and then $result{$key}++ will leave you with $result{$key} == 2 no matter how many times called. Plus, this doesn't account for duplicates in files. | [reply] |
Re: find common data in multiple files
by Anonymous Monk on Dec 28, 2017 at 13:19 UTC
|
$result{$key} = 1;
$result{$key}++;
Instead, omit the first line. If a hash-key does not yet exist when you "increment" it, it will automatically be created with the value zero, then incremented to the value 1. (In the code as written, the value can never be anything but 1.) The remainder of the solution should be straightforward: after processing all 25 files, look for keys with the value 25. (Assuming that you know that there are no duplicates in each file.) If there could be duplicates, you will need to use a second hash, cleared at the start of each file, and use the exists verb to see if a specified key is already in it (i.e. is a duplicate within this file). | [reply] [d/l] |
|
Don't do this:
$result{$key} = 1;
$result{$key}++;
... (In the code as written, the value can never be anything but 1.)
Sorry, that's wrong.
$ perl -E '$key = "wrong"; $result{$key} = 1; $result{$key}++; say val
+ues %result'
2
The way forward always starts with a minimal test.
| [reply] [d/l] [select] |
|
"So, sue me ..." ... two!
Nevertheless, not the increment-behavior that the programmer intended, which was the essential point. The value will always be one two.
| [reply] |
Re: find common data in multiple files
by mao9856 (Sexton) on Jan 05, 2018 at 06:31 UTC
|
INPUT
File1
ID121 ABC14
ID122 EFG87
ID145 XYZ43
ID157 TSR11
ID181 ABC31
ID962 YTS27
ID567 POH70
ID921 BAMD80
File2
ID111 RET61
ID157 TSR11
ID181 ABC31
ID962 YTS27
ID452 FYU098
ID122 EFG87
File3
ID121 ABC14
ID612 FLOW12
ID122 EFG87
ID745 KIDP36
ID145 XYZ43
..................
File25
ID122 EFG87
ID809 EYE24
ID921 BAMD80
ID389 TOP30
ID121 ABC14
I tried following new code:
#!/usr/bin/env perl
use strict;
use warnings;
my %data;
while (<>) {
my ( $key, $value ) = split;
push( @{ $data{$key} }, $value );
}
foreach my $key ( sort keys %data ) {
if ( @{ $data{$key} } >= @ARGV ) {
print join( "\t", $key, @{ $data{$key} } ), "\n";
}
}
$ code.pl *.txt
it gives following output as per my understanding. Please correct me if I am wrong
OUTPUT
File1 File2 File3 ...........File25
ID121 ABC14 space ABC14 ...........ABC14
ID122 EFG87 EFG87 EFG87 ...........EFG87
ID157 TSR11 TSR11 space .......... space
Thank you in advance:)
| [reply] [d/l] [select] |