find common data in multiple files

mao9856 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: find common data in multiple files by tybalt89 (Monsignor) on Dec 28, 2017 at 14:34 UTC
`#!/usr/bin/perl use strict; use warnings; my %all = map{$_, $_} do{local @ARGV = shift; <>}; %all = map{$_, $_} grep defined, @all{do{local @ARGV = $_; <>}} for @A +RGV; print sort keys %all;` [download] Outputs: `ID121 ABC14 ID122 EFG87 ID157 TSR11` [download] which looks right. Hash slices can be your friend :)	[reply] [d/l] [select]
Re: find common data in multiple files by BrowserUk (Patriarch) on Dec 28, 2017 at 10:56 UTC
If you do: `my %hash; ++$hash{ $_ } while <>;` [download] It will add and increment an entry in the hash for each line of every file named on the command line. Then all you need to do is run through the hash and print out any key with a value equal to the number of files: `$hash{ $_ } == @ARGV and print for keys %hash;` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply] [d/l] [select]
Re^2: find common data in multiple files by Laurent_R (Canon) on Dec 28, 2017 at 11:18 UTC
Yes, I was thinking about a similar solution. Short, easy, and efficient. Please note, however, that this will work properly only provided there are are no duplicate entries in the individual input files.	[reply]
Re: find common data in multiple files by thanos1983 (Parson) on Dec 28, 2017 at 10:18 UTC
Hello mao9856, Since you are not telling us what is the problem e.g. the script is not running or it is not producing the desired output with a quick view we can not assist you. A similar question parse multiple text files keep unique lines only was asked in the past and maybe you can find a possible solution to your problem that many Monks have tackled elegantly. Update: I just tried to execute your sample of code, and it is not running. It looks you found the code somewhere you pasted here and did asked for someone to solve it for you. Can you show the minimum amount of effort that you tried to resolve it before and make the script executable? Update 2: I had some time to kill so I put together this script that more or less does what you want. It reads all files from `@ARGV` and processes every line. Then it only keeps the lines that are in common. Assuming that lines are always the same and they are no combinations. By combinations I mean that you want only to detect duplicated lines. Sample of code: Read more... (1249 Bytes) Update 2 continue: In case you want to detect uniquely lines that may contain only the `$key` or only the `$value` as duplicates, you can easily do it like this. Sample of code: Read more... (2 kB) Update 2 continue: I used the module List::MoreUtils and more specifically the function List::MoreUtils/duplicates that "`Returns a new list by stripping values in LIST occuring less than twice.`". The DATA that I used are from the sample of DATA files that you provided us. Hope this helps, BR. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re^2: find common data in multiple files by mao9856 (Sexton) on Dec 29, 2017 at 10:38 UTC
Thank you for help. I tried to write this code based on my understanding. Please excuse me. I am very beginner of perl. My data contain unique ids (ID157) and name (TSR11) separated by tab. i want to look for both ids and name (ID157 TSR11) if they are present in all 25 files. If ID157 TSR11 is present in all 25 files, it should be printed in the output. This i want to print only those IDs and name that are present in all 25 files. and id and name should print together separated by tab as: ID157 TSR11. I am less familiar with using perl modules, but i am trying my best.	[reply]
Re^3: find common data in multiple files by thanos1983 (Parson) on Dec 29, 2017 at 14:12 UTC
Hello again mao9856, Not knowing it is not a problem, nobody started coding and knew everything. This forum is open and free for people to learn and contribute. You provide us a bit of a non working code. You either got it from someone or you modified it to the point it was not working. It is good for you and all of us to practice and try to provide a working sample of code that shows what you have tried and where you got stuck. For us to provide you a solution it is really easy but it would not mean that it resolves your problem since you will not learn anything out of it. Having said that here is a sample of code that it does what you want. #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use List::MoreUtils 'frequency'; my (@lines); my $numberOfFiles = scalar @ARGV; while (<>) { next if /^\s*$/; # skip empty lines (remove if not needed) chomp; push @lines, $_; } continue { close ARGV if eof; # Not eof()! } my @frequencyLines = frequency @lines; my %frequencyHash = @frequencyLines; my @unwanted; foreach my $key (keys %frequencyHash) { if ($frequencyHash{$key} != $numberOfFiles) { push @unwanted, $key; # push any related keys onto @unwanted } } delete @frequencyHash{@unwanted}; my @matches = keys %frequencyHash; print Dumper \@matches; __END__ $ perl test.pl File1.txt File2.txt File3.txt $VAR1 = [ 'ID122 EFG87', 'ID121 ABC14', 'ID157 TSR11' ]; [download] I used as input files the first 3 files (File 1 -3) as input DATA that you provided us. Hope this helps, BR. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re^4: find common data in multiple files by mao9856 (Sexton) on Dec 31, 2017 at 06:18 UTC
Re^5: find common data in multiple files by afoken (Chancellor) on Dec 31, 2017 at 11:57 UTC
Re^5: find common data in multiple files by thanos1983 (Parson) on Jan 02, 2018 at 09:49 UTC
Some notes below your chosen depth have not been shown here
Re: find common data in multiple files by Discipulus (Canon) on Dec 28, 2017 at 10:20 UTC
Hello mao9856, Have you run the code you posted? `@counts` and even more `open my A, "<", "@counts "could not open file1 $!";` make no sense at all. Also `foreach (@ARGV)` makes no sense infact you are never using `$_` the implicit variable filled for you by `foreach` when you do not specify a named variable. So the beginning of your program must be better wrote as: `use strict; use warnings; my %result; # my @counts=(); # no need of this foreach (@ARGV) { # my $column=0; # no need of this too open my $fh, "<", $_ or die "could not open file [$_] $!"; while (<$fh>) { chomp; ...` [download] Use `print` to debug your program: `print "DEBUG: working on file [$_]\n"` as first line of the loop will confirm you are reading all files. Then you do not need to split you lines as you want to check for `ID121 ABC14` whole presence in every file. So put the whole line as key of the results hash and ++ it as you are doing. A tip: as you need only strings that are in all files add every entry of the first file and then for following files just ++ keys that are already present in the hash. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: find common data in multiple files by mao9856 (Sexton) on Dec 29, 2017 at 07:54 UTC
Hello Discipulus I haven't tried the code, but I wrote it based on my understanding. Please excuse me i am very beginner in perl. my @counts=() was wriiten to define number of files so while running this programme i thought I would use: perl prog.pl *.txt And ARGV was used because name of files vary. As per suggestions, i am trying to run new code. Thank you for suggestions:)	[reply]
Re: find common data in multiple files by kcott (Archbishop) on Dec 28, 2017 at 21:15 UTC
G'day mao9856, I'd read through one file and store all of its data in a hash; then read through the remaining files, removing hash data that wasn't common. Given these files (in the spoiler) using data from your OP: `$ cat pm_1206312_in1 ID121 ABC14 ID122 EFG87 ID145 XYZ43 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID567 POH70 ID921 BAMD80` [download] `$ cat pm_1206312_in2 ID111 RET61 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID452 FYU098 ID121 ABC14 ID122 EFG87` [download] `$ cat pm_1206312_in3 ID121 ABC14 ID612 FLOW12 ID122 EFG87 ID745 KIDP36 ID145 XYZ43 ID157 TSR11` [download] `$ cat pm_1206312_in25 ID122 EFG87 ID809 EYE24 ID157 TSR11 ID921 BAMD80 ID389 TOP30 ID121 ABC14` [download] This code: `#!/usr/bin/env perl use strict; use warnings; use autodie; my @files = glob 'pm_1206312_in*'; my %uniq; { open my $fh, '<', shift @files; while (<$fh>) { my ($k, $v) = split; $uniq{$k} = $v; } } for my $file (@files) { my %data; open my $fh, '<', $file; while (<$fh>) { my ($k, $v) = split; $data{$k} = $v; } for (keys %uniq) { delete $uniq{$_} unless exists $data{$_} and $uniq{$_} eq $dat +a{$_}; } } printf "%s %s\n", $_, $uniq{$_} for sort keys %uniq;` [download] Produces this output: `ID121 ABC14 ID122 EFG87 ID157 TSR11` [download] — Ken	[reply] [d/l] [select]
Re^2: find common data in multiple files by mao9856 (Sexton) on Dec 30, 2017 at 08:32 UTC
Hi Ken This code worked for me after I put last line: printf "%s %s\n", $_, $uniq{$_} for sort keys %uniq; before closing parenthesis. Thanks a million:)	[reply]
Re^3: find common data in multiple files by kcott (Archbishop) on Dec 31, 2017 at 01:51 UTC
"This code worked for me after I put last line ... before closing parenthesis. Thanks a million" Whilst I appreciate the thanks, it sounds like you've introduced a (possibly subtle) bug. The basic logic for my code is: `Declare hash SINGLE BLOCK (reading one file): Populate hash LOOP BLOCK (reading all other files): Remove data that isn't common from hash Print hash data` [download] If you move the `Print` operation to LOOP BLOCK, you'll get multiple (24) groups of output. That's not what you want, and it would have been plainly obvious if you'd done that, so you've probably done something different to what you've described. You've said "I am very beginner of perl" in a couple of places. I suspect you haven't understood the anonymous block I used in SINGLE BLOCK and ended up with logic more like this: `Declare hash start SINGLE BLOCK Populate hash LOOP BLOCK Print hash data end SINGLE BLOCK` [download] An anonymous block is just code wrapped in braces: `{ # code here }` [download] I've used it to provide a limited lexical scope. The variables (`$fh`, `$k` and `$v`) that I've declared in that block, only exist in that block; they are quite different to, and cannot interfere in any way with, the similarly named variables elsewhere in the code. There's also an additional benefit: when `$fh` goes out of scope, Perl performs an implicit close. Anyway, while that's probably useful information you can add to your "beginner of perl" knowledgebase, it's very much guesswork on my part with respect to whatever modifications you made to my original code. If you post your changes, I can provide more concrete feedback. — Ken	[reply] [d/l] [select]
Re^4: find common data in multiple files by mao9856 (Sexton) on Dec 31, 2017 at 06:09 UTC
Re^5: find common data in multiple files by kcott (Archbishop) on Jan 01, 2018 at 01:01 UTC
Re^5: find common data in multiple files by poj (Abbot) on Dec 31, 2017 at 09:18 UTC
Some notes below your chosen depth have not been shown here
Re: find common data in multiple files by poj (Abbot) on Dec 28, 2017 at 14:24 UTC
Build a Hash of Arrays (HoA) (see perldsc) and check each element with grep #!/usr/bin/perl use strict; use warnings; my $filecount = 25; my %count = (); my @files = map { "File $_.txt" }(1..$filecount); for my $i (0..$#files){ open IN,'<',$files[$i] or die "Could not open $files[$i] : $!"; while (<IN>){ chomp; $count{$_}[$i] += 1; } close IN; } my @result=(); print join "\t",'ID Name',@files,"\n"; for my $key (sort keys %count){ my @all = (); for my $i (0..$#files){ if (defined $count{$key}[$i]){ $all[$i] = $count{$key}[$i]; } else { $all[$i] = 0; } } print join "\t",$key,@all,"\n"; # debug # skip if any are zero next if grep( $_==0, @all ); push @result,$key; } print "\nCommon to all\n"; print "$_\n", for @result; [download] poj	[reply] [d/l]
Re: find common data in multiple files by BillKSmith (Monsignor) on Dec 28, 2017 at 18:28 UTC
I searched MetaCPAN for "intersection set". The second entry (App::setop) appears to be exactly what you need. Bill	[reply]
Re^2: find common data in multiple files by thanos1983 (Parson) on Dec 29, 2017 at 14:44 UTC
Hello BillKSmith, I had no idea, great module thanks for pointing it also to us. Just to include a complete answer for future reference I will also add sample of code. `#!/usr/bin/perl use strict; use warnings; use Capture::Tiny 'capture'; my $cmd = 'setop --intersect ' . join ' ', @ARGV; my ($stdout, $stderr, $exitCode) = capture { system( $cmd ); }; print $stdout if $exitCode == 0; print 'Error: ' . $stderr unless $exitCode == 0; __END__ $ perl test.pl File1.txt File2.txt File3.txt ID121 ABC14 ID122 EFG87 ID157 TSR11` [download] BR / Thanos Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: find common data in multiple files by Dallaylaen (Chaplain) on Dec 28, 2017 at 16:01 UTC
#!perl use strict; use warnings; # The final hash my %all; # process files one by one, gather lines from each foreach (@ARGV) { my $hash = unique_in_file($_); $all{$_}++ for keys %$hash; }; # print result print "$_\n" for grep { $all{$_} == @ARGV; # don't hardcode the needed count, the number +of files will change } sort keys %all; sub unique_in_file { my $fname = shift; # don't bother opening local @ARGV = ($fname); # do the same uniq(1) but for the current file my %uniq; while (<>) { chomp; # preprocess line here $uniq{$_}++; }; # could've returned array as well return \%uniq; }; [download]	[reply] [d/l]
Re: find common data in multiple files by Dallaylaen (Chaplain) on Dec 28, 2017 at 15:49 UTC
$result{$key} = 1 and then $result{$key}++ will leave you with $result{$key} == 2 no matter how many times called. Plus, this doesn't account for duplicates in files.	[reply]
Re: find common data in multiple files by Anonymous Monk on Dec 28, 2017 at 13:19 UTC
Don't do this: `$result{$key} = 1; $result{$key}++;` [download] Instead, omit the first line. If a hash-key does not yet exist when you "increment" it, it will automatically be created with the value zero, then incremented to the value 1. (In the code as written, the value can never be anything but 1.) The remainder of the solution should be straightforward: after processing all 25 files, look for keys with the value 25. (Assuming that you know that there are no duplicates in each file.) If there could be duplicates, you will need to use a second hash, cleared at the start of each file, and use the `exists` verb to see if a specified key is already in it (i.e. is a duplicate within this file).	[reply] [d/l]
Re^2: find common data in multiple files by 1nickt (Canon) on Dec 28, 2017 at 13:46 UTC
Don't do this: `$result{$key} = 1; $result{$key}++;` [download] ... (In the code as written, the value can never be anything but 1.) Sorry, that's wrong. `$ perl -E '$key = "wrong"; $result{$key} = 1; $result{$key}++; say val +ues %result'` [download] `2` [download] The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^3: find common data in multiple files by Anonymous Monk on Dec 28, 2017 at 23:56 UTC
"So, sue me ..." ... two! Nevertheless, not the increment-behavior that the programmer intended, which was the essential point. The value will always be ~~one~~ two.	[reply]
Re: find common data in multiple files by mao9856 (Sexton) on Jan 05, 2018 at 06:31 UTC
Greetings to all For printing the common data present among all 25 .txt files as input: `INPUT File1 ID121 ABC14 ID122 EFG87 ID145 XYZ43 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID567 POH70 ID921 BAMD80 File2 ID111 RET61 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID452 FYU098 ID122 EFG87 File3 ID121 ABC14 ID612 FLOW12 ID122 EFG87 ID745 KIDP36 ID145 XYZ43 .................. File25 ID122 EFG87 ID809 EYE24 ID921 BAMD80 ID389 TOP30 ID121 ABC14` [download] I tried following new code: `#!/usr/bin/env perl use strict; use warnings; my %data; while (<>) { my ( $key, $value ) = split; push( @{ $data{$key} }, $value ); } foreach my $key ( sort keys %data ) { if ( @{ $data{$key} } >= @ARGV ) { print join( "\t", $key, @{ $data{$key} } ), "\n"; } }` [download] `$ code.pl *.txt` it gives following output as per my understanding. Please correct me if I am wrong `OUTPUT File1 File2 File3 ...........File25 ID121 ABC14 space ABC14 ...........ABC14 ID122 EFG87 EFG87 EFG87 ...........EFG87 ID157 TSR11 TSR11 space .......... space` [download] Thank you in advance:)	[reply] [d/l] [select]


We don't bite newbies here... much
	PerlMonks