Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

find common data in multiple files

by mao9856 (Sexton)
on Dec 28, 2017 at 09:55 UTC ( [id://1206312]=perlquestion: print w/replies, xml ) Need Help??

mao9856 has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have 25 files containing two columns in each file. I want to find and print common (ID and name for example ID121 ABC14) data present in in each of 25 files

File 1 ID121 ABC14 ID122 EFG87 ID145 XYZ43 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID567 POH70 ID921 BAMD80 File 2 ID111 RET61 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID452 FYU098 ID121 ABC14 ID122 EFG87 File 3 ID121 ABC14 ID612 FLOW12 ID122 EFG87 ID745 KIDP36 ID145 XYZ43 ID157 TSR11 ......... File 25 ID122 EFG87 ID809 EYE24 ID157 TSR11 ID921 BAMD80 ID389 TOP30 ID121 ABC14 Output: ID121 ABC14 ID122 EFG87 ID157 TSR11

All these files are .txt files so i want to compare all 25 files and print data that exists in all 25 files. Please help.

#!/usr/bin/perl use strict; use warnings; my %result; my @counts=(); foreach (@ARGV) { my $column=0; open my A, "<", "@counts "could not open file1 $!"; while (<A>) { chomp; my $key = (split /\t/, $_)[0]; $result{$key} = 1; $result{$key}++; if ($result{$key} == 20 { print "Line with $key is present in all twenty \n"; } } close (A); }; foreach (@counts) { s/^\s//g; print $_,"\n"; };

Replies are listed 'Best First'.
Re: find common data in multiple files
by tybalt89 (Monsignor) on Dec 28, 2017 at 14:34 UTC
    #!/usr/bin/perl use strict; use warnings; my %all = map{$_, $_} do{local @ARGV = shift; <>}; %all = map{$_, $_} grep defined, @all{do{local @ARGV = $_; <>}} for @A +RGV; print sort keys %all;

    Outputs:

    ID121 ABC14 ID122 EFG87 ID157 TSR11

    which looks right.

    Hash slices can be your friend :)

Re: find common data in multiple files
by BrowserUk (Patriarch) on Dec 28, 2017 at 10:56 UTC

    If you do:

    my %hash; ++$hash{ $_ } while <>;

    It will add and increment an entry in the hash for each line of every file named on the command line.

    Then all you need to do is run through the hash and print out any key with a value equal to the number of files:

    $hash{ $_ } == @ARGV and print for keys %hash;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      Yes, I was thinking about a similar solution. Short, easy, and efficient. Please note, however, that this will work properly only provided there are are no duplicate entries in the individual input files.
Re: find common data in multiple files
by thanos1983 (Parson) on Dec 28, 2017 at 10:18 UTC

    Hello mao9856,

    Since you are not telling us what is the problem e.g. the script is not running or it is not producing the desired output with a quick view we can not assist you.

    A similar question parse multiple text files keep unique lines only was asked in the past and maybe you can find a possible solution to your problem that many Monks have tackled elegantly.

    Update: I just tried to execute your sample of code, and it is not running. It looks you found the code somewhere you pasted here and did asked for someone to solve it for you. Can you show the minimum amount of effort that you tried to resolve it before and make the script executable?

    Update 2: I had some time to kill so I put together this script that more or less does what you want. It reads all files from @ARGV and processes every line. Then it only keeps the lines that are in common. Assuming that lines are always the same and they are no combinations. By combinations I mean that you want only to detect duplicated lines.

    Sample of code:

    Update 2 continue: In case you want to detect uniquely lines that may contain only the $key or only the $value as duplicates, you can easily do it like this.

    Sample of code:

    Update 2 continue: I used the module List::MoreUtils and more specifically the function List::MoreUtils/duplicates that "Returns a new list by stripping values in LIST occuring less than twice.". The DATA that I used are from the sample of DATA files that you provided us.

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      Thank you for help. I tried to write this code based on my understanding. Please excuse me. I am very beginner of perl. My data contain unique ids (ID157) and name (TSR11) separated by tab. i want to look for both ids and name (ID157 TSR11) if they are present in all 25 files. If ID157 TSR11 is present in all 25 files, it should be printed in the output. This i want to print only those IDs and name that are present in all 25 files. and id and name should print together separated by tab as: ID157 TSR11. I am less familiar with using perl modules, but i am trying my best.

        Hello again mao9856,

        Not knowing it is not a problem, nobody started coding and knew everything. This forum is open and free for people to learn and contribute. You provide us a bit of a non working code. You either got it from someone or you modified it to the point it was not working. It is good for you and all of us to practice and try to provide a working sample of code that shows what you have tried and where you got stuck. For us to provide you a solution it is really easy but it would not mean that it resolves your problem since you will not learn anything out of it.

        Having said that here is a sample of code that it does what you want.

        #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use List::MoreUtils 'frequency'; my (@lines); my $numberOfFiles = scalar @ARGV; while (<>) { next if /^\s*$/; # skip empty lines (remove if not needed) chomp; push @lines, $_; } continue { close ARGV if eof; # Not eof()! } my @frequencyLines = frequency @lines; my %frequencyHash = @frequencyLines; my @unwanted; foreach my $key (keys %frequencyHash) { if ($frequencyHash{$key} != $numberOfFiles) { push @unwanted, $key; # push any related keys onto @unwanted } } delete @frequencyHash{@unwanted}; my @matches = keys %frequencyHash; print Dumper \@matches; __END__ $ perl test.pl File1.txt File2.txt File3.txt $VAR1 = [ 'ID122 EFG87', 'ID121 ABC14', 'ID157 TSR11' ];

        I used as input files the first 3 files (File 1 -3) as input DATA that you provided us.

        Hope this helps, BR.

        Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: find common data in multiple files
by Discipulus (Canon) on Dec 28, 2017 at 10:20 UTC
    Hello mao9856,

    Have you run the code you posted? @counts and even more open my A, "<", "@counts "could not open file1 $!"; make no sense at all. Also foreach (@ARGV) makes no sense infact you are never using $_ the implicit variable filled for you by foreach when you do not specify a named variable.

    So the beginning of your program must be better wrote as:

    use strict; use warnings; my %result; # my @counts=(); # no need of this foreach (@ARGV) { # my $column=0; # no need of this too open my $fh, "<", $_ or die "could not open file [$_] $!"; while (<$fh>) { chomp; ...

    Use print to debug your program: print "DEBUG: working on file [$_]\n" as first line of the loop will confirm you are reading all files.

    Then you do not need to split you lines as you want to check for ID121 ABC14 whole presence in every file. So put the whole line as key of the results hash and ++ it as you are doing.

    A tip: as you need only strings that are in all files add every entry of the first file and then for following files just ++ keys that are already present in the hash.

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Hello Discipulus I haven't tried the code, but I wrote it based on my understanding. Please excuse me i am very beginner in perl. my @counts=() was wriiten to define number of files so while running this programme i thought I would use: perl prog.pl *.txt And ARGV was used because name of files vary. As per suggestions, i am trying to run new code. Thank you for suggestions:)

Re: find common data in multiple files
by kcott (Archbishop) on Dec 28, 2017 at 21:15 UTC

    G'day mao9856,

    I'd read through one file and store all of its data in a hash; then read through the remaining files, removing hash data that wasn't common. Given these files (in the spoiler) using data from your OP:

    $ cat pm_1206312_in1 ID121 ABC14 ID122 EFG87 ID145 XYZ43 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID567 POH70 ID921 BAMD80
    $ cat pm_1206312_in2 ID111 RET61 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID452 FYU098 ID121 ABC14 ID122 EFG87
    $ cat pm_1206312_in3 ID121 ABC14 ID612 FLOW12 ID122 EFG87 ID745 KIDP36 ID145 XYZ43 ID157 TSR11
    $ cat pm_1206312_in25 ID122 EFG87 ID809 EYE24 ID157 TSR11 ID921 BAMD80 ID389 TOP30 ID121 ABC14

    This code:

    #!/usr/bin/env perl use strict; use warnings; use autodie; my @files = glob 'pm_1206312_in*'; my %uniq; { open my $fh, '<', shift @files; while (<$fh>) { my ($k, $v) = split; $uniq{$k} = $v; } } for my $file (@files) { my %data; open my $fh, '<', $file; while (<$fh>) { my ($k, $v) = split; $data{$k} = $v; } for (keys %uniq) { delete $uniq{$_} unless exists $data{$_} and $uniq{$_} eq $dat +a{$_}; } } printf "%s %s\n", $_, $uniq{$_} for sort keys %uniq;

    Produces this output:

    ID121 ABC14 ID122 EFG87 ID157 TSR11

    — Ken

      Hi Ken This code worked for me after I put last line: printf "%s %s\n", $_, $uniq{$_} for sort keys %uniq; before closing parenthesis. Thanks a million:)

        "This code worked for me after I put last line ... before closing parenthesis. Thanks a million"

        Whilst I appreciate the thanks, it sounds like you've introduced a (possibly subtle) bug. The basic logic for my code is:

        Declare hash SINGLE BLOCK (reading one file): Populate hash LOOP BLOCK (reading all other files): Remove data that isn't common from hash Print hash data

        If you move the Print operation to LOOP BLOCK, you'll get multiple (24) groups of output. That's not what you want, and it would have been plainly obvious if you'd done that, so you've probably done something different to what you've described.

        You've said "I am very beginner of perl" in a couple of places. I suspect you haven't understood the anonymous block I used in SINGLE BLOCK and ended up with logic more like this:

        Declare hash start SINGLE BLOCK Populate hash LOOP BLOCK Print hash data end SINGLE BLOCK

        An anonymous block is just code wrapped in braces:

        { # code here }

        I've used it to provide a limited lexical scope. The variables ($fh, $k and $v) that I've declared in that block, only exist in that block; they are quite different to, and cannot interfere in any way with, the similarly named variables elsewhere in the code. There's also an additional benefit: when $fh goes out of scope, Perl performs an implicit close.

        Anyway, while that's probably useful information you can add to your "beginner of perl" knowledgebase, it's very much guesswork on my part with respect to whatever modifications you made to my original code. If you post your changes, I can provide more concrete feedback.

        — Ken

Re: find common data in multiple files
by poj (Abbot) on Dec 28, 2017 at 14:24 UTC

    Build a Hash of Arrays (HoA) (see perldsc) and check each element with grep

    #!/usr/bin/perl use strict; use warnings; my $filecount = 25; my %count = (); my @files = map { "File $_.txt" }(1..$filecount); for my $i (0..$#files){ open IN,'<',$files[$i] or die "Could not open $files[$i] : $!"; while (<IN>){ chomp; $count{$_}[$i] += 1; } close IN; } my @result=(); print join "\t",'ID Name',@files,"\n"; for my $key (sort keys %count){ my @all = (); for my $i (0..$#files){ if (defined $count{$key}[$i]){ $all[$i] = $count{$key}[$i]; } else { $all[$i] = 0; } } print join "\t",$key,@all,"\n"; # debug # skip if any are zero next if grep( $_==0, @all ); push @result,$key; } print "\nCommon to all\n"; print "$_\n", for @result;
    poj
Re: find common data in multiple files
by BillKSmith (Monsignor) on Dec 28, 2017 at 18:28 UTC
    I searched MetaCPAN for "intersection set". The second entry (App::setop) appears to be exactly what you need.
    Bill

      Hello BillKSmith,

      I had no idea, great module thanks for pointing it also to us.

      Just to include a complete answer for future reference I will also add sample of code.

      #!/usr/bin/perl use strict; use warnings; use Capture::Tiny 'capture'; my $cmd = 'setop --intersect ' . join ' ', @ARGV; my ($stdout, $stderr, $exitCode) = capture { system( $cmd ); }; print $stdout if $exitCode == 0; print 'Error: ' . $stderr unless $exitCode == 0; __END__ $ perl test.pl File1.txt File2.txt File3.txt ID121 ABC14 ID122 EFG87 ID157 TSR11

      BR / Thanos

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: find common data in multiple files
by Dallaylaen (Chaplain) on Dec 28, 2017 at 16:01 UTC
    #!perl use strict; use warnings; # The final hash my %all; # process files one by one, gather lines from each foreach (@ARGV) { my $hash = unique_in_file($_); $all{$_}++ for keys %$hash; }; # print result print "$_\n" for grep { $all{$_} == @ARGV; # don't hardcode the needed count, the number +of files *will* change } sort keys %all; sub unique_in_file { my $fname = shift; # don't bother opening local @ARGV = ($fname); # do the same uniq(1) but for the current file my %uniq; while (<>) { chomp; # preprocess line here $uniq{$_}++; }; # could've returned array as well return \%uniq; };
Re: find common data in multiple files
by Dallaylaen (Chaplain) on Dec 28, 2017 at 15:49 UTC
    $result{$key} = 1 and then $result{$key}++ will leave you with $result{$key} == 2 no matter how many times called. Plus, this doesn't account for duplicates in files.
Re: find common data in multiple files
by Anonymous Monk on Dec 28, 2017 at 13:19 UTC
    Don't do this:
    $result{$key} = 1; $result{$key}++;
    Instead, omit the first line. If a hash-key does not yet exist when you "increment" it, it will automatically be created with the value zero, then incremented to the value 1. (In the code as written, the value can never be anything but 1.)

    The remainder of the solution should be straightforward: after processing all 25 files, look for keys with the value 25. (Assuming that you know that there are no duplicates in each file.) If there could be duplicates, you will need to use a second hash, cleared at the start of each file, and use the exists verb to see if a specified key is already in it (i.e. is a duplicate within this file).

      Don't do this:
      $result{$key} = 1; $result{$key}++;
      ... (In the code as written, the value can never be anything but 1.)

      Sorry, that's wrong.

      $ perl -E '$key = "wrong"; $result{$key} = 1; $result{$key}++; say val +ues %result'
      2


      The way forward always starts with a minimal test.
        "So, sue me ..." ... two!
        Nevertheless, not the increment-behavior that the programmer intended, which was the essential point. The value will always be one two.
Re: find common data in multiple files
by mao9856 (Sexton) on Jan 05, 2018 at 06:31 UTC

    Greetings to all

    For printing the common data present among all 25 .txt files as input:

    INPUT File1 ID121 ABC14 ID122 EFG87 ID145 XYZ43 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID567 POH70 ID921 BAMD80 File2 ID111 RET61 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID452 FYU098 ID122 EFG87 File3 ID121 ABC14 ID612 FLOW12 ID122 EFG87 ID745 KIDP36 ID145 XYZ43 .................. File25 ID122 EFG87 ID809 EYE24 ID921 BAMD80 ID389 TOP30 ID121 ABC14

    I tried following new code:

    #!/usr/bin/env perl use strict; use warnings; my %data; while (<>) { my ( $key, $value ) = split; push( @{ $data{$key} }, $value ); } foreach my $key ( sort keys %data ) { if ( @{ $data{$key} } >= @ARGV ) { print join( "\t", $key, @{ $data{$key} } ), "\n"; } }
     $ code.pl *.txt

    it gives following output as per my understanding. Please correct me if I am wrong

    OUTPUT File1 File2 File3 ...........File25 ID121 ABC14 space ABC14 ...........ABC14 ID122 EFG87 EFG87 EFG87 ...........EFG87 ID157 TSR11 TSR11 space .......... space

    Thank you in advance:)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1206312]
Approved by haukex
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 21:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found