Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Examine two files to delete duplicates

by Coop197823 (Initiate)
on Sep 03, 2015 at 16:33 UTC ( [id://1140904]=perlquestion: print w/replies, xml ) Need Help??

Coop197823 has asked for the wisdom of the Perl Monks concerning the following question:

I found this bit of code that parse the first column of two text files checks the first column of the first file and then examines the second file to find and remove duplicates. as I am brand new to Perl my question is this: How to I update the code to specify which is the first file to be examined and which is the second file to have duplicates removed? Thank you in advance for the help!
#!/usr/bin/perl # create names lookup table from first file my %names; while (<>) { (my $col1)= split / /, $_; $names{$col1} = 1; last if eof; } # scan second file while (<>) { print if /^(\S+).*/ && not $names{$1}; }

Replies are listed 'Best First'.
Re: Examine two files to delete duplicates
by stevieb (Canon) on Sep 03, 2015 at 16:57 UTC

    Welcome to the Monastery, Coop197823!

    Always use strict; and use warnings;, and you should also include a sampling of your input for testing (put that in <code></code> tags like you've done with your code), along with a snip of your expected output. Does the following do what you expect?

    #!/usr/bin/perl use warnings; use strict; if ($ARGV[0] && ! $ARGV[1]){ die "Usage: ./script.pl file1.txt file2.txt\n"; } my $file1 = $ARGV[0] || 'in.txt'; my $file2 = $ARGV[1] || 'in2.txt'; my %names; # file 1 open my $fh1, '<', $file1 or die "can't open file $file1: $!"; while (<$fh1>) { next if /^\s*$/; $names{(split)[0]} = 1; } close $fh1 or die $!; # file 2 open my $fh2, '<', $file2 or die "can't open file $file2: $!"; while (<$fh2>) { print if /^(\S+)/ && not $names{$1}; } close $fh2 or die $!;

    You can call it like ./script.pl file1.txt file2.txt. If you leave off the arguments, it'll use the defaults provided. If you have one argument but not both, the script will fail.

    -stevieb

    edit: added check for blank and whitespace-only lines, added @ARGV.

Re: Examine two files to delete duplicates
by ww (Archbishop) on Sep 03, 2015 at 22:15 UTC

    The hash you show would be relevant were you trying to match col2 in one file to col2 in another, but that's NOT what your narrative specs. In fact, your textual problem description seems to suggest one kind of data and your code seems to try to deal with another (for which use of a hash could be appropriate-- see comments after the script). So, for the sake of argument, and using only my often-fallible crystal ball, let's assume data files like these:

    file one, aka 1140904.txt:
    test.txt foo bar baz texty_as_all_beat_hell.tst foo bar baz ffoo.txt test bar baz 1140904.txt bar baz bat

    and...

    file two aka 1140904a.txt:
    1140904a.txt foo bar baz texty_as_all_beat_hell.tst foo bar baz ffoo.txt test bar baz baz.test bar blivitz

    The code below seems to illustate (very verbosely with excessive detail -- NO, this is NOT the way it should look for production use) one answer to your explicit question (how to distinguish the first from the second file, for which purpose stevieb's approach in the first reply also does satisfactorily) and a tactic for distinguishing the matches from the non-matches:

    #!/usr/bin/perl use 5.018; use strict; # LET PERL HELP YOU (id typos, etc) use warnings; # LET PERL HELP YOU (id typos, etc): strict and warnin +gs, always! use Data::Dumper; # print "Enter name for first file: "; my $file1 = 'C:\_ww\1140904.txt'; # print "Enter name for second file: "; my $file2 = '1140904a.txt'; my (@fileONE, $fileONE); open(my $FH, "<", $file1) || die "Can't open $file1: $!\n"; for my $line(<$FH>) { chomp $line; say "DEBUG Ln 18: \$line is: $line"; my ($col1) = split(/ /, $line, 2); push @fileONE, $col1; next; } say Dumper @fileONE; say "\n\t --------"; my (@fileTWO, $FH2); open($FH2, "<", $file2) || die "Can't open $file2: $!\n"; for my $line2(<$FH2>) { chomp $line2; say "DEBUG Ln 34: \$line2 is: $line2"; my ($col1_2) = split(/ /, $line2); push @fileTWO, $col1_2; next; } say "DEBUG Ln 37 - reached Ln 37"; say Dumper @fileTWO; say "\n **************"; my ($fileTWO, $i); for ($i = 0; $i < @fileONE; ++$i) { my $BASEname1 = 'fileONE[$'; my $Fname1 = '$' . $BASEname1 . "$i" . ']'; my $BASENAME2 = 'fileTWO[$'; my $Fname2 = '$' . $BASENAME2 . $i . ']'; my $content1 = $fileONE[$i]; my $content2 = $fileTWO[$i]; # say "DEBUG Ln50: content of \$Fname1:\t $fileONE[$i] \n\t and + content of \$Fname2: $fileTWO[$i] \n"; if ($content1 eq $content2 ) { say "Exclude because it's a match: |--> $content2 <--| \n"; } else { say "\t Content of $Fname1 and $Fname2 does not match.\n"; } }

    If this way of tackling the problem of sorting the matches from the non-matches is irrelevant to your real problem, ignore the above
        .... but please note that you did NOT include sample data... some thing you should do, for cases such as this: a possible discrepancy between narrative and code.

    But really, what I think you were truly asking was for someone to learn Perl for you - "found this bit of code" - which is not a Monk-ish conduct which is widely approved. See, please, On asking for help, How do I post a question effectively? and I know what I mean. Why don't you?. IOW, we're here to help you learn or to solve specific coding problems; NOT to be a script-writing service.

    OUTPUT (less the DEBUG output, which is left as a learning aid):

    Content of $fileONE[$0] and $fileTWO[$0] does not match. Exclude because it's a match: |--> texty_as_all_beat_hell.tst <--| Exclude because it's a match: |--> ffoo.txt <--| Content of $fileONE[$3] and $fileTWO[$3] does not match.

    And why did I leave the DEBUG lines in the code? Because a major goal here is "helping people learn."


    ++$anecdote ne $data

Re: Examine two files to delete duplicates
by trippledubs (Deacon) on Sep 03, 2015 at 22:24 UTC

    You don't have to change the script itself, you just need to change the parameters that you pass it.

    ./script.pl file1 file2

    Just change file1 and file2

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1140904]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-24 11:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found