Examine two files to delete duplicates

Coop197823 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Examine two files to delete duplicates by stevieb (Canon) on Sep 03, 2015 at 16:57 UTC
Welcome to the Monastery, Coop197823! Always `use strict;` and `use warnings;`, and you should also include a sampling of your input for testing (put that in <code></code> tags like you've done with your code), along with a snip of your expected output. Does the following do what you expect? #!/usr/bin/perl use warnings; use strict; if ($ARGV[0] && ! $ARGV[1]){ die "Usage: ./script.pl file1.txt file2.txt\n"; } my $file1 = $ARGV[0] \|\| 'in.txt'; my $file2 = $ARGV[1] \|\| 'in2.txt'; my %names; # file 1 open my $fh1, '<', $file1 or die "can't open file $file1: $!"; while (<$fh1>) { next if /^\s*$/; $names{(split)[0]} = 1; } close $fh1 or die $!; # file 2 open my $fh2, '<', $file2 or die "can't open file $file2: $!"; while (<$fh2>) { print if /^(\S+)/ && not $names{$1}; } close $fh2 or die $!; [download] You can call it like `./script.pl file1.txt file2.txt`. If you leave off the arguments, it'll use the defaults provided. If you have one argument but not both, the script will fail. -stevieb edit: added check for blank and whitespace-only lines, added `@ARGV`.	[reply] [d/l] [select]
Re: Examine two files to delete duplicates by ww (Archbishop) on Sep 03, 2015 at 22:15 UTC
The hash you show would be relevant were you trying to match col2 in one file to col2 in another, but that's NOT what your narrative specs. In fact, your textual problem description seems to suggest one kind of data and your code seems to try to deal with another (for which use of a hash could be appropriate-- see comments after the script). So, for the sake of argument, and using only my often-fallible crystal ball, let's assume data files like these: file one, aka 1140904.txt: `test.txt foo bar baz texty_as_all_beat_hell.tst foo bar baz ffoo.txt test bar baz 1140904.txt bar baz bat` [download] and... file two aka 1140904a.txt: `1140904a.txt foo bar baz texty_as_all_beat_hell.tst foo bar baz ffoo.txt test bar baz baz.test bar blivitz` [download] The code below seems to illustate (very verbosely with excessive detail -- NO, this is NOT the way it should look for production use) one answer to your explicit question (how to distinguish the first from the second file, for which purpose stevieb's approach in the first reply also does satisfactorily) and a tactic for distinguishing the matches from the non-matches: #!/usr/bin/perl use 5.018; use strict; # LET PERL HELP YOU (id typos, etc) use warnings; # LET PERL HELP YOU (id typos, etc): strict and warnin +gs, always! use Data::Dumper; # print "Enter name for first file: "; my $file1 = 'C:\_ww\1140904.txt'; # print "Enter name for second file: "; my $file2 = '1140904a.txt'; my (@fileONE, $fileONE); open(my $FH, "<", $file1) \|\| die "Can't open $file1: $!\n"; for my $line(<$FH>) { chomp $line; say "DEBUG Ln 18: \$line is: $line"; my ($col1) = split(/ /, $line, 2); push @fileONE, $col1; next; } say Dumper @fileONE; say "\n\t --------"; my (@fileTWO, $FH2); open($FH2, "<", $file2) \|\| die "Can't open $file2: $!\n"; for my $line2(<$FH2>) { chomp $line2; say "DEBUG Ln 34: \$line2 is: $line2"; my ($col1_2) = split(/ /, $line2); push @fileTWO, $col1_2; next; } say "DEBUG Ln 37 - reached Ln 37"; say Dumper @fileTWO; say "\n *************"; my ($fileTWO, $i); for ($i = 0; $i < @fileONE; ++$i) { my $BASEname1 = 'fileONE[$'; my $Fname1 = '$' . $BASEname1 . "$i" . ']'; my $BASENAME2 = 'fileTWO[$'; my $Fname2 = '$' . $BASENAME2 . $i . ']'; my $content1 = $fileONE[$i]; my $content2 = $fileTWO[$i]; # say "DEBUG Ln50: content of \$Fname1:\t $fileONE[$i] \n\t and + content of \$Fname2: $fileTWO[$i] \n"; if ($content1 eq $content2 ) { say "Exclude because it's a match: \|--> $content2 <--\| \n"; } else { say "\t Content of $Fname1 and $Fname2 does not match.\n"; } } [download] If this way of tackling the problem of sorting the matches from the non-matches is irrelevant to your real problem, ignore the above .... but please note that you did NOT include sample data... some thing you should do, for cases such as this: a possible discrepancy between narrative and code. But really, what I think you were truly asking was for someone to learn Perl for you - "found this bit of code" - which is not a Monk-ish conduct which is widely approved. See, please, On asking for help, How do I post a question effectively? and I know what I mean. Why don't you?. IOW, we're here to help you learn or to solve specific coding problems; NOT to be a script-writing service. OUTPUT (less the DEBUG output, which is left as a learning aid): `Content of $fileONE[$0] and $fileTWO[$0] does not match. Exclude because it's a match: \|--> texty_as_all_beat_hell.tst <--\| Exclude because it's a match: \|--> ffoo.txt <--\| Content of $fileONE[$3] and $fileTWO[$3] does not match.` [download] And why did I leave the DEBUG lines in the code? Because a major goal here is "helping people learn." `++$anecdote ne $data`*	[reply] [d/l] [select]
Re: Examine two files to delete duplicates by trippledubs (Deacon) on Sep 03, 2015 at 22:24 UTC
You don't have to change the script itself, you just need to change the parameters that you pass it. `./script.pl file1 file2` Just change file1 and file2	[reply] [d/l]