Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

comparing file contents

by kumaar1986 (Initiate)
on Nov 15, 2011 at 21:17 UTC ( #938252=perlquestion: print w/ replies, xml ) Need Help??
kumaar1986 has asked for the wisdom of the Perl Monks concerning the following question:

I have a two text files . I need to compare the contents of two files word by word. if the word is present in both the text files then then the particular word should be removed from the first text file and updated. How could i do this. Help me with this. I am beginner in perl and i find this difficult to tackle

Comment on comparing file contents
Re: comparing file contents
by GrandFather (Cardinal) on Nov 15, 2011 at 21:49 UTC

    Show us what you have tried. If you haven't tried anything tell what you have read. If you haven't read anything then do so! Read anything that looks relevant in Tutorials for a start, then come back to us with evidence that you have tried something and we will help with specific problems you may have.

    True laziness is hard work

      I have read that i can use hash to store the text files and can use compare function to check the file contents. use::compare checks files on the whole and does not perform word by word comparison

        So show us some code you have tried and tell us what isn't working for you. A very small amount of sample data, actual output and expected output will help a lot too.

        True laziness is hard work
Re: comparing file contents
by Util (Priest) on Nov 22, 2011 at 23:09 UTC
    The problems are numerous. A line-by-line walkthrough is needed to cover them. Since you have persevered, and desire to improve your Perl skills, here is that walkthrough:
    • #!/usr/bin/perl
      This line needs to be followed with these two lines:
          use strict;
          use warnings;
      The "strict" and "warnings" lines turn on error checking for common mistakes, and force you to declare your variables.
    • print "reading 'ansvarig_vardenhet_utf8.txt'\n";
      Google says that "Ansvarig vardenhet" is Swedish for "Responsible care unit". If that is true, then this print statement (and its twin further down) is lying. They are not reading 'ansvarig_vardenhet_utf8.txt'; they are reading 'icd-10-codes.txt' and 'female1.txt'. If you just need reassurance about what point your program has reached, then use something like print "P1\n"; and print "P2\n";. Otherwise, include the actual file names. Do not lie; it will just confuse the next person who works on the program.
    • open(WIN, "icd-10-codes.txt");
      As written, if the ICD file is missing, then your program will seem to run correctly, but will silently fail to do any replacements. Silent failure is not acceptable. die() is used to halt a program on a critical error; it is often coupled with or so that the die happens when the first part of a statement fails. The $! variable contains the reason for the failure.
           open(WIN, "icd-10-codes.txt") or die "Cannot open file: $!"; .
      Some other changes that would be improvements, but are not mandatory:
      • You are not telling Perl whether to read or write the file, so it defaults to read. It is better to be explicit.
            open( WIN, '<', "icd-10-codes.txt" )
      • WIN is a "bareword" filehandle. We now encourage the use of a real variable for the filehandle.
            open( my $win_fh, '<', "icd-10-codes.txt" )
      • If you use a variable to hold the filename, then you can locate the actual filename at the top of the program, where such configurable info is usually located. It will also allow you to put the filename in the "die" error message, and in the "reading ansvarig_vardenhet_utf8.txt" line already discussed, without repeating yourself.
            my $icd_file = 'icd-10-codes.txt';
            #print "Reading '$icd_file'\n";
            open( my $win_fh, '<', $icd_file ) or die "Cannot open '$icd_file': $!";
      By the way, the missing "strict" and "warnings" lines, and the missing '<' in "open", tell me that you are learning Perl from outdated material. I highly recommend Learning Perl or Modern Perl (or its free PDF).
    • while($line = <WIN>) {
      Since we added "use strict", we now have to declare all variables, such as $line. Also, $line needs to have its newline removed, otherwise the ICD description will have the newline embedded in it, which will mis-format your replacements. The chomp() function removes a trailing newline if one is present.
          while ( my $line = <WIN> ) {
              chomp $line;
    • @columns = split(/\t/, $line);
      Since each line has exactly two fields, you should name them. You can split() any extra fields into a generic array, to help find any future problems with the file formatting.
          my ($lookup_code, $icd_code_and_text, @rest) = split /\t/, $line;
          warn "Extra tab in file '$icd_file at line $.\n" if @rest;
    • #print "$columns[1]";
      That line is clearly for debugging, but if you use it, the output will be indistinct from its neighbors. You need a terminating newline.
          #print "$columns[1]\n";
    • foreach my $value (@columns) { #print "$value\n" }
      When you need to inspect more than a single simple variable, use Data::Dumper. In fact, use it to inspect even single simple variables, too.
          # This "use" line goes at the top, just after "use warnings".
          use Data::Dumper; $Data::Dumper::Useqq=1;

          print Dumper \@columns;
      Also, your closing brace is not lined up with the 'f' in 'foreach'. The Perl compiler does not care, but alignment will help any human trying to read your code.
    • close(WIN);
      This would need "or die" if it was anything more complex than *reading* a file on a *local* filesystem. It is fine as-is.

    Note that you have not preserved *any* data from the ICD file, so no ICD lookups can be done below this point! We will discuss how to fix this later, by copying the needed info to a hash or array.

    (OK, before I added the my before @columns, you did technically preserve *some* ICD data, but since it was only the data from the last line in the file, it hardly counts.)

    • print "reading 'ansvarig_vardenhet_utf8.txt'\n"; open(IN, "female1.txt"); while($line = <IN>) {
      Fix as described above, including chomp().
    • %col = split(/ /, $line);
      No. This is nonsense. It would only make sense if you had a file with a single line, which contained all the keys and values separated by spaces. Instead of a hash, we need an array (declared, of course).
          my @cols = split / /, $line;
    • #print "$columns[1]\n";
      Use Data::Dumper.
    • foreach my $val (%col) {
      "for" is easier to type than "foreach", and either of them work. We have already changed from hash to array.
          for my $val (@cols) {
    • if($val == $columns[0]){ $val =~ s/$val/$columns[1]/g;
      The == operator compares two vars *numerically*. The eq operator compares two vars as strings, so it is the one you want. In other words, "3" == "3.0", but "3" ne "3.0". Also, if you have just determined that $val is an exact match, then s/// is overkill; just replace $var by assigning to it.
          if ( $val eq $columns[0] ) {
              $val = $columns[1];
      Note that, even with those changes, this code will not work, because @columns does not contain the whole table of ICD translations. Even if it did contain them, you would need *two* loops, nested, for this method to work. That would be needlessly complex and inefficient. Later, we will use a simple hash instead of @columns and two loops.
    • print IN ">>$val\n";
      This looks confused. print IN $anything_at_all will write to the IN file. Are you trying to overwrite the data in 'female1.txt'? Luckily, Perl will not let you, because you opened the file as read-only. Your original description says "...should be replaced with ICD-code files corresponding...", but you are not outputting the entire line. Is the '>>' a debugging marker, or does it indicate that you are trying to append to a file? Going by your original description, I think you need:
           print join( ' ', @columns ), "\n";
    • close(IN);
      Again, fine when reading locally.
    • I am a beginner in PERL.
      You are a beginner in Perl. (See PERL as shibboleth and the Perl community.)
      Welcome to the Perl Community!

    The problem with your *approach* is just the failure to preserve the ICD data from the first half of the program so that the second half can use it. This can be done by declaring %lookup_icd above the first while() loop, and then within the while() loop, use:
        $lookup_icd{$icd_code} = $icd_code_and_text;

    With a few more additions, it would look like this:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; $Data::Dumper::Useqq = 1; my $icd_path = 'icd-10-codes.txt'; my $in_path = 'female1.txt'; my $out_path = 'female1_corrected.txt'; open my $icd_fh, '<', $icd_path or die "Cannot open '$icd_path': $!"; #print "Reading '$icd_path'\n"; my %lookup_icd; while ( my $line = <$icd_fh> ) { chomp $line; my ( $lookup_code, $icd_code_and_text ) = split /\t/, $line; # print Dumper $lookup_code, $icd_code_and_text; if ( exists $lookup_icd{$lookup_code} ) { warn "Replacing $lookup_code;" . " was '$lookup_icd{$lookup_code}'," . " now '$icd_code_and_text'\n"; } $lookup_icd{$lookup_code} = $icd_code_and_text; } close $icd_fh; #print Dumper \%lookup_icd; #print "Reading '$in_path'\n"; #print "Writing '$out_path'\n"; open my $in_fh, '<', $in_path or die "Cannot open '$in_path': $!"; open my $out_fh, '>', $out_path or die "Cannot open '$out_path': $!"; while ( my $line = <$in_fh> ) { chomp $line; my @cols = split / /, $line; for my $possible_icd (@cols) { my $replacement_icd = $lookup_icd{$possible_icd}; if ($replacement_icd) { $possible_icd = $replacement_icd; } } print {$out_fh} join( ' ', @cols ), "\n"; } close $in_fh; close $out_fh or warn "Cannot close '$out_path': $!\nSome data may not have been + written.";

    I strongly recommend spending some time with either of the books I mentioned above. You have ventured a little past "baby Perl", and so now will be better served by tutorial than by blind exploration.

      Thanks for your step by step explanation of the code and how to approach to make it right. the explanation is so informative and it helped me to learn insight of PERL. Thanks a lot.

      This looks confused. print IN $anything_at_all will write to the IN file. Are you trying to overwrite the data in 'female1.txt'? Luckily, Perl will not let you, because you opened the file as read-only. Your original description says "...should be replaced with ICD-code files corresponding...", but you are not outputting the entire line. Is the '>>' a debugging marker, or does it indicate that you are trying to append to a file? Going by your original description, I think you need:

      I need to overwrite the result in 'female1.txt' instead of writing in a new output file

      .

      can that be achieved by changing the 'female1.txt' in to read and write mode

      print {$out_fh} join( ' ', @cols ), "\n"

      to

      print {$in_path} join( ' ', @cols ), "\n"

      Let me know if there is any other way to do this. thanks for your help in advance. your explanations are really good. I appreciate your time and effort for helping

        Yes, it could be achieved by opening the file in read+write mode, or by re-opening the file in write mode after its read-only filehandle was closed. However, either of these methods would require other code changes, to hold all the lines of output until after all the input had occurred. The simplest way is to append this code to the bottom of my previous code:
        unlink $in_path or die "Could not delete '$in_path': $!"; use File::Copy; move( $out_path, $in_path ) or die "Could not move file '$out_path' to '$in_path': $!";

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://938252]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (13)
As of 2014-12-18 17:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (59 votes), past polls