comparing file contents

kumaar1986 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: comparing file contents by GrandFather (Saint) on Nov 15, 2011 at 21:49 UTC
Show us what you have tried. If you haven't tried anything tell what you have read. If you haven't read anything then do so! Read anything that looks relevant in Tutorials for a start, then come back to us with evidence that you have tried something and we will help with specific problems you may have. True laziness is hard work	[reply]
Re^2: comparing file contents by kumaar1986 (Initiate) on Nov 15, 2011 at 21:57 UTC
I have read that i can use hash to store the text files and can use compare function to check the file contents. use::compare checks files on the whole and does not perform word by word comparison	[reply]
Re^3: comparing file contents by GrandFather (Saint) on Nov 15, 2011 at 22:01 UTC
So show us some code you have tried and tell us what isn't working for you. A very small amount of sample data, actual output and expected output will help a lot too. True laziness is hard work	[reply]
Re^4: comparing file contents by Anonymous Monk on Nov 22, 2011 at 16:23 UTC
Re^5: comparing file contents by Util (Priest) on Nov 22, 2011 at 17:40 UTC
Some notes below your chosen depth have not been shown here
Re: comparing file contents by Util (Priest) on Nov 22, 2011 at 23:09 UTC
The problems are numerous. A line-by-line walkthrough is needed to cover them. Since you have persevered, and desire to improve your Perl skills, here is that walkthrough: `#!/usr/bin/perl` [download] This line needs to be followed with these two lines: `use strict;` `use warnings;` The "strict" and "warnings" lines turn on error checking for common mistakes, and force you to declare your variables. `print "reading 'ansvarig_vardenhet_utf8.txt'\n";` [download] Google says that "Ansvarig vardenhet" is Swedish for "Responsible care unit". If that is true, then this print statement (and its twin further down) is lying. They are not reading `'ansvarig_vardenhet_utf8.txt'`; they are reading `'icd-10-codes.txt'` and `'female1.txt'`. If you just need reassurance about what point your program has reached, then use something like `print "P1\n";` and `print "P2\n";`. Otherwise, include the actual file names. Do not lie; it will just confuse the next person who works on the program. `open(WIN, "icd-10-codes.txt");` [download] As written, if the ICD file is missing, then your program will seem to run correctly, but will silently fail to do any replacements. Silent failure is not acceptable. `die()` is used to halt a program on a critical error; it is often coupled with `or` so that the `die` happens when the first part of a statement fails. The `$!` variable contains the reason for the failure. `open(WIN, "icd-10-codes.txt") or die "Cannot open file: $!";` . Some other changes that would be improvements, but are not mandatory: You are not telling Perl whether to read or write the file, so it defaults to read. It is better to be explicit. `open( WIN, '<', "icd-10-codes.txt" )` `WIN` is a "bareword" filehandle. We now encourage the use of a real variable for the filehandle. `open( my $win_fh, '<', "icd-10-codes.txt" )` If you use a variable to hold the filename, then you can locate the actual filename at the top of the program, where such configurable info is usually located. It will also allow you to put the filename in the "die" error message, and in the "reading ansvarig_vardenhet_utf8.txt" line already discussed, without repeating yourself. `my $icd_file = 'icd-10-codes.txt';` `#print "Reading '$icd_file'\n";` `open( my $win_fh, '<', $icd_file ) or die "Cannot open '$icd_file': $!";` By the way, the missing "strict" and "warnings" lines, and the missing `'<'` in "open", tell me that you are learning Perl from outdated material. I highly recommend Learning Perl or Modern Perl (or its free PDF). `while($line = <WIN>) {` [download] Since we added "use strict", we now have to declare all variables, such as `$line`. Also, `$line` needs to have its newline removed, otherwise the ICD description will have the newline embedded in it, which will mis-format your replacements. The `chomp()` function removes a trailing newline if one is present. `while ( my $line = <WIN> ) {` `chomp $line;` `@columns = split(/\t/, $line);` [download] Since each line has exactly two fields, you should name them. You can `split()` any extra fields into a generic array, to help find any future problems with the file formatting. `my ($lookup_code, $icd_code_and_text, @rest) = split /\t/, $line;` `warn "Extra tab in file '$icd_file at line $.\n" if @rest;` `#print "$columns[1]";` [download] That line is clearly for debugging, but if you use it, the output will be indistinct from its neighbors. You need a terminating newline. `#print "$columns[1]\n";` `foreach my $value (@columns) { #print "$value\n" }` [download] When you need to inspect more than a single simple variable, use Data::Dumper. In fact, use it to inspect even single simple variables, too. `# This "use" line goes at the top, just after "use warnings".` `use Data::Dumper; $Data::Dumper::Useqq=1;` `print Dumper \@columns;` Also, your closing brace is not lined up with the 'f' in 'foreach'. The Perl compiler does not care, but alignment will help any human trying to read your code. `close(WIN);` [download] This would need "or die" if it was anything more complex than reading a file on a local filesystem. It is fine as-is. Note that you have not preserved any data from the ICD file, so no ICD lookups can be done below this point! We will discuss how to fix this later, by copying the needed info to a hash or array. (OK, before I added the `my` before `@columns`, you did technically preserve some ICD data, but since it was only the data from the last line in the file, it hardly counts.) `print "reading 'ansvarig_vardenhet_utf8.txt'\n"; open(IN, "female1.txt"); while($line = <IN>) {` [download] Fix as described above, including `chomp()`. `%col = split(/ /, $line);` [download] No. This is nonsense. It would only make sense if you had a file with a single line, which contained all the keys and values separated by spaces. Instead of a hash, we need an array (declared, of course). `my @cols = split / /, $line;` `#print "$columns[1]\n";` [download] Use Data::Dumper. `foreach my $val (%col) {` [download] "for" is easier to type than "foreach", and either of them work. We have already changed from hash to array. `for my $val (@cols) {` `if($val == $columns[0]){ $val =~ s/$val/$columns[1]/g;` [download] The `==` operator compares two vars numerically. The `eq` operator compares two vars as strings, so it is the one you want. In other words, "3" == "3.0", but "3" ne "3.0". Also, if you have just determined that $val is an exact match, then s/// is overkill; just replace $var by assigning to it. `if ( $val eq $columns[0] ) {` `$val = $columns[1];` Note that, even with those changes, this code will not work, because @columns does not contain the whole table of ICD translations. Even if it did contain them, you would need two loops, nested, for this method to work. That would be needlessly complex and inefficient. Later, we will use a simple hash instead of @columns and two loops. `print IN ">>$val\n";` [download] This looks confused. `print IN $anything_at_all` will write to the IN file. Are you trying to overwrite the data in 'female1.txt'? Luckily, Perl will not let you, because you opened the file as read-only. Your original description says "...should be replaced with ICD-code files corresponding...", but you are not outputting the entire line. Is the '>>' a debugging marker, or does it indicate that you are trying to append to a file? Going by your original description, I think you need: `print join( ' ', @columns ), "\n";` `close(IN);` [download] Again, fine when reading locally. `I am a beginner in PERL.` [download] You are a beginner in Perl. (See PERL as shibboleth and the Perl community.) Welcome to the Perl Community! The problem with your approach is just the failure to preserve the ICD data from the first half of the program so that the second half can use it. This can be done by declaring `%lookup_icd` above the first `while()` loop, and then within the `while()` loop, use: `$lookup_icd{$icd_code} = $icd_code_and_text;` With a few more additions, it would look like this: #!/usr/bin/perl use strict; use warnings; use Data::Dumper; $Data::Dumper::Useqq = 1; my $icd_path = 'icd-10-codes.txt'; my $in_path = 'female1.txt'; my $out_path = 'female1_corrected.txt'; open my $icd_fh, '<', $icd_path or die "Cannot open '$icd_path': $!"; #print "Reading '$icd_path'\n"; my %lookup_icd; while ( my $line = <$icd_fh> ) { chomp $line; my ( $lookup_code, $icd_code_and_text ) = split /\t/, $line; # print Dumper $lookup_code, $icd_code_and_text; if ( exists $lookup_icd{$lookup_code} ) { warn "Replacing $lookup_code;" . " was '$lookup_icd{$lookup_code}'," . " now '$icd_code_and_text'\n"; } $lookup_icd{$lookup_code} = $icd_code_and_text; } close $icd_fh; #print Dumper \%lookup_icd; #print "Reading '$in_path'\n"; #print "Writing '$out_path'\n"; open my $in_fh, '<', $in_path or die "Cannot open '$in_path': $!"; open my $out_fh, '>', $out_path or die "Cannot open '$out_path': $!"; while ( my $line = <$in_fh> ) { chomp $line; my @cols = split / /, $line; for my $possible_icd (@cols) { my $replacement_icd = $lookup_icd{$possible_icd}; if ($replacement_icd) { $possible_icd = $replacement_icd; } } print {$out_fh} join( ' ', @cols ), "\n"; } close $in_fh; close $out_fh or warn "Cannot close '$out_path': $!\nSome data may not have been + written."; [download] I strongly recommend spending some time with either of the books I mentioned above. You have ventured a little past "baby Perl", and so now will be better served by tutorial than by blind exploration.	[reply] [d/l] [select]
Re^2: comparing file contents by kumaar1986 (Initiate) on Nov 23, 2011 at 00:33 UTC
Thanks for your step by step explanation of the code and how to approach to make it right. the explanation is so informative and it helped me to learn insight of PERL. Thanks a lot.	[reply]
Re^2: comparing file contents by kumaar1986 (Initiate) on Nov 23, 2011 at 01:27 UTC
This looks confused. print IN $anything_at_all will write to the IN file. Are you trying to overwrite the data in 'female1.txt'? Luckily, Perl will not let you, because you opened the file as read-only. Your original description says "...should be replaced with ICD-code files corresponding...", but you are not outputting the entire line. Is the '>>' a debugging marker, or does it indicate that you are trying to append to a file? Going by your original description, I think you need: I need to overwrite the result in 'female1.txt' instead of writing in a new output file . can that be achieved by changing the 'female1.txt' in to read and write mode `print {$out_fh} join( ' ', @cols ), "\n"` to `print {$in_path} join( ' ', @cols ), "\n"` Let me know if there is any other way to do this. thanks for your help in advance. your explanations are really good. I appreciate your time and effort for helping	[reply] [d/l] [select]
Re^3: comparing file contents by Util (Priest) on Nov 23, 2011 at 01:59 UTC
Yes, it could be achieved by opening the file in read+write mode, or by re-opening the file in write mode after its read-only filehandle was closed. However, either of these methods would require other code changes, to hold all the lines of output until after all the input had occurred. The simplest way is to append this code to the bottom of my previous code: `unlink $in_path or die "Could not delete '$in_path': $!"; use File::Copy; move( $out_path, $in_path ) or die "Could not move file '$out_path' to '$in_path': $!";` [download]	[reply] [d/l]
Re^4: comparing file contents by kumaar1986 (Initiate) on Nov 26, 2011 at 16:01 UTC
Re^5: comparing file contents by Anonymous Monk on Nov 26, 2011 at 16:46 UTC
Re^5: comparing file contents by kumaar1986 (Initiate) on Nov 28, 2011 at 00:56 UTC


No such thing as a small change
	PerlMonks