http://www.perlmonks.org?node_id=1006148


in reply to My "if" is failing!

I'm wondering if maybe you're going about this the wrong way. You gave us this description of the task:
I am parsing an HTML file for a word in a foreign language. If the word is found, it will add it to a file.
Your code expects you have a file called "test.txt" that contains one or more (Urdu?) words. BTW, is the text in that file UTF-8 encoded? Have you made sure that your script is reading it correctly?

Then your code expects you have a file called "platts_wkg.html", which is assumed to contain one or more matches to the words in test.txt. Is the html file encoded the same way as test.txt? Have you checked whether some of the words that "ought" to match might be using numeric character entities (e.g. ب or ء for the Urdu letter "b")?

But there are some problems with the OP logic:

  1. When the "else" clause is not commented out, you will be prompted to "Type in an alternate spelling" for every line in the HTML file that does not match /^.*?<p>.*?( $sw ) -- (where $sw is a single word just read from test.txt) and it's likely that there are quite a few such lines in that file.
  2. You are re-opening and re-reading the HTML file for each word in the "test.txt" list, which is really inefficient (and really tedious, if you have to respond manually to all those prompts on every line of HTML input).

In other words, your "if" statement is not failing - it's doing what the logic says it should do. The problem is that the logic is wrong.

I think you should start by reading all the contents of "test.txt" before you open the html file. Combine all the target words into a single regex, and then do just one pass over the html data - like this:

my $targets_file = '/Users/me/test.txt'; # (I'd rather get this from +@ARGV) open( my $urdu_words, '<', $targets_file ) # (2nd arg might need ':ut +f8' too) or die "$targets_file: $!\n"; my @target_strings = <$urdu_words>; close $urdu_words; chomp @target_strings; my $target_regex = join( '|', @target_strings ); # Now open and read from the html file # Use a hash to kept track of matches, so you can sort them later: my $html_file = '/Users/me/platts_wkg.html'; # could get that from @AR +GV too open( my $platts, '<', $html_file ) or die "$html_file: $!\n"; my %matches; while (<$platts>) { if ( /^.*?<p>.*? ($target_regex) / ) { # note: spaces are now OUT +SIDE parens $matches{$1} .= " $_"; } } # At this point, it would be easy to dump all the matches to a file, a +nd # then edit that file manually, if you want: my $matched_file = '/Users/me/matches_found.txt'; open( my $output, '>', $matched_file ) or die "$matched_file: $!\n"; for my $match ( sort keys %matches ) { print $output "Matches found for target: $match\n $matches{$match} +\n"; }
If there's something you want to do with lines that don't match for any of the target words, you can put an "else" clause in the while loop that reads from the html file. But in that case, I would again recommend that you avoid doing anything that involves manual input to the script for each html line - put stuff into an array or hash, print it to a separate file, and deal with it in some way that's likely to be easier and less error-prone.