http://www.perlmonks.org?node_id=1006148


in reply to My "if" is failing!

I'm wondering if maybe you're going about this the wrong way. You gave us this description of the task:
I am parsing an HTML file for a word in a foreign language. If the word is found, it will add it to a file.
Your code expects you have a file called "test.txt" that contains one or more (Urdu?) words. BTW, is the text in that file UTF-8 encoded? Have you made sure that your script is reading it correctly?

Then your code expects you have a file called "platts_wkg.html", which is assumed to contain one or more matches to the words in test.txt. Is the html file encoded the same way as test.txt? Have you checked whether some of the words that "ought" to match might be using numeric character entities (e.g. ب or ء for the Urdu letter "b")?

But there are some problems with the OP logic:

  1. When the "else" clause is not commented out, you will be prompted to "Type in an alternate spelling" for every line in the HTML file that does not match /^.*?<p>.*?( $sw ) -- (where $sw is a single word just read from test.txt) and it's likely that there are quite a few such lines in that file.
  2. You are re-opening and re-reading the HTML file for each word in the "test.txt" list, which is really inefficient (and really tedious, if you have to respond manually to all those prompts on every line of HTML input).

In other words, your "if" statement is not failing - it's doing what the logic says it should do. The problem is that the logic is wrong.

I think you should start by reading all the contents of "test.txt" before you open the html file. Combine all the target words into a single regex, and then do just one pass over the html data - like this:

my $targets_file = '/Users/me/test.txt'; # (I'd rather get this from +@ARGV) open( my $urdu_words, '<', $targets_file ) # (2nd arg might need ':ut +f8' too) or die "$targets_file: $!\n"; my @target_strings = <$urdu_words>; close $urdu_words; chomp @target_strings; my $target_regex = join( '|', @target_strings ); # Now open and read from the html file # Use a hash to kept track of matches, so you can sort them later: my $html_file = '/Users/me/platts_wkg.html'; # could get that from @AR +GV too open( my $platts, '<', $html_file ) or die "$html_file: $!\n"; my %matches; while (<$platts>) { if ( /^.*?<p>.*? ($target_regex) / ) { # note: spaces are now OUT +SIDE parens $matches{$1} .= " $_"; } } # At this point, it would be easy to dump all the matches to a file, a +nd # then edit that file manually, if you want: my $matched_file = '/Users/me/matches_found.txt'; open( my $output, '>', $matched_file ) or die "$matched_file: $!\n"; for my $match ( sort keys %matches ) { print $output "Matches found for target: $match\n $matches{$match} +\n"; }
If there's something you want to do with lines that don't match for any of the target words, you can put an "else" clause in the while loop that reads from the html file. But in that case, I would again recommend that you avoid doing anything that involves manual input to the script for each html line - put stuff into an array or hash, print it to a separate file, and deal with it in some way that's likely to be easier and less error-prone.

Replies are listed 'Best First'.
Re^2: My "if" is failing!
by Sumtingwong (Novice) on Nov 29, 2012 at 06:39 UTC

    Yes, that is exactly what it expects: an Urdu word. Hindi is also a possibility as Platts has both and this script will be used for that as well. I did solve the encoding issue and that works without problems--I can find a several-word combination with the present regex and current files. The problem with Platts is that it has older spellings and inconsistent entry of some of the letters. I did a mass search and replace for some of the inconsistencies, but there are differences in word spellings in Platts' time (late 1800's) from what is the norm now that are not necessarily incorrect. This is why an alternate spelling is requested.

    I will try your approach to the problem. The reason for asking for the correct definition is to cut down on numerous multiple definitions for the same word--an easy fix with BBEdit once the file is generated.

    As you can see, I am still very much a noob with the code writing but am able to, most of the time, get the results that I need in a much shorter time period than by manually doing the work. The rest of the time I am learning something for the next iteration!

    Thanks for your help!

Re^2: My "if" is failing!
by Sumtingwong (Novice) on Nov 30, 2012 at 06:32 UTC

    Yeah, to say my logic was wrong was an understatement! I played around with the original for a few minutes today and inserted a few print statements--my poor man's debugger. I didn't realize that the search string was going to go through every line in the html file--don't quite know what my thinking was when that is what I expected the input file to do. Makes sense now and the elegance of your solution is clear to me now.