|Problems? Is your data what you think it is?|
Re: My "if" is failing!by graff (Chancellor)
|on Nov 29, 2012 at 02:56 UTC||Need Help??|
I'm wondering if maybe you're going about this the wrong way. You gave us this description of the task:
I am parsing an HTML file for a word in a foreign language. If the word is found, it will add it to a file.Your code expects you have a file called "test.txt" that contains one or more (Urdu?) words. BTW, is the text in that file UTF-8 encoded? Have you made sure that your script is reading it correctly?
Then your code expects you have a file called "platts_wkg.html", which is assumed to contain one or more matches to the words in test.txt. Is the html file encoded the same way as test.txt? Have you checked whether some of the words that "ought" to match might be using numeric character entities (e.g. ب or ء for the Urdu letter "b")?
But there are some problems with the OP logic:
In other words, your "if" statement is not failing - it's doing what the logic says it should do. The problem is that the logic is wrong.
I think you should start by reading all the contents of "test.txt" before you open the html file. Combine all the target words into a single regex, and then do just one pass over the html data - like this:
If there's something you want to do with lines that don't match for any of the target words, you can put an "else" clause in the while loop that reads from the html file. But in that case, I would again recommend that you avoid doing anything that involves manual input to the script for each html line - put stuff into an array or hash, print it to a separate file, and deal with it in some way that's likely to be easier and less error-prone.