http://www.perlmonks.org?node_id=1006112

Sumtingwong has asked for the wisdom of the Perl Monks concerning the following question:

Pretty new here and not much experience...please bear with me. I am parsing an HTML file for a word in a foreign language. If the word is found, it will add it to a file. However, I put an "else" in there and the regex will no longer work in the opening "if" and defaults to the first "print" statement under the "else". When the "else" block is removed, the regex works. Any help would be greatly appreciated!

open (URDUWORDS, "</Users/me/test.txt") || die "Can't open files: $!"; while (<URDUWORDS>) { chomp; my $sw = $_; open (PLATTS, "</Users/me/platts_wkg.html") || die "Can't open files: $!"; while (<PLATTS>) { if (/^.*?<p>.*?( $sw )/) { print "$sw \t $_ \n"; print "Add this definition? "; chomp (my $dec = <STDIN>); if ($dec eq "y") { print "\nadded \n\n"; open (DEFS, ">>/Users/me/defs.txt") || die "Can't open files: $!"; print DEFS "$sw \t $_ \n"; close DEFS; } #} else { # print "Type in alternate spelling: "; # chomp (my $sw = <STDIN>); # if (/^.*?<p>.*?( $sw )/i) { # print "$sw \t $_ \n"; # print "Add this definition? "; # chomp (my $dec = <STDIN>); # if ($dec eq "y") { # print "\nadded \n\n"; # open (DEFS, ">>/Users/me/defs.txt") || # die "Can't open files: $!"; # print DEFS "$sw \t $_ \n"; # close DEFS; # } # } } } }

Replies are listed 'Best First'.
Re: My "if" is failing!
by 2teez (Vicar) on Nov 28, 2012 at 22:57 UTC

    Hi Sumtingwong,

    However, I put an "else" in there and the regex will no longer work in the opening "if" and defaults to the first "print" statement under the "else". When the "else" block is removed, the regex works

    You are having this issue, because the else{...} goes with the first if(){...}i.e if (/^.*?<p>.*?( $sw )/) {...} and not if ($dec eq "y") {...} as intended.

    How? You asked? Please see this from your code:

    ... while (<PLATTS>) { if (/^.*?<p>.*?( $sw )/) { # first if ... if ( $dec eq "y" ) { # second if ... } # second if CLOSED and DONE WITH } else { ## this was intended to work with second if statemen +t, but now works with the first if ... } ...
    Meanwhile, what was intended was this:
    ... while (<PLATTS>) { if (/^.*?<p>.*?( $sw )/) { # first if ... if ( $dec eq "y" ) { # second if ... } else { ## NOW else is attached to the second if, and now + works as intended ... } } ## END of first if statement. ...

    That been said, there are a number of things, that could also help the implementation of your code, if you will.
    • Avoid bareword as file handles, use lexical variable in your open function.
      Also use 3 argument open function. like so:open my $fh,'<',$filename or die "can't open file: $!";
    • All your "sandwiched" open functions, within the while loop and if statement can be stated clearly, and referred to using your lexical variable used as file handles like so:
      open my $fh_output,'>>',$output or die "can't open file: $!"; open my $fh_input_file1,'<',$input_file1 or die "can't open file: +$!"; open my $fh_input_file2,'<',$input_file2 or die "can't open file: +$!"; ... if ( $dec eq "y" ) { print "\nadded \n\n"; ... print $fh_output "$sw \t $_ \n"; } ...
    • Update:
      Please DO NOT parse an HTML file, using regex. Use a cpan module like HTML::Parser, HTML::TokeParser, HTML::TreeBuilder or any other one.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

      Ok, I got it now. This helps a lot, thanks!

Re: My "if" is failing!
by graff (Chancellor) on Nov 29, 2012 at 02:56 UTC
    I'm wondering if maybe you're going about this the wrong way. You gave us this description of the task:
    I am parsing an HTML file for a word in a foreign language. If the word is found, it will add it to a file.
    Your code expects you have a file called "test.txt" that contains one or more (Urdu?) words. BTW, is the text in that file UTF-8 encoded? Have you made sure that your script is reading it correctly?

    Then your code expects you have a file called "platts_wkg.html", which is assumed to contain one or more matches to the words in test.txt. Is the html file encoded the same way as test.txt? Have you checked whether some of the words that "ought" to match might be using numeric character entities (e.g. &#x0628; or &#1569; for the Urdu letter "b")?

    But there are some problems with the OP logic:

    1. When the "else" clause is not commented out, you will be prompted to "Type in an alternate spelling" for every line in the HTML file that does not match /^.*?<p>.*?( $sw ) -- (where $sw is a single word just read from test.txt) and it's likely that there are quite a few such lines in that file.
    2. You are re-opening and re-reading the HTML file for each word in the "test.txt" list, which is really inefficient (and really tedious, if you have to respond manually to all those prompts on every line of HTML input).

    In other words, your "if" statement is not failing - it's doing what the logic says it should do. The problem is that the logic is wrong.

    I think you should start by reading all the contents of "test.txt" before you open the html file. Combine all the target words into a single regex, and then do just one pass over the html data - like this:

    my $targets_file = '/Users/me/test.txt'; # (I'd rather get this from +@ARGV) open( my $urdu_words, '<', $targets_file ) # (2nd arg might need ':ut +f8' too) or die "$targets_file: $!\n"; my @target_strings = <$urdu_words>; close $urdu_words; chomp @target_strings; my $target_regex = join( '|', @target_strings ); # Now open and read from the html file # Use a hash to kept track of matches, so you can sort them later: my $html_file = '/Users/me/platts_wkg.html'; # could get that from @AR +GV too open( my $platts, '<', $html_file ) or die "$html_file: $!\n"; my %matches; while (<$platts>) { if ( /^.*?<p>.*? ($target_regex) / ) { # note: spaces are now OUT +SIDE parens $matches{$1} .= " $_"; } } # At this point, it would be easy to dump all the matches to a file, a +nd # then edit that file manually, if you want: my $matched_file = '/Users/me/matches_found.txt'; open( my $output, '>', $matched_file ) or die "$matched_file: $!\n"; for my $match ( sort keys %matches ) { print $output "Matches found for target: $match\n $matches{$match} +\n"; }
    If there's something you want to do with lines that don't match for any of the target words, you can put an "else" clause in the while loop that reads from the html file. But in that case, I would again recommend that you avoid doing anything that involves manual input to the script for each html line - put stuff into an array or hash, print it to a separate file, and deal with it in some way that's likely to be easier and less error-prone.

      Yes, that is exactly what it expects: an Urdu word. Hindi is also a possibility as Platts has both and this script will be used for that as well. I did solve the encoding issue and that works without problems--I can find a several-word combination with the present regex and current files. The problem with Platts is that it has older spellings and inconsistent entry of some of the letters. I did a mass search and replace for some of the inconsistencies, but there are differences in word spellings in Platts' time (late 1800's) from what is the norm now that are not necessarily incorrect. This is why an alternate spelling is requested.

      I will try your approach to the problem. The reason for asking for the correct definition is to cut down on numerous multiple definitions for the same word--an easy fix with BBEdit once the file is generated.

      As you can see, I am still very much a noob with the code writing but am able to, most of the time, get the results that I need in a much shorter time period than by manually doing the work. The rest of the time I am learning something for the next iteration!

      Thanks for your help!

      Yeah, to say my logic was wrong was an understatement! I played around with the original for a few minutes today and inserted a few print statements--my poor man's debugger. I didn't realize that the search string was going to go through every line in the html file--don't quite know what my thinking was when that is what I expected the input file to do. Makes sense now and the elegance of your solution is clear to me now.

Re: My "if" is failing!
by choroba (Cardinal) on Nov 28, 2012 at 22:06 UTC
    Before the if and after branching, add
    warn "$_, $sw.\n";
    to see what's really going on if you are lazy to run the debugger. The regex must work the same regardless of the else part being commented or not.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      OK, will try the debugger. I am still figuring out how to do a lot of this, thanks!

Re: My "if" is failing!
by space_monk (Chaplain) on Nov 28, 2012 at 22:51 UTC
    Not an answer to your question so much as an observation.

    The code in the else clause is rather similar to the code in the 'if' clause, so the code ought to be simplified. If you write code with duplicated sections, it makes it much harder to find out what went wrong and why.

    A Monk aims to give answers to those who have none, and to learn from those who know more.
      Ok, thanks!