Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I'm wondering if maybe you're going about this the wrong way. You gave us this description of the task:
I am parsing an HTML file for a word in a foreign language. If the word is found, it will add it to a file.
Your code expects you have a file called "test.txt" that contains one or more (Urdu?) words. BTW, is the text in that file UTF-8 encoded? Have you made sure that your script is reading it correctly?

Then your code expects you have a file called "platts_wkg.html", which is assumed to contain one or more matches to the words in test.txt. Is the html file encoded the same way as test.txt? Have you checked whether some of the words that "ought" to match might be using numeric character entities (e.g. ب or ء for the Urdu letter "b")?

But there are some problems with the OP logic:

  1. When the "else" clause is not commented out, you will be prompted to "Type in an alternate spelling" for every line in the HTML file that does not match /^.*?<p>.*?( $sw ) -- (where $sw is a single word just read from test.txt) and it's likely that there are quite a few such lines in that file.
  2. You are re-opening and re-reading the HTML file for each word in the "test.txt" list, which is really inefficient (and really tedious, if you have to respond manually to all those prompts on every line of HTML input).

In other words, your "if" statement is not failing - it's doing what the logic says it should do. The problem is that the logic is wrong.

I think you should start by reading all the contents of "test.txt" before you open the html file. Combine all the target words into a single regex, and then do just one pass over the html data - like this:

my $targets_file = '/Users/me/test.txt'; # (I'd rather get this from +@ARGV) open( my $urdu_words, '<', $targets_file ) # (2nd arg might need ':ut +f8' too) or die "$targets_file: $!\n"; my @target_strings = <$urdu_words>; close $urdu_words; chomp @target_strings; my $target_regex = join( '|', @target_strings ); # Now open and read from the html file # Use a hash to kept track of matches, so you can sort them later: my $html_file = '/Users/me/platts_wkg.html'; # could get that from @AR +GV too open( my $platts, '<', $html_file ) or die "$html_file: $!\n"; my %matches; while (<$platts>) { if ( /^.*?<p>.*? ($target_regex) / ) { # note: spaces are now OUT +SIDE parens $matches{$1} .= " $_"; } } # At this point, it would be easy to dump all the matches to a file, a +nd # then edit that file manually, if you want: my $matched_file = '/Users/me/matches_found.txt'; open( my $output, '>', $matched_file ) or die "$matched_file: $!\n"; for my $match ( sort keys %matches ) { print $output "Matches found for target: $match\n $matches{$match} +\n"; }
If there's something you want to do with lines that don't match for any of the target words, you can put an "else" clause in the while loop that reads from the html file. But in that case, I would again recommend that you avoid doing anything that involves manual input to the script for each html line - put stuff into an array or hash, print it to a separate file, and deal with it in some way that's likely to be easier and less error-prone.

In reply to Re: My "if" is failing! by graff
in thread My "if" is failing! by Sumtingwong

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (4)
    As of 2019-02-16 11:19 GMT
    Find Nodes?
      Voting Booth?
      I use postfix dereferencing ...

      Results (95 votes). Check out past polls.