Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Loop behavior with HTML::TokeParser::Simple

by WhiteBird (Hermit)
on Dec 20, 2003 at 02:09 UTC ( #315971=perlquestion: print w/replies, xml ) Need Help??

WhiteBird has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks:

I'm having a difficulty with a snippet of code. The larger context requires that I replace one image link embedded deep inside each of over 100 HTML recipe files. The current link in each document is a relative link and each image has a different, unknown ID number as such:
<img src='../../images/dbimage.asp?ID=758'>

The replacement string is constructed of a title gleaned from the title of the recipe and munged into a proper HTML format. I've built code that gets the title and gets to the image tag. I just can't seem to get the regex replacement to work correctly. The relevent snippet is this:

while ( $token=$p->get_token() ) { if ($token->is_start_tag('img') ) { my $src = $token->return_attr->{src}; print "SRC IS $src\n"; #For Debugging $src =~ s/$src/$newsrc/; print "SRC NOW: $src\n"; #For Debugging $token->set_attr('src', $src); }

There are two other image references after the one I'm interested in and the looping picks through all of them. (At this point, that's not a problem.) When I run the script on a file, my printed debugging output is this:

got filetitle: ApplePecanBreadStuffing SRC IS ../../images/dbimage.asp?ID=751 SRC NOW: ../../images/dbimage.asp?ID=751 SRC IS ../../images/spacer.gif SRC NOW: 'ApplePecanBreadStuffing.jpg' SRC IS ../../images/spacer.gif SRC NOW: 'ApplePecanBreadStuffing.jpg'

Why is the first tag missed in the replacement step and the following two work correctly? I keep looking at it and I suspect it's something obvious in the structure of the code, but I am out of ideas. Help?

Replies are listed 'Best First'.
Re: Loop behavior with HTML::TokeParser::Simple
by jsprat (Curate) on Dec 20, 2003 at 02:20 UTC
    Probably the regex is failing because of the metacharacters in $src. Try s/\Q$src\E/$newsrc/ (untested) to make the regex work.

    However, the better solution is to chuck the regex. It replaces $src in its entirety, so a simple $src = $newsrc (or replace both lines with $token->set_attr('src', $newsrc);) would be both easier to maintain and more effective.

    HTH

      Thank you for the prompt reply. Using the $src = $newsrc works nicely to make the replacement, and the combined $token->set_attr('src', $newsrc); doesn't complain so I'm assuming it works as expected.

      I get an odd bit in the next step of my code, though.

      print FILE $token->as_is; close FILE;
      I'm trying to save it all back to a file and all I get is a 1K file that merely has an end bold HTML tag in it. That's an improvement over before where I was getting an empty file, but I'm still missing something important. Am I getting something wrong in my set_attr call, or am I mis-understanding and misusing the token->as_is part? I'm looking at the documentation for the module, but I'm clueless. Any suggestions are appreciated.
        I haven't seen your entire code, but I imagine that your code looks somewhat like below -
        while ( $token=$p->get_token() ) { .... } ... print FILE $token->as_is; close FILE;
        If you want to print all the tokens, you should put the print statement inside your loop, otherwise the $token variable is replaced with the next token every time you call the $p->get_token() function.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://315971]
Approved by jsprat
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2022-06-27 06:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (86 votes). Check out past polls.

    Notices?