Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Re: Harvesting and Parsing HTML from other sites

by marius (Hermit)
on Mar 28, 2001 at 09:31 UTC ( #67753=note: print w/replies, xml ) Need Help??

in reply to Harvesting and Parsing HTML from other sites

First, change your @pages array to a hash. Then you can step through this with a:
foreach $page (keys %pages) { }
rather than the cumbersome and obfuscated for(){} loop above.

Second, a lot of your regexes don't need the /s modifier. See perldoc perlre for info about that.

Third, use strict.

And now for code error issues: I don't see where you set $keeperlength before using it in your nested for(){} loop. Incidentally, your changing of <tag> to {{{tag}}} doesn't account for things like <br />. That's a minor nitpick, though. Other than that, I can't see why it would "revert" back to the original $html variable. Wanna fix these things I've pointed out (or point out my flaws in thinking as the case may be =]) and try it, and if it still doesn't work point us to some pages that do and pages that don't work and we'll continue hammering.

Good luck!


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://67753]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2016-10-21 22:40 GMT
Find Nodes?
    Voting Booth?
    How many different varieties (color, size, etc) of socks do you have in your sock drawer?

    Results (291 votes). Check out past polls.