Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Harvesting and Parsing HTML from other sites

by marius (Hermit)
on Mar 28, 2001 at 09:31 UTC ( #67753=note: print w/ replies, xml ) Need Help??


in reply to Harvesting and Parsing HTML from other sites

First, change your @pages array to a hash. Then you can step through this with a:

foreach $page (keys %pages) { }
rather than the cumbersome and obfuscated for(){} loop above.

Second, a lot of your regexes don't need the /s modifier. See perldoc perlre for info about that.

Third, use strict.

And now for code error issues: I don't see where you set $keeperlength before using it in your nested for(){} loop. Incidentally, your changing of <tag> to {{{tag}}} doesn't account for things like <br />. That's a minor nitpick, though. Other than that, I can't see why it would "revert" back to the original $html variable. Wanna fix these things I've pointed out (or point out my flaws in thinking as the case may be =]) and try it, and if it still doesn't work point us to some pages that do and pages that don't work and we'll continue hammering.

Good luck!

-marius


Comment on Re: Harvesting and Parsing HTML from other sites
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://67753]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (9)
As of 2014-07-24 21:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (167 votes), past polls