Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^3: Extracting HTML content between the h tags

by vagabonding electron (Hermit)
on Aug 05, 2012 at 14:09 UTC ( #985529=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Extracting HTML content between the h tags
in thread Extracting HTML content between the h tags

Thank you very much!
Just tried the both approaches, it works even if the last h2-tag is missing ( appears in about 10 pages from > 400, for which I used the following workaround:

my @solution_2 = $content->findvalues( './h2[4]/preceding-sibling::*' +); unless ( @solution_2 ) { @solution_2 = $content->findvalues( '//hr/preceding-sibling::*' ); }
... with substr as before ...
Fortunately they have only one hr-tag in the page :-)
With your approach it is not necessary anymore.
BTW the content after the <h2>[4] is not important.
Thanks again!


Comment on Re^3: Extracting HTML content between the h tags
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://985529]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (10)
As of 2015-07-31 06:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (274 votes), past polls