Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: HTML Parsing (ick)

by sundialsvc4 (Abbot)
on Aug 20, 2014 at 12:27 UTC ( #1098108=note: print w/ replies, xml ) Need Help??


in reply to HTML Parsing (ick)

What you would have liked to have done (at the time) was to wrap all of the significant content into <span>s which had, say, an identifying class=, even if that class-definition specified nothing further as to the content.   The class-name would have served as a semantic tag to conclusively identify, within the data-stream itself, what the relevant bits of content were, so that an XPath expression (like the one shown in a previous comment) could have been used consistently to extract it.   Otherwise, “parsing the HTML is the easy part, and reliably picking-out the data within that HTML is the hard part.”   It will depend on finding totally-reliable place markers within the templates, and making 100% sure that it gets all the right data in every case.

That being said ... what are you chances, now, of being able to make changes to the templates which (I hope ...) drive the production of those web pages?   Or do you work for someone else now?   ;-)   If you could add span-tags with dummy class-names, that certainly would make this job far more reliable and easy.   (With such tags, the whole job could be done using XSLT stylesheets.)


Comment on Re: HTML Parsing (ick)
Re^2: HTML Parsing (ick)
by dbarron (Novice) on Aug 20, 2014 at 13:03 UTC
    My plan is to suck all the data in the web pages into a database and make it database driven system with dynamic web pages. I ran into problems with the complexity of the website (much of it style sheet and editing program issues and possibly user error) and decided the best thing was to avoid those complexities and make it all form based. Yes, in hindsight, if I knew I was going that way, I could have tagged even more....

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1098108]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2014-11-23 14:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (132 votes), past polls