Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: HTML Parsing (ick)

by sundialsvc4 (Abbot)
on Aug 20, 2014 at 12:27 UTC ( #1098108=note: print w/replies, xml ) Need Help??

in reply to HTML Parsing (ick)

What you would have liked to have done (at the time) was to wrap all of the significant content into <span>s which had, say, an identifying class=, even if that class-definition specified nothing further as to the content.   The class-name would have served as a semantic tag to conclusively identify, within the data-stream itself, what the relevant bits of content were, so that an XPath expression (like the one shown in a previous comment) could have been used consistently to extract it.   Otherwise, “parsing the HTML is the easy part, and reliably picking-out the data within that HTML is the hard part.”   It will depend on finding totally-reliable place markers within the templates, and making 100% sure that it gets all the right data in every case.

That being said ... what are you chances, now, of being able to make changes to the templates which (I hope ...) drive the production of those web pages?   Or do you work for someone else now?   ;-)   If you could add span-tags with dummy class-names, that certainly would make this job far more reliable and easy.   (With such tags, the whole job could be done using XSLT stylesheets.)

Replies are listed 'Best First'.
Re^2: HTML Parsing (ick)
by dbarron (Novice) on Aug 20, 2014 at 13:03 UTC
    My plan is to suck all the data in the web pages into a database and make it database driven system with dynamic web pages. I ran into problems with the complexity of the website (much of it style sheet and editing program issues and possibly user error) and decided the best thing was to avoid those complexities and make it all form based. Yes, in hindsight, if I knew I was going that way, I could have tagged even more....

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1098108]
[Corion]: chacham: Oh, yes, reinstalling all the SDK versions and Gradle and whatnot, yes, even in the short time I used it (2 weeks?) I felt that pain
[chacham]: right now im editing the versions in the applications gradle file to use a version it wants. sheesh.
[Corion]: Ow, "but you're not supposed to do that" ;)
[chacham]: with google, you kinda just gotta do what it recomends.

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (10)
As of 2017-03-29 08:04 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (344 votes). Check out past polls.