Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Regex to match first html tag previous to text

by aquarium (Curate)
on Nov 29, 2007 at 03:01 UTC ( [id://653754]=note: print w/replies, xml ) Need Help??


in reply to Regex to match first html tag previous to text

incidentally..does anybody know if any LWP or similar implement DOM? I have a hunch that DOM parsing is cleaner than X(HT)ML parsing....that's with the latter sometimes not being well formed etc....whilst DOM will always give you access to A tags.
the hardest line to type correctly is: stty erase ^H
  • Comment on Re: Regex to match first html tag previous to text

Replies are listed 'Best First'.
Re^2: Regex to match first html tag previous to text
by erroneousBollock (Curate) on Nov 29, 2007 at 03:16 UTC
    does anybody know if any LWP or similar implement DOM
    LWP::UserAgent does not provide DOM-level access.

    WWW::Mechanize doesn't either, but does parse the HTML for you in order to provide methods like links(), which incidentally, does what you want.

    I have a hunch that DOM parsing is cleaner than X(HT)ML parsing
    "DOM" is not a manner of parsing, but a manner of access. For methods from the DOM to be able to access data from a tree of nodes, some "parser" code still has to build that tree.

    It's certainly cleaner to access data using DOM (or DOM-like) methods, or selector interfaces like XPath or XQuery.

    HTML::TreeBuilder::XPath builds an HTML::Tree internally and then provides XPath-like access to that tree.

    (HTML parsing) sometimes not being well formed etc
    If you're talking about the robustness of parsing HTML, there are many libraries that parse HTML properly even when given invalid input. It's quite orthogonal to how you access the data once you've parsed the document.

    -David

      parsing vs access methods does get blurry...anyway, in the end we're interested in getting to point B and not that interested in the trip itself...whether parsing or using an access method.
      to this point...i think (possibly) best for this problem would be find_link() method of WWWW::Mechanize. not as tedious as Xpath or HTML::Tree etc.
      the hardest line to type correctly is: stty erase ^H

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://653754]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-04-20 00:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found