Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
totally agree that scraping html is quite bad and unstable. my rough guide for scraping, from most to least desireable
  • don't scrape...if you can find out if there's a RESTful way to get the results instead, via some API or alternate (e.g. XML format) url
  • if the html is well formed (i.e. xhtml) then it will be almost guaranteed to be well structured, will proper closing tags etc...so use one of the XML based parsers. you can easily get to specific elements via a well defined hierarchy once parsed
  • this stuff gets ugly when you start slurping whole html into a scalar and progressively find markers where to suck bits into desired variables for inspection. even here you can code a bit defensively by picking sane markers, e.g. "id" or "class" elements. never anchor to text containing inline html styles or other bits that are likely to change fequently, like inline javascript or such.
finally, if you end up producing html in the CGI, don't mix actual output with styling. write a stylesheet instead. producing a well formed xhtml document in the CGI without inline styles, provides later opportunity to use the output of the CGI via another CGI or whatever. it's also much easier to change the output of the CGI via a stylesheet, rather than digging in perl code.
there are frameworks for doing even fancier scraping, where you end up running a browser engine server side, to pretend that your program is a browser. this is necessary when a website dynamically produces most of it's output with javascript. and naturally because javascript is browser/client side code, you won't see the results of that unless you run it. this is pretty horrid stuff. although you can do automatic login and traverse a website and results...it typically breaks as soon as absolutely anything changes on the website. A good/helpful website, even if dynamically fancy rendered with javascript, should provide a RESTful api to get data out. But some companies still insist on not being very helpful.
the hardest line to type correctly is: stty erase ^H

In reply to Re^2: Parsing HTML files by aquarium
in thread Parsing HTML files by ajju

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others contemplating the Monastery: (7)
    As of 2014-10-21 03:23 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (95 votes), past polls