Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
appaling (sic), you say?

Well, the nested tables are awkward and the use of various outdated or deprecated tags is unfortunate; the lack of quotes and the like can certainly be labeled "mistakes." But "appalling" is a pretty strong word. Perhaps "dated" or similar would be better.

...so bad as to be practically of no use.

Even harsher (and IMO, excessive), particularly since what we know about the html fails to support any inference that OP bears any responsibility.

There is, however, a valuable nugget that saves your post from a quick downvote -- the notion that future changes could break a regex solution. OTOH, any solution we can readily offer today would also be broken were the html converted to 100% compliant xml.


In reply to Re^3: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!) by ww
in thread how to quickly parse 50000 html documents? by brengo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-23 15:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found