Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Untemplating

by tlhf (Scribe)
on Jul 16, 2002 at 03:20 UTC ( #181986=perlquestion: print w/ replies, xml ) Need Help??
tlhf has asked for the wisdom of the Perl Monks concerning the following question:

When searching the sites listed by google, it seemed that the world and his dog wanted to explain templating to me - how to get data from a dataset into some sort of styled html. Well, I need to do the opposite; use a template, some html pages, and some magic to get a nice, clean dataset.

Ok. So at first I was gonna attack the problem with some regexps. But, considering the number of seperate sets of data to untemplate, this is an extremely unattractive prospect.

I have a number of html pages for each day. Each day has one or more contributions. Also, the contributions have an option title.

Eg, a contribution like this may appear a few times in a page:

<tr><td><b>A Title</b> - <b>12/3/2002 23:11</b> <br> Some Contribution <p> </td> </tr>
Unfortunately, the page HTML isn't always hunky-dory, it seems to side on the non-standard kind, which I think sidelines most of the HTML modules. Luckly though, all the contributions are all written in the same manner.

Can anyone help? Is there a module already written for something like this? If not, where would I start? Just quotemeta() the template and do some sort of match? But there's more than one match per page. I'm finding myself out of my league here...

tlhf
xxx

Comment on Untemplating
Download Code
Re: Untemplating
by Chmrr (Vicar) on Jul 16, 2002 at 03:39 UTC

    By far the most common solution to this is to use one of the HTML modules. Yes, you say that the html is "non-standard" -- but, truth to be told, most HTML out there is, and the HTML-parsing modules know that, and are perfectly able to cope. If they were only able to deal with perfectly syntactic HTML, they'd be called XML-parsing, not HTML-parsing. :)

    My personal favorite tool for extracting data from web pages is HTML::TreeBuilder -- in your case, it would be a simple matter of asking for all <td> elements, and grabbing the various answers out of them. You may find the dump method particularly useful in examining what the parser makes of your HTML.

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Untemplating
by grantm (Parson) on Jul 16, 2002 at 08:25 UTC

    Another name for this type of activity is 'screen scraping'. One approach that matts advocates for screen scraping HTML is to use XPath.

    The first issue you'll need to address is that your HTML is probably not well-formed XML. Two approaches that spring to mind are:

    • pipe the HTML through HTML Tidy to convert it to XHTML
    • process it using XML::LibXML which can read HTML directly

    Then you can 'zero in' on a part of the page using an XPath expression like this:

    /html/body/table/tr/td[./b]

    which would match all td 'nodes' which contain a 'b' tag and occur in a 'tr' in a 'table' in the 'body' of the 'html' document. Once you have selected nodes in this way, you can use XPath to dissect them further, or dump them back out to an XML string (including all child nodes) and do regex matches against that.

    See also, the XPath tutorial at zvon.org

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://181986]
Approved by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2014-09-23 09:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (216 votes), past polls