Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: extracting data from HTML

by zwon (Monsignor)
on Jun 03, 2012 at 11:52 UTC ( #974113=note: print w/ replies, xml ) Need Help??


in reply to extracting data from HTML

You can try WWW::Mechanize, it seems what you're looking for. Also HTML::Parser and HTML::DOM may be of interest.


Comment on Re: extracting data from HTML
Re^2: extracting data from HTML
by Jurassic Monk (Acolyte) on Jun 03, 2012 at 12:21 UTC

    # sighs
    I have looked into so many modules already... and not one off the modules gave me a workable solution for something so obviuos

    can't it be simple, like:

    my $BlahBlahParser = XML::BlahBlah->new(); my $XMLobj = $BlahBlahParser->load_html("http://www.perlmonks.org/");

    and then use any ordinary XPath to query my document or extract some paragraphs of text?

      Yes, there would be one. Mojo::DOM. It's not XPath, but more like what CSS does. I found it fairly easy to use the one time I needed to parse/transform HTML.

        Actually, here's what I ended up with the one time. It transforms certain HTML documents to \latex{}. (I just wanted to print the documents out, but the page break algorithms in web browsers are nonexistent.) Thought you might want to see a sample of the module in action.

        my $dom = Mojo::DOM->new($html); my $body = $dom->at('.article-bodycopy'); $body->find('p, table')->each(sub { my $node = shift; if ($node->{class} eq 'SubHead') { print '\subsection{' . $node->text . "}"; return; } elsif ($node->type eq "table") { my $img = $node->find('img')->[0]->{src}; my $cap = filter($node->find('.Figure1')->[0]); $img =~ s/\.gif/\.png/; print join("\n", '\begin{Figure}', '\centering', '\includegraphics[width=0.65\linewidth,' . 'height=0.85\textheight,keepaspectratio]{' . $img . '} +', '\captionof{figure}{' . $cap . '}', '\end{Figure}'); return; } if ($node->children->size == 0) { print filter($node); } else { # node has sub-tags $node->children->each(sub { my $n = shift; my $tag = $n->type; if ($tag eq 'b') { $n->replace('{\bf ' . $n->text . '}'); } else { print STDERR "UNHANDLED MARKUP TYPE: " . $n->type +. "\n"; } }); print filter($node); } });

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://974113]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2014-12-19 01:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (70 votes), past polls