Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: extracting data from HTML

by zwon (Monsignor)
on Jun 03, 2012 at 11:52 UTC ( #974113=note: print w/ replies, xml ) Need Help??


in reply to extracting data from HTML

You can try WWW::Mechanize, it seems what you're looking for. Also HTML::Parser and HTML::DOM may be of interest.


Comment on Re: extracting data from HTML
Replies are listed 'Best First'.
Re^2: extracting data from HTML
by Jurassic Monk (Acolyte) on Jun 03, 2012 at 12:21 UTC

    # sighs
    I have looked into so many modules already... and not one off the modules gave me a workable solution for something so obviuos

    can't it be simple, like:

    my $BlahBlahParser = XML::BlahBlah->new(); my $XMLobj = $BlahBlahParser->load_html("http://www.perlmonks.org/");

    and then use any ordinary XPath to query my document or extract some paragraphs of text?

      Yes, there would be one. Mojo::DOM. It's not XPath, but more like what CSS does. I found it fairly easy to use the one time I needed to parse/transform HTML.

        Actually, here's what I ended up with the one time. It transforms certain HTML documents to \latex{}. (I just wanted to print the documents out, but the page break algorithms in web browsers are nonexistent.) Thought you might want to see a sample of the module in action.

        my $dom = Mojo::DOM->new($html); my $body = $dom->at('.article-bodycopy'); $body->find('p, table')->each(sub { my $node = shift; if ($node->{class} eq 'SubHead') { print '\subsection{' . $node->text . "}"; return; } elsif ($node->type eq "table") { my $img = $node->find('img')->[0]->{src}; my $cap = filter($node->find('.Figure1')->[0]); $img =~ s/\.gif/\.png/; print join("\n", '\begin{Figure}', '\centering', '\includegraphics[width=0.65\linewidth,' . 'height=0.85\textheight,keepaspectratio]{' . $img . '} +', '\captionof{figure}{' . $cap . '}', '\end{Figure}'); return; } if ($node->children->size == 0) { print filter($node); } else { # node has sub-tags $node->children->each(sub { my $n = shift; my $tag = $n->type; if ($tag eq 'b') { $n->replace('{\bf ' . $n->text . '}'); } else { print STDERR "UNHANDLED MARKUP TYPE: " . $n->type +. "\n"; } }); print filter($node); } });

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://974113]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (9)
As of 2015-07-31 00:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (274 votes), past polls