Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Perl HTML:: Parser

by Anonymous Monk
on Apr 25, 2013 at 21:17 UTC ( [id://1030735]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hey can anyone tell hıow can i parse html contents to text, i am using the following code but it conversts the whole webpage to text i want just the results as text, how can i do that
my $html6 = $browser->content; my $Format2 =HTML::FormatText->new(leftmargin =>3, rightmargin =>50 ); my $TreeBuilder2 =HTML::TreeBuilder->new(); $TreeBuilder->parse($html6); my $parsed3= $Format->format($TreeBuilder);

Replies are listed 'Best First'.
Re: Perl HTML:: Parser
by graff (Chancellor) on Apr 26, 2013 at 03:01 UTC
    I don't understand what difference there is between "the whole webpage to text" and "the results as text". Can you provide some data samples to show what sort of difference you're talking about?

    Also, how about showing us a runnable code snippet, that actually uses some sample data and produces some output. Then explain how that output is different from the output you actually want. That will make it easier to help you.

Re: Perl HTML:: Parser
by 2teez (Vicar) on Apr 26, 2013 at 04:41 UTC

    Hi Anonymous Monk,
    Please, if I can make an assumption that since you used the module HTML::FormatText you intended to get your output in plain text not have the whole html page with all the tags in as text.
    If this is what you want, then you can do like so:

    use warnings; use strict; use HTML::TreeBuilder 5 -weak; use HTML::FormatText; my $tree = HTML::TreeBuilder->new_from_url("http://www.google.com"); my $format = HTML::FormatText->new(leftmargin=>3, rightmargin=>50); print $format->format($tree);
    Output:
    Search Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in Nigeria Advanced searchLanguage tools Google.com.ng offered in: Hausa Igbo Yorùbá Pidgin Advertising ProgramsBusiness SolutionsAbout GoogleGoogle.com © 2013 - Privacy & Terms
    NOTE:
    1. Of course, you might need the module LWP::UserAgent, to get your html file, if you don't have html file stored.
    2. Please, note the usage of the module HTML::TreeBuilder, if you have your html file, you might use a different method.
    3. However, if you are using a linux OS, you can also see lynx like so: lynx -dump http://www.google.com
    I hope this helps. graff was right on the question he asked about the clarity of what you wanted done.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1030735]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-23 06:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found