Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')

by bronto (Priest)
on Oct 15, 2006 at 15:55 UTC ( #578396=perlquestion: print w/ replies, xml ) Need Help??
bronto has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Monks

I am writing a couple of web-page-scraping tools that will help me in my job seek. I already have something working, but what I am missing is a nice pure perl solution that would format a web page to a nice plain text, so that if an announcement is, for any reason, removed, I still have a chance of getting to the contents

And hence the question: is there anything like lynx -dump in Perl? I dug into CPAN for about half an hour and tried html2text, but it didn't really do a good job...

For the few of you that don't know what lynx is and what it does:

NAME lynx - a general purpose distributed information browser for the World Wide Web ... DESCRIPTION Lynx is a fully-featured World Wide Web (WWW) client for users running cursor-addressable, character-cell display devices (e.g., vt100 terminals, vt100 emulators running on Windows 95/NT or Macintoshes, or any other "curses-oriented" display). ... OPTIONS ... -dump dumps the formatted output of the default document or one specified on the command line to standard output. This can be used in the following way: lynx -dump http://www.subir.com/lynx.html

Thanks a lot in advance for your help

Ciao!
--bronto


In theory, there is no difference between theory and practice. In practice, there is.

Comment on Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
Select or Download Code
Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by davidrw (Prior) on Oct 15, 2006 at 16:22 UTC
    WWW::Mechanize has a method for that (it requires that HTML::TreeBuilder is installed as well) ..
    my $mech = WWW::Mechanize->new(); $mech->get('http://example.com'); print $mech->content(format => 'text');
    If you're not already using WWW::Mechanize for your scraping, i highly recommend it (note it uses LWP underneath)..
    Update: added 'print' so that snippet has output

      Gosh! You didn't even take a look at what lynx -dump produces, did you? First, the code you wrote outputs nothing (you missed a print); second, the output is awful (to be kind...)

      perl -MWWW::Mechanize -e 'my $mech = WWW::Mechanize->new(); $mech->get +(q(http://cercalavoro.monster.it/getjob.asp?JobID=49072224&AVSDM=2006 +%2D10%2D13+11%3A00%3A00&Logo=1&q=unix+system+administrator+perl&sort= +dt&tm=7d&qt=any&brd=1,5937&fn=660,6,561&lid=175&jt=4,1)); print $mech +->content(format => q(text));'
      outputs this:

      while lynx outputs this:

      Do you still think it's the same thing???

      Ciao!
      --bronto


      In theory, there is no difference between theory and practice. In practice, there is.

      20061015 Janitored by Corion: Added changed PRE tags to code tags, as per Writeup Formatting Tips

        Gosh! You didn't even take a look at what lynx -dump produces, did you?

        He didn't claim it would produce the same output, nor comparable one. He just pointed out it has a method for outputting plain text, which it has. Indeed I think it more or less amounts to the as_text() of the whole parse tree of the wanted page. Lynx and its variations are full fledged browser, so it is natural they go beyond the capabilities of a simple parser, aiming at being presentation friendly. But that's quite a lot of work. You may hack/roll your own by inserting horizontal and vertical whitespace suitably around individual elements before printing them as_text. Needless to say, this is necessarily going to be quite a lot of work, but maybe just inserting newlines after every single one of them may make everything more clear. Oh, and at the very least take care of paragraphs and breaks. But if you also want line wrap that's a whole another story. (A call for Text::Wrap, most probably.)

        OTOH did you look at the outcome of your post (as is recommended)?!? It screwed up the whole view for this thread. Use <code> tags around the stuff you pasted, although it's not strictly code. At least that has smart line wrap...

        Update: the post has been fixed, hence the above comment does not apply any more.

        Ciao

Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by grep (Monsignor) on Oct 15, 2006 at 17:23 UTC
    A quick search of cpan gave me HTML::FormatText::WithLinks.

    From the POD:
    DESCRIPTION

    HTML::FormatText::WithLinks takes HTML and turns it into plain text but prints all the links in the HTML as footnotes. By default, it attempts to mimic the format of the lynx text based web browser's --dump option.

    Also please use '<code>' not '<pre>' tags when posting, then preview your post before creating.



    grep
    One dead unjugged rabbit fish later
      A quick search of cpan gave me HTML::FormatText::WithLinks

      ...and THIS was the answer! Thanks grep!

      Ciao!
      --bronto


      In theory, there is no difference between theory and practice. In practice, there is.
Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by davidrw (Prior) on Oct 15, 2006 at 17:39 UTC
    may or may not help with your specific case, but in general HTML::TableExtract can be extremely useful as well
Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by monarch (Priest) on Oct 16, 2006 at 08:52 UTC
    I tend to do these things by hand, even though I know I really shouldn't.
    my $string = "..htmlstuff.."; # strip out newlines $string =~ s/[\r\n]+/ /sg; # replace <p> with custom paragraph marker my $marker_paragraph = "**PARAGRAPHHERE**"; $string =~ s/<p(\s[^>]*)?>/$marker_paragraph/isg; # remove all HTML tags $string =~ s/<[^>]*>//sg; # replace custom paragraph marker with blank line $string =~ s/\Q$marker_paragraph\E/\n\n/sg;

    You can add other transforms, such as wrapping at a particular column etc.

Re: Any pure-perl html to text? (Or: missing a perl equivalent to 'lynx -dump')
by spatterson (Monk) on Oct 16, 2006 at 10:16 UTC
    I've had reasonable success with SGML::StripParser though that just rips the tags out, leaving the rest formatted as-is.

    just another cpan module author

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://578396]
Approved by grep
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (9)
As of 2014-12-25 10:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (160 votes), past polls