Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Extract text from HTML

by Anonymous Monk
on Dec 28, 2002 at 14:40 UTC ( #222730=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Perl Gods, I need some help -- suppose I want to search a variable containing the contents of a HTML page, what would be the best way? For example, the HTML page has "NOTE: Something somethinng", but NOTE: is surrounded by a bunch of HTML gook. All I want is the stuff after "NOTE:" .. how would I do that?
<font face='Arial,Helvetica,Sans Serif' color='#000055' size='2'/><B>N +ote:</B> &nbsp;</font><font face='Arial,Helvetica,Sans Serif' color= +'#000055' size='2'/>By arrangement.</font>
And I want to extract "By arrangement." from the above .... how would I do that? Thanks, Helpless (and Stupid)

Comment on Extract text from HTML
Download Code
Re: Extract text from HTML
by Juerd (Abbot) on Dec 28, 2002 at 14:47 UTC
Re: Extract text from HTML
by jdporter (Canon) on Dec 28, 2002 at 15:29 UTC
    Here's a nice little function that does it.

    (FYI - the faq noted by Juerd is obsolete. Not only does it not give an actual solution, but it recommends HTML::Parse, which is now deprecated.)
    use HTML::Parser; sub extract_html_text { my $html = shift; my $text = ''; HTML::Parser->new( api_version => 3, text_h => [ sub { $text .= "@ +_"; }, "dtext" ] )->parse( $html )->eof; $text }

    UPDATE: Here's another (imho, nicer) little function that does it:
    use HTML::TreeBuilder; sub extract_html_text { HTML::TreeBuilder->new_from_content($_[0])->as_text }

    jdporter
    ...porque es dificil estar guapo y blanco.

      jdporter, Thanks a ton for your help ..... that did it. I had a question though ..... with regards to my question and your answer, I am using -
      my $we = extract_html_text($browser->{res}->content); my @note = $we =~ m/Note:\s*([^<]+)/gi;
      to first strip the HTML gook and then search the remainder for "Note". The thing is that "@note" prints out everything after Note: i.e. all the other coding in the remainder of the (formerly) HTML file, etc. Is there any way I can get it to search for Note:, collect all the information after it and stop when it reaches "Pre" or "Attrib" or "Link"? Would greatly appreciate your feedback. Thanks.
        How about:
        my ($note) = $we =~ m/Note:\s*(.+?)(?:Pre|Attrib|Link)/sgi
Re: Extract text from HTML
by vek (Prior) on Dec 29, 2002 at 14:09 UTC
Re: Extract text from HTML
by osama (Scribe) on Dec 29, 2002 at 20:56 UTC
    I used to like reinventing the wheel every time... I used to do something like this:
    # THIS IS BAD s/(\s|\&nbsp;)+/ /g; s/<(BR|P)>/\n/ig; s/<.+?>//g;
    Now I just :
    use HTML::TokeParser;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://222730]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2014-07-29 21:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (228 votes), past polls