Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Extract text from HTML

by Anonymous Monk
on Dec 28, 2002 at 14:40 UTC ( #222730=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Perl Gods, I need some help -- suppose I want to search a variable containing the contents of a HTML page, what would be the best way? For example, the HTML page has "NOTE: Something somethinng", but NOTE: is surrounded by a bunch of HTML gook. All I want is the stuff after "NOTE:" .. how would I do that?
<font face='Arial,Helvetica,Sans Serif' color='#000055' size='2'/><B>N +ote:</B> &nbsp;</font><font face='Arial,Helvetica,Sans Serif' color= +'#000055' size='2'/>By arrangement.</font>
And I want to extract "By arrangement." from the above .... how would I do that? Thanks, Helpless (and Stupid)

Replies are listed 'Best First'.
Re: Extract text from HTML
by jdporter (Canon) on Dec 28, 2002 at 15:29 UTC
    Here's a nice little function that does it.

    (FYI - the faq noted by Juerd is obsolete. Not only does it not give an actual solution, but it recommends HTML::Parse, which is now deprecated.)
    use HTML::Parser; sub extract_html_text { my $html = shift; my $text = ''; HTML::Parser->new( api_version => 3, text_h => [ sub { $text .= "@ +_"; }, "dtext" ] )->parse( $html )->eof; $text }

    UPDATE: Here's another (imho, nicer) little function that does it:
    use HTML::TreeBuilder; sub extract_html_text { HTML::TreeBuilder->new_from_content($_[0])->as_text }

    ...porque es dificil estar guapo y blanco.

      jdporter, Thanks a ton for your help ..... that did it. I had a question though ..... with regards to my question and your answer, I am using -
      my $we = extract_html_text($browser->{res}->content); my @note = $we =~ m/Note:\s*([^<]+)/gi;
      to first strip the HTML gook and then search the remainder for "Note". The thing is that "@note" prints out everything after Note: i.e. all the other coding in the remainder of the (formerly) HTML file, etc. Is there any way I can get it to search for Note:, collect all the information after it and stop when it reaches "Pre" or "Attrib" or "Link"? Would greatly appreciate your feedback. Thanks.
        How about:
        my ($note) = $we =~ m/Note:\s*(.+?)(?:Pre|Attrib|Link)/sgi
Re: Extract text from HTML
by Juerd (Abbot) on Dec 28, 2002 at 14:47 UTC
Re: Extract text from HTML
by vek (Prior) on Dec 29, 2002 at 14:09 UTC
Re: Extract text from HTML
by osama (Scribe) on Dec 29, 2002 at 20:56 UTC
    I used to like reinventing the wheel every time... I used to do something like this:
    # THIS IS BAD s/(\s|\&nbsp;)+/ /g; s/<(BR|P)>/\n/ig; s/<.+?>//g;
    Now I just :
    use HTML::TokeParser;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://222730]
Approved by Corion
[MidLifeXis]: I just kicked over a hornet's nest at work. \o/
[MidLifeXis]: Trying not to get stung.

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2017-04-25 18:24 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (462 votes). Check out past polls.