Re: Stripping HTML tags efficiently


There's more than one way to do things
	PerlMonks

Re: Stripping HTML tags efficiently

by TedPride (Priest)

on Dec 10, 2004 at 09:41 UTC ( [id://413790]=note: print w/replies, xml )

Need Help??

in reply to Stripping HTML tags efficiently

It looks like you're just trying to extract the tags from the document. The following should work:

use strict; use warnings;

read(DATA, $_, 1024);
print join "\n", m/<.*?>/g;

__DATA__
Once <a href="foo.html">upon</a> a time there was a
<font color="#FF0000">CODE <b>RED</b></font> situation.
[download]

EDIT: As per Crian's comment, the above should be print join "\n", m/<.*?>/sg; instead.

Or a line by line version, if you're working with large files:

use strict; use warnings;

while (<DATA>) {
    print $&."\n" while m/<.*?>/g;
}

__DATA__
Once <a href="foo.html">upon</a> a time there was a
<font color="#FF0000">CODE <b>RED</b></font> situation.
[download]

This is not really a robust method, however, and you're probably better off using a library unless your needs are simple and you're sure the tags are formatted properly.

Comment on Re: Stripping HTML tags efficiently Select or Download Code

Replies are listed 'Best First'.
Re^2: Stripping HTML tags efficiently by Your Mother (Archbishop) on Dec 10, 2004 at 18:48 UTC
Both approaches are pretty flawed. Breaking text into chunks is going to break tags in half often, eg <p style="bor1024_markder:1px solid black"> and reading line by line is going to split tags in half that cross lines: `<img src="/some/path/somewhere.png" alt="A long title" style="display:block" class="article" />` [download] Parsing HTML correctly is non-trivial. With one of the html parser modules, like HTML::TokeParser et al, you'll be sure it's right.	[reply] [d/l]
Re^2: Stripping HTML tags efficiently by Crian (Curate) on Dec 10, 2004 at 11:13 UTC
And what, if a tag is splitted onto two or more lines? You will miss that ones by doing it this way.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://413790]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others exploiting the Monastery: (7)

As of 2024-04-23 07:07 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found