Stripping HTML tags efficiently

agynr has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Stripping HTML tags efficiently by davido (Cardinal) on Dec 10, 2004 at 07:10 UTC
I haven't benchmarked it myself, but I have used HTML::Strip to strip HTML from a document, and have found it to be effective and simple. The POD for the module claims that it is about five times faster than using regular expressions to strip HTML. Here's how you do it: `use strict; use warnings; use LWP::Simple; use HTML::Strip; my $raw_html = get( 'http://www.somewebsite.com' ); my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof; print $clean_text, "\n";` [download] Dave	[reply] [d/l]
Re: Stripping HTML tags efficiently by gaal (Parson) on Dec 10, 2004 at 06:43 UTC
(Please surround your code with CODE tags to keep it readable.) If you just want to de-HTMLify a document, the fastest way I know of doing it would be to run it through `lynx -dump`. This even gives you a bit of formatting. If you really need to overwrite tags with spaces, and in the proper amount, then your approach of making a pattern first and then using it is not bad, but you're making two mistakes. First, you're only making a string, not a compiled regexp. You can very easily fix that by changing your first statement to: `my $pattern = qr/ ...whatever was here before... /;` Secondly, you are doing the work twice: first you just match for tags, then you substitute. Don't do that. `1 while $target_data =~ s/$pattern/' ' x length $1/ge;` (This is not tested! At all!) Finally, don't use regexps to parse HTML. Use an HTML::Parser.	[reply] [d/l] [select]
Re^2: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 10, 2004 at 11:52 UTC
Thanx for ur useful advice. Until u had told me I was unaware of the particular module.I have used HTML::Parser but in a different way.I have put my data in a particular file and then parsed it like given below my $p = HTML::Parser->new( text_h => \&text, 'dtext', ); #### my data into the particular file $p->parse_file('try.txt') or die $!; open FILE, ">output.txt" or die "Can't: $!\n"; sub text { my $text = shift; $output .= $text; Anyhow Thanx once again	[reply]
Re^2: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 11, 2004 at 08:33 UTC
Sir, I am having one problem again. That the code completely eliminates the html tags but what I want is to convert it into tags which it is not doing. Can u plz tell me how it can be done?	[reply]
Re^3: Stripping HTML tags efficiently by gaal (Parson) on Dec 11, 2004 at 08:59 UTC
If I understand what you're trying to do: You want to strip out all the tags from the original data, but gether them all in a separate place? Okay, instead of doing nothing ("1"), gather the data. `my @extragted_tags; push @extracted_tags, $1 while s/$pattern/" " x length $1/ge;` [download] (Not tested, either!) This puts the separate tags in separate elements of @extracted_tags. If you want them all together in a single string, try this. `my $extracted_tags; $extracted_tags .= $1 while s/$pattern/" " x length $1/ge;` [download] The better you manage to specify what you want to do, the easier it will be for you to do it.	[reply] [d/l] [select]
Re^4: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 11, 2004 at 09:14 UTC
Re^5: Stripping HTML tags efficiently by gaal (Parson) on Dec 11, 2004 at 09:36 UTC
Some notes below your chosen depth have not been shown here
Re: Stripping HTML tags efficiently by Crian (Curate) on Dec 10, 2004 at 11:18 UTC
The (or one) problem is, that you have a variable in your regular expression, what is not neccessary in this case. This slows things always down. Whats about using `qr//` to compile the expression or just putting the pattern into the RE directly? `while ($target_data=~m/(<[^>]{1,300}>)/gi)` (You don't have to escape `<` and `>` btw.)	[reply] [d/l]
Re^2: Stripping HTML tags efficiently by agynr (Acolyte) on Dec 10, 2004 at 12:08 UTC
You are also right.. Thanx for that	[reply]
Re: Stripping HTML tags efficiently by Animator (Hermit) on Dec 10, 2004 at 12:41 UTC
Why limit the size of the tag from 1 to 300 (instead of using * or +)? I'm not 100% but this might slow it down...	[reply]
Re: Stripping HTML tags efficiently by TedPride (Priest) on Dec 10, 2004 at 09:41 UTC
It looks like you're just trying to extract the tags from the document. The following should work: `use strict; use warnings; read(DATA, $_, 1024); print join "\n", m/<.?>/g; __DATA__ Once <a href="foo.html">upon</a> a time there was a <font color="#FF0000">CODE <b>RED</b></font> situation.` [download] EDIT: As per Crian's comment, the above should be `print join "\n", m/<.?>/sg;` instead. Or a line by line version, if you're working with large files: `use strict; use warnings; while (<DATA>) { print $&."\n" while m/<.*?>/g; } __DATA__ Once <a href="foo.html">upon</a> a time there was a <font color="#FF0000">CODE <b>RED</b></font> situation.` [download] This is not really a robust method, however, and you're probably better off using a library unless your needs are simple and you're sure the tags are formatted properly.	[reply] [d/l] [select]
Re^2: Stripping HTML tags efficiently by Crian (Curate) on Dec 10, 2004 at 11:13 UTC
And what, if a tag is splitted onto two or more lines? You will miss that ones by doing it this way.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^2: Stripping HTML tags efficiently by Your Mother (Archbishop) on Dec 10, 2004 at 18:48 UTC
Both approaches are pretty flawed. Breaking text into chunks is going to break tags in half often, eg <p style="bor1024_markder:1px solid black"> and reading line by line is going to split tags in half that cross lines: `<img src="/some/path/somewhere.png" alt="A long title" style="display:block" class="article" />` [download] Parsing HTML correctly is non-trivial. With one of the html parser modules, like HTML::TokeParser et al, you'll be sure it's right.	[reply] [d/l]


"be consistent"
	PerlMonks