Re: Stripping HTML tags efficiently
by davido (Cardinal) on Dec 10, 2004 at 07:10 UTC
|
I haven't benchmarked it myself, but I have used HTML::Strip to strip HTML from a document, and have found it to be effective and simple. The POD for the module claims that it is about five times faster than using regular expressions to strip HTML.
Here's how you do it:
use strict;
use warnings;
use LWP::Simple;
use HTML::Strip;
my $raw_html = get( 'http://www.somewebsite.com' );
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse( $raw_html );
$hs->eof;
print $clean_text, "\n";
| [reply] [d/l] |
Re: Stripping HTML tags efficiently
by gaal (Parson) on Dec 10, 2004 at 06:43 UTC
|
(Please surround your code with CODE tags to keep it readable.)
If you just want to de-HTMLify a document, the fastest way I know of doing it would be to run it through lynx -dump. This even gives you a bit of formatting.
If you really need to overwrite tags with spaces, and in the proper amount, then your approach of making a pattern first and then using it is not bad, but you're making two mistakes. First, you're only making a string, not a compiled regexp. You can very easily fix that by changing your first statement to:
my $pattern = qr/ ...whatever was here before... /;
Secondly, you are doing the work twice: first you just match for tags, then you substitute. Don't do that.
1 while $target_data =~ s/$pattern/' ' x length $1/ge;
(This is not tested! At all!)
Finally, don't use regexps to parse HTML. Use an HTML::Parser. | [reply] [d/l] [select] |
|
Thanx for ur useful advice. Until u had told me I was unaware of the particular module.I have used HTML::Parser but in a different way.I have put my data in a particular file and then parsed it like given below
my $p = HTML::Parser->new(
text_h => \&text, 'dtext',
);
#### my data into the particular file
$p->parse_file('try.txt') or die $!;
open FILE, ">output.txt" or die "Can't: $!\n";
sub text {
my $text = shift;
$output .= $text;
Anyhow Thanx once again
| [reply] |
|
Sir,
I am having one problem again. That the code completely eliminates the html tags but what I want is to convert it into tags which it is not doing. Can u plz tell me how it can be done?
| [reply] |
|
If I understand what you're trying to do:
You want to strip out all the tags from the original data, but gether them all in a separate place? Okay, instead of doing nothing ("1"), gather the data.
my @extragted_tags;
push @extracted_tags, $1 while s/$pattern/" " x length $1/ge;
(Not tested, either!)
This puts the separate tags in separate elements of @extracted_tags. If you want them all together in a single string, try this.
my $extracted_tags;
$extracted_tags .= $1 while s/$pattern/" " x length $1/ge;
The better you manage to specify what you want to do, the easier it will be for you to do it. | [reply] [d/l] [select] |
|
|
|
Re: Stripping HTML tags efficiently
by Crian (Curate) on Dec 10, 2004 at 11:18 UTC
|
The (or one) problem is, that you have a variable in your regular expression, what is not neccessary in this case. This slows things always down.
Whats about using qr// to compile the expression or just putting the pattern into the RE directly?
while ($target_data=~m/(<[^>]{1,300}>)/gi)
(You don't have to escape < and > btw.)
| [reply] [d/l] |
|
You are also right.. Thanx for that
| [reply] |
Re: Stripping HTML tags efficiently
by Animator (Hermit) on Dec 10, 2004 at 12:41 UTC
|
Why limit the size of the tag from 1 to 300 (instead of using * or +)? I'm not 100% but this might slow it down...
| [reply] |
Re: Stripping HTML tags efficiently
by TedPride (Priest) on Dec 10, 2004 at 09:41 UTC
|
It looks like you're just trying to extract the tags from the document. The following should work:
use strict; use warnings;
read(DATA, $_, 1024);
print join "\n", m/<.*?>/g;
__DATA__
Once <a href="foo.html">upon</a> a time there was a
<font color="#FF0000">CODE <b>RED</b></font> situation.
EDIT: As per Crian's comment, the above should be print join "\n", m/<.*?>/sg; instead.
Or a line by line version, if you're working with large files:
use strict; use warnings;
while (<DATA>) {
print $&."\n" while m/<.*?>/g;
}
__DATA__
Once <a href="foo.html">upon</a> a time there was a
<font color="#FF0000">CODE <b>RED</b></font> situation.
This is not really a robust method, however, and you're probably better off using a library unless your needs are simple and you're sure the tags are formatted properly. | [reply] [d/l] [select] |
|
And what, if a tag is splitted onto two or more lines? You will miss that ones by doing it this way.
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in.
|
|
<p style="bor1024_markder:1px solid black">
and reading line by line is going to split tags in half that cross lines:
<img src="/some/path/somewhere.png"
alt="A long title"
style="display:block"
class="article" />
Parsing HTML correctly is non-trivial. With one of the html parser modules, like HTML::TokeParser et al, you'll be sure it's right. | [reply] [d/l] |