Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Stripping HTML tags efficiently

by agynr (Acolyte)
on Dec 10, 2004 at 06:27 UTC ( [id://413770]=perlquestion: print w/replies, xml ) Need Help??

agynr has asked for the wisdom of the Perl Monks concerning the following question:

Hello Everyone,
I am running this loop given below to find out any html tags(<>or </>) from an HTML document but the problem is that it is consuming lot of execution time. Can u plz suggest me any other way or alternative to sort out this problem.

P.S. The target_data is the data I m extracting from a file and it's size can be very large enough.

$pattern='(\<[^\>]{1,300}\>)'; while ($target_data=~m/$pattern/gi) { $m=$1; $space=''; $space=' ' x length($m); $target_data=~s/$pattern/$space/i; print $m."\n"; }

Janitored by davido: Added formatting and code tags to reflect the OP's input layout.
Retitled by davido per consideration.

Replies are listed 'Best First'.
Re: Stripping HTML tags efficiently
by davido (Cardinal) on Dec 10, 2004 at 07:10 UTC

    I haven't benchmarked it myself, but I have used HTML::Strip to strip HTML from a document, and have found it to be effective and simple. The POD for the module claims that it is about five times faster than using regular expressions to strip HTML.

    Here's how you do it:

    use strict; use warnings; use LWP::Simple; use HTML::Strip; my $raw_html = get( 'http://www.somewebsite.com' ); my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof; print $clean_text, "\n";

    Dave

Re: Stripping HTML tags efficiently
by gaal (Parson) on Dec 10, 2004 at 06:43 UTC
    (Please surround your code with CODE tags to keep it readable.)

    If you just want to de-HTMLify a document, the fastest way I know of doing it would be to run it through lynx -dump. This even gives you a bit of formatting.

    If you really need to overwrite tags with spaces, and in the proper amount, then your approach of making a pattern first and then using it is not bad, but you're making two mistakes. First, you're only making a string, not a compiled regexp. You can very easily fix that by changing your first statement to:

    my $pattern = qr/  ...whatever was here before...  /;

    Secondly, you are doing the work twice: first you just match for tags, then you substitute. Don't do that.

    1 while $target_data =~ s/$pattern/' ' x length $1/ge;

    (This is not tested! At all!)

    Finally, don't use regexps to parse HTML. Use an HTML::Parser.

      Thanx for ur useful advice. Until u had told me I was unaware of the particular module.I have used HTML::Parser but in a different way.I have put my data in a particular file and then parsed it like given below my $p = HTML::Parser->new( text_h => \&text, 'dtext', ); #### my data into the particular file $p->parse_file('try.txt') or die $!; open FILE, ">output.txt" or die "Can't: $!\n"; sub text { my $text = shift; $output .= $text; Anyhow Thanx once again
      Sir, I am having one problem again. That the code completely eliminates the html tags but what I want is to convert it into tags which it is not doing. Can u plz tell me how it can be done?
        If I understand what you're trying to do:

        You want to strip out all the tags from the original data, but gether them all in a separate place? Okay, instead of doing nothing ("1"), gather the data.

        my @extragted_tags; push @extracted_tags, $1 while s/$pattern/" " x length $1/ge;

        (Not tested, either!)

        This puts the separate tags in separate elements of @extracted_tags. If you want them all together in a single string, try this.

        my $extracted_tags; $extracted_tags .= $1 while s/$pattern/" " x length $1/ge;

        The better you manage to specify what you want to do, the easier it will be for you to do it.

Re: Stripping HTML tags efficiently
by Crian (Curate) on Dec 10, 2004 at 11:18 UTC

    The (or one) problem is, that you have a variable in your regular expression, what is not neccessary in this case. This slows things always down.

    Whats about using qr// to compile the expression or just putting the pattern into the RE directly?

    while ($target_data=~m/(<[^>]{1,300}>)/gi)

    (You don't have to escape < and > btw.)

      You are also right.. Thanx for that
Re: Stripping HTML tags efficiently
by Animator (Hermit) on Dec 10, 2004 at 12:41 UTC
    Why limit the size of the tag from 1 to 300 (instead of using * or +)? I'm not 100% but this might slow it down...
Re: Stripping HTML tags efficiently
by TedPride (Priest) on Dec 10, 2004 at 09:41 UTC
    It looks like you're just trying to extract the tags from the document. The following should work:
    use strict; use warnings; read(DATA, $_, 1024); print join "\n", m/<.*?>/g; __DATA__ Once <a href="foo.html">upon</a> a time there was a <font color="#FF0000">CODE <b>RED</b></font> situation.
    EDIT: As per Crian's comment, the above should be print join "\n", m/<.*?>/sg; instead.

    Or a line by line version, if you're working with large files:

    use strict; use warnings; while (<DATA>) { print $&."\n" while m/<.*?>/g; } __DATA__ Once <a href="foo.html">upon</a> a time there was a <font color="#FF0000">CODE <b>RED</b></font> situation.
    This is not really a robust method, however, and you're probably better off using a library unless your needs are simple and you're sure the tags are formatted properly.
      And what, if a tag is splitted onto two or more lines? You will miss that ones by doing it this way.
      A reply falls below the community's threshold of quality. You may see it by logging in.

      Both approaches are pretty flawed. Breaking text into chunks is going to break tags in half often, eg

      <p style="bor1024_markder:1px solid black">
      
      and reading line by line is going to split tags in half that cross lines:
      <img src="/some/path/somewhere.png" alt="A long title" style="display:block" class="article" />

      Parsing HTML correctly is non-trivial. With one of the html parser modules, like HTML::TokeParser et al, you'll be sure it's right.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://413770]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-24 02:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found