Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Stripping HTML tags efficiently

by TedPride (Priest)
on Dec 10, 2004 at 09:41 UTC ( [id://413790]=note: print w/replies, xml ) Need Help??


in reply to Stripping HTML tags efficiently

It looks like you're just trying to extract the tags from the document. The following should work:
use strict; use warnings; read(DATA, $_, 1024); print join "\n", m/<.*?>/g; __DATA__ Once <a href="foo.html">upon</a> a time there was a <font color="#FF0000">CODE <b>RED</b></font> situation.
EDIT: As per Crian's comment, the above should be print join "\n", m/<.*?>/sg; instead.

Or a line by line version, if you're working with large files:

use strict; use warnings; while (<DATA>) { print $&."\n" while m/<.*?>/g; } __DATA__ Once <a href="foo.html">upon</a> a time there was a <font color="#FF0000">CODE <b>RED</b></font> situation.
This is not really a robust method, however, and you're probably better off using a library unless your needs are simple and you're sure the tags are formatted properly.

Replies are listed 'Best First'.
Re^2: Stripping HTML tags efficiently
by Your Mother (Archbishop) on Dec 10, 2004 at 18:48 UTC

    Both approaches are pretty flawed. Breaking text into chunks is going to break tags in half often, eg

    <p style="bor1024_markder:1px solid black">
    
    and reading line by line is going to split tags in half that cross lines:
    <img src="/some/path/somewhere.png" alt="A long title" style="display:block" class="article" />

    Parsing HTML correctly is non-trivial. With one of the html parser modules, like HTML::TokeParser et al, you'll be sure it's right.

Re^2: Stripping HTML tags efficiently
by Crian (Curate) on Dec 10, 2004 at 11:13 UTC
    And what, if a tag is splitted onto two or more lines? You will miss that ones by doing it this way.
    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://413790]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2024-04-23 07:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found