in reply to HTML Parsing

This is what I've ended up with:

if ($field eq "comments") { # Remove any links (because they break URL to link conversion) $$field =~ s/<A.*?HRef.*?>//isg; $$field =~ s/<\/A>//isg; # Extract any image links and add them to an array for safe-ke +eping, replace them with placeholders $image_database = 0; while ($$field =~ /<Img(.*?)>/) { $$field =~ s/(<Img(.*?)>)/\[My_Image=$image_database\]/iso +; $images[$image_database] = $1; $image_database ++; } # If HTML is not allowed, strip any remaining HTML if ($allow_html != 1) { $$field =~ s/<(?:[^>'"]*|(['"]).*?\1)* +>//gs; } # Convert URL's and e-mail addresses to links (with regex) $$field =~ s/(((ht|f)tp):(\/\/)[a-z0-9%&_\-\+=:@~#\/.\?]+(\/|[ +a-z]))/<A HRef="$1" Target="_blank">$1<\/A>/isg; $$field =~ s/(^\W|\s)([a-z0-9_\-.]+\@[a-z0-9_\-]+\.[a-z]+)(.*? +$)/$1<A HRef="mailto:$2">$2<\/A>$3/mig; # Replace the image placeholders with their corresponding imag +es $image_database = 0; while ($$field =~ /\[My_Image=(\d*)\]/) { $img_src = $images[$1]; $$field =~ s/\[My_Image=(\d*)\]/$img_src/iso; $image_database ++; } }

(Yes, I know I'm not using "strict" - this is a prototype only).

Anyone see any problems with this code?

 

In theory, there is no difference between theory and practise.  But in practise, there is.
 
Jonathan M. Hollin
Digital-Word.com

Replies are listed 'Best First'.
Re: Re: HTML Parsing
by DarkBlue (Sexton) on Feb 12, 2001 at 05:51 UTC
    Just realised that
    $$field =~ s/<A.*?HRef.*?>//isg; $$field =~ s/<\/A>//isg;
    is going to screw up any <A Name...> tags... damn...

     

    In theory, there is no difference between theory and practise.  But in practise, there is.
     
    Jonathan M. Hollin
    Digital-Word.com