Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Safe HTML output?

by gav^ (Curate)
on Jan 19, 2002 at 22:17 UTC ( [id://140102]=perlquestion: print w/replies, xml ) Need Help??

gav^ has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a web based app where I want to let my users type in HTML. This is my first pass at parsing out anything that would be considered unsafe and would appreciate some comments. I did have the handlers nested inside safe_html but that gives warnings about %tags and $html not staying shared. I'm not 100% what the best way around this is, I want to stick to a single subroutine so it's easy for other to use.

Thanks.

gav^

use strict; use warnings; use HTML::Parser; # tags with allowable attributes my %tags = ( img => { map { $_ => 1 } qw(width height src border) }, a => { map { $_ => 1 } qw(href target name) }, ); # tags with no attributes $tags{$_} = {} foreach qw(b i u br p code pre); my $html = ""; sub safe_html { my $p = new HTML::Parser( api_version => 3, start_h => [ \&_start, 'tagname, attr' ], end_h => [ \&_end, 'tagname'], text_h => [ \&_text, 'text' ], ); $p->parse(shift); $html =~ s/\s+/ /g; return $html; } sub _start { my ($tag, $attrs) = @_; return unless $tags{$tag}; $html .= '<' . $tag; while (my ($attr, $value) = each %$attrs) { if ($tags{$tag}->{$attr}) { $html .= sprintf(q{ %s="%s"}, $attr, $value); } } $html .= '>'; } sub _end { my $tag = shift; $html .= '</' . $tag . '>' if $tags{$tag}; } sub _text { $html .= shift; }

gav^

Replies are listed 'Best First'.
Re: Safe HTML output?
by Hero Zzyzzx (Curate) on Jan 20, 2002 at 00:36 UTC

    HTML::TagFilter does this admirably. It's a subclass of HTML::Parser and allows you to specify what tags/attributes to allow/deny similarly to what you're doing. You'd probably need to tweak this a little to fit into your code the way you want, but it should do the trick.

    use HTML::TagFilter; my $tf = HTML::TagFilter->new( allow=>{ p=>{'any'}, i=>{'any'}, b=>{'any'}, code=>{'any'}, br=>{'any'}, u=>{'any'}, pre=>{'any'}, img=>{width=>['any'], height=>['any'], border=>['any'], src=>['any'], }, a=>{href=>['any'], target=>['any'], name=>['any'], }, }, deny=>{}, log_rejects => 1, strip_comments => 1, ); sub filter_html{ $tf->filter(shift); }

    Update: This module will freak out if you try to install/use it on anything earlier than perl 5.6, I believe because it uses Warnings. As another monk pointed out (forgot who, it was a while ago), you can just comment this out (or install it, I suppose) and it'll work fine.

    -Any sufficiently advanced technology is
    indistinguishable from doubletalk.

      HTML::TagFilter looks great, apart from it doesn't allow me to use my own handler for text sections which I need.

      Thanks for the tip though, it looks like something that may come in handy.

      gav^

Re: Safe HTML output?
by Juerd (Abbot) on Jan 19, 2002 at 22:45 UTC
    The warning about staying shared is an issue with named subs only. (See diagnostics)
    The solution is easy: don't use named subs as closures.

    # First solution: Put coderefs in scalars and use those my $start = sub { ... }; my $p = HTTP::Parser->new( ..., start_h => [ $start, ... ], ... ); # Second solution: Use coderefs, but don't use scalars to hold them my $p = HTTP::Parser->new( ..., start_h => [ sub { ... # code here }, ... ], end_h => [ sub { ... # code here }, ... ], ... );
    HTH.

    P.S. Maybe HTML::TokeParser is better for this job.

    2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

Re: Safe HTML output?
by Masem (Monsignor) on Jan 19, 2002 at 22:36 UTC
    Not a solution to your immediate problem, but I've found that the safe html solution that Everything's engine offers (with both tags and attributes settable) to be quite good for clean HTML parsing. It ammounts to basically the same as what you're trying to do above, but I believe Everything's requires no extra modules.

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
    "I can see my house from here!"
    It's not what you know, but knowing how to find it if you don't know that's important

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://140102]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-25 12:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found