Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change


by thraxil (Prior)
on Nov 20, 2003 at 04:21 UTC ( [id://308488]=sourcecode: print w/replies, xml ) Need Help??
Category: Web Stuff
Author/Contact Info Anders Pearson,

Tidy is a useful command-line utility that cleans up messy and invalid HTML, producing nice, pristine, XHTML. I wrote this module as a wrapper around tidy to let you clean up html on the fly as it is served by apache. It also uses the Apache::Filter framework so you can use it to clean up the results of any other Filter compliant perl handler (or registry script using Apache::RegistryFilter). This could be very useful if you are trying to get your site to validate but are stuck with an old CMS that produces messy, invalid markup.

You can also download the module from:

Any and all suggestions are welcome. If no one finds any big problems, I may try to upload it to CPAN.

package Apache::Tidy;

use strict;
use warnings;

use vars qw($VERSION);
$VERSION = "0.1";
use Apache::Constants qw(OK DECLINED NOT_FOUND);
use Apache::File;

sub handler {
    my $r = shift;
    # we only care about html
    return DECLINED unless $r->content_type eq 'text/html';
    my $fh = undef;

    if (lc $r->dir_config('Filter') eq 'on') {
        # register as a filter
        $r = $r->filter_register;
        # get input from any previous filters
        ($fh,my $status) = $r->filter_input;
        return $status unless $status == OK;
    } else {
        $fh = Apache::File->new($r->filename);
        return DECLINED unless $fh;

    my $dirty = do {local $/; <$fh>};

    my $tidy_path = $r->dir_config('TidyPath')    || "/usr/bin/tidy";
    my $temp_dir  = $r->dir_config('TidyTempDir') || "/tmp";
    my $options   = join ' ', $r->dir_config->get('TidyOptions');
    $options = $options || "-q -asxhtml";


    # clean up the path so we can run in taint mode
    delete $ENV{PATH};

    eval {
        # write a tempfile
            or die "couldn't write to tempfile: $!";
        print TMP $dirty;
        close TMP;

        # run tidy over it
        system("$tidy_path $options $temp_dir/tidy_$$.html > $temp_dir

        # read in results
            or die "couldn't read tempfile: $!";
        my @results = <OUT>;
        close OUT;

        # clean up
        unlink "$temp_dir/tidy_$$.html";
        unlink "$temp_dir/tidy_out_$$.html";

        print @results;
    if ($@) {
        # if something generated an error,
        # we default to just passing the content on unchanged.
        print $dirty;
    return OK;


=head1 NAME

Apache::Tidy - htmltidy as an apache filter


  PerlModule Apache::Filter
  PerlModule Apache::Tidy
  <Location /filtered/*.html>
     SetHandler perl-script
     PerlHandler Apache::Tidy


  Cleans up and fixes invalid HTML on the fly.


Wrapper for the htmltidy program (L<>) usi
the Apache::Filter framework. Fixes HTML/XHTML validation issues on
the fly.

Dave Raggett's HTML Tidy is a free command-line utility for cleaning
up messy and invalid HTML or XHTML code. It will correct missing or
mismatched end tags, clean up Microsoft Word generated HTML, convert
pages to XHTML, and format markup for easier reading.

Apache::Tidy uses the Apache::Filter framework to allow you to
automatically run tidy over web content as it is being served. This
can be very useful if you have editors or CMSes that produce invalid

To filter static content add the following to your httpd.conf:

  PerlModule Apache::Tidy
  <Location /directory/to/filter/>
     SetHandler perl-script
     PerlHandler Apache::Tidy

Apache::Tidy can also work as part of an Apache::Filter chain:

  PerlModule Apache::Filter
  PerlModule Apache::RegistryFilter
  PerlModule Apache::Tidy
  <Location /perl/*.pl>
     PerlSetVar Filter On
     SetHandler perl-script
     PerlHandler Apache::RegistryFilter Apache::Tidy

Apache::Tidy supports all of htmltidy's command-line options by
setting TidyOptions:

  <Location /filtered/>
    SetHandler perl-script
    PerlHandler Apache::Tidy
    PerlSetVar TidyOptions '-wrap 60'
    PerlSetVar TidyOptions -clean
    PerlSetVar TidyOptions -asxhtml

It defaults to '-q -asxhtml' if no options are explicitly set.

You can also specify a different path to the tidy executable
(necessary if you've installed it anywhere but in /usr/bin/) and the
temp directory used can also be specified (defaults to /tmp):

   <Location /filtered/>
    SetHandler perl-script
    PerlHandler Apache::Tidy
    PerlSetVar TidyPath /opt/local/bin/tidy
    PerlSetVar TidyTempDir /some/other/temp/dir

=head1 NOTES

You must have htmltidy installed on your system. if it is installed
anywhere other than in /usr/bin/, you'll have to specify the full path

  PerlSetVar TidyPath /path/to/tidy

I've only tested Apache::Tidy on unix systems. It may run on other
platforms, but you will probably have to change the path, temp
directory, and options.

Since Apache::Tidy just jumps out to the shell to call the external
tidy program, it probably isn't very efficient. I'd like to
reimplement this someday with an XS or SWIG wrapped tidylib.

=head1 SEE ALSO

L<Apache::Filter>, L<>, L<Apache::Registry

=head1 AUTHOR

Anders Pearson, E<lt>anders@columbia.eduE<gt>


Copyright 2003 by Anders Pearson

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself. 


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://308488]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-09-14 20:38 GMT
Find Nodes?
    Voting Booth?
    The PerlMonks site front end has:

    Results (21 votes). Check out past polls.

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.