Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re^2: Dynamically cleaning up HTML fragments

by SilasTheMonk (Chaplain)
on Sep 24, 2010 at 11:41 UTC ( #861793=note: print w/replies, xml ) Need Help??

in reply to Re: Dynamically cleaning up HTML fragments
in thread Dynamically cleaning up HTML fragments

Actually HTML::Tidy seems to have a bit of bad history at Debian. My original post that it is not in Debian is wrong, but its definitely in an odd state. I am investigating.
  • Comment on Re^2: Dynamically cleaning up HTML fragments

Replies are listed 'Best First'.
Re^3: Dynamically cleaning up HTML fragments
by wfsp (Abbot) on Sep 25, 2010 at 10:32 UTC
    Ubuntu 8.04, perl 5.10.1

    HTML::Tidy has been released three times this year (the last on 17 September) so some of the criticisms may have been addressed.

    It requires tidyp (version 1.04 recently released) which is a fork of tidy.

    I was able to install tidyp in the usual way and H::T installed without fuss using cpanp.

    #! /usr/bin/perl use strict; use warnings; use HTML::Tidy; my $tidy = HTML::Tidy->new( { output_xhtml => 1, tidy_mark => 0, markup => 1, q{show-body-only} => 1, } ); printf qq{tidyp: %s\n}, $tidy->tidyp_version; printf qq{libtidyp: %s\n}, $tidy->libtidyp_version; printf qq{HTML::Tidy: %s\n}, $HTML::Tidy::VERSION; my $html = do {local $/;<DATA>}; $tidy->parse(q{test.html}, $html) or die q{parse failed}; for my $message ($tidy->messages){ print $message->as_string, qq{\n}; } my $xhtml = $tidy->clean($html); print $xhtml; __DATA__ <div> <p>tidy</p> <img src="pic.jpg"> </div>
    tidyp: 1.04 libtidyp: 1.04 HTML::Tidy: 1.54 test.html (1:1) Warning: missing <!DOCTYPE> declaration test.html (1:1) Warning: inserting implicit <body> test.html (1:1) Warning: inserting missing 'title' element test.html (3:3) Warning: <img> lacks "alt" attribute <div> <p>tidy</p> <img src="pic.jpg" /></div>
    See the tidy quick reference for all the configuration options.
      Thanks. It actually installs fine on Debian using the packaging system. And I was able to use and configure it. The issues are:
      1. The version in Debian is old.
      2. An update does not appear to be happening I think due to the fork of tidy. It makes it very messy and until someone really screams it won't happen. I am in the relevant group and I won't volunteer.
      3. I could not configure it to change "<span>blah</span>" to "blah". Saying that tidy is not intended to do that is reasonable, but I want it to do that. Javascript rich text editors generate stuff that one does not necessarily want or need.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://861793]
[marto]: This won't take till Summer to complete I hope ;)
[Discipulus]: i hope too, vacuum cleaner permitting, 2-3 afternoon to build. Or is the 1:1 serie? ;=)
[Tanktalus]: Sure, no one is active for hours and hours in the cb, so I go to upgrade the db, and then someone is active. Sheesh :)
[Discipulus]: anyway poppins probably died with the last night cold. But is not normal to see them in dec. they must pop in April
NodeReaper eyes the thorns in the side
Discipulus : the party puller!
[Discipulus]: I was trying to solve this but i'm not. my regex-fu is stuck at primary school

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2017-12-18 21:35 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (498 votes). Check out past polls.