Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Cleanning HTML - New/better module for that - test please! ;-P

by gmpassos (Priest)
on Apr 22, 2003 at 06:00 UTC ( [id://252200]=perlmeditation: print w/replies, xml ) Need Help??

I was testing the module HTML::Clean to make a filter flag to the output of mod_perl for HPL (another HTML/Perl embed). But when I started to see the source, how the code is cleaned, I saw that the filter can make some mistakes with complex HTML. So I decided to make my own filter, but one that doesn't change the final result in the browser. I made some tests with HTML::Clean and my new module, and saw that I got a better filter (without changes in the result) and that clean better/more. (I have used www.cnn.com.br & www.perl.com pages that have styles, javascript, etc...)

What I want is not say what is better or not, actually the HTML::Clean idea to make a filter based in direct changes with RE is good, since use less memory, but it can't know exactly what it does inside the HTML tree. But we can't make a filter full based in parsed HTML tree, since this will be slow, what is not good for a server. My module is something between the 2 ways, and try to look in the basic things that can be cleaned, not very complex ideas, to keep it fast.

I was talking with the author (for now just sent an e-mail, waiting reply) to make some update to the module HTML::Clean with the code that I made. But the code has only 2 days of life, and need tests. I would like that the monks test the code with some Web Sites and see if the output was ok, the same, in the browser. Any idea to make the filter better or comments are gladly accepted!

To test get: http://www.inf.ufsc.br/~gmpassos/htmlclean.zip
Is very small and the test script has only 2 files, and doesn't need to install anything/modules in your Perl.

Graciliano M. P.
"The creativity is the expression of the liberty".

  • Comment on Cleanning HTML - New/better module for that - test please! ;-P

Replies are listed 'Best First'.
Re: Cleanning HTML - New/better module for that - test please! ;-P
by thpfft (Chaplain) on Apr 22, 2003 at 12:32 UTC

    Regex-based html processing is generally not regarded as a good idea: it's unreliable, labour-intensive, demanding to maintain and very very difficult to get right. The vast majority of respectable solutions are based on HTML::Parser, either directly or by way of one of the modules that put a simpler interface on it. Ovid's HTML::TokeParser::Simple is probably the one I'd recommend.My own HTML::TagFilter is simpler, but not as good (and not at all diligently maintained :).

    If your goal is just to clean, rather than to digest and process, then you would also do well to try HTML::Tidy, a perl interface to the venerable but very effective htmltidy library.

    I'm afraid you will almost certainly find that this wheel has already been made for you and that only a half-dozen lines of code are required...

      I'd like to point out that japhy wrote one (regex based parser).

      YAPE::HTML - Yet Another Parser/Extractor for HTML


      MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
      I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
      ** The Third rule of perl club is a statement of fact: pod is sexy.

      If you parse the HTML tag by tag you can make a good work with REGEX, and is what I made, not a regex filter directly in the full HTML source. Is like a pure Perl parser that use the ability of REGEX to make it faster.

      Since what I want is only clean HTML in a fast way, I can't parse the HTML with a full tree. Note that the idea is to filter the output of mod_perl, or any CGI, to make the HTML smaller, and this can't be slow or use much memory/CPU or will be bad for the server, without advantages.

      I tested htmltidy (http://tidy.sourceforge.net/) and saw that it's good to fix bugs in the HTML and to apply a style to it, not to clean the code!

      Graciliano M. P.
      "The creativity is the expression of the liberty".

        tidy intro - When editing HTML it's easy to make mistakes. Wouldn't it be nice if there was a simple way to fix these mistakes automatically and tidy up sloppy editing into nicely layed out markup?

        could you tell me what is your definition of clean the code? could you provide an example?
      Suppose you want to write HTML::Parser in pure Perl. (Or is it already?) What would you use for the job? - You guessed it. The opposite of parsing HTML is treating it as an unstructured stream of characters - whether you use pattern matching is orthogonal to the approach taken.

      Makeshifts last the longest.

        It is true, of course, that it would be very difficult to recreate HTML::Parser in pure perl without using any regexes, though it does not follow from there that it is a good idea to recreate HTML::Parser in pure perl.

        It is also true that factors you describe are orthogonal, but only if you restrict the phrase 'use pattern matching' to its most drily correct application. In more informal usage it is common to talk of 'using regexes' as one way of parsing html and 'using the parser' as another, better way. I speak from chastening experience here.

        So, to clarify, you are advising the OP to write his own parser in perl using plenty of regexes, and to restrict himself to only the most exact usage of words and operators? Which doesn't seem very perly, but I'm only a lowly bishop and easily muddled :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://252200]
Approved by Corion
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-16 17:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found