Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

What is the fastest way to parse HTML?

by sri (Vicar)
on Jul 22, 2003 at 22:37 UTC ( #276944=perlquestion: print w/replies, xml ) Need Help??
sri has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

When I say parse, I mean extracting the text and formatting information like font size, bold, is anchor etc...

The main requirements are speed and fault tolerance.

There are a few possible solutions that came to my mind:

- HTML::Parser would be the easiest, but is it the fastest?
- Parse::RecDescent
- Regex
- XML::Parser, maybe not fault tolerant enought
- Build something with flex or bison and make XS binding

What do you think is the fastest way to parse realworld(tm) HTML?

Replies are listed 'Best First'.
Re: What is the fastest way to parse HTML?
by valdez (Monsignor) on Jul 22, 2003 at 23:04 UTC

    Why are you so worried about performances? Have you tried with HTML::Parser or HTML::TokeParserSimple? I have used both in production under moderate load without problems. Try to describe your problem with more details and you will certainly receive a better answer :)

    Ciao, Valerio

      Well, i have to regularly index a few million documents for a small intranet search engine.

      Currently I am using HTML::Parser, it works fine but the number of new/updated documents is increasing fast and now I'm overthinking the indexer part.

      Hope, this helps you to make better answers. ;)
        Well, i have to regularly index a few million documents for a small intranet search engine.

        Then you asked the wrong question. The right one is: "What is the fastest way to index a few million documents for a small intranet search engine?"

        The answer, as I recently learned from tachyon, is Swish-e. Of course, you'll also want to grab the Perl interface, SWISH, from CPAN.

        -sauoq
        "My two cents aren't worth a dime.";
        
        Are you trying to actually parse or just strip the text for indexing? How are you doing the indexing? You may want to try some benchmarks out to see where your code is spending the most time. look at: Devel::Profile and Benchmark to help see where the actual slowdowns are happening. In my experiance the strip to text is very fast and the indexing and updating the db is the slow part.

        -Waswas
Re: What is the fastest way to parse HTML?
by Abigail-II (Bishop) on Jul 23, 2003 at 07:40 UTC
    It will be something written in C.

    Abigail

Re: What is the fastest way to parse HTML?
by PodMaster (Abbot) on Jul 24, 2003 at 09:59 UTC
    You can toss out Parse::RecDescent, Regex, and "Build something with flex or bison and make XS binding".

    Look into YAPE::HTML (regex), HTML::SimpleParse (also regex, but probably not as robust) and HTML::TagReader (XS).

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: What is the fastest way to parse HTML?
by Tricky (Sexton) on Jul 24, 2003 at 09:26 UTC
    Hi folks, Sounds like a similar problem I'm having. I'm fresh to the programming worls, so I ask for your collecitve patience! I'm using regexes to scan HTML source files and remove certain tags, such as image tags, to improve readability for Web users with poor vision (my MSc project). Can I import HTML::Parser to help me in this task? Cheers, Richard
      What was your impression after reading the manual page of HTML::Parser? Can it help you with your task?

      Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://276944]
Approved by valdez
Front-paged by broquaint
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2018-01-23 02:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How did you see in the new year?










    Results (238 votes). Check out past polls.

    Notices?