Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Re: What is the fastest way to parse HTML?

by sri (Vicar)
on Jul 22, 2003 at 23:16 UTC ( #276952=note: print w/replies, xml ) Need Help??


in reply to Re: What is the fastest way to parse HTML?
in thread What is the fastest way to parse HTML?

Well, i have to regularly index a few million documents for a small intranet search engine.

Currently I am using HTML::Parser, it works fine but the number of new/updated documents is increasing fast and now I'm overthinking the indexer part.

Hope, this helps you to make better answers. ;)
  • Comment on Re: Re: What is the fastest way to parse HTML?

Replies are listed 'Best First'.
Re: Re: Re: What is the fastest way to parse HTML?
by sauoq (Abbot) on Jul 22, 2003 at 23:58 UTC
    Well, i have to regularly index a few million documents for a small intranet search engine.

    Then you asked the wrong question. The right one is: "What is the fastest way to index a few million documents for a small intranet search engine?"

    The answer, as I recently learned from tachyon, is Swish-e. Of course, you'll also want to grab the Perl interface, SWISH, from CPAN.

    -sauoq
    "My two cents aren't worth a dime.";
    
      I, too, concur with Swish. Granted, I used it 8 years ago, but it was an excellent tool.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Re: Re: What is the fastest way to parse HTML?
by waswas-fng (Curate) on Jul 22, 2003 at 23:57 UTC
    Are you trying to actually parse or just strip the text for indexing? How are you doing the indexing? You may want to try some benchmarks out to see where your code is spending the most time. look at: Devel::Profile and Benchmark to help see where the actual slowdowns are happening. In my experiance the strip to text is very fast and the indexing and updating the db is the slow part.

    -Waswas
      You made me think about another possible improvement, as I said I only use text and some layout tags, so I could use the report_tags() method of HTML::Parser to suppress all unneeded junk.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://276952]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2018-07-20 14:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (435 votes). Check out past polls.

    Notices?