Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: Searching module

by eduardo (Curate)
on Jan 30, 2001 at 00:38 UTC ( #55059=note: print w/replies, xml ) Need Help??

in reply to Searching module

ZydecoSue said:

I noticed that CPAN contains one called Search-InvertedIndex, but that seems really complicated for I thought should be a simple task.

And eduardo cringed... I have written search engines pretty much my entire professional programming life. All I did at every single employer I can think of was write indexers and search engines for different types of data. Relational data, flat data, ISAM data, geographic data, archaic data, encrypted data... Please, do yourself a favor, and realize that searching is one of the most time honored and well studied fields in computer science. If you point your browser to <a href="">Sorting and Searching</a> by the great Knuth you will realize that if it took him 1/2 of a 780 page book, maybe there is more complexity to this entire "searching" thing that at first seems to be on the surface.

The first and most important thing that you need to do is understand the data that you are searching through. Is it flat files, is it DBM's, are you looking at RDBMS tables, OORDBMS? What is the "nature" of the data, what is it's "thingness." What does it contain, what does it show you, how does it index?

Most data that you will find, can be described in two categories:

  • That which has a key
  • That which does not have a key

If you realize that your data is data that can be keyed, then your problems become much easier. There are 100's if not 1000's of mechanisms for the ease of searching through keyed information. You have choices ranging from:

  • Create a database with primary keys
  • Create DBM's which you tie
  • Create keyed index files
  • Use some pre-built system (it's amazing what's out there)

If however, you are doing free form searching on data, data that can not be related as simply as key => value, then the problem is a slight bit more complicated. You are asking for things which are more "full-text" and open form. This is very difficult to implement right, which is why you have such a difference in the quality of search engines. A search engine (like Google) does just this, attempt to find a way to intelligently parse the free form data that exists on the internet. There is *never* a good reason to invent the wheel (well, I lie, sometimes for didactic purposes)... if it is this type of data you have, then I suggest you find an indexing / full text search system:

  • Glimpse is an amazing produce for full text searching
  • ht://dig is also pretty good

However, all that I can suggest, is do yourself a favor, this is a more complex thing than just indexing and using grep. Understand your data, understand your structure, understand what it is that you are trying to accomplish, and remember, you can do what merlyn says in his WebTechniques column, use WWW::Search and rely on Altavista to do your searching for you :)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://55059]
[ambrus]: Hopefull the object isn't kept alive, the events are processed immediately, but you'd have to read a lot of source code to be sure about that.
[Corion]: ambrus: I think both of AnyEvent and Prima are pretty tight in their memory management because they both are cooperative multitasking and (I think) both use the Perl memory management for managing things
[Corion]: ambrus: And for Windows, I don't think that Prima knows if there still are messages queued for an object (in the Windows message loop). Finding that out would take lots of effort for little gain
[ambrus]: And even if this works, I'm still not sure you can't get double timeouts from a Timer.
[ambrus]: Corion: well Prima::Object says something like that the cleanup method will send an onDestory message and that you can't get more messages after cleanup, or something.
[Corion]: ambrus: Yeah - I don't think the deep source dive will be necessary if things are implemented as simple as they could be :)) And hopefully I won't need (more) timely object destruction. I can update the screen at 60Hz and hopefully even do HTTP ...
[Corion]: ... transfers in the background. Now that I think about it, this maybe even means that I can run the OpenGL filters on Youtube input :)
[ambrus]: Corion: I mentioned that the unix event loop of Prima always wakes up at least once every 0.2 seconds. Have you found out whether the win32 event loop of Prima does that too?
[Corion]: ambrus: Hmm - I would assume that the onDestroy message is sent from the destructor and doesn't go through the messageloop, but maybe it is sent when a window gets destroyed but all components are still alive...
[ambrus]: Corion: partly deep source dive, partly just conservative coding even if it adds an overhead.

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2016-12-09 10:25 GMT
Find Nodes?
    Voting Booth?
    On a regular basis, I'm most likely to spy upon:

    Results (150 votes). Check out past polls.