Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Google-like query of ASCII text with Text::Query or other modules

by jbullock35 (Hermit)
on Aug 24, 2005 at 05:38 UTC ( #486105=perlquestion: print w/replies, xml ) Need Help??
jbullock35 has asked for the wisdom of the Perl Monks concerning the following question:

I want end users to search some texts I have via a CGI tool. And I want them to be able to query the texts as they would query the web with Google. In other words, I want to take a string that a user passes—a string that obeys the rules established at, or at least the most basic of those rules—and use it to query a set of texts. When that's done, I'll present the user with all the texts that matched the query.

The catch is that my texts don't exist in discrete files. So I can't simply have Google index them and provide my end users with the Google search results.

As far as I can tell, I have two options. I can roll my own code. Or I can use Text::Query. Are there any other modules I might use?


  • Comment on Google-like query of ASCII text with Text::Query or other modules

Replies are listed 'Best First'.
Re: Google-like query of ASCII text with Text::Query or other modules
by saintmike (Vicar) on Aug 24, 2005 at 06:04 UTC

      This looks like a module I'd love to use. But it indexes files, and my problem is that the text passages against which I want to run queries don't exist in files. They're created on the fly—thousands of short passages. In my script, they exist as scalars. Is there any way to run a Google-like (or SWISH-like) query against them, given that they don't exist as files?


        Swish supports indexing arbitrary items, not just files. If you disect the innerts of SWISH::API::Common, you'll see that it puts a 'streamer' into swish's config file.

        The 'streamer' is a program that prints out the text data that's supposed to be indexed, plus some meta data. Check out the file_stream method in SWISH::API::Common:

        print "Path-Name: $file\n", "Document-Type: TXT*\n", "Content-Length: $size\n\n"; print $data;

        So, unless you want to put your text snippets into files, there'll be some additional work involved, but it should be easy.

Re: Google-like query of ASCII text with Text::Query or other modules
by Tanktalus (Canon) on Aug 24, 2005 at 14:36 UTC

    This may start out sounding like I'm completely off-topic, but I'm not, please bear with me.

    When writing CGI applications, the overwhelming consensus is to put your web pages into templates of some sort (whether that be with Text::Template, HTML::Template, or any of a number of others) to separate out the data from the logic. (Then there's Text::ScriptTemplate, but let's not go there ;-})

    In your application, you say you have the text in scalars. Why not separate them out into a real data store of some sort? That could be files or a database. Both of these have fairly universal access methods - you can access them, as well as the index/query engine. (In fact, the database could be that very index/query engine.) I realise that this would likely be a fair bit of up-front work, but it's likely to pay off in the long run. The text could be updated by anyone, not just people who know how to get around perl. You could create another CGI app to update it - it's much easier to programmatically create new files or rows in a table than it is to programmatically update perl code!

    Once you've changed your code to work "inside the box", then you're following an expected (and easy-to-deal-with) paradigm, and lots of other benefits will accrue. Sometimes, conformity is a good thing ;-)

Re: Google-like query of ASCII text with Text::Query or other modules
by danmcb (Monk) on Aug 24, 2005 at 09:35 UTC
    if your texts are not in files, what are they in?

      Alas, they're in one large text file. I work at a university. Basically, I've got a 5MB text file containing thousands of paragraphs, each of which is a description of a different course. I want students to be able to search this catalog, and I want their results to be the full text of all matching paragraphs.

      Right now, I handle this by using a regex to search each paragraph, returning to the user every paragraph that matches the query. The file is searched each time, so this is quite inefficient. As Tanktalus suggests below, I'd be better off if these descriptions were already in a database. But there's no question that this particular project isn't worth the time I would need to do that.

      Of course, using a database would also permit the students to run more powerful queries, which is what I'd really like to do. My main concern at this point is not the efficiency of the search (which isn't terribly slow as it is), but improving the query capabilities.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://486105]
Approved by jbrugger
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2018-02-20 10:04 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (268 votes). Check out past polls.