Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^9: Running SuperSearch off a fast full-text index.

by dpavlin (Friar)
on Jun 10, 2007 at 20:38 UTC ( [id://620362]=note: print w/replies, xml ) Need Help??


in reply to Re^8: Running SuperSearch off a fast full-text index.
in thread Running SuperSearch off a fast full-text index.

I would also prefer discussion and than summary on wiki.

OOH, storing nodes locally in SQLite seems like an overkill. With good filesystem there is no reason to complicate crawler with DBI code, just dump files on disk.


2share!2flame...

Replies are listed 'Best First'.
Re^10: Running SuperSearch off a fast full-text index.
by dmitri (Priest) on Jun 10, 2007 at 21:19 UTC
    The reason I think that SQLite would be useful is that if we want to separate the spider from indexer, finding the articles to update in the index is as simple as
    SELECT * FROM ARTICLES WHERE LAST_UPDATED > $LAST_TIME_I_RAN
    instead of searching the filesystem. Stored on the filesystem, we will need code to
    • search,
    • store, and
    • update
    the documents. SQLite provides all of that for free. Want to move to a different machine? -- The database is a single file. Plus, who knows what other useful things SQLite's flexibility will allow us to do?
      I've worked on spiders that have used the file system, and spiders that have used databases. It's certainly cheaper to use the file system. But thinking about the size of the dataset, we can easily afford to put these records into a database. There are only c. 600,000 records, and they're small -- not even full web pages. I like the idea of using SQLite.
      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://620362]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-03-28 13:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found