Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Best way to aggregate many silos of data into a search form

by Your Mother (Canon)
on Nov 30, 2011 at 20:24 UTC ( #940932=note: print w/ replies, xml ) Need Help??


in reply to Best way to aggregate many silos of data into a search form

Though Anonymous Monk has tried to save you from anyone giving you answers, I feel brave today. Prepare to be assaulted and have things forced down your throat. Idiomatically speaking, of course!

You want a search engine, look at Lucy (née KinoSearch) and the many related/collateral packages, especially those of KARMAN.

This is obviously not the only way and if you eventually decide you want more relational searching instead of raw Google-like document sifting with keys, you'll be sorry, but from your description, this is what I'd do.

Approach

  • Write a dump/textify routine for each data source.
  • Write a small OpenSearch-y app with Mojolicious or whatever you like. Run it with some Plack server or other if the Mojo built-in isn't "right."
  • The app will have one RESTy endpoint: a PUT where each data source can submit their documents for indexing and a GET for searching from any request.
  • Index the raw text along with its source, URI, date, catalog Id, authorization level for record, etc...
  • Put a small tool on each source "box" to cron, daemonize, or trigger the PUTs to update the index.
  • Now any simple Ajaxy HTML page can use the GET service to deliver searches in tens of milliseconds over millions of documents and if you write it by a standard like OpenSearch, you may even get others writing client code for you.

Concerns

  • You don't "own" any data. This is both good and bad. You can't ruin real records with mistakes but you also might need to be able to reindex everything from all the sources.
  • This is easier than some approaches but it's also a road less traveled by in the main. You'll have less help and fewer docs than if you do a Pg|SQLite version or whatever canned PHP/Java might be floating around.
  • Persistent services need health monitoring. You'll probably be writing that stuff on your own.

Good luck and if you go this road, I hope you have fun!


Comment on Re: Best way to aggregate many silos of data into a search form

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://940932]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2014-09-18 09:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (109 votes), past polls