in reply to Best way to aggregate many silos of data into a search form
Though Anonymous Monk has tried to save you from anyone giving you answers, I feel brave today. Prepare to be assaulted and have things forced down your throat. Idiomatically speaking, of course!
You want a search engine, look at Lucy (née KinoSearch) and the many related/collateral packages, especially those of KARMAN.
This is obviously not the only way and if you eventually decide you want more relational searching instead of raw Google-like document sifting with keys, you'll be sorry, but from your description, this is what I'd do.
Approach
- Write a dump/textify routine for each data source.
- Write a small OpenSearch-y app with Mojolicious or whatever you like. Run it with some Plack server or other if the Mojo built-in isn't "right."
- The app will have one RESTy endpoint: a PUT where each data source can submit their documents for indexing and a GET for searching from any request.
- Index the raw text along with its source, URI, date, catalog Id, authorization level for record, etc...
- Put a small tool on each source "box" to cron, daemonize, or trigger the PUTs to update the index.
- Now any simple Ajaxy HTML page can use the GET service to deliver searches in tens of milliseconds over millions of documents and if you write it by a standard like OpenSearch, you may even get others writing client code for you.
Concerns
- You don't "own" any data. This is both good and bad. You can't ruin real records with mistakes but you also might need to be able to reindex everything from all the sources.
- This is easier than some approaches but it's also a road less traveled by in the main. You'll have less help and fewer docs than if you do a Pg|SQLite version or whatever canned PHP/Java might be floating around.
- Persistent services need health monitoring. You'll probably be writing that stuff on your own.
Good luck and if you go this road, I hope you have fun!
|
---|
In Section
Seekers of Perl Wisdom