Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Perl Monk, Perl Meditation
 
PerlMonks  

Best way to aggregate many silos of data into a search form

by MyMonkName (Acolyte)
on Nov 30, 2011 at 19:38 UTC ( #940923=perlquestion: print w/ replies, xml ) Need Help??
MyMonkName has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have something of a chicken and egg problem and a vision of what could possibly transcend it. I'm looking for opinions on how to do it.

At my job, we have many different databases that are curated by experts in their respective fields. Til date, these databases are served up in distinct web forms with lots of knobs and doodads. The users of these databases are used mainly by librarian types who sift and sort to get the information they want via the aforementioned doodads. These are all served from an array of mutually unintelligible technologies, of course.

The problems I see are that (1) Ostensibly only experts are using these resources, not the public at large, (2) the databases don't talk to each other, necessitating a bunch of repetitive searching through different web forms to find the right information. I think there should be a unified, googlish search page that returns all the information we have on a given term in one shot.

At the same time, I'm not interested in frog-marching anybody into the One True Solution, mainly because I'm not a masochist, but also because the experts who maintain the silos actually do know what they are talking about content-wise, and maybe they have good reasons for doing what they are doing. If it turns out that they are wrong though, then I'll have the data to show them why (e.g., more people are using the information and the world did not end.)

So my notion is to build a lightweight front-end to all these databases. A couple of criteria I've thought about already are:

- No live communication, replication (in the broad sense of the term) instead. The data doesn't grow that fast, and I don't want my app to be tending to a herd of unruly xmlhttprequests if I can help it.

- As simple as possible. I'm not very interested in administering a heavy duty PostGreSQL system or something like that.

Whatever the shape of the backend system's data, the front end would likely consist of not much more than a snippet of text and a link to the record on the backend webserver/db.

What kind of solutions spring to mind? What should I watch out for? Obviously, I'd be using Perl to get it done. Something trendy, like a NoSQL database? XML or JSON? SQLite?

Comment on Best way to aggregate many silos of data into a search form
Re: Best way to aggregate many silos of data into a search form
by Anonymous Monk on Nov 30, 2011 at 19:46 UTC
    Too vague to give any reasonable advice, so beware of people who want to push their favourite technology du jour onto you.

    Come up with a detailed design, IOW write a functional spec. <http://www.joelonsoftware.com/articles/fog0000000036.html> Then one may talk about implementation details.

Re: Best way to aggregate many silos of data into a search form
by MyMonkName (Acolyte) on Nov 30, 2011 at 20:08 UTC
    I guess I am asking for an informal survey of favourite technologies du jour :P Reasons for would be a nice bonus...
Re: Best way to aggregate many silos of data into a search form
by moritz (Cardinal) on Nov 30, 2011 at 20:19 UTC

    The only way to gain simplicity is by glossing over lots of details. One way to do that is to request some HTML pages, and then scrape those for text, regardless of structure.

    Of course that implies that you have a way to get all the (interesting) data out of each database in HTML form.

      Screen scraping was my first approach, and frankly the one I am most comfortable with. But for the combination of net latency, the time it took for each of the databases to process the request, and the time for the script to execute, it was just too laggy. It would be fine for one or two sources I suppose, but we are talking about 6 or 7 databases, just for starters (!)
Re: Best way to aggregate many silos of data into a search form
by Your Mother (Canon) on Nov 30, 2011 at 20:24 UTC

    Though Anonymous Monk has tried to save you from anyone giving you answers, I feel brave today. Prepare to be assaulted and have things forced down your throat. Idiomatically speaking, of course!

    You want a search engine, look at Lucy (née KinoSearch) and the many related/collateral packages, especially those of KARMAN.

    This is obviously not the only way and if you eventually decide you want more relational searching instead of raw Google-like document sifting with keys, you'll be sorry, but from your description, this is what I'd do.

    Approach

    • Write a dump/textify routine for each data source.
    • Write a small OpenSearch-y app with Mojolicious or whatever you like. Run it with some Plack server or other if the Mojo built-in isn't "right."
    • The app will have one RESTy endpoint: a PUT where each data source can submit their documents for indexing and a GET for searching from any request.
    • Index the raw text along with its source, URI, date, catalog Id, authorization level for record, etc...
    • Put a small tool on each source "box" to cron, daemonize, or trigger the PUTs to update the index.
    • Now any simple Ajaxy HTML page can use the GET service to deliver searches in tens of milliseconds over millions of documents and if you write it by a standard like OpenSearch, you may even get others writing client code for you.

    Concerns

    • You don't "own" any data. This is both good and bad. You can't ruin real records with mistakes but you also might need to be able to reindex everything from all the sources.
    • This is easier than some approaches but it's also a road less traveled by in the main. You'll have less help and fewer docs than if you do a Pg|SQLite version or whatever canned PHP/Java might be floating around.
    • Persistent services need health monitoring. You'll probably be writing that stuff on your own.

    Good luck and if you go this road, I hope you have fun!

Re: Best way to aggregate many silos of data into a search form
by InfiniteSilence (Deacon) on Nov 30, 2011 at 20:29 UTC

    My first question is always the 'who' -- who are we building the application for? From what you said:

    The users of these databases are used mainly by librarian types who sift and sort to get the information they want via the aforementioned doodads...

    It appears that the consumers of the disparate databases are the 'librarian types.' They "sift and sort to get the information they want" -- so my guess is that these people are your target.

    Since the data is already in databases my recommendation is that you look into data warehousing. I do not recommend that you build a single application to connect to multiple back-end databases. Instead, I would obtain the advice of a knowledgeable data warehouse person and pay that individual to draw up a plan to build a single warehouse from the existing databases. I would then look into using whatever reporting tools my staff was fluent in to extract the data in meaningful ways for my target audience.

    If and when I find that the reporting utilities of the data warehouse application overlap with those of the former applications I would deprecate the application functionality and promote the warehouse application's use. If you find that a large portion of functionality can be supplanted by the warehouse then I would schedule that application for end of life.

    What you might notice after going through a few cycles of this is that almost all of the hard analytical work will be done by the data warehouse and/or the tools associated with it. Then the front-end applications will mainly be used for dashboards/data collection points. At that point it would be feasible to consolidate one or more of them into a single front-end application. You could use any of the numerous web application toolkits available in Perl to build such a thing.

    Celebrate Intellectual Diversity

Re: Best way to aggregate many silos of data into a search form
by TJPride (Pilgrim) on Nov 30, 2011 at 21:10 UTC
    Well, there are two ways to do this. The first, as suggested above, is to compile the data into a single database. The problem with this is the single database is essentially a mirror of all the existing data, and will have to be kept constantly updated if you want real-time searching. The second is to write a query for each database and then just output all the results on one page. Which method you use largely depends on the number of searches vs the number of updates and how time-sensitive the data is. If you have a large number of global searches, the single database will be more efficient. If there's fewer searches and a lot of updates (and the number of databases is under a few dozen), the multiple-query method may work better. It's hard to give advice without knowing a lot more about your data.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://940923]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2014-04-20 14:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls