Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Indexing flat data files stored on a local server

by whakka (Hermit)
on Sep 05, 2008 at 16:12 UTC ( [id://709328]=perlquestion: print w/replies, xml ) Need Help??

whakka has asked for the wisdom of the Perl Monks concerning the following question:

The environment:

  • Small group of academics and associated support producing mostly individual research on a specific topic in economics dispersed across individual machines.
  • Overlap, real and potential, in common resources - mostly data (flat files - Stata binary and text) and Stata (proprietary statistical package) code; but also documents, teaching materials, bookmarks, utilities, etc.
  • Local *nix server (only our group has access since we own it) available to store files, run programs on respective user home folders with individual permission controls.
  • The major issue: Collaboration is a touchy thing where faculty are in competition with each other for tenure - certainly no "proprietary" (either paid for or those generated over a long period of time) resources will be shared.

The goal: Make an easy way for faculty and support to share non-proprietary data/other resources in home folders on the local server which can be indexed and searched, retrieving relevant metadata (for data files this would hopefully include the source, a short description, variable names, geographic and time coverage...) and file location on the server (a link, say).

I'm looking at rolling my own Perl solution for the indexing and search but obviously need a way for non-programmer contributors to provide the relevant meta-data. I'm also more than open to open source solutions but from what I've found so far what's mostly out there is for database-driven commerce and nothing like what I'm after. Also, since data is mostly in Stata binary it would be difficult to inspect it with Perl, especially since the only Stata modules on CPAN can read Stata 8 and 10 files, where we've been using Stata 9 for the last couple of years. In other words, auto-generating the metadata would be downright difficult (since nothing's impossible, of course).

You may have noticed that I didn't say "web front-end" for the search since I have no idea if this is the best solution, a local application may be better. You may also think that I'm way over my head but I'm eager to use this project as a learning exercise. Any help or pointers is greatly appreciated :)

Update:Thanks to shmem for pointing out Swish-e, which looks like the ideal solution especially with Perl customization for reading weird data formats like Stata. Thanks also to moritz for pointing out beagle and KinoSearch but Swish-e looks like the most straightforward option.

  • Comment on Indexing flat data files stored on a local server

Replies are listed 'Best First'.
Re: Indexing flat data files stored on a local server
by shmem (Chancellor) on Sep 05, 2008 at 16:30 UTC
    I'm looking at rolling my own Perl solution for the indexing

    Rather than rolling my own solution, I would use Swish-e for indexing and search, and see if the CPAN Stata modules can be adapted for version 9.

      Shmem, Thanks for the link!
Re: Indexing flat data files stored on a local server
by moritz (Cardinal) on Sep 05, 2008 at 16:29 UTC
    The indexing and searching of text based documents could be done with beagle (of which I've heard, but I have no experience with it). I guess you can write import filters for beagle's indexer.

    Probably not perfect, but worth investigating.

    A Perl module for building search engines is KinoSearch, which I successfully used for small projects (but which requires you to write a lot more code than an out-of-the-box solution like beagle).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://709328]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-23 07:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found