Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Proof of concept: File::Index

by mattr (Curate)
on May 18, 2006 at 04:31 UTC ( #550152=note: print w/ replies, xml ) Need Help??


in reply to Proof of concept: File::Index

Nice, looks interesting and eminently useful. Of course the "any time the user suspects" part is a bit weak.

it's impossible to find the Nth occurrence of some phrase or word in a book without opening the book and counting your way through it.

I would like to note that after practice I became able to consistently open a book to the correct page in the case of a thick Japanese character dictionary (Nelson's) back in school. I think this may be like a lookup table that matches thickness of pages before one's thumb to a list of >100 chapters. At least it always worked for the most important chapter.

So if you know where a change had been made in a file, you could in fact jump to a prestudied location before that point, which you know has X occurrences of a pattern before it, and then count N-X occurrences starting from there instead. Metadata describing the various prestudied points (or results of prerun pattern matches) could be saved in a memo at the head of the index file.

You could also save a series of checksums per chapter (if not per line) and this could help determine where a change was made, though maybe Diff could do something similar. This would let you enjoy the benefits of a flat file, i.e. do regex pattern matching or tie the file to some module's object model like Config, while also enjoying some of the structure given by a record-based object store.

Personally I would probably rather have an index that operated based on keywords or patterns than using a recno. If the text file has a list of paragraphs, I could save a few words describing each paragraph in the index and then later jump to the Nth article matching a given keyword or above a certain score. Or perhaps I have a list of events in a calendar, and each would have an event type or event owner associated with it. In this case maybe I would like to have multiple lines per record, in other words the delimiter would not be "\n". Maybe I'd like a (not necessarily unique) date-based key, or a certain format serial number. These are just ideas.

I am trying to think of when I would want to use your new module, and I keep thinking of extracting descriptive words from text as in NLP (natural language processing) and saving them with each paragraph or sentence. Regardless of whether this is a single flat file or not, it would be useful, and a tool to navigate the precompiled index with pointers into the data would seem useful. Perhaps a callback or plugins for index creation would be useful.

At the moment I am thinking of indexing books, which make nice flat files. I wrote a little program that lets me read books from my server on my cell phone when on the train (turns out that's not cheap but..) anyway I read 10KB per page (max that fits in RAM and enough to reach the next station). It would be nice if I had an index built so as to allow me to make one page end at the end of a sentence, within the 10K limit. It is so much of a pain that currently I even split words across pages. A recno could be used as a bookmark, if the recno is created based on a page length and a "try not to break sentences across pages" heuristic. So to make a long story short, it would be interesting if your module would support creation of indices based on pages of a length decided somewhat intelligently. Would that be possible with your module? Keep up the good work!


Comment on Re: Proof of concept: File::Index

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://550152]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (16)
As of 2015-07-31 14:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (278 votes), past polls