Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Simple Text Indexing

by cyocum (Curate)
on Nov 29, 2003 at 16:26 UTC ( [id://310891]=perlquestion: print w/replies, xml ) Need Help??

cyocum has asked for the wisdom of the Perl Monks concerning the following question:

Hello Fellow Perlmonks

I have a medium sized text file, which I would like to index along with line number so that I can find where a word is in a file. I have already written some code, which is appended, however I was wondering if there were any good plain text indexers in Perl that also store where in a file a particular word is. I have looked at Apache Lucene. The only problem that I have with it is that it does not store where in a file a particular word is only that the word is in the file. I could use the find feature in whatever text editor I use but I would like to be able to search for a term over several files.

I have also looked at two nodes here and and here.

Any ideas?

UPDATE: I have updated the code a bit since I was not stripping the unwanted characters before I did some other things to the text.

use strict; use warnings; use utf8; use IO::File; #the file to index my $inFile = "c:\\temp\\texts\\T100001A.txt"; #the file to store the index information my $indexFile = "c:\\temp\\index\\t100001a.index"; my $inFh = new IO::File $inFile, "r"; my $outFh = new IO::File "$indexFile", "w"; my $lineNum = 0; my %index; while(my $line = <$inFh>) { $lineNum++; chomp $line; my @words = split /\s/, $line; foreach my $word (@words) { $word =~ s/,$|\.$|\[|\]|\(|\)|;|:|!//g; $word = lc $word; } @words = grep {!&inStopList($_);} @words; @words = grep {&removeNullEntries($_);} @words; foreach my $word (@words) { if(exists $index{$word}) { push @{$index{$word}}, $lineNum; } else { my @lineNums; push @lineNums, $lineNum; $index{$word} = \@lineNums; } } } print "done indexing\n"; foreach my $key (keys %index) { print $outFh $key; print $outFh "="; print $outFh join(',', @{$index{$key}}); print $outFh "\n"; } sub inStopList { my $word = shift; my @stopList = ("the", "a", "an", "of", "and", "on", "in", "by", " +with", "at", "he", "after", "into", "their", "is", "that", "they", "f +or", "to", "it", "them", "which"); foreach my $stopWord (@stopList) { if($word eq $stopWord) { return $word; } elsif($word =~ /p\.(\d)+/) { return $word; } elsif($word =~ /\-{5,}?/) { return $word; } else { next; } } } sub removeNullEntries { my $word = shift; if($word) { return $word; } else { return undef; } }

Replies are listed 'Best First'.
Re: Simple Text Indexing
by jeffa (Bishop) on Nov 29, 2003 at 17:40 UTC

    My first recommendation was MMM::Text::Search, but it too only tells you which file(s) the search words were found in, not where.

    What are you building this for? Wouldn't a simple brute force loop be enough if you are only dealing with one file?

    my $match = qr/thunderbird/i; open FILE, '<', 'foo.txt' or die $!; while (<FILE>) { if ($_ =~ $match) { my @word = split /\s+/, $_; $_ =~ s/[A-Za-z0-9_ ]//g; for my $i (0..$#word) { if ($word[$i] =~ $match) { print "match found on line $. word ", $i+1,"\n"; } } } }

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      Good point. I am a graduate student at the University of Edinburgh's Celtic Studies department. Most of what I am working with is from the CELT site (here). Their search engine does not cover the annals (look for the cronicon scottorum or annals of ulster in the published section) and is horribly slow along with many other bugs so I need to be able to quickly search and find stuff by keyword not only for cronicon scottorum but for all the annals. I am using the one file as a testbed to branch out into multi-file indexing and searching (at some point).

      I hope this helps answer some of your questions.

      Thanks for your input!

        Why not create a dictionary of words with lists of offsets. Store the byte-location of the start of the line so then when you want to retrieve you seek immediately to the right location and print that line. Another idea is to store the offset of the n-th previous line so you can print some context.

        I've written a function for you that accepts a filename and an optional number of lines of context ( the default being 1. You'll probably want to store the index somewhere using Storable so its convenient to re-use your index for later.

        my @files = glob "*.txt"; my %file_idx = map {; $_ => index_file( $_, 5 ) } @files; =pod { 'foobar.txt' => { word => [ 1, 3, 5, 6 ], another => [ 5, 7, 2, ] }, 'barfoo.txt' => { ....... } =cut sub index_file { my $filename = shift; my $lines_of_context = $_[0] > 0 ? shift() : 1; open my $fh, "<", $filename or die "Couldn't open $filename: $!"; my @offsets; my %index; while ( my $line = <$fh> ) { push @offsets, tell $fh; my $offset = scalar( @offsets ) < $lines_of_context ? $offsets[0] : shift @offsets; for my $word ( split ' ', $line ) { push @{ $index{$word} }, $offset; } } close $fh or warn "Couldn't close $filename: $!"; return \ %index; }
Re: Simple Text Indexing
by Cody Pendant (Prior) on Nov 29, 2003 at 22:01 UTC

    Couple of tiny comments on your code:

    my @words = split /\s/, $line;

    Might have problems if there are multiple spaces, spaces and tabs, so

    my @words = split /\s+/, $line;

    Would be better, surely? Is that what your removeNullEntries is about?

    I ended up coming up with a complex regex to get what I thought were "words" out of text, something like

    /\w[\w'-]*\w|\w+/
    rather than just grabbing strings seperated by whitespace and trying to figure out if they're really valid words later.

    And

    my @stopList = ("the", "a", "an", "of", "and", "on", "in", "by", "with", "at", "he", "after", "into", "their", "is", "that", "they", "for", "to", "it", "them", "which");
    Seems like it would be better off as a hash so you can just go if(defined($stoplist{$word})).


    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print
Re: Simple Text Indexing
by broquaint (Abbot) on Nov 29, 2003 at 23:23 UTC
    If you've looked at Lucene, then you may interested in a perl version that's being developed by Simon Cozen called Plucene.
    HTH

    _________
    broquaint

Re: Simple Text Indexing
by cyocum (Curate) on Nov 30, 2003 at 01:33 UTC

    I would like to thank you all for your interesting comments and code. It is very late here so I will go to bed and look at all of this tomorrow. Again, thank you!

Re: Simple Text Indexing
by bl0rf (Pilgrim) on Dec 01, 2003 at 01:00 UTC
    cyocum, It would really help if you explained more
    in depth what exactly you need to do. Do you really
    need to find the offset of a word or can the users
    be content with knowingh which doc the word is in?

    Perhaps if you don't have a lot of data you could perform
    a regular grep on all the files each time ( perhaps even
    outsorce it to a shell command, be careful though...)

    My site

      What I need it to do is not only tell me if a word is in a file but where in that file the word appears. I am a historian. Knowing that a word is in a file is completely useless to me unless I know where in that file it is so that I can read the information myself. With small files, this is not a problem since I can quickly find it but with medium and large size files it is useless. Like I said, I could just use the find feature of my favorite text editor however I would like to be able to search over multiple files since, especally with the annals, I need to know if something shows up in other files. Also, I am on WinXP so shell commands are not exactly an option.

Re: Simple Text Indexing
by cyocum (Curate) on Jan 05, 2004 at 13:26 UTC

    I just wanted to give a bit of an update on this node. I was purusing this node when someone mentioned the module Text::ParseWords. I have not given it a try yet but I think it may give me what I need to parse up the words correctly then create an index.

Re: Simple Text Indexing
by cyocum (Curate) on Mar 10, 2004 at 12:51 UTC

    I just wanted to update this discussion with a link to an article I found on perl.com about Plucene that I thought might be relevant.

CLucene module for perl
by dpavlin (Friar) on Dec 01, 2003 at 19:43 UTC
    Ever since I found CLucene - a C++ search engine, I have been dreaming of perl module for it. Since XS is mistery to me, I started examining Inline::CPP. However, my C++ skills are not quite up to that task yet.

    I'm aware theat there is GNU mifluz engine, which can also do the job. However, perl module for it Search::Mifluz is again XS which isn't working with current version of mifluz.

    Any help in any of those issues from perl community would be greatly appriciated.

    2share!2flame...

      Thanks for the information! The only issue is that mifluz still has the same problem as before: it does not store where in the file a word is only that it is in the file. Take a look at the Introduction.

      I am beginning to belive that there needs to be a fundimental change in the way people think about text indexing. All the text indexing projects that I have seen only store that a word is in a file. They need to start behaving more like an index found in the back of a good academic book.

        Several points from some similar work I have been dabbling in, on an off, for some time now...

        If you have a file which allows comments or anchors (HTML etc.) I've found it really is easiest for indexing to set up a two-pass process... the first to set up appropriate markers, at reasonable intervals, the second to pull your wordlist out, ideally with a hash of words pointing to lists of markers or tags etc. Alternatively, a process which uses paragraph numbers, line numbers, or simply file offsets useable by a seek may suffice.

        You will need a more extensive stop-list for large bodies of text -- in fact, for really large ones you need to develop your own, suited to the text concerned. Some frequency analysis may assist here. Also see perlindex, which uses the __DATA__ area as a store for a longer list.

        My preferred technique with a corpus of plain text is actually to convert it (using perl, naturally) into HTML, inserting copious anchors for indexed points. This means I can view segments in a browser for context checking.

        (I assume you can always convert back, recording say, para numbers, if you need to have text back.)

        Frankly, the above for me is the easy bit. The hard bit is the establishment of context for an index marker, and the correct addition of synonyms to the index for extra terms not otherwise included in the text. That's why I find the HTML conversion and viewing really works best for me. There's still no substitute for human judgement on the context indexing question...

        WordNet modules may be the answer to the synonym problem here. That's the bit I'm looking at now.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://310891]
Approved by Courage
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-05-18 01:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found