http://www.perlmonks.org?node_id=773848

AI Cowboy has asked for the wisdom of the Perl Monks concerning the following question:

I ask for the wisdom of those who use Perl more than I do, and ask for their opinion on whether the following will work:

while (<>) { $foundit = 0; chomp; $input=$_; if($input =~ /what is/i or $input =~ /who is/i or $input =~ /tell +me about/i){ $input =~ s/what is//ig; $input =~ s/who is//ig; $input =~ s/tell me about//ig; $input =~ s/\?//g; $input =~ s/a //i; ($part1, $part2, $part3) = split " ", $input; open (FILEHANDLE, "<enwikisource-20090621-pages-articles.xml") while(<FILEHANDLE>) { if($_ =~ /<title>$input/i or $_ =~ /<title>$part1 $part2/i + or $_ =~ /<title>$part2 $part3/i or $_ =~ /<title>$part1/i or $_ =~ +/<title>$part2/i or $_ =~ /<title>$part3/i and $correct == 0){ $correct = 1; }else{ continue; } if($_=~ /<p>/i and $foundit == 0){ $foundit = 1; $test = $_; last; } } close FILEHANDLE; ($crap,@goodstuff) = split ">", $test; foreach $item (@goodstuff) { ($finalgoodstuff,$crap)=split "<",$item; $beststuff .= $finalgoodstuff; } print "\n"; print"$beststuff"; print "\n"; $beststuff = ""; $finalgoodstuff = ""; } }

will this work? I am working on a 2.4 GigaByte file (all of Wikipedia) and my Ram is only 2.0 Gigs, so I need all your help on whether this will work, or crash my machine. best case (working like I think it does) it should stop getting data when the file gets to the right point, worst case I run out of Ram and something bad happens....

if anyone could help me, I would be eternally grateful! Thanks!!

  • Comment on HUGE file poses risk to testing out code... need professional look-see
  • Download Code

Replies are listed 'Best First'.
Re: HUGE file poses risk to testing out code... need professional look-see
by graff (Chancellor) on Jun 23, 2009 at 05:36 UTC
    I gather that the file named "enwikisource-20090621-pages-articles.xml" is the huge 2.4 GB thing. There are two basic problems with the OP code:

    1. You should be using an XML parsing module to read that file; I would recommend the fundamental XML::Parser, which offers plenty of simplicity and flexibility, including a "Stream" style of processing, so that the whole file doesn't need to be memory-resident all at one time.

    2. You should do one pass over the big file to build an index based on the contents of the "title" elements, so that each index entry stores the location (byte offset) and size (byte count) of the content associated with each title; then as you read your set of query inputs, search the index for matches to the question text; if there's a match, just seek to the corresponding byte offset, read the specified number of bytes from that offset, and process that content for presentation as the "answer".

    Figure out what tag it is that contains both a "title" and the sequence of "paragraphs" associated with a title. For each end-tag of that type, output an index entry that says what the title is, and what the byte range is for the whole container element.

    Searching the index for hits on a given query (and determining whether there are no hits at all) will be a lot quicker and more efficient than scanning the whole big 2.4 GB source file; you can probably load the entire index into memory at one time if you want (since it's only titles and byte offsets).

    Then, seeking into the big file to a designated start point and reading a given number of bytes will also be very quick, and if your indexing step was done properly, this portion of the file will, by itself, be a well-formed xml string which you can parse in order to present some subset (e.g. just the first paragraph).

    Use Super Search to look for nodes that show code using XML::Parser, and read its manual.

    (update: other monks with broader experience in search-engine development can probably recommend modules that do a lot of the work for building and querying an index; e.g. you might want to look at KinoSearch. The OP approach to manipulating the query string seems rather coarse, and you could use some help with that part as well. And... what if the query doesn't match any title words, but does match some relevant terms in the paragraph data? Wouldn't a generic Lucene-style index and relevance search be more useful?)