Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

FAST way to pull multiple lines around a keyword

by rizzy (Sexton)
on Oct 19, 2010 at 17:57 UTC ( #866136=perlquestion: print w/ replies, xml ) Need Help??
rizzy has asked for the wisdom of the Perl Monks concerning the following question:

I need to search a very large number of html files for several keywords and save the paragraph containing the word(s) (the line before and after will do). I can do it with the following, but it takes a great deal of time. Given that I need to do hundreds of thousands of files, the minute or so that it takes for each one is unacceptable (will take months at this rate):

#!/usr/bin/perl -w use strict; use LWP::Simple; open ("output","> outputfile.txt") || die ("Could not open output file + $!"); my $html = get("http://www.htmladdress.com/file.html") or die "Couldn't fetch the site."; while($html=~ m{(.+\n.+(key\sword1|key\sword2|key\sword3).+\n.+)}gim){ + my $text =$1; $text =~ tr[\n][ ]; print output "$text\n"; } close ("output");

I've made it more simple than it is in practice. Basically, I get the html file and search for the following sequence: a line, a line break, a line with one of my keywords, a line break, and another line. I think it is taking a long time because I've included such a long sequence of characters to search for. If I don't tell it to look for the surrounding lines (i.e., .+\n.+), it is much quicker (seconds versus minutes).

Ideally, I'd like to identify only my keyword and then save the the previous line, the current line(s), and the subsequent line. Anybody know a way to do this that would speed things up? Also, I want to be able to match a phrase across line brakes, so this might complicate things.

Any help would be greatly appreciated!

Comment on FAST way to pull multiple lines around a keyword
Download Code
Re: FAST way to pull multiple lines around a keyword
by CountZero (Bishop) on Oct 19, 2010 at 18:22 UTC
    if speed is of the essence and only relatively few of your files contain the keywords you are looking for then I would go for a quick'n'dirty check of these keywords without bothering about the surrounding lines. Just save the name of the file in a file and have another script extract the keywords and surrounding lines from the files listed.

    Also note that alterations are rather slow in a regexp and by using Regexp::Assemble you may be able to construct a regexp that runs faster.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Thanks. I think I may do what you suggest.

      One thing I did which DID speed things up very much was add a line break (/n) at the beginning and end of the match. That way there is only one possible match (for each keyword). What it was doing before was taking every possible combination of characters from the previous line and the subsequent line (i.e., .+) and then picking the longest one. Including the line break explicitly gives it only one choice.

Re: FAST way to pull multiple lines around a keyword
by moritz (Cardinal) on Oct 19, 2010 at 18:23 UTC
    I think it is taking a long time

    My alarm bells are ringing - optimizing based on unverified assumptions is a very bad idea. Test your assumption with a profiler like Devel::NYTProf before trying to improve anything.

    I personally would expect the download to be much slower than the searching; if that turns out not to be true, my next attempt would be to split on newlines, search each line, and if you have a match, also look at the lines before and after.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: FAST way to pull multiple lines around a keyword
by ig (Vicar) on Oct 19, 2010 at 19:31 UTC

    In addition to the other good advice, you might consider anchoring your RE and using possessive quantifiers.

    s/iter old new old 1.60 -- -99% new 2.00e-02 7895% --

    I haven't tested carefully, but in simple cases the modified expression appears to give the same results, performance aside.

Re: FAST way to pull multiple lines around a keyword
by aquarium (Curate) on Oct 19, 2010 at 22:16 UTC
    if performance is key, perhaps you could do this with grep command, the command line utility that is. It has an -A and -B option to print the required number of prematch and postmatch lines. and if you have more code than just this matching business, you could wrap the system or backticks call to grep from within a perl script, that does whatever it wants with the match and contextual lines.
    the hardest line to type correctly is: stty erase ^H

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://866136]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2014-12-27 22:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls