Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Grep Speeds

by ImpalaSS (Monk)
on Feb 06, 2001 at 19:55 UTC ( #56676=perlquestion: print w/replies, xml ) Need Help??

ImpalaSS has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,
I currently have 2 large files, each about 65,000 lines each. The files include lines of data, all seperated by "pipes". The first 3 fields of the files are what differentiats (sp?) each line for every cell sight in the nextel system, per half hour for the last 24 hours. So, for exmaple, the first line would look like:
20035pa|02/06/2001|8:30| from here on is about 60 more fields of data

In any case, i have a program which searches this file, but it runs extremely slow, I believe it happens because once it finds the search string, it still continues to look through the entire script for the search string. However, the files only contain one combination for each site/per date/per half our. Here is the relevant code:
foreach $item(@timearray){ $searchstring = "$NETID\|$month\/$date\/$year\|$item\|"; chomp($ECL=`grep "^$searchstring" /PHL/data1/PHL/tmp/ECL_STAT. +txt`); if ( $ECL == "" ) { push @ECL, "||$item||"; } else { push @ECL, $ECL;} }
The variables are:
$NETID = the network id, which would correspond to 20035pa in the abov +e example. $month, $date, $year = would correspond to 02/06/2001 in the above exa +mple. $item = the time, which would be 8:30 above.
I am using unix grep in the program, my question is, would using perl grep speed the process? Any tips on speeding it up?
Here are some solutions i came up with, im not sure how effective they are, if possible:
1: Start the grep where it left off, with the next search string, becasue the files are sorted by half hour. (is that possible to do?) 2: Terminate the grep once string is found, then pick up for the next string (in which the $item would have changed)

I dont prefer unix grep over perl grep, im just looking for ways to speed up the program.

Thanks In Advance


Dipul

Replies are listed 'Best First'.
Re: Grep Speeds
by merlyn (Sage) on Feb 06, 2001 at 20:10 UTC
    If your solution works, but is too slow, it's time to change the algorithm or the data to cache more of the results.

    One possible suggestion would be to stop searching. Use an index of some kind. If your data can be flattened to "key maps to value" semantics, then a simple DBM will do. Otherwise, you should look at something like MySQL.

    -- Randal L. Schwartz, Perl hacker

      <assumptions> 2 files x 65_000 lines x 60 fields x 8 chars = ~60MB data |-delimited ASCII search query is field 1 data contains no 'escaped' |'s (e.g. \| or "xx|xx") </assumptions>

      Given the relatively 'small' (meaning under 128MB) database size, the questions would be 'how often do you need to search it?' and 'how often does the data change?'

      As merlyn pointed out, you could index this file based upon whatever field(s) you are searching by; in this example, the ECL.

      N.B. also that it would be simplest if (as one might assume) the data is only appended to (and never 'changed') to scan only the new data into the DB index. This example assumes that you get 'new files;' e.g. by rotating out files before indexing. If that is not the case (if the files are appended to in place), keeping track of the length of the file at the time it were indexed and then using seek() to begin indexing after the end would be more effective. (To 'reset' this index, just remove the db file.)

      [mk_ecl_index] #!/usr/bin/perl use DB_File; for my $filename (@ARGV) { my %ecl; tie %ecl, DB_File, "$filename.db" or die "Can't tie $filename.db: $!"; open ASCII, "<$filename"; while (<INPUT>) { chomp; next unless m{^ ([^\|]* \| [^\|]*) \|}x; # first two fields # can't store refs in basic DB_File # but data guaranteed not to contain \n, so... :-/ $ecl{$1} = '' unless defined $ecl{$1}; $ecl{$1} .= $_ . "\n"; } close ASCII; untie %ecl; } [grep_ecl] #!/usr/bin/perl # n.b. args opposite of Unix grep; filename, query, q2... my $filename = shift; my %index; tie %index, DB_File, "$filename.db" or die "Can't tie to $filename.db: $!"; for my $query (@ARGV) { if ( exists $index{$query} ) { print $index{$query}; # newlines already provided } else { print STDERR "$0: $filename: $query not found\n"; } } untie %index;

      File locking is left as an exercise for the reader; if you index in a cron job or logrotate script, you'll likely need it.

      Hope that helps ;-)

Re: Grep Speeds
by jeroenes (Priest) on Feb 06, 2001 at 20:35 UTC
    This is a nice one. I would take another approach. The 65,000 is not really a problem with decent memory, so I would just read all in one swoop:
    use SuperSplit; $AoA = supersplit_open('|','\n',$filename); no strict; for $line ( @AoA ){ $ecl_hash->{$line[0]}->{$line[1]}->{$line[2]}=@line[3..$#line]; } my $sub_ecl = $ecl_hash->{$NETID}->{$date}; some_function( $sub_ecl->{$time} ) for $time (keys %$sub_ecl);
    This prevents you from grepping every item of your timearray, which takes quite some time indeed. The above returns something easy to process, but if you want more speed, the following is better and doesn't use supersplit:
    open DATA, $filename; my %ecl_hash; my $str = "$NETID|$month/$date/$year"; while( <DATA> ){ next if index($_,$str)<0; chomp; my $item = (split '|' )[2]; $ecl_hash{$item} = [] unless defined $ecl_hash{$item}; push @$ecl_hash{$item}, $_; }
    I use index here, because it's faster than matches.

    Returns something similar, only the fields have not been separated into arrays. But maybe you don't even want to.

    SuperSplit can be found here.

    Cheers,

    Jeroen
    "We are not alone"(FZ)

    Update: I ignored the fact that you want to search on the date/ID as well. Fixed in the last codeblock. Will be much faster, of course. What do you *really* want? You want to add an extra line to a report, with some sumary? You wanna make some graphs? Depending on the question, you can decide whether you want the whole thing read in, or just a little piece, or whether you'd better go and put everything in a database, as merlyn suggested...

    Update2: after quite some CB, I think ImpalaSS'd better use the supersplit method and use the modified hash to access his data.

      Hey,
      Well to answer your question, what the program does is, for each site/per date/per half hour it grabs data from 2 different files. It then tkaes this data, and performs a lot of calculations to print numbers such as dropped calls, and total traffic etc. As of yet, no graphs or charts, all the data is just dumped into the arrays and then a subroutine takes the data and performs calculations and prints the results.

      Dipul
        As long as you don't want to do recalculations (as in deviations from the mean, or percentages from the max) you can stick with a database-less solution.

        If you already know that you want the last lines of your database, why don't you just use ' open DATA, "tail -n $number $filename|";'? That would speed up things, as perl doesn't have to work the whole file.

        Jeroen
        "We are not alone"(FZ)

Re: Grep Speeds
by arhuman (Vicar) on Feb 06, 2001 at 20:10 UTC
    What about this one-line ?

    Here are the main ideas :

    You read the file only once !
    Then you try for each line to match one of your paterns...
    You don't use backquotes as they spawn a shell and it's very cpu/memory consuming...

    perl -ne ' foreach $item(@timearray) { $searchstring = "$NETID\|$month +\/$date\/$year\|$item\|"; if (/"^$searchstring/) { print }}' /PHL/dat +a1/PHL/tmp/ECL_STAT.txt

    Of course it could be optimized (for example I try to match all the remaining patterns even if I got a match, this seems to be unnecessary...)
    Anyway, IMHO your main mistake is to open/read the file several time !

Re (tilly) 1: Grep Speeds
by tilly (Archbishop) on Feb 07, 2001 at 00:20 UTC
    Everyone has already pointed out the advantages of dbms and databases. But if you just need to process the files once, it is probably most efficient to hash your search strings instead. Like this (untested):
    # Build a hash lookup for things I want, defaulting to not there my %found; my @search; foreach my $item (@timearray) { my $id = "$NETID\|$month\/$date\/$year\|$item\|"; push @search, $id; $found{$id} = "||$item||"; } # Scan the file for them my $file = "/PHL/data1/PHL/tmp/ECL_STAT"; open(ECL_STAT, "< $file") or die "Cannot read $file: $!"; while (<ECL_STAT>) { chomp; if (/(([^\|]+\|){3})/ and exists $found{$1}) { $found{$1} = $_; } } # Build output array push @ECL, map $found{$_}, @search;
    Now you only need to scan once, and only build in memory data structures for the actual data you are looking for. (Which is coming in the order you are looking for it in. This code would be easier if I could assume that order didn't matter.)
Re: Grep Speeds
by spaz (Pilgrim) on Feb 06, 2001 at 22:38 UTC
    My suggestion is to internalize your grep command as such (I hope I'm using seek() properly here)
    open( FILE, "/PHL/data1/PHL/tmp/ECL_STAT.txt" ) or die "Couldn't open ECL_STAT.txt: $!"; my $found = 0; foreach $item(@timearray){ seek( FILE, 0, 0 ); $searchstring = "$NETID\|$month\/$date\/$year\|$item\|"; while( <FILE> ) { if( /^$searchstring/ ) { push @ECL, $_; $found = 1; last; } } push @ECL, "||$item||" unless $found; } close( FILE ) or die "Couldn't close ECL_STAT.txt: $!";
    This is probably a bad suggestion, but it's what I think is right. -- Dave
Re: Grep Speeds
by runrig (Abbot) on Feb 06, 2001 at 23:11 UTC
    Its up to you to decide whether or not its worthwhile putting the data in a database, it may depend on the frequency you run the program, how many searches you do during each run of the program, etc. But it doesn't really take long to read each file into memory every time, so a quick 'n' dirty solution might be to read the file into a hash of hashes of hashes:
    my %log_data; open FH, $file or die "Can't open $file: $!"; while (<FH>) { my ($net_id, $mdy, $item) = split /\|/; $log_data{$net_id}{$mdy}{$item} = $_; } # Then cycle through data you're searching for # and save/process it if its in the %log_data array
Re: Grep Speeds
by Albannach (Monsignor) on Feb 06, 2001 at 20:53 UTC
    I'm wondering why you are testing $ECL (which should be a string unless you're not showing the grep options you are using) for numerical equality with ""? Wouldn't that always be true or am I missing something here?

    Update: Ok, I understand what you want, but you'd better put an eq in there instead of == as ANY string will match "" numerically (i.e. "a456" == "" is true). The only case where that won't be true is where $ECL contains a number or a string starting with a number which is then interpreted as a plain number in the numeric context, but all you are really trying to do is see if $ECL is empty, so use eq. If your data file is ever changed so the lines start with a non-number, your code will no longer work.

    Actually if you ran your code with -w you'd get: Argument "" isn't numeric in numeric eq (==) at line xxx.

    --
    I'd like to be able to assign to an luser

      Hey,
      Basically with that line, its saying if a specific instance isnt found (for example is a half hour for a certain date, ie: 20042pa|02/06/2001|8:30 was missing) it would insert $item as a place carrier.

      Dipul

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://56676]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2023-03-23 17:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (60 votes). Check out past polls.

    Notices?