Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Optimise the script

by Anonymous Monk
on Mar 30, 2011 at 11:32 UTC ( #896379=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This takes too long to process. Please tell me how can this be optimised.
Below is the sample lines attached.
99.60.97.205 - - [26/Mar/2011:06:00:00 +0000] GET /2 +009-03-29/world/impact.row.atlantic_1_rower-paul-ridley-cancer-resear +ch?_s=PM:WORLD HTTP/1.1 200 9386 www.abc.com Mozilla/ +5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 ( +KHTML, like Gecko) Chrome/9.0.597.107 Safari/534.13 TCP_MISS + Apache=- - 1068000 - - - deflate +=- rmt=- 72.234.67.132 - - [26/Mar/2011:09:00:00 +0000] GET /a +d-abc.php?f=medium_rectangle HTTP/1.1 200 869 www.abc.com + Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1. +16) Gecko/20101130 Firefox/3.5.16 TCP_HIT Apache=- - + 1000 - - - deflate=- rmt=- 68.12.178.167 - - [26/Mar/2011:09:00:00 +0000] GET /a +d-feedback.js.php?e3e999d9b79cf36c165f5b379a0e9f269be82344 HTTP/1.1 + 200 600 www.abc.com Mozilla/4.0 (compatible; MSIE 8. +0; Windows NT 5.1; Trident/4.0; GTB6.6; .NET CLR 2.0.50727; .NET CLR +3.0.4506.2152; .NET CLR 3.5.30729) TCP_HIT Apache=- - 10 +00 - - - deflate=- rmt=-
The file size is 16955714940 bytes.
#!/usr/bin/perl use strict; use warnings; use Date::Manip; my $n="4"; my $date_converted = UnixDate(ParseDate("$n days ago"),"%e/%h/%Y"); open FILE,"> file.txt"; open DATA,"input.txt"; while(<DATA>){ my @tab_delimited_array = split(/\t/,$_); $tab_delimited_array[3] =~ s/^\[//; $tab_delimited_array[3] =~ s/^\-//; my $converted_date = Date_ConvTZ( UnixDate($tab_delimited_arra +y[3],"%Y%m%d%H:%M:%S"),'GMT','PST'); my $pst_converted_date = UnixDate($converted_date,"%e/%h/%Y:%H +:%M:%S"); $pst_converted_date =~ s/^\s//g; my $extracted_YMD=UnixDate($converted_date,"%e/%h/%Y"); if($extracted_YMD =~ m/$date_converted/){ print FILE $_; } } close FILE; close DATA;

Comment on Optimise the script
Select or Download Code
Re: Optimise the script
by Anonymous Monk on Mar 30, 2011 at 11:43 UTC
    Instead of converting every single timestamp from your logfile, generate a single timestamp ($converted_date) that matches your log format , and use it to do string comparison instead of date calculation
      To elaborate, here is a c-program way of doing this
      $ date -d "2 days ago" "+%e/%b/%Y" 28/Mar/2011 $ grep "28/Mar/2011" input.txt
      or combine it (might not work on every shell)
      $ grep `date -d "2 days ago" "+%e/%b/%Y"` input.txt
Re: Optimise the script
by JavaFan (Canon) on Mar 30, 2011 at 11:49 UTC
    Either remove some of your 16Gb of data, or write it in C. You also may consider caching the calls to UnixDate and Date_ConvTZ, but you then may risk the chance of a performance loss due to the additional memory (it depends on the number of different values).

    I do notice two assignments to $pst_converted_date, but it's never used. And extracted_YMD is UnixDate(Date_ConvTZ(UnixDate(...))). That should be possible with a single UnixDate call. And the final match could be an eq.

      I'm with JavaFan. 16 GB is huge. That takes time. Maybe you could use multiple-cores/threads and break up the file into pieces, processing the pieces in parallel, or if you had a "cloud" architecture, farm out the pieces to a hundred different machines, but I'm inclined to say the problem is better rearchitected, before one goes into optimizing Perl.
Re: Optimise the script
by bart (Canon) on Mar 30, 2011 at 11:58 UTC
    It looks to me like you want to filter on the date (not the time) of the fourth column. Why not forgo converting this date before trying to match it, and directly test the contents of that field for a matching date? I assume that they're all the exact same format.

    update Well, since you are converting between time zones, the date might niot be an exact match. But it won't differ by more than one day. So you could prefilter the results, for example for the 26th you could match /^\^[(26|25|27)\b/, and do a second filtering in your old way, in a second step.

Re: Optimise the script
by happy.barney (Pilgrim) on Mar 30, 2011 at 12:01 UTC
    if i understood it correctly, you want lines matching specific time interval.
    • get rid of Date::Manip, just calc your start time directly ($start = time - $n * 86_400)
    • use Time::Local::timegm to convert timestamp to epoch
    • compare epoch times (numbers)
Re: Optimise the script
by wind (Priest) on Mar 30, 2011 at 16:04 UTC

    As has already been said by others, if all you're trying to do is match a single day, then I would create a regex that matched those preconverted days. Most likely you'd need two regex's, one to match the start of the day and one to match the end. If this test passed, then you could do the more costly conversion of the date to determine whether or not it actually matched the proper date.

    However, the best thing you can do to increase your runtime is to rely on the fact that this data is probably ordered by date/time. Once you find your given date and leave it, you no longer need to process the file. Even better would be to create an index file of positions where each PST date starts. You could regularly update your index by continuing where it last left off, and then this script would be able to run immediately.

    - Miller

    PS, Also note that your Unixdate conversion of your dates in the loop doesn't seem to be working right. The format you specified isn't actually the format in the file.
Re: Optimise the script
by davido (Archbishop) on Mar 30, 2011 at 22:58 UTC

    Unless your data is sorted and the file's lines (or records) are fixed in length, your solution will never be faster than O(n). However, there's a lot of room for improvement in runtimes even if magnitudes of work don't change.

    One aspect to consider is how often you're expecting to see a match in the 16GB input file. If matching records are sparse, you can gain a lot by rejecting non-matches and short-circuiting the loop's iteration as early as possible. Instead of splitting the line, massaging $tab_delimited_array[3], and then running it through Unix_Date and Date_ConvTZ before finally testing to see if $date_converted is the same as $extracted_YMD, couldn't you massage your $date_converted into something that more approximates the raw format of the date presented in the 16GB file? That would allow for faster rejections of unneeded lines.

    Second, if it turns out that there are frequent matches in the file, you might be wasting unnecessary time printing often. You could push $_ onto a cache array, and then print the array every 1000 iterations, for example. Then do a final flush after terminating the loop. That would be a small enough chunk as to not introduce memory problems, while at the same time reducing time spent in IO calls.


    Dave

Re: Optimise the script
by Cristoforo (Deacon) on Apr 01, 2011 at 01:13 UTC
    Using different modules, Time::Local and POSIX, is a direct way to compute today's date in UTC to compare to the dates in the file which are UTC. That avoids all the function calls in the loop you built, which really slow down the program.

    #!/usr/bin/perl use strict; use warnings; use Time::Local qw/ timegm_nocheck /; use POSIX qw/ strftime /; my $n = 4; my ($d, $m, $y) = (gmtime)[3..5]; my $gmtime = strftime "%d/%b/%Y", gmtime timegm_nocheck 0,0,0, $d-$n, +$m, $y; $gmtime =~ s/^(?:0| )//; while(<DATA>){ print if /$gmtime\b/; }

    But to impliment some of the time saving 'tricks' suggested by others in this thread would involve a while loop with more code.

Re: Optimise the script
by Marshall (Prior) on Apr 02, 2011 at 21:54 UTC
    The code has a lot of futzing around with splits, substitutions and slow time conversion routines.

    I would get the time ASAP. Split can be a time consuming critter, cut it short by using the limit parameter on split.
     my @tab_delimited_array = split(/\t/,$_,5); I agree with happy.barney. The routines with %Y,%H etc are really slow compared with the low level functions. strftime() is famous for being slow. I would send $tab_delimited_array[3] directly into code like the epoch routine below.

    timegm() is implemented very efficiently and it caches months that it has seen before - there are a few math operations and bingo you have epoch time number. Even if it happens to do some multiplies, no big deal as on a modern Intel processor they are about the same speed as integer ops! There is a "no error checking" version of timegm that you can import although I don't think that you will need to.

    Calculate your "search for range" in advance, convert to epoch integers and then a couple of integer compares gets you a yes/no decision quickly.

    To make things really fast, you will have to do some benchmarking. Run it without doing anything except reading the file, add in the split and see what that does, add in time conversion and see what that does. Consider and try using a regex to extract the date, sometimes that is faster than using split - but testing is required. Doing something like a binary search to get you near the start of your "search range" has the potential to really speed things up, but a huge increase in complexity (assuming this is an ordered file).

    #!/usr/bin/perl -w use strict; use Time::Local; #use Time::Local 'timegm_nocheck'; #faster non error checked version my %montext2num = ('Jan' => 0, 'Feb'=> 1, 'Mar'=> 2, 'Apr'=> 3, 'May'=> 4, 'Jun'=> 5, 'Jul'=> 6, 'Aug'=> 7, 'Sep'=> 8, 'Oct'=> 9, 'Nov'=> 10, 'Dec'=> 11); my $x = epoch('[26/Mar/2011:06:00:00 ]'); print "epoch=$x\n"; sub epoch { my $log_time = shift; # like [26/Mar/2011:06:00:00.....blah] my ($day,$mon,$year,$hour,$min,$sec) = $log_time =~ m|(\d+)/(\w+)/(\d+):(\d\d):(\d\d):(\d\d)|; my $month = $montext2num{$mon}; return (timegm($sec, $min, $hour, $day, $month, $year)); }
    Update: I don't know who controls the time format - often we don't get a choice, but if you do, then something like YYYY-MM-DD HH:MM:SS where leading zeroes are important is a good idea. "2011-03-26 14:01:35" can be directly compared against a similar string with lt, gt, eq (or ASCII sorted) and the order will "work out" without conversions. This format also translates very directly into many database time formats. Keep time in UTC(GMT) for all logging functions and translate into local time as needed for presentation.
Re: Optimize apache access_log parsing with index file
by wind (Priest) on Apr 19, 2011 at 01:40 UTC

    Given that your Anonymous, I doubt this result will ever be found. However, in some of my spare time I put together a script that indexes the apache log file like I suggested in my post to you

    I also converted the script to use DateTime instead of Date::Manip as the former is faster.

    Note, the script took about 41 minutes per gig of data on my development machine, but then takes an infinitesimal amount of time for any subsequent run given the data is ordered and indexed.

    #!/usr/bin/perl use DateTime; use Fcntl qw(:seek); use strict; use warnings; my $infile = 'access_log'; my $indexfile = $infile . '.pst'; my $outfile = 'file.txt'; my %mon = do { my $i = 1; map {$_ => $i++} qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov De +c) }; my $search_date = DateTime->now()->subtract(days => 10)->strftime("%Y% +m%d"); print "Search date is $search_date\n"; # Get Last Index Location my @index_last = ('', 0); my @index_start; my @index_stop; if (-e $indexfile) { open my $fh, $indexfile or die "$indexfile: $!"; while (<$fh>) { chomp; @index_last = split "\t"; @index_stop = @index_last if @index_start && !@index_stop; @index_start = @index_last if $index_last[0] eq $search_date; } } open my $oh, '>', $outfile or die "$outfile: $!"; open my $ih, $infile or die "$infile: $!"; my ($lastday, $index) = @index_start ? @index_start : @index_last; seek $ih, $index, SEEK_SET; while (<$ih>) { my $day; # If w/i indexes, no need to reparse day if (@index_stop) { # End reached last if $index >= $index_stop[1]; $day = $search_date; # Parse Date } elsif (m{\[(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+)\s+([+-]\d+)}) { my $dt = DateTime->new( year => $3, month => $mon{$2}, day => $1, hour => $4, minute => $5, second => $6, time_zone => $7, ); $dt->set_time_zone('America/Los_Angeles'); $day = $dt->strftime("%Y%m%d"); } else { warn "Invalid date on: $_"; next; } # New Date if ($day ne $lastday) { # Add to index if necessary if ($day > $index_last[0]) { @index_last = ($day, $index); open my $oh, '>>', $indexfile or die "$indexfile: $!"; print $oh join("\t", @index_last), "\n"; close $oh; } # End if past search date last if $day > $search_date; print "Processing $day, $index on " . scalar(localtime) . "\n" +; $lastday = $day; } # Matches search date if ($day eq $search_date) { # Do whatever print $oh $_; } $index = tell $ih; } close $ih; __END__

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://896379]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (10)
As of 2014-07-28 22:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (210 votes), past polls