Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Optimize my foreach loop / code

by rmocster (Novice)
on Aug 20, 2016 at 02:00 UTC ( [id://1170107]=perlquestion: print w/replies, xml ) Need Help??

rmocster has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, happy Friday.

Please optimize my code or give suggestion on how to do it. On a big array (35k elements), it take over 1.5 minute to compute. Any suggestion is greatly appreciated. I have included a section of my code and an example of the array all_listing (first line). Thank you.

@all_listing = qw ('folder/file.000001.jpg','folder/file.000002.jpg', +... , 'folder/file.039000.jpg','folder/file.040000.jpg'); foreach $kk (@all_listing) { next if $kk eq ''; # next on empty line next if ($kk =~ m|(.+)\/\..+| and $hidden); # omit hidden (do +t) files if ($kk =~ m|(.+)\/(\d+)\.(.+$)|) { # match image sequence wi +thout prefix $path = $1; $num = $2; $ext = $3; print "$path: $num.$ext\n" if ($debug); $pathpre = "$path/$ext"; # path and ext as the key ; ad +ding ext so that if there are 2 seq with the same prefix and diff ext +ension print ">> $pathpre\n" if ($debug); if (exists $seq{$pathpre}) { push (@range_tmp,$num); # add element to array } else { $seq{$pathpre} = $num; # create hash @range_tmp = $num; # empty array and add new elem +ent to it } $new_seq{$pathpre} = [@range_tmp]; # create new seq hash $ext{$pathpre} = $ext; # create hash of extension } elsif ($kk =~ m|(.+)\/(.+?)([\._]+)(\d+)\.(.+$)|) { # match + image sequence $path = $1; $pre = $2; $div = $3 ; $num = $4; $ext = $5; print "$path: $pre$div$num.$ext\n" if ($debug); $pathpre = "$path/$pre$div$ext"; # path, prefix, divide +r and ext as the key ; adding ext so that if there are 2 seq with the + same prefix and diff extension print ">> $pathpre\n" if ($debug); if (exists $seq{$pathpre}) { push (@range_tmp,$num); # add element to array } else { $seq{$pathpre} = $num; # create hash @range_tmp = $num; # empty array and add new elem +ent to it } $new_seq{$pathpre} = [@range_tmp]; # create new seq hash $ext{$pathpre} = $ext; # create hash of extension } elsif ($kk =~ m|(.+)\/(.+?)(\d+)\.(.+$)|) { # match most im +age sequence, except this 7R01.0118762.dpx (number before dot) ; abov +e regex takes care of this match $path = $1; $pre = $2; $num = $3; $ext = $4; print "$path: $pre$num.$ext\n" if ($debug); $pathpre = "$path/$pre$ext"; # path, prefix and ext as +the key ; adding ext so that if there are 2 seq with the same prefix +and diff extension print ">> $pathpre\n" if ($debug); if (exists $seq{$pathpre}) { push (@range_tmp,$num); # add element to array } else { $seq{$pathpre} = $num; # create hash @range_tmp = $num; # empty array and add new elem +ent to it } $new_seq{$pathpre} = [@range_tmp]; # create new seq hash $ext{$pathpre} = $ext; # create hash of extension } else { push (@no_match,$kk); } }

Replies are listed 'Best First'.
Re: Optimize my foreach loop / code
by davido (Cardinal) on Aug 20, 2016 at 04:09 UTC

    You have a single loop, so the amount of work you are doing increases in direct relationship to the number of filenames you are running through. Your lookups use hash keys which means that as your data set grows the lookup times won't grow significantly. You are pushing onto a couple of arrays, which doesn't cost you any significant growth pains for a dataset of 36000 file names, assuming you have a computer made this decade.

    So about all that leaves within the code you demonstrated is how much time it takes to run the regexes on the filenames. In the case of your first regex, there's no need to capture, and no need to use quantifiers, as m|.+/.+| will match the exact same strings that would match against m|./.|. The reason is that if "one or more" on either side of a slash matches, then one on either side of the slash would also match, and vice versa. So that regex has a small amount of room for optimization.

    Are you convinced through actual testing that this is the segment of code that takes all the time? Profiling would tell you, but even the minimal change of adding a time call to either side of the loop would tell you. It's possible that we're looking at the wrong code here.

    If it turns out that this really is your bottleneck, see if you can further reduce the number of files you have to iterate over. Maybe your listing doesn't need to be quite as inclusive.


    Dave

      Thanks for your reply.

      It is this loop that takes most of the time. Unfortunately, the input image array can grow as much as 400k elements (files). I am hoping there is a better way to improve on the hash assignment and/or regexes.

      Best!

        As I tried to illustrate, you will not find optimizations for this existing work-flow that attain an order of magnitude of improvement. It would be improbable that you could even cut the time in half.

        What if you build an index from each file as it comes in, rather than doing a huge chunk of files all at once? Gather whatever meta-data you need on each file as it arrives, and shove that data into a database that you can query as needed. This will spread the computational workload over a longer period of time, and make tallying of results very fast.


        Dave

Re: Optimize my foreach loop / code
by Anonymous Monk on Aug 20, 2016 at 03:56 UTC

    Hi, how fast does this run you

    sub ff{print scalar gmtime, "\n"; } ff; @g=(1..35_000); for(@g){ /(\d+)/ and $ff{$1}=$1; /(\d+?)/ and $ff{$1}=$1; /(\d+)/ and $ff{$1}=$1; /(\d+?)/ and $ff{$1}=$1; } ff; __END__

      Less than 1 second.

      Thu Aug 25 20:29:05 2016
      Thu Aug 25 20:29:06 2016

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1170107]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-20 00:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found