Optimize my foreach loop / code

rmocster has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, happy Friday.

Please optimize my code or give suggestion on how to do it. On a big array (35k elements), it take over 1.5 minute to compute. Any suggestion is greatly appreciated. I have included a section of my code and an example of the array all_listing (first line). Thank you.


@all_listing = qw ('folder/file.000001.jpg','folder/file.000002.jpg', 
+... , 'folder/file.039000.jpg','folder/file.040000.jpg');

foreach $kk (@all_listing) {
  next if $kk eq '';                    # next on empty line
  next if ($kk =~ m|(.+)\/\..+| and $hidden);        # omit hidden (do
+t) files

  if ($kk =~ m|(.+)\/(\d+)\.(.+$)|) {        # match image sequence wi
+thout prefix
    $path = $1; $num = $2; $ext = $3;
    print "$path:  $num.$ext\n" if ($debug);
    $pathpre = "$path/$ext";            # path and ext as the key ; ad
+ding ext so that if there are 2 seq with the same prefix and diff ext
+ension
    print ">> $pathpre\n" if ($debug);
    if (exists $seq{$pathpre}) {
      push (@range_tmp,$num);                # add element to array
    } else { 
      $seq{$pathpre} = $num;                # create hash
      @range_tmp = $num;                # empty array and add new elem
+ent to it
    }
    $new_seq{$pathpre} = [@range_tmp];        # create new seq hash
    $ext{$pathpre} = $ext;            # create hash of extension
  } elsif ($kk =~ m|(.+)\/(.+?)([\._]+)(\d+)\.(.+$)|) {        # match
+ image sequence
    $path = $1; $pre = $2; $div = $3 ; $num = $4; $ext = $5;
    print "$path:  $pre$div$num.$ext\n" if ($debug);
    $pathpre = "$path/$pre$div$ext";            # path, prefix, divide
+r and ext as the key ; adding ext so that if there are 2 seq with the
+ same prefix and diff extension
    print ">> $pathpre\n" if ($debug);
    if (exists $seq{$pathpre}) {
      push (@range_tmp,$num);                # add element to array
    } else { 
      $seq{$pathpre} = $num;                # create hash
      @range_tmp = $num;                # empty array and add new elem
+ent to it
    }
    $new_seq{$pathpre} = [@range_tmp];        # create new seq hash
    $ext{$pathpre} = $ext;            # create hash of extension
  } elsif ($kk =~ m|(.+)\/(.+?)(\d+)\.(.+$)|) {        # match most im
+age sequence, except this 7R01.0118762.dpx (number before dot) ; abov
+e regex takes care of this match
    $path = $1; $pre = $2; $num = $3; $ext = $4;
    print "$path:  $pre$num.$ext\n" if ($debug);
    $pathpre = "$path/$pre$ext";            # path, prefix and ext as 
+the key ; adding ext so that if there are 2 seq with the same prefix 
+and diff extension
    print ">> $pathpre\n" if ($debug);
    if (exists $seq{$pathpre}) {
      push (@range_tmp,$num);                # add element to array
    } else { 
      $seq{$pathpre} = $num;                # create hash
      @range_tmp = $num;                # empty array and add new elem
+ent to it
    }
    $new_seq{$pathpre} = [@range_tmp];        # create new seq hash
    $ext{$pathpre} = $ext;            # create hash of extension

  } else {  
    push (@no_match,$kk);
  }
}
[download]

Comment on Optimize my foreach loop / code Download Code

Replies are listed 'Best First'.
Re: Optimize my foreach loop / code by davido (Cardinal) on Aug 20, 2016 at 04:09 UTC
You have a single loop, so the amount of work you are doing increases in direct relationship to the number of filenames you are running through. Your lookups use hash keys which means that as your data set grows the lookup times won't grow significantly. You are pushing onto a couple of arrays, which doesn't cost you any significant growth pains for a dataset of 36000 file names, assuming you have a computer made this decade. So about all that leaves within the code you demonstrated is how much time it takes to run the regexes on the filenames. In the case of your first regex, there's no need to capture, and no need to use quantifiers, as `m\|.+/.+\|` will match the exact same strings that would match against `m\|./.\|`. The reason is that if "one or more" on either side of a slash matches, then one on either side of the slash would also match, and vice versa. So that regex has a small amount of room for optimization. Are you convinced through actual testing that this is the segment of code that takes all the time? Profiling would tell you, but even the minimal change of adding a `time` call to either side of the loop would tell you. It's possible that we're looking at the wrong code here. If it turns out that this really is your bottleneck, see if you can further reduce the number of files you have to iterate over. Maybe your listing doesn't need to be quite as inclusive. Dave	[reply] [d/l] [select]
Re^2: Optimize my foreach loop / code by rmocster (Novice) on Aug 25, 2016 at 22:00 UTC
Thanks for your reply. It is this loop that takes most of the time. Unfortunately, the input image array can grow as much as 400k elements (files). I am hoping there is a better way to improve on the hash assignment and/or regexes. Best!	[reply]
Re^3: Optimize my foreach loop / code by davido (Cardinal) on Aug 26, 2016 at 07:17 UTC
As I tried to illustrate, you will not find optimizations for this existing work-flow that attain an order of magnitude of improvement. It would be improbable that you could even cut the time in half. What if you build an index from each file as it comes in, rather than doing a huge chunk of files all at once? Gather whatever meta-data you need on each file as it arrives, and shove that data into a database that you can query as needed. This will spread the computational workload over a longer period of time, and make tallying of results very fast. Dave	[reply]
Re: Optimize my foreach loop / code by Anonymous Monk on Aug 20, 2016 at 03:56 UTC
Hi, how fast does this run you `sub ff{print scalar gmtime, "\n"; } ff; @g=(1..35_000); for(@g){ /(\d+)/ and $ff{$1}=$1; /(\d+?)/ and $ff{$1}=$1; /(\d+)/ and $ff{$1}=$1; /(\d+?)/ and $ff{$1}=$1; } ff; __END__` [download]	[reply] [d/l]
Re^2: Optimize my foreach loop / code by rmocster (Novice) on Aug 25, 2016 at 20:31 UTC
Less than 1 second. Thu Aug 25 20:29:05 2016 Thu Aug 25 20:29:06 2016	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks