fredho has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks,
I have several files which name has the same root and different suffix (ex: file.001, file.002, file.003) and I need to identify
,for each unique value of root value (file), the file which extension has the highest value (003)
Do I need to push matching file into an array before sorting elements on the extension?
Or is there an easiest way to proceed?
Thanks for the help
Re: Find the last item in a series of files
by tybalt89 (Monsignor) on Jun 16, 2017 at 15:27 UTC
|
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1192943
use strict;
use warnings;
my %names;
/(.*)\.(.*)/ and $names{$1}[$2] = $_ while <DATA>;
print $names{$_}[-1] for sort keys %names;
__DATA__
file.001
file.003
file.002
one.004
two.001
two.003
one.002
one.001
two.002
| [reply] [d/l] |
|
use strict;
use warnings;
my %names;
/^([^.]+)\.(\d{3})$/ and $names{$1}[$2] = $_ while <DATA>;
print $names{$_}[-1] for sort keys %names;
__DATA__
file.001
file.003
not.good.001
file.002
file.10
one.004
two.001
two.003
two.five
one.002
one.001
one.0039
two.002
.005
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] |
|
Here is my piece of code. I'm not sure this is the best way to do
chdir ($folder);
while ($file = <*>){
my ($ext) = $file =~ /(\.[^.]+)$/; #Check file extension
if (($ext =~ m/00./) and ($ext ne ".001")){
next;
}
elsif ($ext eq ".001") { # first file is required
my $filenameroot = $file;
$filenameroot =~ s/(.+)\.[^.]+$/$1/; # File name root
my @list = glob("$folder$filenameroot*");
print "Last element : $list[(scalar @list-1)]\n"; # Last file
+ of the series
}
}
| [reply] [d/l] |
|
I see your thought process, this is
very close. Since you were nice enough to comply with the "hey,
show us what you got" request, I'll make a few comments which I hope will be helpful
for you in writing future code...
-
if (($ext =~ m/00./) and ($ext ne ".001")){ The first
conditional as part of the "and" is not needed.
($ext ne ".001") says it all.
- elsif ($ext eq ".001") { This "if" is not needed either. $ext has to be
equal to ".001" if you get to this point. The previous lines have rejected any value that
wasn't equal to "001".
- Adding comment about:
my $filenameroot = $file;
$filenameroot =~ s/(.+)\.[^.]+$/$1/; # File name root
This is fine, you make a copy of "$file" by assigning that to a new variable, "$filenameroot" Then you use a substitute operation to modify $filenameroot. This works.
However consider:
(my $filenameroot) = $file =~ m/(.+)\.[^.]+$/;
In general a substitute operation is more "expensive" than a simple "return a value"
operation. That is because the input string must be modified instead of selected parts just being copied. If you put the LHS (Left hand side) into a List context, you can assign $1, and even $2,$3.. from a match. Here $1 gets assigned to $filenameroot - no substitution operation required. This of course also avoids the problem of assigning $filenameroot to something that it is "not quite correct" yet. Here $filenameroot becomes $1.
- my @list = glob("$folder$filenameroot*"); I am not sure
if glob() returns a sorted list or not? Even if it does, it would be Character String sorted
and not numerically sorted. This can make a big difference as "13" sorts lower than "3". This
sorting difference between Character and Numeric is something to consider when you have numeric values. I don't know for sure whether this is a problem, but always include some double digit numbers in your test cases.
- The big issue with the glob() is that you are re-reading the directory multiple times.
File system operations are "expensive" in terms of CPU. Get in the habit of
trying to do a directory read "only once". Store it if you have to in your own data structure.
Of course in your application, I don't expect any performance issue, but this is something
to be aware of in the future.
- print "Last element : $list[(scalar @list-1)]\n"; That does indeed get the last
element of @list. However there could be a problem because that last element might not be
the file with the largest extension number due to previously mentioned potential sorting issues? Note better written as $list[-1]. In
Perl the -1 index is the last item, -2 is next to last, etc. A very handy concept. Your code is correct, just mentioning that there is a better syntax for this.
I direct your attention to the code by BillKSmith, tybalt89 and CountZero.
This is clever in how it works. I think some further explanation may be helpful to you.
This builds a HoA (Hash of Array) called %names. What is special is that the array @{$names{"name"}}
is what is called a "sparse array" - not every element of the array has an assigned value.
Perl allows this. If say @array only has 3 things in it, you can still assign $array[14]="Something";.
A bunch of values will wind up being "undef" or undefined, but that is just fine.
A numeric sort to get the "largest suffix number" is unnecessary, just using the [-1] index
is enough. The sort of keys %names just puts the root names in alphabetical order.
This has nothing to do with determining the highest numbered suffix. Added: look at Laurent_R's code also.
I recommend that you use some adaption of the HoA code or Laurent_R's code. Both look great to me.
Welcome to the group! You will get a lot of help here. In general more help is forthcoming when you demonstrate some effort on your part (which you did).
| [reply] [d/l] [select] |
|
okay. i haven't tested this code...but, here's what i got...
# firstly, i'm gonna use the working directory, for laziness' sake! lo
+l
# secondly, i haven't thoroughtly tested this. 'sub external_files($$
+)' is tested, and does work according to my tests
# i'm working in a windows 10 environment, apache24 and activestate's
+perl 5.020002 (i think that version # is right)
#
# thridly, this script assumes all the files in the folder are named w
+ith .xxx where each x is a digit 0..9
# fourth, this will do no error checking! it will work perfect, so lon
+g as you adhere to the file extension convention
# fifth, and finally, i have not tested this code
##############################
# i copied this from a project i'm working on
# yes. i use prototypes. SUE me!
sub external_files($;$) {
#*
# lists files within a specified folder (eg: config, txt)
# folders will not be included in this list - just the filenames onl
+y
# if no type is provided, *.* is assumed
# type should be just "png" or "txt", no need to include a leading d
+ot
#*
my ($folder, $type) = @_; # a location (eg: users), relative to web
+root && a file type
if ($type) {
# the following is just in case the user of this
# subroutine ignores instructions (mainly me lol)
$type =~ s/(\*)*//g; # remove stars
$type =~ s/(\.)*//; # remove dots
$type =~ s/\///g; # remove forward slashes
if ($type) { $type = ".$type"; }
}
if ($folder) {
# same idea here as for $type
# this one, however, may seem weird, but i've
# found it better to account for all possibilities
# rather than leave it up to the user of this
# code to ensure correct params are given
#
# besides, i tend to forget to follow my own
# instructions, so this saves me tons of head
# scratching, see?
$folder =~ s/(\/)*$//; # remove trailing /'s
$folder =~ s/^(\/)*//; # remove leading /'s
$folder =~ s/\/\//\//g; # convert //'s to /
$folder .= "/"; # attach trailing /*
}
my @fixed;
my $filespec = $folder . "*" . $type;
my @dirs = glob($filespec);
$folder =~ s/\./\\./g;
$folder =~ s/\//\\\//g;
foreach my $dir (@dirs) {
if (-f $dir) {
$dir =~ s/$folder//;
push (@fixed, $dir);
}
}
return @fixed; # an array
#usage: my @fileList = external_files("D:/", "txt");
} # end of sub external_files($$);
#sub get_last($) { # you could uncomment this line...and turn the foll
+owing into a sub!
#my ($folder) = @_; # and yes, i do this, too! again, sue me (i belie
+ve wholeheartedly, and pedantically so, in the K.I.S.S concept)
# my @files = external_files($folder); # i'll leave it up to you to ma
+ke sure $folder is a valid location, but give it whatever you like, r
+eally
my @files = external_files("d:/myNumberedFiles");
# @ files should now contain all yer files stored in d:/myNumberedFile
+s/
# now, you want the file with an extension that works out to being the
+ highest #?
# easy!
# first, i'm gonna rip through the list, and build a new one.
# the new one will contain just the extension with no dots.
# leading zeros will be removed from the extension. this should
# result in a list with elements that are just numbers.
# then, i'm gonna sort the bugger, and pit out the last element.
my @exts = ();
foreach my $file (@files) {
$file =~ s/^(.)*\.(0)*//; # remove everything before and including t
+he dot and any leading zeros after the dot
# now, pop that into your list
push @exts, $file;
}
# now sort the list!
sort @exts;
print $exts[$#exts];
#return $exts[$#exts];
#}
# and you have yer answer...
#you could drop the above "main" code into a sub of it's own, too, of
+course.
#just uncomment the #sub... line and the line after it, and the #retur
+n and #} lines at the bottom
i hope this one works, and doesn't get too butchered by the rest of the monks here :D
i like to think i'm pretty decent at this coding thing, so, go easy on me. i'm 100% self taught, and i have no personal group of PERL programmers in my midst - i'm alone, and i'm a one man band.
sincerely,
jamroll | [reply] [d/l] |
|
i haven't tested this code...
Having a variety of test cases is important. I admit I haven't tested your code myself, but if you had tested it with multiple cases, you might have found that, for example, sort @exts; isn't doing what you want. Also, I can warmly recommend one of the filename manipulation modules like Path::Class, or perhaps File::Spec (a core module) - if you use the former you can even use its methods to list files in the directory (->children). A few more suggestions: Be careful with if ($folder), since that will test negative when $folder happens to be "0" (Truth and Falsehood), you probably want to use length or defined tests instead (same goes for if ($type), of course). Also, I think you might have missed a /g on your "remove dots" regex?
Update 2019-08-17: Updated the link to "Truth and Falsehood".
| [reply] [d/l] [select] |
Re: Find the last item in a series of files
by 1nickt (Canon) on Jun 16, 2017 at 14:47 UTC
|
Hi, what do you have so far? Can you show your code please? An array of file names seems like a good start, after you get the file suffixes.
You might like Path::Tiny::iter() to find the files, and File::Basename::fileparse() for getting the filename suffix.
Hope this helps!
The way forward always starts with a minimal test.
| [reply] |
Re: Find the last item in a series of files
by Laurent_R (Canon) on Jun 16, 2017 at 20:28 UTC
|
I'm a bit reluctant to sort a whole array if all that is needed is to find a maximum value, because the algorithmic complexity is higher (meaning it is in theory less efficient). Having said that, I must admit it probably does not matter unless then number of file is very high.
Anyway, you might avoid a sort with something like this:
my %hash
for my $file (glob("*.*")) {
my ($root, $ext) = split /\./, $file;
if (defined $hash{root) {
$hash{$root} = $ext if $ext > $hash{$root};
} else {
$hash{$root} = $ext;
}
}
The if ... else statement could be reduced to a simple if statement and a Boolean operator:
$hash{$root} = $ext if (not defined $hash{$root}) or $ext > $hash{$roo
+t};
but I wanted to make it as easy to read as possible.
| [reply] [d/l] [select] |
|
A simple split to two parts may not be enough depending upon the OP's filenames and how many '.' characters might be contained within those names. I think that
a regex assignment would be more appropriate instead of a split. But it could be that this is all the OP needs.
I don't like your one line if..else because it is "hard to understand" and
confers no execution advantage.
Anyway, a nice algorithm idea that compares favorably with the HoA ideas previously posted.
| [reply] |
|
| [reply] [d/l] |
|
| [reply] |
|
|
Re: Find the last item in a series of files
by BillKSmith (Monsignor) on Jun 16, 2017 at 15:16 UTC
|
You need one hash. In one pass, store basename as key and the largest extension found so far as the corresponding value. At the end of that pass, the hash contains exactly what you want. All you have left to do is format it.
| [reply] |
Re: Find the last item in a series of files
by tybalt89 (Monsignor) on May 31, 2024 at 19:07 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=1192943
use warnings;
# If it's really true that all the files you are interested in
# have the same root with three digit extensions,
# then all that is needed is:
my $highest = (<file.???>)[-1];
print "File with Highest Extension => $highest\n";
| [reply] [d/l] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|