Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Find the last item in a series of files

by fredho (Initiate)
on Jun 16, 2017 at 14:39 UTC ( #1192943=perlquestion: print w/replies, xml ) Need Help??
fredho has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I have several files which name has the same root and different suffix (ex: file.001, file.002, file.003) and I need to identify ,for each unique value of root value (file), the file which extension has the highest value (003)
Do I need to push matching file into an array before sorting elements on the extension?
Or is there an easiest way to proceed?
Thanks for the help
  • Comment on Find the last item in a series of files

Replies are listed 'Best First'.
Re: Find the last item in a series of files
by tybalt89 (Deacon) on Jun 16, 2017 at 15:27 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1192943 use strict; use warnings; my %names; /(.*)\.(.*)/ and $names{$1}[$2] = $_ while <DATA>; print $names{$_}[-1] for sort keys %names; __DATA__ file.001 file.003 file.002 one.004 two.001 two.003 one.002 one.001 two.002
      Very nice solution!

      A small improvement makes the regex a bit more specific and have it reject filenames that do not match the expected file name template.

      use strict; use warnings; my %names; /^([^.]+)\.(\d{3})$/ and $names{$1}[$2] = $_ while <DATA>; print $names{$_}[-1] for sort keys %names; __DATA__ file.001 file.003 not.good.001 file.002 file.10 one.004 two.001 two.003 two.five one.002 one.001 one.0039 two.002 .005

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
      Here is my piece of code. I'm not sure this is the best way to do
      chdir ($folder); while ($file = <*>){ my ($ext) = $file =~ /(\.[^.]+)$/; #Check file extension if (($ext =~ m/00./) and ($ext ne ".001")){ next; } elsif ($ext eq ".001") { # first file is required my $filenameroot = $file; $filenameroot =~ s/(.+)\.[^.]+$/$1/; # File name root my @list = glob("$folder$filenameroot*"); print "Last element : $list[(scalar @list-1)]\n"; # Last file + of the series } }
        I see your thought process, this is very close. Since you were nice enough to comply with the "hey, show us what you got" request, I'll make a few comments which I hope will be helpful for you in writing future code...

        • if (($ext =~ m/00./) and ($ext ne ".001")){ The first conditional as part of the "and" is not needed.
          ($ext ne ".001") says it all.
        • elsif ($ext eq ".001") { This "if" is not needed either. $ext has to be equal to ".001" if you get to this point. The previous lines have rejected any value that wasn't equal to "001".
        • Adding comment about:
          my $filenameroot = $file; $filenameroot =~ s/(.+)\.[^.]+$/$1/; # File name root

          This is fine, you make a copy of "$file" by assigning that to a new variable, "$filenameroot" Then you use a substitute operation to modify $filenameroot. This works. However consider:
          (my $filenameroot) = $file =~ m/(.+)\.[^.]+$/;
          In general a substitute operation is more "expensive" than a simple "return a value" operation. That is because the input string must be modified instead of selected parts just being copied. If you put the LHS (Left hand side) into a List context, you can assign $1, and even $2,$3.. from a match. Here $1 gets assigned to $filenameroot - no substitution operation required. This of course also avoids the problem of assigning $filenameroot to something that it is "not quite correct" yet. Here $filenameroot becomes $1.
        • my @list = glob("$folder$filenameroot*"); I am not sure if glob() returns a sorted list or not? Even if it does, it would be Character String sorted and not numerically sorted. This can make a big difference as "13" sorts lower than "3". This sorting difference between Character and Numeric is something to consider when you have numeric values. I don't know for sure whether this is a problem, but always include some double digit numbers in your test cases.
        • The big issue with the glob() is that you are re-reading the directory multiple times. File system operations are "expensive" in terms of CPU. Get in the habit of trying to do a directory read "only once". Store it if you have to in your own data structure. Of course in your application, I don't expect any performance issue, but this is something to be aware of in the future.
        • print "Last element : $list[(scalar @list-1)]\n"; That does indeed get the last element of @list. However there could be a problem because that last element might not be the file with the largest extension number due to previously mentioned potential sorting issues? Note better written as $list[-1]. In Perl the -1 index is the last item, -2 is next to last, etc. A very handy concept. Your code is correct, just mentioning that there is a better syntax for this.
        • I direct your attention to the code by BillKSmith, tybalt89 and CountZero. This is clever in how it works. I think some further explanation may be helpful to you.

          This builds a HoA (Hash of Array) called %names. What is special is that the array @{$names{"name"}} is what is called a "sparse array" - not every element of the array has an assigned value. Perl allows this. If say @array only has 3 things in it, you can still assign $array[14]="Something";. A bunch of values will wind up being "undef" or undefined, but that is just fine. A numeric sort to get the "largest suffix number" is unnecessary, just using the [-1] index is enough. The sort of keys %names just puts the root names in alphabetical order. This has nothing to do with determining the highest numbered suffix. Added: look at Laurent_R's code also.

          I recommend that you use some adaption of the HoA code or Laurent_R's code. Both look great to me.

          Welcome to the group! You will get a lot of help here. In general more help is forthcoming when you demonstrate some effort on your part (which you did).

      okay. i haven't tested this code...but, here's what i got...
      # firstly, i'm gonna use the working directory, for laziness' sake! lo +l # secondly, i haven't thoroughtly tested this. 'sub external_files($$ +)' is tested, and does work according to my tests # i'm working in a windows 10 environment, apache24 and activestate's +perl 5.020002 (i think that version # is right) # # thridly, this script assumes all the files in the folder are named w +ith .xxx where each x is a digit 0..9 # fourth, this will do no error checking! it will work perfect, so lon +g as you adhere to the file extension convention # fifth, and finally, i have not tested this code ############################## # i copied this from a project i'm working on # yes. i use prototypes. SUE me! sub external_files($;$) { #* # lists files within a specified folder (eg: config, txt) # folders will not be included in this list - just the filenames onl +y # if no type is provided, *.* is assumed # type should be just "png" or "txt", no need to include a leading d +ot #* my ($folder, $type) = @_; # a location (eg: users), relative to web +root && a file type if ($type) { # the following is just in case the user of this # subroutine ignores instructions (mainly me lol) $type =~ s/(\*)*//g; # remove stars $type =~ s/(\.)*//; # remove dots $type =~ s/\///g; # remove forward slashes if ($type) { $type = ".$type"; } } if ($folder) { # same idea here as for $type # this one, however, may seem weird, but i've # found it better to account for all possibilities # rather than leave it up to the user of this # code to ensure correct params are given # # besides, i tend to forget to follow my own # instructions, so this saves me tons of head # scratching, see? $folder =~ s/(\/)*$//; # remove trailing /'s $folder =~ s/^(\/)*//; # remove leading /'s $folder =~ s/\/\//\//g; # convert //'s to / $folder .= "/"; # attach trailing /* } my @fixed; my $filespec = $folder . "*" . $type; my @dirs = glob($filespec); $folder =~ s/\./\\./g; $folder =~ s/\//\\\//g; foreach my $dir (@dirs) { if (-f $dir) { $dir =~ s/$folder//; push (@fixed, $dir); } } return @fixed; # an array #usage: my @fileList = external_files("D:/", "txt"); } # end of sub external_files($$); #sub get_last($) { # you could uncomment this line...and turn the foll +owing into a sub! #my ($folder) = @_; # and yes, i do this, too! again, sue me (i belie +ve wholeheartedly, and pedantically so, in the K.I.S.S concept) # my @files = external_files($folder); # i'll leave it up to you to ma +ke sure $folder is a valid location, but give it whatever you like, r +eally my @files = external_files("d:/myNumberedFiles"); # @ files should now contain all yer files stored in d:/myNumberedFile +s/ # now, you want the file with an extension that works out to being the + highest #? # easy! # first, i'm gonna rip through the list, and build a new one. # the new one will contain just the extension with no dots. # leading zeros will be removed from the extension. this should # result in a list with elements that are just numbers. # then, i'm gonna sort the bugger, and pit out the last element. my @exts = (); foreach my $file (@files) { $file =~ s/^(.)*\.(0)*//; # remove everything before and including t +he dot and any leading zeros after the dot # now, pop that into your list push @exts, $file; } # now sort the list! sort @exts; print $exts[$#exts]; #return $exts[$#exts]; #} # and you have yer answer... #you could drop the above "main" code into a sub of it's own, too, of +course. #just uncomment the #sub... line and the line after it, and the #retur +n and #} lines at the bottom

      i hope this one works, and doesn't get too butchered by the rest of the monks here :D i like to think i'm pretty decent at this coding thing, so, go easy on me. i'm 100% self taught, and i have no personal group of PERL programmers in my midst - i'm alone, and i'm a one man band.

      sincerely,

      jamroll
        i haven't tested this code...

        Having a variety of test cases is important. I admit I haven't tested your code myself, but if you had tested it with multiple cases, you might have found that, for example, sort @exts; isn't doing what you want. Also, I can warmly recommend one of the filename manipulation modules like Path::Class, or perhaps File::Spec (a core module) - if you use the former you can even use its methods to list files in the directory (->children). A few more suggestions: Be careful with if ($folder), since that will test negative when $folder happens to be "0" (Truth and Falsehood), you probably want to use length or defined tests instead (same goes for if ($type), of course). Also, I think you might have missed a /g on your "remove dots" regex?

Re: Find the last item in a series of files
by 1nickt (Prior) on Jun 16, 2017 at 14:47 UTC

    Hi, what do you have so far? Can you show your code please? An array of file names seems like a good start, after you get the file suffixes.

    You might like Path::Tiny::iter() to find the files, and File::Basename::fileparse() for getting the filename suffix.

    Hope this helps!


    The way forward always starts with a minimal test.
Re: Find the last item in a series of files
by Laurent_R (Abbot) on Jun 16, 2017 at 20:28 UTC
    I'm a bit reluctant to sort a whole array if all that is needed is to find a maximum value, because the algorithmic complexity is higher (meaning it is in theory less efficient). Having said that, I must admit it probably does not matter unless then number of file is very high.

    Anyway, you might avoid a sort with something like this:

    my %hash for my $file (glob("*.*")) { my ($root, $ext) = split /\./, $file; if (defined $hash{root) { $hash{$root} = $ext if $ext > $hash{$root}; } else { $hash{$root} = $ext; } }
    The if ... else statement could be reduced to a simple if statement and a Boolean operator:
    $hash{$root} = $ext if (not defined $hash{$root}) or $ext > $hash{$roo +t};
    but I wanted to make it as easy to read as possible.
      A simple split to two parts may not be enough depending upon the OP's filenames and how many '.' characters might be contained within those names. I think that a regex assignment would be more appropriate instead of a split. But it could be that this is all the OP needs.

      I don't like your one line if..else because it is "hard to understand" and confers no execution advantage.

      Anyway, a nice algorithm idea that compares favorably with the HoA ideas previously posted.

        I agree with you. I chose split on the basis of the example provided in the OP (file.001, file.002, ...). A regex would be more robust for more complex filenames.

        I also think that the one-line version of the if ... else statement is less easy to understand (that's why I used the other version in the complete code version), I only wanted to show it could also be made more concise.

        A simple split to two parts may not be enough depending upon the OP's filenames and how many '.' characters might be contained within those names. I think that a regex assignment would be more appropriate instead of a split. But it could be that this is all the OP needs.

        That's why using File::Basename seems a good idea as it provides an easy interface to extract path, files-basename and extension.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: Find the last item in a series of files
by BillKSmith (Vicar) on Jun 16, 2017 at 15:16 UTC
    You need one hash. In one pass, store basename as key and the largest extension found so far as the corresponding value. At the end of that pass, the hash contains exactly what you want. All you have left to do is format it.
    Bill
Re: Find the last item in a series of files
by sundialsvc4 (Abbot) on Jun 19, 2017 at 13:44 UTC

    As Perl monks universally say, “there’s more than one way to do it,” and I would like to politely offer the opinion here that all of the proposed solutions so far are “less than ideal” from my point-of-view as a technical project manager or team lead, for specifically these reasons, none of which strictly have to do with whether-or-not the solution now produces the right answer:

    1. Even though, I am quite sure, “all of them work,” it is not particularly “drop-dead obvious” why they work, nor necessarily what they do.   But, it should be.
    2. It would be difficult to extend these solutions in several obvious areas, such as finding the files within a recursive directory structure, even though off-the-shelf modules such as File::Find::Object already do this completely.   “(Debuggum Ne Agas:   Do Not Debug A Thing Already Done.)”
    3. There are two distinct “concerns” here ((1) selecting the files to be sorted, and (2) sorting them correctly) which are not separated, but rather mixed-together.   If I wanted to change the logic just a little bit, say to include more than one file base-name or somesuch, suddenly my change would be error-prone and difficult.

    In short, these solutions would not be readily maintainable.

    The solution that I would approve would, first of all, leverage an existing CPAN file-finder to do the work, probably the one aforementioned.   It would then a simple next unless (condition or condition) statement to filter out unwanted files ... to the extent that a file-matching pattern given to the finder did not already do so.   It would push the matched file-names onto an array.   The final step, performed after the loop ended, would sort this array, perhaps using a custom sort-function.   Now, the concerns are easily separated, and a git patch applied to change one behavior probably would change only one line ... not the whole thing.   Also, if I am using a known-good module, I don’t have to worry if it does actually do the basic task of finding files.

    (I will pause briefly while you hit the “--” downvote-button now.   Or maybe you would like to consider it vote it completely off the island ...)

    “Tim Toady” does not mean finding a way that works.   It means finding the best way for the project and the team.   For instance, when I do a search at http://search.cpan.org for “File::Find”, I find 213 hits.   Yes, some of them are esoteric ... but several of them solve this very problem.   I really don’t want to accept a probable future maintenance-headache into my project if I can avoid it entirely.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1192943]
Front-paged by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (10)
As of 2017-09-20 12:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    During the recent solar eclipse, I:









    Results (236 votes). Check out past polls.

    Notices?