Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Collapsing a list of files...

by fiddler42 (Beadle)
on Sep 16, 2009 at 22:18 UTC ( #795721=perlquestion: print w/ replies, xml ) Need Help??
fiddler42 has asked for the wisdom of the Perl Monks concerning the following question:

Kind Folk,

I have a list of c. 5,000 files that I need to collapse into the smallest possible list of files with wildcards. For example, say I have some files named:-

abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05
...and I need to reduce them to:-

abcpwbln0* abcpwbnn* adesicln0* adesicnn05
Any suggestions on how to do this as efficiently as possible in perl? The routine will start off by just reading in the contents of a user-specified unix directory. The names will be very similar to what's referenced above.

Thanks,

-fiddler42

Comment on Collapsing a list of files...
Select or Download Code
Re: Collapsing a list of files...
by kennethk (Monsignor) on Sep 16, 2009 at 22:25 UTC
    You need to consider your spec a little more, since the trivial answer for your question is:

    *

    which I assume is not what you want. Once you've considered what you want in terms of your constraints, modify the spec and I'm sure you'll get some more helpful suggestions.

    This also sounds like a case where some Dynamic_programming would get you a long way.

      Ha! No, I don't want a single * and, yes, I could have done a better job of articulating the constraints.

      Here's the deal. Each of the files in that original list of 5,000 will have a prefix of 3 characters, a middle pattern of [\w+\d]*[l|n]n, and a suffix of n digits. Now that I think about it, you can ignore the suffix altogether and just replace the n digits with *. So all we need to focus on is the prefix and middle pattern. Now I need to do this:-

      1. Fracture up the start list into n new lists, where n is the number of unique prefixes. So if n = 3, the sum total of the file names in the 3 lists will equal 5,000.

      2. Now take each list from #1 and reduce each one down to the fewest number of unique patterns that end in *. The prefix must be preserved, as well as, at a minimum, the first character of the middle string.

      3. There can only be one *, and it has to be at the end of each pattern. The original example should make this clear. (Again, it closely resembles the actual file names and the outcome I want.)

      Hopefully this helps!

      Thanks,

      fiddler42

        Maybe you can adopt this, I call it floyd
        #!/usr/bin/perl -- use strict; use warnings; my(@list ) = sort qw[ abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05 ]; use Data::Dumper; my $len = length 'abcpwbnn'; my $ret = floyd( \@list, $len, ) ; print Dumper($ret),"\n"; for my $ar( @$ret ){ if($#{$ar} > 0){ print substr $ar->[0], 0, $len; print "*\n"; } else { print $ar->[0],"\n"; } } sub floyd { my ( $input, $len ) = @_; my %output; keys %output = scalar @$input; #presize %output, max possible my @keys; my $num_keys_l = 0; for my $i( @$input ){ my $ikey = substr $i, 0, $len; push @{$output{$ikey}}, $i; my $nkeys = 0+ keys %output; if( $nkeys > $num_keys_l ){ $num_keys_l = $nkeys; push @keys, $ikey; } } return [ @output{@keys} ]; } __END__ $VAR1 = [ [ 'abcpwbln07', 'abcpwbln08' ], [ 'abcpwbnn06', 'abcpwbnn10' ], [ 'adesicln01', 'adesicln08', 'adesicln09' ], [ 'adesicnn05' ] ]; abcpwbln* abcpwbnn* adesicln* adesicnn05
        Hopefully this helps!

        Not entirely, perhaps...

        ... a suffix of n digits. Now that I think about it, you can ignore the suffix altogether and just replace the n digits with *.

        Well, I'm assuming there must be some other constraint (which you haven't described yet) that would make this an inadequate solution:

        my %collapsed; opendir( D, "/whatever/path" ) or die "whatever: $!\n"; for my $name ( grep /\d+$/, readdir D ) { $name =~ s/\d+$/*/; $collapsed{$name}++; } for my $name ( sort keys %collapsed ) { print "$name\n"; }
        If you're trying to do something else besides ignoring the final digits -- if there's supposed to be some actual structure to the file names and you need to do something with that besides adding "*" at the end -- let us in on it. From what you've described so far, it's either too simple or too vague.
        I think you are still a little caught up in the source of your data, and not considering it purely from a string processing perspective. From my reading, your goals are:

        1. Determine the number of unique 4-character prefixes (n + 1, n == 3) within a list of strings.
        2. For each prefix group, determine the longest possible string anchored at the string start common between all members of the group

        If the second goal is incorrect (you've never stated it but it seems to pervade your solutions), your goal can be achieved very simply with:

        #!/usr/bin/perl use strict; use warnings; my %files = (); while (<DATA>) { if (/(.{4})/) { $files{$1} = 1; } } foreach my $prefix (keys %files) { print "$prefix*\n"; } __DATA__ abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05

        If you want to keep the second goal, one way to accomplish your goal (TIMTOWTDI) would be to compare the new postfix with the existing solution for that prefix, and shorten that guess until it matches the to criterion:

        !/usr/bin/perl use strict; use warnings; my %files = (); while (<DATA>) { if (my ($prefix,$postfix) = /(.{4})(.*)/) { if (exists $files{$prefix}) { my $pattern = $files{$prefix}; while (not $postfix =~ /^$pattern/) { $pattern =~ s/.$//; } $files{$prefix} = $pattern; } else { $files{$prefix} = $postfix; } } } foreach my $prefix (keys %files) { print "$prefix$files{$prefix}*\n"; } __DATA__ abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05
Re: Collapsing a list of files...
by Anonymous Monk on Sep 16, 2009 at 22:25 UTC
    Hmm, first try, Regex::PreSuf
    #!/usr/bin/perl -- use Regex::PreSuf; print presuf(qw[ abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05 ]); __END__ a(?:bcpwb(?:ln0[78]|nn(?:06|10))|desic(?:ln0[189]|nn05))
    2nd try Regexp::Assemble
    #!/usr/bin/perl -- use strict; use warnings; use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add( quotemeta $_) for qw[ abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05 ]; print $ra->re; __END__ (?-xism:a(?:bcpwb(?:nn(?:06|10)|ln0[78])|desic(?:ln0[189]|nn05)))
    3rd try Regexp::Assemble OR Regex::PreSuf site:perlmonks.org

    4th maybe String::Glob::Permute

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://795721]
Approved by kennethk
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2014-09-16 01:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (155 votes), past polls