Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Collapsing a list of files...

by fiddler42 (Beadle)
on Sep 17, 2009 at 00:22 UTC ( #795738=note: print w/ replies, xml ) Need Help??


in reply to Re: Collapsing a list of files...
in thread Collapsing a list of files...

Ha! No, I don't want a single * and, yes, I could have done a better job of articulating the constraints.

Here's the deal. Each of the files in that original list of 5,000 will have a prefix of 3 characters, a middle pattern of [\w+\d]*[l|n]n, and a suffix of n digits. Now that I think about it, you can ignore the suffix altogether and just replace the n digits with *. So all we need to focus on is the prefix and middle pattern. Now I need to do this:-

1. Fracture up the start list into n new lists, where n is the number of unique prefixes. So if n = 3, the sum total of the file names in the 3 lists will equal 5,000.

2. Now take each list from #1 and reduce each one down to the fewest number of unique patterns that end in *. The prefix must be preserved, as well as, at a minimum, the first character of the middle string.

3. There can only be one *, and it has to be at the end of each pattern. The original example should make this clear. (Again, it closely resembles the actual file names and the outcome I want.)

Hopefully this helps!

Thanks,

fiddler42


Comment on Re^2: Collapsing a list of files...
Download Code
Re^3: Collapsing a list of files...
by Anonymous Monk on Sep 17, 2009 at 00:35 UTC
    Maybe you can adopt this, I call it floyd
    #!/usr/bin/perl -- use strict; use warnings; my(@list ) = sort qw[ abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05 ]; use Data::Dumper; my $len = length 'abcpwbnn'; my $ret = floyd( \@list, $len, ) ; print Dumper($ret),"\n"; for my $ar( @$ret ){ if($#{$ar} > 0){ print substr $ar->[0], 0, $len; print "*\n"; } else { print $ar->[0],"\n"; } } sub floyd { my ( $input, $len ) = @_; my %output; keys %output = scalar @$input; #presize %output, max possible my @keys; my $num_keys_l = 0; for my $i( @$input ){ my $ikey = substr $i, 0, $len; push @{$output{$ikey}}, $i; my $nkeys = 0+ keys %output; if( $nkeys > $num_keys_l ){ $num_keys_l = $nkeys; push @keys, $ikey; } } return [ @output{@keys} ]; } __END__ $VAR1 = [ [ 'abcpwbln07', 'abcpwbln08' ], [ 'abcpwbnn06', 'abcpwbnn10' ], [ 'adesicln01', 'adesicln08', 'adesicln09' ], [ 'adesicnn05' ] ]; abcpwbln* abcpwbnn* adesicln* adesicnn05
Re^3: Collapsing a list of files...
by graff (Chancellor) on Sep 17, 2009 at 03:53 UTC
    Hopefully this helps!

    Not entirely, perhaps...

    ... a suffix of n digits. Now that I think about it, you can ignore the suffix altogether and just replace the n digits with *.

    Well, I'm assuming there must be some other constraint (which you haven't described yet) that would make this an inadequate solution:

    my %collapsed; opendir( D, "/whatever/path" ) or die "whatever: $!\n"; for my $name ( grep /\d+$/, readdir D ) { $name =~ s/\d+$/*/; $collapsed{$name}++; } for my $name ( sort keys %collapsed ) { print "$name\n"; }
    If you're trying to do something else besides ignoring the final digits -- if there's supposed to be some actual structure to the file names and you need to do something with that besides adding "*" at the end -- let us in on it. From what you've described so far, it's either too simple or too vague.
Re^3: Collapsing a list of files...
by kennethk (Monsignor) on Sep 17, 2009 at 15:07 UTC
    I think you are still a little caught up in the source of your data, and not considering it purely from a string processing perspective. From my reading, your goals are:

    1. Determine the number of unique 4-character prefixes (n + 1, n == 3) within a list of strings.
    2. For each prefix group, determine the longest possible string anchored at the string start common between all members of the group

    If the second goal is incorrect (you've never stated it but it seems to pervade your solutions), your goal can be achieved very simply with:

    #!/usr/bin/perl use strict; use warnings; my %files = (); while (<DATA>) { if (/(.{4})/) { $files{$1} = 1; } } foreach my $prefix (keys %files) { print "$prefix*\n"; } __DATA__ abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05

    If you want to keep the second goal, one way to accomplish your goal (TIMTOWTDI) would be to compare the new postfix with the existing solution for that prefix, and shorten that guess until it matches the to criterion:

    !/usr/bin/perl use strict; use warnings; my %files = (); while (<DATA>) { if (my ($prefix,$postfix) = /(.{4})(.*)/) { if (exists $files{$prefix}) { my $pattern = $files{$prefix}; while (not $postfix =~ /^$pattern/) { $pattern =~ s/.$//; } $files{$prefix} = $pattern; } else { $files{$prefix} = $postfix; } } } foreach my $prefix (keys %files) { print "$prefix$files{$prefix}*\n"; } __DATA__ abcpwbln08 abcpwbln07 abcpwbnn10 abcpwbnn06 adesicln09 adesicln08 adesicln01 adesicnn05

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://795738]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2014-07-10 06:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (202 votes), past polls