Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Building a UNIX path from Irritating data

by roswell1329 (Acolyte)
on Nov 24, 2009 at 19:59 UTC ( #809192=perlquestion: print w/replies, xml ) Need Help??

roswell1329 has asked for the wisdom of the Perl Monks concerning the following question:

Oh wise and benevolent Monks, I beseech thee to help me with this Perl conundrum: I need to retire a 3rd party online document warehouse application, and I'm getting stuck on a specific problem. The data is stored in a hierarchy resembling a Windows file tree, like:
folder |--folder2 |--folder3 | |--folderX | |--folderY |-folder4
I need to recreate that hierarchy in a UNIX path structure. There are tools provided by the application to extract the hierarchy, but their information is very segmented. The 2 tools I have are: Folder info: will give me the folder id, name, and number of subfolders under it, like so:
Folder 4464 - foldername_1. count subfolders 0. Folder 4465 - foldername_2. count subfolders 0. Folder 4466 - foldername_3. count subfolders 4. ...
Folder/Folder info: will give me folder ID's for each sub-folder under each folder, like so:
Folder 1298 - foldername_ten. subfolder 1299. subfolder 1300. Folder 1299 - foldername_eleven. No sub folders. Folder 1300 - foldername_twelve. No sub folders. Folder 1311 - foldername_thirteen. subfolder 1317. subfolder 1318. subfolder 1958.
Based on this data, I wrote a script that would first collect the folder ID's and names. For each folder ID, it would then build a UNIX path by searching for the folder's parent folder, and then the parent for that parent folder recursively back to the root folder as shown here:
# %folders has each folder ID as a key and the name as the value # @subfolders is the Folder/Folder data as shown above, line-for-l +ine foreach my $k (sort (keys (%folders))) { $folderpaths{$k} = &build_path($k,@subfolders); print "$k => $folderpaths{$k}\n"; } sub build_path($@) { my $folderid = shift @_; my @dumpff = @_; my $path = "$folderid"; my $parentid = ""; foreach my $line (@dumpff) { if ($line =~ /^Folder (\d{3,5})\s+\-\s+.*\./) { $parentid = $1; } elsif ($line =~ /\s+subfolder\s+$folderid\s+\-\s+.*\./) { $path = join('/', &build_path($parentid,@subfolders),$fold +erid); } } return $path; }
The above code looked like it was working perfectly, but I saw lots of data was missing. I later discovered that there are a few folders that appear as children under multiple parent folders. My script can identify multiple parents for each node it's looking at, but it can only return 1 match. My question is, how can I account for the multiple paths, and how can I identify those multiple paths so I know to make each duplicate path a symlink to the original when I actually build the UNIX filesystem?

Replies are listed 'Best First'.
Re: Building a UNIX path from Irritating data
by GrandFather (Saint) on Nov 25, 2009 at 02:11 UTC

    I'd build a structure that maps pretty much directly on to the information provided by 'Folder/Folder info'. Along with a little error checking and a recursive sub to generate the output something like the following should fit the bill:

    use strict; use warnings; my %folders; my $parentId; while (<DATA>) { chomp; if (m/^Folder\s+(\d+)\s*-\s*(.*)\.$/) { # new folder my ($folderId, $folderName) = ($1, $2); die "Duplicate entry for $folderId ($folderName) at line $.\n" if ++$folders{$folderId}{idCount} > 1; # Create the new folder $folders{$folderId}{folders} = []; $folders{$folderId}{name} = $folderName; $parentId = $folderId; } elsif (m/^\s+subfolder (\d+)\.$/) { # New subfolder my $folderId = $1; die "No parent folder for $folderId. Data format error at line + $.\n" if ! defined $parentId; warn "Probably bad data. Subfolder $folderId has already been +seen\n" if exists $folders{$folderId}; push @{$folders{$parentId}{folders}}, $folderId; $folders{$folderId}{hasParent} = 1; } else { # Bogus line } } genPath ($_) for sort grep {! $folders{$_}{hasParent}} keys %folders; sub genPath { my ($folderId, $root) = @_; $root .= '/'; die "Missing folder information for $folderId\n" if ! exists $folders{$folderId}{name}; die "Cycle in 'tree' involving $folderId and path $root\n" if ++$folders{$folderId}{visits} > 1; $root .= $folders{$folderId}{name}; print "$root\n"; genPath ($_, $root) for @{$folders{$folderId}{folders}}; } __DATA__ Folder 1298 - foldername_ten. subfolder 1299. subfolder 1300. Folder 1299 - foldername_eleven. No sub folders. Folder 1300 - foldername_twelve. No sub folders. Folder 1311 - foldername_thirteen. subfolder 1317. subfolder 1318. Folder 1317 - foldername_twelve. No sub folders. Folder 1318 - foldername_twelve. No sub folders.

    Prints:

    /foldername_ten /foldername_ten/foldername_eleven /foldername_ten/foldername_twelve /foldername_thirteen /foldername_thirteen/foldername_twelve /foldername_thirteen/foldername_twelve

    True laziness is hard work
      Hi GrandFather -- Thanks for the suggestion. I tried your code and it works as you mentioned, but I tried the same code with a static data file from the application itself (I posted a copy here: subfolders.txt), and the list never got past the first "/foldername". I'm looking into that to find out why.

      However, I also don't see in your code how deep the paths could go. Some of these folder/subfolder relationships go down 5 or 6 levels so we could see something like this:

      /foldername_ten/foldername_eleven/foldername_twelve/foldername_thirteen/foldername_fourteen

      It looks like your code only handles the first level of subfolder. Is that correct?

      For an example of how this data could be nested, search for folder ID 3053 in the linked datafile above and tell me if your code could address the structure seen there. Thank you for your assistance!

        There is no inherent limit to how nested the folders may be.

        I tried your data and it seemed to work fine for me generating over 3000 lines of output. Could you have a line ending issue? If you are running the script on *nix, but the file is generated on Windows then the line ends generated (cr/lf) won't match those expected (lf).

        I notice that some of the names have / characters in them (Folder 7969 for example). That is likely to cause you grief if you use the names as folder names.

        It looks to me like you really need a database solution to this problem instead of a file based solution.


        True laziness is hard work
Re: Building a UNIX path from Irritating data
by kyle (Abbot) on Nov 25, 2009 at 02:23 UTC

    It might help to give us some runable code. I started by writing this, but the folder IDs don't match up.

    Off the topic of your question, one thing I notice about your code is that build_path is defined with a prototype (which is usually a bad idea), and then you call it with an ampersand sigil which bypasses the prototype anyway. Unless there's more code somewhere that requires the prototype to be in effect, I'd suggest you get rid of it and call build_path normally (without the ampersand).

    I'm not entirely clear on the question you're trying to answer. It seems as if you're saying that build_path could return more than one path for a given folder ID. If that's the case, I'd say have it return a reference to an array rather than a simple string. Later, when you walk through %folderpaths, you can create the first path listed and make the others symbolic links to it. The code below is supposed to give you an idea of what I'm talking about.

    use Data::Dumper; sub mock_build_path { my $folderid = shift @_; my @dumpff = @_; my @paths; push @paths, 'example/1'; push @paths, 'example/2'; push @paths, 'example/3'; return \@paths; } my %folderpaths; $folderpaths{ '1337' } = mock_build_path( 1234, 'stuff' ); print Dumper \%folderpaths; my ( $mkpath, @linkpaths ) = @{ $folderpaths{ '1337' } }; print "make path: $mkpath\n"; print "link paths: ", join( q{, }, @linkpaths ), "\n"; __END__ $VAR1 = { '1337' => [ 'example/1', 'example/2', 'example/3' ] }; make path: example/1 link paths: example/2, example/3

    This requires the use of references, which I don't see in the code you've posted. If you're not familiar with references, have a look at Perl reference tutorial, Perl References, and References quick reference (links I took from the home node of kennethk, which has a lot of other good information on it).

    I hope this helps.

      My mistake, kyle. My original code pulls the data directly from the command-line utilities provided with the software and formats it on-the-fly, and in my original post, I just created some sample data from memory. I've created some static versions of the data produced by those command-line utilities (see below code), and here's a modified version of code that uses those files:
      #!/usr/bin/perl -w use strict; my $foldername_file = shift @ARGV; my $folderfolder_file = shift @ARGV; my %folders = (); my %folderpaths = (); open(FNAMES,"$foldername_file") or die "Can't open $foldername_file: $ +!\n"; foreach my $folderline (<FNAMES>) { chomp $folderline; if ($folderline =~ /^Folder (\d{3,5})\s+\-\s+(.*)\.$/) { $folders{$1} = $2; } } close(FNAMES); open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil +e: $!\n"; my @subfolders = <FFINFO>; foreach my $k (sort (keys (%folders))) { $folderpaths{$k} = build_path($k,@subfolders); print "$k => $folderpaths{$k}\n"; } exit 0; sub build_path { my $folderid = shift @_; my @dumpff = @_; my $path = "$folderid"; my $parentid = ""; foreach my $line (@dumpff) { if ($line =~ /^Folder (\d{3,5})\s+\-\s+.*\./) { $parentid = $1; } elsif ($line =~ /\s+subfolder\s+$folderid\s+\-\s+.*\./ +) { $path = join('/', build_path($parentid,@subfol +ders),$folderid); } } return $path; }
      And you can use the following two data files to make it work:

      foldernames.txt
      subfolders.txt

      Basically, the code above works fine for most all of the data, but there are about 60-70 folder ID's that appear as subfolders to more than one parent folder ID. Folder ID 3053 is a good example. It is a subfolder under 3051, 3057, 3063, and 3067. If you run my code, however, the path printed for 3053 is:

      3053 => 100/3051/3053

      To be totally complete, I would need to also print:

      100/3057/3053
      100/3063/3053
      100/3067/3053

      The only role build_dirs has is to find the parent folder ID of the folder ID it is given, and if the parent ID it finds isn't a root-level node, it calls itself again with the parent ID until a full path is created. However, since I'm building the path layer by layer, I can't figure out when to check for a duplicate path. Each iteration of build_dirs doesn't have any knowledge of any complete paths that have been found. I suppose I could return all the results from a search, but I'm not sure how that would look. Are you suggesting something like this?

      3053 => 100/3051:3057:3063:3067/3053

      Thank you for your assistance!

        I replaced everything after the second open with this:

        my %parents_of; my $parentid; while ( my $line = <FFINFO> ) { if ($line =~ /^Folder (\d{3,5})\s+\-\s+.*\./) { $parentid = $1; } elsif ($line =~ /\s+subfolder\s+(\S+)\s+\-\s+.*\./) { push @{ $parents_of{ $1 } }, $parentid; } } sub build_path { my $folderid = shift @_; my @parents = @{ $parents_of{ $folderid } || [] }; return @parents ? [ map { map { "$_/$folderid" } @{ build_path( +$_ ) } } @parents ] : [ $folderid ]; } foreach my $k (sort (keys (%folders))) { $folderpaths{$k} = build_path($k); print "$k => $_\n" for @{ $folderpaths{$k} }; }

        The highlights of the changes are:

        • I read the subfolders file only once.
        • I store the subfolders data in a %parents_of hash of arrays, which maps each folder ID to the folders that have it as a subfolder.
        • In build_path, I use the hash of arrays instead of iterating over the whole subfolders file. It's still recursive.
        • Since build_path returns an array reference now, the output loop has to iterate its contents.

        For the ID that you pointed out, the new output is:

        3053 => 100/3051/3053 3053 => 100/3057/3053 3053 => 100/3063/3053 3053 => 100/3066/3053

        The new code produces all the output of the old code plus another 1108 lines, and it runs in less than 1 second while the original takes about 220 seconds.

        I hope this helps.

Re: Building a UNIX path from Irritating data
by spx2 (Deacon) on Nov 25, 2009 at 09:02 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://809192]
Approved by keszler
Front-paged by biohisham
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2021-10-21 19:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (83 votes). Check out past polls.

    Notices?