http://www.perlmonks.org?node_id=81711

lindex has asked for the wisdom of the Perl Monks concerning the following question:

Once upon a time in a land far far away, there was a perl monk named Lindex.

This perl monk had an idea one day to write a file system "Web Interface". Ah but there was a problem, the big bad old directory that the "file system web interface" would be managing had a whopping 93,440 files in it, and that if he wrote this perl script to load the contents of said directory into an array, that the script would have a huge memory foot print.

So Lindex decided to have his "web interface" page through the files in increments of 10 or 20. But there was a problem, Lindex didn't have a clue as to how to seek through a directory properly so that he could skip through the contents of the directory allowing him to page through the files. After much tribulation Lindex consulted the Great Tome of Perl Magick to hopefully get help from his fellow perl monks.

To Be Continued

The breakdown...
basicly I just don't know how to set a position on a directory filehandle so that it skips x amount of directory listing. Where I have:

use IO::Dir; my($dir)=IO::Dir->new('/imageproc/files/ips/fullsize/') || die; my($c,$f)=0; while($f=$dir->read() and $c<=9) { print "$c $f\n"; $c++; }

for listing a limited amount of files from a directory if I do a $dir->seek(POS) what should my POS be?


Brought to you by that crazy but lovable guy... lindex

Replies are listed 'Best First'.
Re: A story of a Perl Monk and Problem
by Hero Zzyzzx (Curate) on May 19, 2001 at 18:12 UTC

    Well, this would be easy to do with an RDBMS, and given the sheer number of files, this may be the best way to do it.

    Read the directory in and put it into a table, in the method of your choice. In mySQL, an autoincrement field can create an "index" field for you, by which you can have a field that allows you to order and manage the files, separate from the filenames, which may or may not be sequential. Then, using the "limit" function, you can select and create links to the next "x" files, like so:

    select id,filename from files where id=$startid order by id limit 10

    It's then a simple thing to create a script that would allow you to page through these files. This would be very fast also, given the RDBMS backend, and you could have users choose how many to see per page.

    If the files change frequently, you can set up a cron job that would regularly update the table at the interval of your choice.

    There are advantages and disadvantages to this system, of course, but I've done something similar. The script I wrote manages a directory with about 1600 image files in it, and it works excellently.

Re: A story of a Perl Monk and Problem
by chipmunk (Parson) on May 19, 2001 at 19:32 UTC
    When using IO::Dir, the value you pass to $dir->seek(POS) should be a value returned from $dir->tell().

    These methods are wrappers around Perl's builtins, seekdir and telldir.

      Will the POS change if data in the direcotyr changes?
      What are the security implications of passing a seek POS through the web ?

      Brought to you by that crazy but lovable guy... lindex
Re: A story of a Perl Monk and Problem
by Brovnik (Hermit) on May 19, 2001 at 20:07 UTC
    Seek isn't really for skipping forwards into unknown territory like this unless you know enough about the file format to be able to know exactly where you want to go to. In particular, you can't do a "skip the next 100 files" command.

    However, if you do a

    push(@tells, $dir->tell());
    at the start of each page, it will allow you to use seek later to skip back to any of those particular points later using the values stored in the @tells array. E.G.
    # have now read through all files once and stored every # Nth position in @tells my $dirpos = @tells / 2; # start in the middle my $browsing = 1; while ($browsing) { my $action = ""; $f = $dir->seek($tells[$dirpos]); # code to go here to read next N files and display # results to user. # Come back here when we have an submit from the user # and $action set to the result. if ($action eq "pageforwards") { #should check for end $dirpos++; } elsif ($action eq "pagebackwards") { #should check for start $dirpos--; } else { # do other actions $browsing = 0; } }

    This way, you only have to store a value for every Nth file, which is a big reduction in storage.

    --
    Brovnik.

      But wouldn't it still need to go through the directory on every request from the web? I still think an RDBMS would be better. You'd only loop through the files when you create the table, and the memory requirements would be minimal, beyond the mySQL daemon running.

      After you had your table with filenames, you would then only select the few filenames you needed to create each index page. The list of files is already prepared and stored in the table, there's minimal extra stuff involved to give a user a page.

        Yes, it would. This falls into the "If I were trying to get there, I wouldn't start from here category", but I was answering the specific point about "how do I use seek(POS)" rather than the broader "how do I present 90,000 files to the user".

        I agree with thpfft trying to present them all to the user isn't the way, and a search would be much better.

        Unless the filenames are descriptive (and this is difficult if they are in 8.3 notation), the search needs to be on some content or keywords related to the file as well, so you really should have some sort of persistent Database interface to the directory.
        --
        Brovnik.

        Edit: chipmunk 2001-05-19

Re: A story of a Perl Monk and Problem
by jepri (Parson) on May 19, 2001 at 18:05 UTC
    What happens when you work on a smaller directory and use seeks? Try printing ten at a time then setting the POS back and trying to print the same thing out again.

    ____________________
    Jeremy
    I didn't believe in evil until I dated it.

Re: A story of a Perl Monk and Problem
by thpfft (Chaplain) on May 19, 2001 at 21:09 UTC

    Seems to me that no amount of seeking will let someone find the right file from 90,000. Perhaps you need to offer a different interface to the classical file manager. Let people search by date range or title, for example, or find some regularity in the data which will let you break the collection into chunks. But then you'd probably need a database for that too.

    Anyway, perhaps the simplest paging mechanism would be just to pass the name of the file at the end of the previous list, rather than trying to carry numbers? Then you can use something like:

    # $filename comes from input my $marker; $marker = readdir(DIR) while ($filename cmp $marker); print "$_: " . readdir(DIR) . "\n" for (1..10) ;

    Which assumes an alphabetical list but should survive deletion of the marker file, at least.

    updated to remove stupid mistake before anyone notices.

(dws)Re: A story of a Perl Monk and Problem
by dws (Chancellor) on May 20, 2001 at 02:54 UTC
    ... if he wrote this perl script to load the contents of said directory into an array, that the script would have a huge memory foot print.

    Have you determined whether an occassional huge memory footprint is actually significant in the system you're building? Reading the directory into an array is simple to implement and test. If the CGI is going to be invoked relatively infrequently (e.g., a few times a minute) on a machine with adequate memory, the impact of the footprint might be insignificant in the grand scheme of things. Smaller footprint alternatives are more difficult to implement, and might be more compute intensive.

      Good Question, ++
      Well on the perticular machine in question, I am developing this tool as a mod_perl application, so any memory used by perl is shared with apache
      Thus if perl has an array in memory that is about 2.3mb then so does apache, this is not acceptable for me.
      As well as the machine does web serving and other file processing, so good CPU and MEM stats are a must.


      Brought to you by that crazy but lovable guy... lindex
        Thus if perl has an array in memory that is about 2.3mb then so does apache, this is not acceptable for me.

        Is mod_perl mandatory, or can you do this particular task with a CGI?