Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

I need speed

by Galen (Beadle)
on Jun 21, 2002 at 17:44 UTC ( [id://176335]=perlquestion: print w/replies, xml ) Need Help??

Galen has asked for the wisdom of the Perl Monks concerning the following question:

We have about 200,000 files floating around in various subdirectories on a network server. To solve the problem of finding where someone has moved individual files, I created a simple cgi to search for a filename (or part of a filename) and when found, output a link to it. Obviously, this would be extremely slow if I actively searched for the current location of the file, so instead I search a delimited text file which has all the filenames and paths. I generate this text file every so often, usually at the end of the day. This text file is currently 31M in size.

But.. the problem I'm now running into is speed. So many people have started using the interface that it's getting rather slow. To make matters worse, we expect another 200,000 files or so to be dumped onto the server soon. I have been thinking about changing the interface slightly so that people only search subsets of the entire filebase. I've also thought that a DBM hash might be faster to search than a text file, but I'm not sure of this.

Would a DBM hash improve efficiency? Any other ideas? Thanks for any suggestions. Here is the little bit of code that I use to search the text file:

sub lookup { my $infile = "locs.txt"; my $href; $table = "<TABLE CELLPADDING=10><TR><TD><B>Path<TD><B>Size<TD><B>Mod +ified</TR>"; open(F, "+< $infile"); while (<F>) { ($filename, $cms, $path, $size, $day, $time) = split /,/, $_; if (index($filename, $to_find) > -1) { $href = "file:\\\\netd\\data".$path."\\$filename"; $href =~ s/\s/%20/g; $table .= "<TR><TD><A HREF=\"$href\">$path\\$filename</A><TD>$si +ze<TD>$day $time</TR>"; } } close(F); $table .= "</TABLE>"; $table =~ s/"//g; }

Replies are listed 'Best First'.
Re: I need speed
by grinder (Bishop) on Jun 21, 2002 at 18:07 UTC
    You don't say what platform you are running on. The following hold for Unix platforms.

    It may seem counter intuitive, but I think you'll get a fair bang for your buck by opening an output pipe from the locate(1) program. It's designed to be efficient at answering these kinds of queries.

    On the other hand, if you search for a given string "abc", it will output the name of the file even if "abc" is part of the pathname, not only the filename. But you could filter out the false hits by taking the basename of the returned file and checking it against your target string when you read it in.

    # remember to untaint $str open LOC, "/usr/bin/locate $str |" or die "Cannot open input pipe: $!\ +n"; while( <LOC> ) { chomp; my $file = basename $_; next unless index($filename, $str) > -1; # output this filename } close LOC;

    I'm not convinced a DBM hash will speed things up: at the end of the day you're still doing a linear scan of 200K records. (Assuming that since you're doing an index partial substrings of names are allowed).

    Antother idea would be to simplify the location file contents, to reduce its size. Just store the name of each file. If and when a person hits it in a search, stat the file at that point in time (the results will be fresher to boot). This will let you store more files per disk block, thus less disk blocks to store all of them.

    Depending on a certain ratio of CPU to disk I/O performance that only testing can demonstrate, it may be faster to zip the file and read and decompress it on the fly, on the assumption that the extra CPU cost is offset by reducing the number of disk blocks read.

    print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
Re: I need speed
by Aristotle (Chancellor) on Jun 21, 2002 at 19:34 UTC
    Here are some tips if you don't want to use external tools. The key is to be lazy and defer as much work as possible. One simple thing you can do is
    my ($filename, $rest) = split /,/, $_, 2; if (index($filename, $to_find) > -1) { my ($cms, $path, $size, $day, $time) = split /,/, $rest;
    Going quite a lot further:
    open(F, "+< $infile"); while (sysread F, $_, 32768) { $_ .= <F>; next unless /\Q$to_find\E/; # quickskip for(grep /^[^,]*?\Q$to_find\E/, split /\n/, $_) { ($filename, $cms, $path, $size, $day, $time) = split /,/; $href = "file:\\\\netd\\data".$path."\\$filename"; $href =~ s/\s/%20/g; $table .= "<TR><TD><A HREF=\"$href\">$path\\$filename</A><TD>$si +ze<TD>$day $tim +e</TR>"; } } close(F);
    This greatly reduces the number of IO operations and restricts the heavy splittage to known matches.

    I believe the following is an extra win, but I haven't benchmarked it.
    open(F, "+< $infile"); while (sysread F, $_, 32768) { $_ .= <F>; next unless /\Q$to_find\E/; # quickskip while(/^([^\n,]*?\Q$to_find\E[^\n]*)/gm) { ($filename, $cms, $path, $size, $day, $time) = split /,/, $1; $href = "file:\\\\netd\\data".$path."\\$filename"; $href =~ s/\s/%20/g; $table .= "<TR><TD><A HREF=\"$href\">$path\\$filename</A><TD>$si +ze<TD>$day $tim +e</TR>"; } } close(F);
    Beyond these tricks, you quickly get to the point of pretty much handrolling your own database engine..

    Makeshifts last the longest.

      Thanks for your reply - I had thought about limiting the split, but I didn't believe it would enhance performance much. It does appear to help though. The simple reduced split you noted seems to improve performance by about 20%. However, the other snippets you provided don't work. I'm trying to figure out why. They return no matches for files I know exist.
        Odd. Admitted, I didn't test them except for the regex in the last snippet when I posted them, but I did now and they seem to be working. *shrug* There's nothing obvious I can spot either.

        Makeshifts last the longest.

Re: I need speed
by Abigail-II (Bishop) on Jun 21, 2002 at 18:04 UTC
    DBM Files are good for exact matches, but you aren't doing an exact match. You are doing a substring match. To do that efficiently (that is, significantly faster than matching all files) you either need a complicated datastructure (that you will have to make persistant somehow) or use lots and lots of (disk)memory. Or some combination of both. It would also mean that updating the information is going to take longer.


Re: I need speed
by twerq (Deacon) on Jun 21, 2002 at 17:51 UTC
    Using DB_File would definately be faster for index lookups, provided you used the btree interface.

    By using the filename as the key, and the path as the value, you can quickly search for the location of your file.

    One problem you may run into, though -- is that many files (under different paths) share identical names. See the section "Handling Duplicate Keys" in man DB_File.

Re: I need speed
by perrin (Chancellor) on Jun 21, 2002 at 18:47 UTC
    Have you ever tried the command "locate"? It does this very very well and you can easilly make your CGI script run it. I believe it is even available on Windows as part of the Cygwin package.
Re: I need speed
by RMGir (Prior) on Jun 21, 2002 at 17:58 UTC
    If you already have a database server set up somewhere in your organisation, and it's not too heavily loaded, a database table and a bit of DBI code would probably help a lot.
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 18:49 UTC
    To answer some questions: this is being done on a *cough* Win2k platform (no locate). I am a lowly knuckle-dragging grunt and access to a database server is not an option. This organization regulates that sort of thing tightly.

    The bottleneck here seems to be processing speed, as opposed to memory or disk space. I suppose that could change if the index file grows to over 100M. It might increase efficiency somewhat to search a file containing only filenames, and then to retrieve path into from a DBM hash using the filename as a key, (with a little workaround for duplicate filenames in different directories). However, I think my initial idea of breaking the index file into separate, smaller indexes and choosing which to look through based on the search string may produce a greater increase in speed. If I didn't want to search for substrings in the filename this would be easy. The problem is that people don't always know the exact filename they are looking for, and it is useful to be able to display multiple matches (all files of a given extension, for example).

      I am a lowly knuckle-dragging grunt and access to a database server is not an option. This organization regulates that sort of thing tightly.

      Would anyone object to you installing DBI with DBD::SQLite? It installs a data base server, but it's a really tiny one, and a DB is conveniently stored as a single file.In short it means no administrative hassle and probably no need to ask anyone about it. SQLite implements the LIKE operator and wildcards in queries, so it should help you with your problem.

Re: I need speed
by robobunny (Friar) on Jun 21, 2002 at 17:51 UTC
    a DBM hash wouldn't do you any good, because you need to process the entire file anyway. the benefit of a DBM is that it speeds up accessing specific elements of the file. you are sequentially reading the whole thing.
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 19:20 UTC
    Cygwin is great! Thank you so much. It's almost as good as having a unix box in my cube again.. which this place won't allow.
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 19:46 UTC
    I don't know a lot about relational databases, but I'm a quick study. I installed MySQL once and played with it for a while - long enough to know I would love to develop some perl-SQL skills. I just didn't quite know where to go with the daemon and eventually uninstalled it because it was wasting resources. However, if I were to get started with something like this file search utility, I can foresee many other potential applications. Do you recommend SQLite over MySQL?
Re: I need speed
by Galen (Beadle) on Jun 24, 2002 at 15:42 UTC
    OK - the regex is working fine in your code Aristotle, but there is a mistake. Here is how I changed it to get it working:
    sysopen(F, $infile, 0); while (sysread F, $_, 32768) { next unless /\Q$to_find\E/; while(/^([^\007,]*?\Q$to_find\E[^\007]*)/gm) { ($filename, $cms, $path, $size, $day, $time) = split /,/, $1;
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 18:56 UTC
    cygwin huh? I shall try it. Thanks!
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 20:34 UTC
    Accidently double posted.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://176335]
Approved by grinder
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (2)
As of 2024-07-22 16:04 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.