http://www.perlmonks.org?node_id=1170958


in reply to use threads for dir tree walking really hurts

Very interesting topic, and ++Corion for the answer. Some sparse suggestion:

For the little i know the matter I suppose that beside the thread implementation you put on the field you must rely on the speed of the filesystem and of all underlying OS specific API call.

I doubt that more CPU can fetch a physical hard drive faster than a single one.

My experience fighting with windows is very long, so long to let me say that is by far better and faster to use as much as possible native tools offered (or well, concelead) by OS. An example is fetching file permissions: old Perl modules exhisted but wrapping around tools like icacls.exe is faster and less error prone and works for decades.

So going for native solutions I think the best would be to read the Master File Table directly: this intrigue me a lot but i suspect is a task by far beyond my hackery skills.

Read the MFT is easier than write to it and some tool can do it, so is feasible. Look at ultrasearch and at this discussion that points to swiftsearch on sourceforge

Also the Linux NTFS file system driver can be a reach source of information, if you are able to investigate a Linux driver.

The task of reading MFT can be accomplished in other languages: see a very detailed answer on stackoverflow about C# and this python example

See also analyzeMFT ntfswalk MTF_parser.

For the thread part you can be interested in the marioroy's MCE that comes with many useful examples. At the monastery marioroy shown an exmple of dir walking using MCE in the thread Re: Perl threads - parallel run not working

Update: ATTENTION play your MFT tests on a test harddisk because the risk of corruption is always present!

Good luck and share your improvements!

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: use threads for dir tree walking really hurts (MFT?)
by Tanktalus (Canon) on Sep 11, 2016 at 16:12 UTC

    Just as a side note, multiple threads (up to a limit) definitely can access the filesystem faster than a single one. On unix, for example, I regularly do rm -rf by forking off (in perl) two sub-processes per directory for recursion (so you can get a lot of processes going at once). I doubt Windows is significantly different in this aspect - when the filesystem loads a given directory, it likely loads the full inode/sector and then hands back one item at a time, meanwhile another thread could request a different inode/sector and start acting on that. Meanwhile, any changes would be made in memory and sync'd out when the OS felt it either had to or had time to, so this can overall be much faster.

Re^2: use threads for dir tree walking really hurts (MFT?)
by exilepanda (Friar) on Sep 01, 2016 at 14:02 UTC
    Thanks for you suggestion, but indeed I dont dare to touch the MFT ( which is far beyond my ability and resources ) Ha!

    I had added a rewrite to my OP which can run as fast as dir does, and I think that's good enough for me, but please help to point out if there's any potential problems there. Thanks a lot!