Re (tilly) 1: Slow at sorting?
by tilly (Archbishop) on Nov 22, 2001 at 02:06 UTC
|
My recommendation for data in this size if you want to do your own sorting is to use DB_File's BTree method. Right now tilly's scratchpad has the following snippet:
use strict;
use DB_File;
my %sorted;
$DB_BTREE->{compare} = sub {$_[1] cmp $_[0]};
tie (%sorted, 'DB_File', undef, O_RDWR|O_CREAT, 0640, $DB_BTREE) or di
+e $!;
%sorted = 1..30;
print map "$_\n", keys %sorted;
which shows how to create an in memory BTREE with an arbitrary sort order. If you follow tye's suggestion you can skip the arbitrary BTREE, use a temporary file rather than in memory, make the keys be your sort key, and the values be your data. Just insert into the hash and then loop over it using the each construct. (Not keys because that will cause a slurp.)
If your OS has large file support, your limit is available disk space. If it does not have large file support, your limit is about 2 GB for the temp file, which is a data structure for (considering the BTREE overhead) somewhat over a GB of data. (I would guestimate about 1.3 GB.)
There is also a File::Sort out there. I prefer the arbitrary text of DB_File rather than having the assumption of lines imposed on me. And if you want to sort data structures rather than text, be sure to look at MLDBM (and expect a big slowdown). | [reply] [Watch: Dir/Any] [d/l] |
Re: Slow at sorting?
by dws (Chancellor) on Nov 22, 2001 at 00:09 UTC
|
My test logfile was 100MB, my first problem was memory, loading the whole log into memory eventually sucked all my resources (382 MB of ram)
Whoa! Even after you've made one pass to build your sort key, chances are quite good that you're falling prey to virtual memory thrashing. If that's the case, then the C++ code might well run faster on the 100MB sample (depending on how you code, C++ is going to have a lower memory footprint), but at some point you're going to throw more data at the C++ code, and it'll fall over, too. Do you have any visibility onto virtual memory behavior (e.g., can you get a running page-fault count?)
As I see it, you have a few choices if you want to stay with Perl:
- Add more physical memory (which these days is an inexpensive proposition), or
- Break the file into smaller pieces, sort the pieces, then merge them later, or
- Do both
Depending on your production dataset size, you may have the same options with C++.
| [reply] [Watch: Dir/Any] |
Re: Slow at sorting?
by shotgunefx (Parson) on Nov 22, 2001 at 01:13 UTC
|
Perl's sorting algorithim is basically a quicksort. I don't know exactly what your data looks like but quicksort usually behaves very badly with nearly sorted data. This is for a logfile correct? It's possible that it's nearly sorted.
Another possibility. When faced with sorting an 800 Meg Search index, I gave up using pure Perl and just used system to punt it off to GNU sort. It went from over 8 hours to around 30 minutes. I had no desire to recreate the functionality of handling files this size. I just needed it to work.
You also might want to check out Mastering Algorithims in Perl by O'Reilly. It has a really good chapter on sorting.
-Lee
"To be civilized is to deny one's nature." | [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
|
Thanks tilly. Didn't know that. Learn something new everyday.
-Lee
"To be civilized is to deny one's nature."
| [reply] [Watch: Dir/Any] |
Re: Slow at sorting?
by mirod (Canon) on Nov 22, 2001 at 02:54 UTC
|
Just one comment on sort efficiency: contrary to what a couple of posters seem to think, Perl sorts strings faster than numbers. The whole point of Uri and Larry's paper is that as string sort is the default, hard-coded sort, it does not have to call any Perl subroutine, and thus works _real_ fast. So the trick to fast sorting is to be able to do just a sort @array, not a sort {$a <=> $b}, @array.
And BTW if you are looking for a module to do this automagically, well it looks like File::Records is looking for a new owner.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Slow at sorting?
by clintp (Curate) on Nov 22, 2001 at 01:11 UTC
|
A few thoughts
- Your sample data set wasn't very large. Would it be possible for you to make even smaller sort keys? Getting the memory footprint of @sorton a bit smaller might help. Since the sort seems to be the killer, and not the preparation time, you might want to spend more time prepping the data.
- Can you do that by making a completely numeric sort key? Since, except for bracketed field ([]), everything else seems to be numeric. Then you wouldn't need an alphabetic comparison here. Packing things down into a smaller value might prove useful. Watch for overflows.
- At the very least, can you get that bracketed field down to a couple of characters instead of a long string?
- How many records _are_ we talking about? Not "100MB" but, how many lines of data?
- (tiny optimization)Can you pack the offset into the sort key as well? Your sort would be sort @sorton instead of needing the indirection there.
Kinda makes me wish I had the actual file to try this out....
| [reply] [Watch: Dir/Any] |
Re: Slow at sorting?
by orbital (Scribe) on Nov 22, 2001 at 00:18 UTC
|
| [reply] [Watch: Dir/Any] |
Re: Slow at sorting?
by guha (Priest) on Nov 22, 2001 at 00:29 UTC
|
Could you post a few lines of the input file, that would give at least me something more to go for ?
Are for instance $1 and $3 of constant length ?
Update
Oops, just checked the previous post and found it! :-\
| [reply] [Watch: Dir/Any] |
Re: Slow at sorting?
by orbital (Scribe) on Nov 22, 2001 at 01:31 UTC
|
clintp, sorry about just listing the file size I should have known better. The log file I was testing on is a total of 1,401,986 lines with 1,401,950 matching my criteria.
I gave a sample of the matching lines in my previous post, but incase you missed it here you go.
CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".
CD2\01100809.pdf(1) - [Invoice Date] Indexed key "10/08/2001".
CD1\01100809.pdf(1) - [Customer Name] Indexed key "FOOBAR".
CD2\01100809.pdf(1) - [Contact Name] Indexed key "Dr. FOO".
CD4\01100809.pdf(20) - [Account Number] Indexed key "54356564".
If you really want the full logfile I could strip out all sensitive data and throw it out somewhere where you can grab it, if this is something you really want to play with let me know... | [reply] [Watch: Dir/Any] [d/l] |
|
CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".
According to an earlier post, the bold fields are the ones you're sorting on. If that's truly the case, and if the text within brackets comes from a limited dictionary, then this sort can be done entirely with numbers.
First, combine the CD number with the file number, yielding (in this case)
101100809
Then map "Account Number" into its pre-determined sequence number. You're then left sorting something like
[ 101100809, 47, seek-address ]
on the first two fields, which are now numbers.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
How about extending it to something like
pack("n N n", $1, $2, $3).$4
thereby we would sort CD# > 9 correctly, decrease the memory footprint and allow for the fast intrinsic ASCIIbetical sort. Perhaps even to the point where fast hash lookups can be used.
It's a pity the data is in practice unavailable
Here's what I would like to time ...
.
.
my (%sort_hash);
my $offset = 0;
while (<IN>) {
if( m/^CD(\d+)\\(\d+)\.pdf\((\d+)\)\s-\s\[(.+?)\]/ ) {
$sort_hash{pack("n N n", $1, $2, $3).$4} = $offset;
}
$offset = tell(IN);
}
foreach my $k (sort keys %sort_hash) {
seek IN, $sort_hash{$k}, 0;
print OUT scalar(<IN>);
}
| [reply] [Watch: Dir/Any] [d/l] |
|
|
Re: Slow at sorting?
by kwoff (Friar) on Nov 22, 2001 at 02:17 UTC
|
Is it possible to put the log into a database table?
In that case, put an index on whatever columns need sorted
and forget about perl sorting. Unless that kind of
thing amuses you. :) | [reply] [Watch: Dir/Any] |
Re: Slow at sorting?
by guha (Priest) on Nov 23, 2001 at 20:06 UTC
|
I have been thinking about the pros and cons of using hash or array. So I made myself some data with a script approximating something like orbitals example, using the rand()-funktion on CD#, pdf-name, unknown digits in parenthesis and the different strings in brackets. I know it's not the real thing but for what I trying to find out it seems OK to me.
The two approaches I've tested is either using a hash, straightforward and easy on the programmer, see my previous post. Or using an array and tucking the filepos to the right side, see petrals post.
The tests were run on a measly Pentium 133 with 64MB RAM and the results are as of the table below(YMWV). The empty cells indicate heavy use of virtual memory. Impatience being a virtue I only gave it double the expected time before I hit Ctrl-C.
kLines |
Filesize |
Hash |
Array |
|
MB |
sec |
sec |
100 |
6,16 |
11 |
12 |
150 |
9,24 |
16 |
18 |
175 |
10,78 |
19 |
22 |
190 |
11,704 |
22 |
24 |
200 |
12,32 |
23 |
25 |
210 |
12,936 |
24 |
26 |
220 |
13,552 |
25 |
27 |
230 |
14,168 |
62 |
29 |
250 |
15,4 |
|
31 |
300 |
18,48 |
|
37 |
350 |
21,56 |
|
44 |
400 |
24,64 |
|
107 |
500 |
30,8 |
|
|
My interpretation is that if you got the memory then the hash method is slightly faster, but using the array method will take you about twice as far in term of possibel file sizes.
Thinking about it, it seems rather logical considering that the hash is both key and value whilst array is value only.
It's also nice to see that both variations behave linearly with increasing volume until VM sets in, just as it's written in A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler. | [reply] [Watch: Dir/Any] |