note
gone2015
<p>So it's understood that:</p>
<ul>
<li><p>reading the file is ~ n work and inserting the keys into a hash is ~ 2m work (where m is the number of unique keys, taking into account the rearrangements required as the hash grows -- assuming no pathology).</p></li>
<li><p>reading the file into an array is ~ 2n work (one to read and one to insert), but to sort it is ~ n log n (assuming no pathology), then you need to scan the array to spot the duplicates, which is a further ~ n work.</p></li>
</ul>
<p>So hash: ~ n + 2m; array: ~ 2n + n log n.</p>
<p>Not all work is the same, however. Scanning the array certainly requires ~ n comparisions, but this is in a Perl loop -- which may dominate the running time. Further, this does not account for the cost of reorganising the array to remove duplicates, if that's what you want to do...</p>
<p>However, that may not be the complete story. What do you intend to do with the information ?</p>
<p>Suppose you want to print out the unique keys in key order:</p>
<ul>
<li><p>the keys need to be extracted from the hash ~ m work, and sorted ~ m log m, and then printed ~ m.</p></li>
<li><p>the array can simply be printed for ~ m work -- most efficiently if that is combined with the scan for duplicates step.</p></li>
</ul>
<p>so the totals now are hash: ~ n + 4m + m log m; and the array ~ 3n + n log n + m. So, where there are a lot of duplicates the hash approach is doing less work; where there are few duplicates there's apparently not a lot in it.</p>
<p>Suppose you want to print out the unique keys in original order (duplicates sorted by first instance):</p>
<ul>
<li><p>with the hash you keep an auxiliary array with the keys in original order, so we have ~ n to read, ~ 2m to add to hash, ~ m to add to auxiliary array, ~ m to print, total: ~ n + 4m -- not a log n or log m in sight, so the total work decreases !</p></li>
<li><p>with an array... requires at least an auxiliary array containing the original array indexes; then the array indexes are sorted according to the key values; then the array indexes are scanned to count the number of times each key value appears; .... from a work perspective this is not apparently a big increase, but the extra code to work with the index array will affect the running time.</p></li>
</ul>
<p>in this case the hash is a clear winner, not only is there no n log n component, but the work in the array sorting and scanning has clearly increased.</p>
<p>So what's the point of all this ? Well:<p>
<ul>
<li><p>big-O-wise filling the hash is O(n) and filling the array and sorting it is O(n log n). But, for the whole process (assuming key order output) the hash is O(n + m log m) and the array is O(n + n log n) -- big-O is broad-brush, but you do need to consider the entire problem.</p></li>
<li><p>hashes carry quite a hefty memory overhead (and for key order output the entire hash is overhead)... so a hash implementation will start to page earlier than an array one -- big-O is all very well, but it's not the whole story.</p></li>
<li><p>while the array implementation appears little different (unless m is a lot smaller than n), it does involve the scan step, which is a Perl loop -- so we might have, roughly:<code>
# hash... # array...
while (<HUM>) { while (<HUM>) {
$hash{key_part($_)}++ ; push @array, key_part($_) ;
} ; } ;
my $p = '' ;
my $c = 1 ;
foreach (sort keys %hash) { foreach (sort @array) {
if ($p ne $_) {
print "$_\: $hash{$_}\n" ; print "$p\: $c\n" ;
$p = $_ ;
$c = 1 ;
}
else {
++$c ;
} ;
} ; } ;
print "$p\: $c\n" ;
</code>which puts the array implementation in it's most favourable light, but the extra code for the array version will have weight not accounted for in the big-O analysis.</p></li>
</ul>
<p>Or, in short, before embarking on big-O, you do have to consider at least the major components of the whole problem, and beware counting apples and oranges.</p>
<p>FWIW, the pseudo-code for keys in original file order is:
<readmore title="here."><code>
# hash... # array...
while (<HUM>) { while (<HUM>) {
my $k = key_part($_) ;
push @keys, $k if !$hash{$k} ; push @array, key_part($_) ;
$hash{$k}++ ;
} ; } ;
my $p = '' ;
my $c = 1 ;
foreach (sort { $array[$a] cmp $array[$b]
|| $a <=> $b }, (0..$#array)) {
my $k = $array[$_] ;
if ($p ne $k) {
push @keys, $p ;
@counts = map $hash{$_}, @keys ; push @counts, $c ;
$p = $k ;
$c = 1 ;
}
else {
++$c ;
} ;
} ; } ;
push @keys, $p ;
push @count, $c ;
</code></readmore>... and with Perl, if it's less code it's generally faster !</p>
740867
740867