Steve973 has asked for the wisdom of the Perl Monks concerning the following question:
I'm very new to perl, so please forgive me if I ask a quiestion that has already been answered. I am trying to use perl instead of writing scripts in bash or other shells. In converting a script, i need to find an efficient way to perform the equivalent of "sort -u" and "uniq -c" (at different times, of course). On this site, a search has already shown me the use of a hashmap for uniqueness, but it doesn't solve the problem of counting. Plus, I'm parsing several ~50 megabyte text files, so I would prefer not to keep structures in memory for this purpose. File::Sort would help to sort the file, but how might I, for example, perform the equivalent of "sort | uniq -c" for ~100 megs of data (or more)?
Thanks for any help!
Steve
Re: perl functionality like unix's "sort -u" and "uniq -c"
by Roy Johnson (Monsignor) on Apr 08, 2005 at 19:23 UTC
|
my %hash = map {($_ => undef)} <>;
print sort keys %hash;
and uniq -c is pretty much
my $prev=<>;
my $count=1;
while (<>) {
if ($prev eq $_) { $count++ }
else {
printf "%4d %s", $count, $prev;
$count = 1;
$prev = $_;
}
}
printf "%4d %s", $count, $prev;
More complete implementations of Unix tools in Perl are available from CPAN as PPT::Util.
Caution: Contents may have been coded under pressure.
| [reply] [d/l] [select] |
Re: perl functionality like unix's "sort -u" and "uniq -c"
by ambs (Pilgrim) on Apr 08, 2005 at 19:02 UTC
|
Steve, can I ask if you are doing that because we really need to do it in Perl, or just to learn? I mean, those two unix programs are very efficient, so, why not just use them?
Also, you can open shell pipes as if they were files. So, you can easilly do:
open PIPE, "sort file | uniq -c |";
while(<PIPE>) {
}
Ok, I know I didn't help you how to write those scripts in Perl, but I hope I made you think if you really need to rewrite those tools in Perl.
| [reply] [d/l] |
|
| [reply] |
|
I'll give you a 4-year observation. Having done something very similar to what the OP is asking, and not only for sort and uniq, but for many things that could be done in shell (I know, because I used to maintain the shell script that did it), I'll tell you that portability can be problematic. It's difficult enough ensuring portability for perl ;-), nevermind a whole host of subcommands.
I make extensive, extensive use out of trivial-looking modules such as File::Copy, and especially File::Spec, and grepping through files is commonplace.
However, that's just the 1-year view. The following 3-years of the view is that once you've started down the road of conversion to pure-perl, you'll find better, faster, and easier ways of doing things. You'll find out, perhaps, that you don't really need the sort. In shell, you need the sort to make uniq work the way you want. In perl, you don't - just use a hash. Bang! Speed improvement. In shell, you need to use temporary files if you want to feed the same input into multiple filters (sort it for one output, sort and uniq for another output, then diff, just to see what is duplicated). In perl, you can keep it in memory (if it's small enough). Bang! More speed improvement.
Reducing, and outright removing, your dependancy on the shell is just the first step to cleaner, faster, and easier to maintain scripting.
Or, at least, that's my experience with perl.
| [reply] |
Re: perl functionality like unix's "sort -u" and "uniq -c"
by xorl (Deacon) on Apr 08, 2005 at 19:15 UTC
|
As much as I would like it to be, Perl is not the answer to everything. I personally would stick with the shell commands.
I've not used File::Sort, but lets assume it sorts the file...
I'd do something like
open(FILE, "sorted.file");
while (<FILE>) {
if($_ != $previous) {
print $current;
$count++;
}
$previous = $_;
}
print "There are $count unique records";
Code is untested and not well formated, but hopefully you'll get the idea. | [reply] [d/l] |
Re: perl functionality like unix's "sort -u" and "uniq -c"
by runrig (Abbot) on Apr 08, 2005 at 20:40 UTC
|
If you don't want to use the unix commands (which are also available for Windows), then consider using a database. | [reply] |
|
If you don't want to install a database, you can use DBD::SQLite.
| [reply] |
|
|