Re: Removing repeated words

Also note, it would be nice if there's a way to be just as fast, but less memory intensive, as the hash may end up holding as many as 4 million items, with keys of length 1 to 32.

There is a two-pass solution to this that uses very little memory. Parse the words from each line and output them to a pipe "|sort|uniq -d". Input the results of that pipe and you'll have a list of duplicate words to save in a hash.

The second time through your file you compare the words to that hash, something like:

  if (!exists $dup{$_} || $dup{$_}++ == 0) {
    print it
  }
[download]

If you know that STDIN is seekable (i.e. a disk file, not pipe or socket or terminal), you can seek STDIN, 0, 0 to rewind. Otherwise you'll have to write a copy of the data somewhere for your second pass.

If what you are really after is a list of the unique words in a file and you don't care about the order or line breaks, you can just parse the words out to "|sort -u".

Comment on Re: Removing repeated words Download Code

Replies are listed 'Best First'.
Re^2: Removing repeated words by Aristotle (Chancellor) on Sep 21, 2002 at 08:51 UTC
That won't fly. For one, `sort` reads its entire input before outputting a single character, so the `sort` process will grow not only comparably to the hash inside the perl process, it will infact grow larger than the entire input file. It does considerably more work too - the single-pass approach doesn't need to sort the data since it uses a hash to keep words unique. Secondly, the hash you're creating in the second pass is exactly as large as the hash would be at the end of the single-pass script - they both contain all unique words in the file. But you create that hash before you start processing, so your second pass will start out with as much memory consumed as the single-pass scripts reach only by the end of their processing. Makeshifts last the longest.	[reply]
Re: Re^2: Removing repeated words by Thelonius (Priest) on Oct 03, 2002 at 21:30 UTC
Well, I doubt anybody will read this since it's so late, but I was, like, busy, ya know? For one, sort reads its entire input before outputting a single character, so the sort process will grow not only comparably to the hash inside the perl process, it will infact grow larger than the entire input file. It does considerably more work too - the single-pass approach doesn't need to sort the data since it uses a hash to keep words unique Notice that I was proposing an external sort, so your first objection is not correct. The external sort does not use much memory at all, only disk space. It will be slower unless the hash runs out of memory, in which case it would be much faster. Secondly, the hash you're creating in the second pass is exactly as large as the hash would be at the end of the single-pass script - they both contain all unique words in the file. No, read again; my hash contains only a list of the duplicated words. The words that are truly unique will hever be in the hash at all. Of course, it's possible that all the words are duplicated at least once, in which case you are right. Also, I suspect that he really wants a list of all the unique words. If he doesn't care about the order, the "sort -u" may well be faster.	[reply]


Problems? Is your data what you think it is?
	PerlMonks