comment on

Not to be argumentative, but are you sure you need to sort this list?

If your goal were to identify duplicates in the list, you could take a hash (MD5, not %hash) of each list element, sort the hashes, and look for duplicates in the hash space. A hash function like MD5 or SHA1 should reliably distinguish 2 GB strings, but if you do find duplicates, you could always verify them from the primary data.

If these strings are expected to be dissimilar, it may suffice to sort them based on their first 1000 characters and then use a separate procedure to deal with cases where the first 1000 characters are identical.

This seems like a time to step back and think about the larger goal and any a priori knowledge of the data to be sorted.

In reply to Re: Sorting Gigabytes of Strings Without Storing Them by eye
in thread Sorting Gigabytes of Strings Without Storing Them by neversaint

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks