Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
After posting my example of a Schwartzian Transform in Complex Sorting Optimization? for suggestions on optimization I made a few more discoveries.

I need to Thank tye for his sorting alternative, by building a single sort key and then sort of that, instead of or`ing my sort priority. I instantly notice 4 times the performance working on a 3MB file.

This enlightnment from tye made me question what other ways can I sort by and how can I sort faster, that is when I found A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler. This paper is incredible, if you have any questions about your own sorting techniques go read this paper now!!!

Taking my new sense of enlightnment I used tye code example to try and sort a very large log file. My test logfile was 100MB, my first problem was memory, loading the whole log into memory eventually sucked all my resources (382 MB of ram), since I was also building sort arrays and then an index array. So I made a very slight modification by making an array with references to the matching line not the line itself, I did this by getting the offset of the line. This saved me quite a bit of memory overhead at least enough to run the script without hitting virtual memory. Here is the new version of the code snippet:

my $infile = $ARGV[0]; my $outfile = $infile; $outfile =~ s/.{3}$/out/i; open(IN, $infile ) || die "can't open logfile: $infile\n"; open(OUT, ">$outfile" ) || die "can't create datfile: $outfile\n"; my ($offset, @sorton); while (<IN>) { if( m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/ ) { push @sorton, $1.pack("N",$2).$3; push @lines, $offset; 1; } $offset = tell(IN); } my @idx= sort { $sorton[$a] cmp $sorton[$b] } 0..$#sorton; for( @lines[@idx] ) { seek IN, $_, 0; print OUT scalar(<IN>); }

Yes I know I should be putting an exclusive lock on my file esp.since I am keying off the line offsets.

What I found from running this script on the large file was horrific. It took over 12 hours to sort this file! With a few very crued checks I found that the bottleneck was the line where I sort into @idx. All the steps above are completed within minutes, and the write step is done in less time then that. I guess my frusration is that this code sample was ported into C++, using a lot of the same techinques and the whole 100MB file ran in 2 minutes!!!! Thats a huge difference.

Does perl built in sort have that much overhead? Would I be better off not using the built in sort at all?


In reply to Slow at sorting? by orbital

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-03-19 11:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found