Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^5: statistics of a large text

by tilly (Archbishop)
on Jan 27, 2011 at 15:05 UTC ( #884570=note: print w/replies, xml ) Need Help??


in reply to Re^4: statistics of a large text
in thread statistics of a large text

Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):
#! /usr/bin/perl -w use strict; my $last_n_gram = ""; my @line_numbers; while (<>) { my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/); if ($n_gram ne $last_n_gram and @line_numbers) { @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n"; $last_n_gram = $n_gram; @line_numbers = (); } push @line_numbers, $line_number; } @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n";
This assumes that you're going to reduce_step.pl intermediate_file > final_file.

Replies are listed 'Best First'.
Re^6: statistics of a large text
by perl_lover_always (Acolyte) on Feb 10, 2011 at 11:13 UTC
    Since you are more expert in memory usage and related issues, I have a question. Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?

      What does perl -V output on your system?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        This is perl, v5.8.8 built for x86_64-linux-thread-multi
      Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?

      Assuming that your OS and Perl allow you full access to the full 50GB, you should not be running out of memory.

      On a 64-bit system, a HoAs with 7 million keys and an average of 10 numbers per array requires ~3.5 GB. For two, reckon on 10 GB max.

      I'm not aware of any restrictions or limits on the memory a 64-bit Perl can address, which leave you OS. Linux can apply per-process (and per-user?) limits to memory and cpu usage. I don't know what the commands are for discovering this information, but meybe that is somewhere you should be looking.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I have no idea since I can access the whole memory (all 50 GB) Do you think it has something to do with my code?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://884570]
help
Chatterbox?
[choroba]: https?
[choroba]: That was at least why my cb talker wasn't working.
[Tanktalus]: curious, then, why the reader works :)
[Tanktalus]: or posting the last hour of cb... well, maybe I should double check that first :)
[Tanktalus]: yup, it's working. :)
[choroba]: https://github. com/choroba/pm-cb/ commit/7b57f513596 7bf8a29d74f1c307de 9a76894cdcf
[choroba]: Also, a thread here on PM mentioned that one of perlmonks.com or www.perlmonks.com should now work
[choroba]: Tidings
[Tanktalus]: So, I can log in, I can update last hour of cb, I can read the cb, I just can't post a message to it :(

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2018-07-15 21:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (326 votes). Check out past polls.

    Notices?