http://www.perlmonks.org?node_id=1030476


in reply to Frequency Analysis Of A Subset Of A File

This will print a pretty good approximation to a randomly distributed 10% of the lines in any file, regardless of its size:

C:\test>wc -l 986831-01.dat 268 986831-01.dat C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 33 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 26 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 32 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 24

Once you have randomly selected X% of the lines in the file, you only need randomly select X% of the characters (pairs/triples) in each of those lines to satisfy your overall goal.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Frequency Analysis Of A Subset Of A File
by Limbic~Region (Chancellor) on Apr 24, 2013 at 18:51 UTC
    BrowserUk,
    And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples. In this approach, each read can result in at most, one newline.

    Cheers - L~R

      And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples.

      Then read fixed sized blocks instead of lines:

      C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.