Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Frequency Analysis Of A Subset Of A File

by BrowserUk (Pope)
on Apr 24, 2013 at 18:34 UTC ( #1030476=note: print w/ replies, xml ) Need Help??


in reply to Frequency Analysis Of A Subset Of A File

This will print a pretty good approximation to a randomly distributed 10% of the lines in any file, regardless of its size:

C:\test>wc -l 986831-01.dat 268 986831-01.dat C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 33 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 26 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 32 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 24

Once you have randomly selected X% of the lines in the file, you only need randomly select X% of the characters (pairs/triples) in each of those lines to satisfy your overall goal.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: Frequency Analysis Of A Subset Of A File
Download Code
Re^2: Frequency Analysis Of A Subset Of A File
by Limbic~Region (Chancellor) on Apr 24, 2013 at 18:51 UTC
    BrowserUk,
    And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples. In this approach, each read can result in at most, one newline.

    Cheers - L~R

      And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples.

      Then read fixed sized blocks instead of lines:

      C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1030476]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (12)
As of 2014-09-16 19:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (45 votes), past polls