Re: Frequency Analysis Of A Subset Of A File

in reply to Frequency Analysis Of A Subset Of A File

This will print a pretty good approximation to a randomly distributed 10% of the lines in any file, regardless of its size:

C:\test>wc -l 986831-01.dat
    268 986831-01.dat

C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l
     33

C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l
     26

C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l
     32

C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l
     24
[download]

Once you have randomly selected X% of the lines in the file, you only need randomly select X% of the characters (pairs/triples) in each of those lines to satisfy your overall goal.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In Section Seekers of Perl Wisdom