Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

using bits to print part of a string

by Anonymous Monk
on Mar 15, 2013 at 15:45 UTC ( #1023727=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I have a whole bunch (~66k) strings that are long (~400k characters). For all strings the same smallish number of columns need to be removed before the result is printed. So basically:

string1 = 0121012102??????????12121212???????? string2 = 0111011102??????????12111112???????? string3 = 0111011102??????????12111112???????? mask = 0001111111001111001111111 (etc.)

...and now only print out the characters in the strings that correspond with 1 in the mask, omitting the ones that have 0. I've tried doing this by splitting the strings and then taking array slices, but this is way too slow. I suspect there is some clever way if doing this with bitwise operators but I'm still confused how it works after reading the relevant PODs. Anyone have any suggestions?

Thanks!

Comment on using bits to print part of a string
Download Code
Re: using bits to print part of a string
by choroba (Abbot) on Mar 15, 2013 at 16:03 UTC
    I created the bit mask from the mask by replacing 0's by \x00 and 1's by \xff. Then, you can use bitwise & and can easily remove the resulting \x00's:
    #!/usr/bin/perl use warnings; use strict; my $string = join q(), map int rand 10, 1 .. 400_000; my $mask = join q(), map int rand 2, 1 .. 400_000; $mask =~ tr/01/\x00\xff/; my $result = $string & $mask; $result =~ tr/\x00//d; print $result;
    Updated: Using tr/ instead of s/ at line 10 speeds the method up.
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Yup, that is a lot faster. Brilliant, thanks!
Re: using bits to print part of a string
by BrowserUk (Pope) on Mar 15, 2013 at 16:09 UTC

    Using bitwise operators won't work because te masked bytes will be null; which are still characters.

    Faster that split would be to use unpack:

    #! perl -slw use strict; print unpack 'x3a7x2a4x2a7', $_ while <DATA>; __DATA__ 0121012102??????????12121212???????? 0111011102??????????12111112???????? 0111011102??????????12111112????????

    Produces:

    C:\test>junk55 1012102??????12121 1011102??????12111 1011102??????12111

    Possibly faster still would be to set up an array of substr refs into a single buffer:

    #! perl -slw use strict; my $buf = chr(0) x 400_000; my @refs = map { \substr $buf, $_->[0], $_->[1] } [3,7],[12,4],[18,7]; while( <DATA> ) { substr( $buf, 0 ) = $_; print map $$_, @refs; } __DATA__ 0121012102??????????12121212???????? 0111011102??????????12111112???????? 0111011102??????????12111112????????

    Produces:

    C:\test>junk55 1012102??????12121 1011102??????12111 1011102??????12111

    You'll have to benchmark to see if whether the latter which was once faster on some earlier version of perl still is on yours.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks! So, could you unpack (har, har) the part where you do "x3a7x2a4x2a7", I am still not getting these templates.
        "x3a7x2a4x2a7"

        Skip 3 bytes; grab seven bytes, skip 2 bytes; ...


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Anyone willing to improve the benchmark?

      Update: Added regexwise.

      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        On one of my machines:

        1..4 ok 1 - array ok 2 - substr ok 3 - pack ok 4 - regex Rate arraywise regexwise packwise bitwise substr +wise arraywise 38.9/s -- -75% -86% -98% - +100% regexwise 153/s 294% -- -43% -93% - +100% packwise 270/s 593% 76% -- -87% - +100% bitwise 2077/s 5239% 1255% 671% -- +-99% substrwise 187563/s 481985% 122274% 69480% 8930% + --
        I guessed that my solution is as slow as I am... ;-)

        McA

        Anyone willing to improve the benchmark?

        They couldn't make it much worse :)

        Incorporating one-off set-ups -- or even a test for one-off setups -- into the benchmark subs is like incorporating the build-time of a car in its race time.

        If the substr code is meant to reflect my second option, you've completely misunderstood the purpose of the substr refs and assigning through a fixed scalar buffer.

        Its also traditional to post the results of a typical run.

        I may have a go at producing a more realistic benchmark later tonight. Key ingredients are that you must not exclude the IO, buffer and memory handling when benchmarking IO processing of large files. Yours excludes all of these.

        Hint: You cannot do IO filter bechmarks using the Benchmark module. The only realistic test is to time processing actual files that are big enough that they do not fit into the filecache. And you must ensure that the cache is flushed between runs.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        BTW: There is a problem with your substrwise test.

        The first time through, $mask is undefined, so you set up @mask and set smask.

        But on the second and subsequent times through, $mask is defined, so the non-state variable: @mask is left empty, so you don't do any actual work.

        That probably explains the surprising apparent efficiency of substrwise in the figures McA posted.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        I finally got around to benchmarking. (Removing the nulls, left by the bitwise&, in-place using tr is the saving grace!):

        #! perl -slw use strict; use Time::HiRes qw[ time ]; my @benches = ( sub { printf 'unpack: '; my $mask = shift; my $templ; while( $mask =~ /((.)\2*)/g ) { $templ .= (qw(x a))[$2] . length $1; } return sub { my $fh = shift; my $count = 0; my $out; $out = join'', unpack( $templ, $_ ), ++$count while <$fh>; $count; } }, sub { printf 'substr: '; my $mask = shift; my $templ; my @mask; while ( $mask =~ /0+/g ) { push @mask, [ $-[0], ( $+[0] - $-[0] ) ]; } return sub { my $fh = shift; my $count = 0; my $out; while( defined( $out = <$fh> ) ) { substr( $out, $mask[-$_][0], $mask[-$_][1],'' ) for 1 +.. @mask; ++$count; } $count; } }, sub { printf 'substrref: '; my $mask = shift; my $templ; my $buf = chr(0); $buf x= 400_000; my @refs; push @refs, \substr( $buf, $-[0], $+[0] - $-[0] ) wh +ile $mask =~ /0+/g; return sub { my $fh = shift; my $count = 0; my $out; while( <$fh> ) { substr( $buf, 0 ) = $_; $out = join'', map $$_, @refs; ++$count; } $count; } }, sub { printf "bitops: "; my $mask = shift; $mask =~ tr[01][\x00\xff]; return sub { my $fh = shift; my $count = 0; $_ &= $mask, tr[\x00][]d, ++$count while <$fh>; $count; } }, ); $|++; our $OPT //= 0; our $FLUSHFILE //= '10gb.csv'; our $TESTFILE //= '1023727.dat'; our $S //= 1; srand $S; my $mask = join '', map int( rand 2 ), 1 .. 400_000; open I, '<', $FLUSHFILE or die $!; 1 while <I>; close I; my $start = time; my $run = $benches[ $OPT ]->( $mask ); open I, '<', $TESTFILE or die $!; my $records = $run->( \*I ); close I; my $stop = time; printf "Took %f seconds for %u records (%f recs/second)\n", $stop - $start, $records, $records / ($stop - $start); __END__ C:\test>for /l %n in (0,1,3) do @1023727 -OPT=%n unpack: Took 164.702357 seconds for 2606 records (15.822482 recs/secon +d) substr: Took 2971.481218 seconds for 2606 records (0.877004 recs/secon +d) substrref: Took 154.501948 seconds for 2606 records (16.867101 recs/se +cond) bitops: Took 12.534998 seconds for 2606 records (207.897916 recs/secon +d)

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        div class=
Re: using bits to print part of a string
by thundergnat (Deacon) on Mar 15, 2013 at 16:15 UTC

    Construct an array of offsets and use substr to delete the unwanted parts.

    Update: ... as BrowserUK already suggested....

    use warnings; use strict; #use Data::Dumper; my $mask = '0001111111001111001111111'; my @mask; while ( $mask =~ /0+/g ) { push @mask, [ $-[0], ( $+[0] - $-[0] ) ]; } #print Dumper \@mask; #print $mask,"\n"; while (my $string = <DATA>){ #print $string; chomp $string; for (1..@mask){ my $replace = ''; #'*' x $mask[-$_][1]; #check to see substr $string, $mask[-$_][0], $mask[-$_][1], $replace; } print $string, "\n"; } __DATA__ 0123456789abcdefghijklmnopqrstuvwxyz 0121012102??????????12121212???????? 0111011102??????????12111112???????? 0111011102??????????12111112????????
    Output:
    3456789cdefijklmnopqrstuvwxyz 1012102??????12121212???????? 1011102??????12111112???????? 1011102??????12111112????????
Re: using bits to print part of a string
by McA (Curate) on Mar 15, 2013 at 16:34 UTC

    Hi all,

    I like choroba's solution. Therefor a ++. I just want to show my solution to prove the TMTOWTDI of perl. ;-)

    my @strings = ( 'Abcderd', 'dkdnnjn', 'ddjfdjk', 'c,oktnx', ); my $mask = '0100110'; my $regex = '^' . join('', map { $_ ? '(.)' : '.' } split //, $mask) . + '$'; foreach my $string (@strings) { my $result = join '', ($string =~ m/$regex/o); print $result, "\n"; }

    Best regards
    McA

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1023727]
Approved by marto
Front-paged by choroba
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2014-09-23 21:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (241 votes), past polls