Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

regex or split

by mce (Curate)
on Feb 06, 2003 at 13:21 UTC ( [id://233131]=perlquestion: print w/replies, xml ) Need Help??

mce has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,
I think I am having a total blackout, but I cannot think of an easier way to do this.

I know I am missing something, but what

$str=" this is a string example"; @data=map { $_.="\n" } split (/\n/, $data); # equals to @data= ( $data =~ /(.*?\n)/g );
The latter is slightly performant (about 25%), but this seems so far fetched, doesn't it?
So, I want to split the $str into an array like you would have if you read these lines from a file.

---------------------------
Dr. Mark Ceulemans
Senior Consultant
IT Masters, Belgium

Replies are listed 'Best First'.
Re: regex or split
by pike (Monk) on Feb 06, 2003 at 13:45 UTC
    If I understand your question right, you don't want the \n chopped off the end of each line. The answer is: use positive look-behind assertions (see perlre). These have zero width, therefore what they match is not lost in the split:

    @data = split /(?<=\n)/, $data;
    HTH,

    pike

      And the winner is....

      I knew that is should be easier that split+map.


      ---------------------------
      Dr. Mark Ceulemans
      Senior Consultant
      IT Masters, Belgium

Re: regex or split
by ihb (Deacon) on Feb 06, 2003 at 13:56 UTC

    The split pattern can be made zero-width, and thus not removing the newline from the result:   @data = split /(?<=\n)/, $data; The advantage with this is that you don't have to make an exception, or premung the string, to handle the possibly missing trailing newline. (In your case you likely want to make sure there's no missing newline, but it's a useful technique to know anyway.) Using split() it's also very clear that you're performing a split and not an extraction. Another advantage is that you may set a limit, which can be an optimization if it's ever wanted.

    An issue with the pattern is that the quantifiers have a max limit. Afaik it differes on different builds, but for very long lines this might be a problem. Perhaps not likely, but anyway. (perl -Mre=debug -wle '"a" =~ /a*/' will tell you your limit.)

    Hope I've helped,
    ihb

    Update: Bah, pike got there first. :)

Re: regex or split
by helgi (Hermit) on Feb 06, 2003 at 14:00 UTC
    I am unable to confirm your results. Why are you using both map and split in the first method? Why not a simple split which is much faster than either?

    use warnings; use strict; use Benchmark; my $data=" this is a string example"; my @data; my $count = 100000; timethese($count, { 'map_split' => sub {@data=map { $_.="\n" } split (/\n/, $data);}, 'simple_split' => sub { @data = split "\n",$data; }, 'regex' => sub { @data= ( $data =~ /(.*?\n)/g );}, } ); __END__ Benchmark: timing 100000 iterations of map_split, regex, simple_split. +.. map_split: 2 wallclock secs ( 2.04 usr + 0.00 sys = 2.04 CPU) @ 48 +947.63/s (n=100000) regex: 2 wallclock secs ( 1.24 usr + 0.00 sys = 1.24 CPU) @ 80 +515.30/s (n=100000) simple_split: 1 wallclock secs ( 0.75 usr + 0.00 sys = 0.75 CPU) @ +132978.72/s (n=100000)

    There is a mistake in the posted code. Your initial string is called $str when you initialise it, but $data when you split it.

    Perhaps this typo influenced your results. If I run your code (with an empty $data variable) the regex indeed looks faster, but this is a completely spurious result.

    $data = ''; Benchmark: timing 1000000 iterations of map_split, regex, simple_split +... map_split: 0 wallclock secs ( 0.99 usr + 0.00 sys = 0.99 CPU) @ 10 +09081.74/s (n=1000000) regex: 1 wallclock secs ( 0.54 usr + 0.00 sys = 0.54 CPU) @ 18 +48428.84/s (n=1000000) simple_split: 0 wallclock secs ( 0.94 usr + 0.00 sys = 0.94 CPU) @ +1062699.26/s (n=1000000)

    Allowing warnings would have caught this mistake!

    --
    Regards,
    Helgi Briem
    helgi AT decode DOT is

      I apologise. I missed the part about wanting to keep the new lines with the array items.

      --
      Regards,
      Helgi Briem
      helgi AT decode DOT is

Re: regex or split
by dug (Chaplain) on Feb 06, 2003 at 15:30 UTC
    Hello,

    It should also be noted that it *really* depends upon the size of your data sets. Pike++'s regex solution is efficient for small sets like the one you used in the example, but with a string that has only a thousand newlines it quickly falls behind your split/map solution.

    If you are using Perl 5.8.0, it may be worth looking at PerlIO's scalar layer as well. For larger datasets it is very efficient to simply use ye ol' file slurp trick. Due to the overhead of the open() call that is necissary, this won't be the most efficient for smaller datasets.


    Here is a bit of code:
    #!/usr/bin/perl use warnings; use strict; $|++; use Benchmark qw( cmpthese ); my $str; # short example string $str=" this is a string example"; # longer string ( 1000 lines ) # my @chars = ( 'a' .. 'z', 'A' .. 'Z' ); # for ( 1..1000 ) { # $str .= $chars[ rand @chars ] for 0 .. rand @chars; # $str .= "\n"; # } cmpthese( 5000, { perl_io => sub { open( my $fh, "<:scalar", \$str) or die "$!\n"; my @data = <$fh>; }, split_map => sub { my @data=map { $_.="\n" } split (/\n/, $str); }, regex_pike => sub { my @data = split /(?<=\n)/, $str; }, } );

    For the shorter strings, here are the results:
    Rate perl_io split_map regex_pike perl_io 14085/s -- -46% -57% split_map 25907/s 84% -- -20% regex_pike 32468/s 131% 25% --

    and for the longer strings:
    Rate regex_pike split_map perl_io regex_pike 79.4/s -- -40% -56% split_map 131/s 65% -- -27% perl_io 181/s 128% 38% --

      -- dug
Re: regex or split
by bronto (Priest) on Feb 06, 2003 at 14:14 UTC

    Your second example doesn't work as you expect: it misses the word "example" -which you would had if you were reading from a file:

    $str=" this is a string example"; @data= ( $str =~ /(.*?\n)/g ); print map "[$_]",@data ;

    prints:

    [ ][this ][is ][a string ]

    Besides, you are using $str at the beginning and $data elsewhere. You weren't using strict, but we had you :-)

    Ciao!
    --bronto


    The very nature of Perl to be like natural language--inconsistant and full of dwim and special cases--makes it impossible to know it all without simply memorizing the documentation (which is not complete or totally correct anyway).
    --John M. Dlugosz
Re: regex or split
by steves (Curate) on Feb 06, 2003 at 13:40 UTC

    Not so far fetched I think. I had a piece of code that was heavily used that was splitting on delimiters, very similar to what you show. I profiled the code with split and with the regexp and the regexp easily won.

Re: regex or split
by mce (Curate) on Feb 06, 2003 at 13:26 UTC
    Oeps, it should read
    The former is slightly more performant
    -------------------------
    Dr. Mark Ceulemans
    Senior Consultant
    IT Masters, Belgium

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://233131]
Approved by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2024-04-23 09:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found