regex or split

mce has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex or split by pike (Monk) on Feb 06, 2003 at 13:45 UTC
If I understand your question right, you don't want the \n chopped off the end of each line. The answer is: use positive look-behind assertions (see perlre). These have zero width, therefore what they match is not lost in the split: `@data = split /(?<=\n)/, $data;` [download] HTH, pike	[reply] [d/l]
Re: Re: regex or split by mce (Curate) on Feb 06, 2003 at 13:51 UTC
And the winner is.... I knew that is should be easier that split+map. --------------------------- Dr. Mark Ceulemans Senior Consultant IT Masters, Belgium	[reply]
Re: regex or split by ihb (Deacon) on Feb 06, 2003 at 13:56 UTC
The split pattern can be made zero-width, and thus not removing the newline from the result: `@data = split /(?<=\n)/, $data;` The advantage with this is that you don't have to make an exception, or premung the string, to handle the possibly missing trailing newline. (In your case you likely want to make sure there's no missing newline, but it's a useful technique to know anyway.) Using `split()` it's also very clear that you're performing a split and not an extraction. Another advantage is that you may set a limit, which can be an optimization if it's ever wanted. An issue with the pattern is that the quantifiers have a max limit. Afaik it differes on different builds, but for very long lines this might be a problem. Perhaps not likely, but anyway. (`perl -Mre=debug -wle '"a" =~ /a/'` will tell you your limit.) Hope I've helped, `ihb` Update:* Bah, pike got there first. `:)`	[reply] [d/l] [select]
Re: regex or split by helgi (Hermit) on Feb 06, 2003 at 14:00 UTC
I am unable to confirm your results. Why are you using both map and split in the first method? Why not a simple split which is much faster than either? use warnings; use strict; use Benchmark; my $data=" this is a string example"; my @data; my $count = 100000; timethese($count, { 'map_split' => sub {@data=map { $_.="\n" } split (/\n/, $data);}, 'simple_split' => sub { @data = split "\n",$data; }, 'regex' => sub { @data= ( $data =~ /(.*?\n)/g );}, } ); __END__ Benchmark: timing 100000 iterations of map_split, regex, simple_split. +.. map_split: 2 wallclock secs ( 2.04 usr + 0.00 sys = 2.04 CPU) @ 48 +947.63/s (n=100000) regex: 2 wallclock secs ( 1.24 usr + 0.00 sys = 1.24 CPU) @ 80 +515.30/s (n=100000) simple_split: 1 wallclock secs ( 0.75 usr + 0.00 sys = 0.75 CPU) @ +132978.72/s (n=100000) [download] There is a mistake in the posted code. Your initial string is called $str when you initialise it, but $data when you split it. Perhaps this typo influenced your results. If I run your code (with an empty $data variable) the regex indeed looks faster, but this is a completely spurious result. `$data = ''; Benchmark: timing 1000000 iterations of map_split, regex, simple_split +... map_split: 0 wallclock secs ( 0.99 usr + 0.00 sys = 0.99 CPU) @ 10 +09081.74/s (n=1000000) regex: 1 wallclock secs ( 0.54 usr + 0.00 sys = 0.54 CPU) @ 18 +48428.84/s (n=1000000) simple_split: 0 wallclock secs ( 0.94 usr + 0.00 sys = 0.94 CPU) @ +1062699.26/s (n=1000000)` [download] Allowing warnings would have caught this mistake! -- Regards, Helgi Briem helgi AT decode DOT is	[reply] [d/l] [select]
Re: Re: regex or split by helgi (Hermit) on Feb 06, 2003 at 14:03 UTC
I apologise. I missed the part about wanting to keep the new lines with the array items. -- Regards, Helgi Briem helgi AT decode DOT is	[reply]
Re: regex or split by dug (Chaplain) on Feb 06, 2003 at 15:30 UTC
Hello, It should also be noted that it really depends upon the size of your data sets. Pike++'s regex solution is efficient for small sets like the one you used in the example, but with a string that has only a thousand newlines it quickly falls behind your split/map solution. If you are using Perl 5.8.0, it may be worth looking at PerlIO's scalar layer as well. For larger datasets it is very efficient to simply use ye ol' file slurp trick. Due to the overhead of the `open()` call that is necissary, this won't be the most efficient for smaller datasets. Here is a bit of code: #!/usr/bin/perl use warnings; use strict; $\|++; use Benchmark qw( cmpthese ); my $str; # short example string $str=" this is a string example"; # longer string ( 1000 lines ) # my @chars = ( 'a' .. 'z', 'A' .. 'Z' ); # for ( 1..1000 ) { # $str .= $chars[ rand @chars ] for 0 .. rand @chars; # $str .= "\n"; # } cmpthese( 5000, { perl_io => sub { open( my $fh, "<:scalar", \$str) or die "$!\n"; my @data = <$fh>; }, split_map => sub { my @data=map { $_.="\n" } split (/\n/, $str); }, regex_pike => sub { my @data = split /(?<=\n)/, $str; }, } ); [download] For the shorter strings, here are the results: `Rate perl_io split_map regex_pike perl_io 14085/s -- -46% -57% split_map 25907/s 84% -- -20% regex_pike 32468/s 131% 25% --` [download] and for the longer strings: `Rate regex_pike split_map perl_io regex_pike 79.4/s -- -40% -56% split_map 131/s 65% -- -27% perl_io 181/s 128% 38% --` [download] -- dug	[reply] [d/l] [select]
Re: regex or split by bronto (Priest) on Feb 06, 2003 at 14:14 UTC
Your second example doesn't work as you expect: it misses the word "example" -which you would had if you were reading from a file: `$str=" this is a string example"; @data= ( $str =~ /(.*?\n)/g ); print map "[$_]",@data ;` [download] prints: `[ ][this ][is ][a string ]` [download] Besides, you are using `$str` at the beginning and `$data` elsewhere. You weren't using `strict`, but we had you `:-)` Ciao! `--bronto` The very nature of Perl to be like natural language--inconsistant and full of dwim and special cases--makes it impossible to know it all without simply memorizing the documentation (which is not complete or totally correct anyway). --John M. Dlugosz	[reply] [d/l] [select]
Re: regex or split by steves (Curate) on Feb 06, 2003 at 13:40 UTC
Not so far fetched I think. I had a piece of code that was heavily used that was splitting on delimiters, very similar to what you show. I profiled the code with split and with the regexp and the regexp easily won.	[reply]
Re: regex or split by mce (Curate) on Feb 06, 2003 at 13:26 UTC
Oeps, it should read The former is slightly more performant ------------------------- Dr. Mark Ceulemans Senior Consultant IT Masters, Belgium	[reply]


Syntactic Confectionery Delight
	PerlMonks