Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Re (tilly) 1: Supersplit

by jeroenes (Priest)
on Dec 31, 2000 at 20:11 UTC ( #49103=note: print w/ replies, xml ) Need Help??


in reply to Re (tilly) 1: Supersplit
in thread Supersplit

Well, tilly, thanx a lot for the thorough critics. And also for the last remark, I needed that ;-). I have taken your code in the script, but merged it to some of my code where appropiate.

Of course, I have some counter-remarks, here we go.

1. Why reverse the order for the arguments? join and split both first start with the separator, and than the input. So I changed the order back to strings, input.

2. I don't like to pre-compile the regex's, otherwise the split couldn't cope with changing delimiters, as in a text file (see the SYNOPSIS), or with sprintf'ed data. So I changed that back. Furthermore, I couldn't find any reference to qr// in manpages. Could you please explain?

3. On tie's comments: more dimensional arrays are a perl 5 feature, so I should check for that anyway. Out of time now, so next version of supersplit.

4. I really like the recursive approach.

5. I don't see the need for a separate IO version, so I changed that back, too. I just try to treat the string as a filehandle, or try to open it as file (new feature). I didn't succeed to get supersplit( INPUT ), with INPUT as a filehandle, to work. That's peculiar, because the manpage tells me that <$fh>, with $fh='INPUT', should work.

6. You are totally right on the matter of the inner/ outer naming convention.

7. And ++ for the join( $_, @_) stuff. I never would have dared to use it. But of course $_ and @_ have different namespaces...

8. I removed the BEGIN blocks. Is this something for the manpages (perldoc perlmod)?

Finally, I tested the code with 2D-arrays. It works. I'm leaving home for the remainder of this year, so we'll continue next year.

Happy new year everyone, best wishes, and thanx for the comments!

Jeroen

The new code, with POD, are here:

package SuperSplit; use strict; =head1 NAME SuperSplit - Provides methods to split/join in two dimensions =head1 SYNOPSIS use SuperSplit; #first example: split on newlines and whitespace and print #the same data joined on tabs and whitespace. The split works on STDI +N # print superjoin( supersplit() ); #second: split a table in a text file, and join it to HTML # my $array2D = supersplit( \*INPUT ) #filehandle must be open my $htmltable = superjoin( '</TD><TD>', "</TD></TR>\n <TR><TD>", $array2D ); $htmltable = "<TABLE>\n <TR><TD>" . $htmltable . "</TD></TR>\n</T +ABLE>"; print $htmltable; #third: perl allows you to have varying number of columns in a row, # so don't stop with simple tables. To split a piece of text into # paragraphs, than words, try this: # undef $/; $_ = <>; tr/.!();:?/ /; #remove punctiation my $array = supersplit( '\s+', '\n\s*\n', $_ ); # now you can do something nifty as counting the number of words in e +ach # paragraph my @numwords = (); my $i=0; for my $rowref (@$array) { push( @numwords, scalar(@$rowref) ); #2D-array: array of refs! print "Found $numwords[$i] \twords in paragraph \t$i\n"; $i++; } =head1 DESCRIPTION Supersplit is just a consequence of the possibility to use multi-dimen +sional arrays in perl. Because that is possible, one also wants a way to convenienently split data into a nD-array (at least I want to). And v +ice versa, of course. Supersplit/join just do that. Because I intend to use these methods in numerous one-liners and in my + collection of handy filters, an object interface is more often than no +t cumbersome. So, this module exports two methods, but it's also all it + has. If you think modules shouldn't do that, period, use the object interfa +ce, SuperSplit::Obj. TIMTOWTDI =over 4 =item supersplit($colseparator,$rowseparator, (...,) $filehandleref || + $string); The first method, supersplit, returns a nD-array. To do that, it need +s data and the strings to split with. Data may be provided as a referen +ce to a filehandle, or as a string. If you want use a string for the data, +you MUST provide the strings to split with (3 argument mode). If you don' +t provide data, supersplit works on STDIN. If you provide a filehandle ( +like \*INPUT) or filename, supersplit doesn't need the splitting strings, a +nd assumes columns are separated by whitespace, and rows are separated by + newlines. Strings are passed directly to split. If you provide more s +trings, they will split the higher dimensions. Supersplit returns a multi-dimensional array or undef if an error occu +rred. =item superjoin( $colseparator, $rowseparator, $array2D ); The second and last method, superjoin, takes a nD-array and returns it + as a string. The default behavior assumes 2D-array. In the string, column +s (adjacent cells) are separated by the first argument provided. Rows (normally lines) are separated by the second argument. Alternatively, + you may give the 2D-array as the only argument. In that case, superjoin j +oins columns with a tab ("\t"), and rows with a newline ("\n"). If you hav +e more dimensions in your array, all separators for all dimensions shoul +d be provided. Superjoin returns an undef if an error occurred, for example if you gi +ve a ref to an hash. If your first dimension points to hashes or strings, superjoin will return undef. Mixed arrays will break the code. =back =head1 AUTHOR Jeroen Elassaiss-Schaap, with great help from tilly, who rewrote most +of the code for version 0.03.. =head1 LICENSE Perl/ artisitic license =head1 STATUS Alpha =cut use Exporter; use vars qw( @EXPORT @ISA @VERSION); @VERSION = 0.03; @ISA = qw( Exporter ); @EXPORT = qw( &supersplit &superjoin ); sub supersplit{ my $text = _text( pop ); $_[0] || ( $_[0] = '\s+' ); $_[1] || ( $_[1] = '\n' ); _split($text, @_); } sub _text{ my $fh = shift; unless (defined($fh)) { $fh = \*STDIN; } if (open INPUT, "<$fh" ) { $fh = join '', <INPUT>; close INPUT; } no strict 'refs'; (join '', <$fh>) || $fh; } sub _split { my $text = shift; my $re = pop; my @res = split($re, $text); # Consider the third arg? if (@_) { @res = map { _split( $_, @_) } @res; } \@res; } sub superjoin{ my $array_ref = pop; push ( @_, "\t") if @_ < 1; push ( @_, "\n") if @_ < 2; return undef unless( ref( $array_ref ) eq 'ARRAY' ); return undef unless( ref( $array_ref->[0] ) =~ /ARRAY/ ); _join( @_, $array_ref); } sub _join{ my $array_ref = pop; my $str = pop; if (@_) { @$array_ref = map {_join( @_, $_)} @$array_ref; } join $str, @$array_ref; } 1;

I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)


Comment on Re: Re (tilly) 1: Supersplit
Select or Download Code
Re (tilly) 3 (comments): Supersplit
by tilly (Archbishop) on Jan 01, 2001 at 00:50 UTC
    Here is explanation for my feedback:
    1. Why reverse the order for the arguments? join and split both first start with the separator, and than the input. So I changed the order back to strings, input.

      Because if you have positionally determined arguments with one list being variable length, it is usual to have the variable list at the end of your argument list. In this case when I made it recursive (and therefore capable of handling n-dim arrays) you had a variable length list of things to join and split. So I moved those arguments to the end.

    2. I don't like to pre-compile the regex's, otherwise the split couldn't cope with changing delimiters, as in a text file (see the SYNOPSIS), or with sprintf'ed data. So I changed that back. Furthermore, I couldn't find any reference to qr// in manpages. Could you please explain?

      Pre-compiling the regexes should not pose a problem for having patterns that handle multiple delimiters. Could you try it and report back? As for documents, the docs on this server are 5.003 specific. (Same as Camel 2.) Most people are on 5.005 or 5.6. On those machines you can find out about the feature from the local documentation using the perldoc utility. In fact in this case: perldoc -f qr directs you to perlop/"Regexp Quote-Like Operators". So try perldoc perlop and then type /Quote-Like to get to the relevant section. Then /qr and hit 'n' until you get to the right spot. (The same search/paging tricks work with utilities like man and less on *nix systems.)

    3. ...more dimensional arrays are a perl 5 feature, so I should check for that anyway. Out of time now, so next version of supersplit.

      I already did that with the recursion. :-)

    4. I really like the recursive approach.

      So did I. :-)

    5. I don't see the need for a separate IO version, so I changed that back, too. I just try to treat the string as a filehandle, or try to open it as file (new feature). I didn't succeed to get supersplit( INPUT ), with INPUT as a filehandle, to work. That's peculiar, because the manpage tells me that <$fh$gt;, with $fh='INPUT', should work.

      The need is due to your having overloaded the input too much. For instance if someone tried to use your current version of supersplit() on an uploaded file from CGI they would fail miserably. I also really don't like trying an open and silently failing.

      Additionally it is generally a bad idea to limit how your caller can pass information. What if I really want to pass you data from a socket? Or from IO::Scalar? Or from a string I have already pre-processed? Having two functions, one of which is a wrapper around the other, for that situation leaves you with a consistent interface and more flexibility.

      As for your comment on what you are surprised is failing, I would not expect that to work. Which manpage led you to expect that it would?

    6. You are totally right on the matter of the inner/outer naming convention.

      Get bitten often enough and you become sensitive to potential confusions in names. :-)

      The real issue here is the same one which makes it hard for programmers to find their own bugs. You need to step out of your own pre-conceptions of how you are supposed to be working and thinking and see the problem from what another person's PoV is likely to be. This is frequently much easier for another person to do...

    7. And ++ for the join( $_, @_) stuff. I never would have dared to use it. But of course $_ and @_ have different namespaces..

      :-)

    8. I I removed the BEGIN blocks. Is this something for the manpages (perldoc perlmod)?

      Well it is something that I know because I looked in some detail at Carp and Exporter a while ago. While the principles of what happens when are documented, I don't think that the conclusion is stated anywhere. I certainly had to learn it by reading and thinking through the code.

      I have to ponder this for a while, and my time for perl is limited at the moment. But for the time being I will like to go into more detail about the filehandle stuff. I read CGI, and I understand the problem, because returns a variable containing both name and filehandle. I would guess that a check for reference should avoid this problem.

      My assumption about the workings of $fh='INPUT'; join '',<$fh>; are based on perldoc:perlop, saying 'If the string inside the angle brackets is a reference to a scalar variable (e.g., <$foo>), then that variable contains the name of the filehandle to input from, or a reference to the same.'.

      I understand the argument about the pipe and stuff, and I think a open.. "$fh" || open .. "<$fh" should do the trick here. I don't have IO::Scalar here, I'll ponder that later.

      You worry about a trying open and silently failing. Well, I tried it, and the code simply returns the name if the open failed (and there is nothing to split on) as the first element. That leaves room for a check.

      Furthermore, the code returns a prepared string as well, if reading as a filehandle fails. I quickly glanced at IO::Scalar, and I think the ref check should allow that as well. I think the following code should work.

      sub _text{ my $fh = shift; unless (defined($fh)) { $fh = \*STDIN; } if ( (! ref $fh) && ((open INPUT, "$fh") || open INPUT, "<$fh" )) +{ $fh = join '', <INPUT>; close INPUT; } no strict 'refs'; (join '', <$fh>) || $fh; }

      Bye,

      Jeroen
      I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)

      After some CB discussion, tilly convinced me why the open was not such good magic. So I devoted a separate routine to that. The filehandle still is in the normal supersplit routine.

      I personally vote for the split order of arguments, because people are more familiar with that. I think the variable length list can be coped with by the pop's. If that's unusual, so be it. I know that this order is mostly not a good idea.

      Moreover, you can provide a limit option now. Because there is some magic involved, I doubled a function that doesn't try to find a limit parameter at the end.

      The code has been posted more than enough now, so I hooked it up to my homenode. POD has been updated. Code has been tested.

      Cheers,

      Jeroen
      I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://49103]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2014-12-25 19:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (162 votes), past polls