1: package SuperSplit;
2: use strict;
3:
4: =head1 NAME
5:
6: SuperSplit - Provides methods to split/join in two dimensions
7:
8: =head1 SYNOPSIS
9: use SuperSplit;
10:
11: #first example: split on newlines and whitespace and print
12: #the same data joined on tabs and whitespace. The split works on STDIN
13: #
14: print superjoin( supersplit() );
15:
16: #second: split a table in a text file, and join it to HTML
17: #
18: my $array2D = supersplit( \*INPUT ) #filehandle must be open
19: my $htmltable = superjoin( '</TD><TD>', "</TD></TR>\n <TR><TD>",
20: $array2D );
21: $htmltable = "<TABLE>\n <TR><TD>" . $htmltable . "</TD></TR>\n</TABLE>";
22: print $htmltable;
23:
24: #third: perl allows you to have varying number of columns in a row,
25: # so don't stop with simple tables. To split a piece of text into
26: # paragraphs, than words, try this:
27: #
28: undef $/;
29: $_ = <>;
30: tr/.!();:?/ /; #remove punctiation
31: my $array = supersplit( '\s+', '\n\s*\n', $_ );
32: # now you can do something nifty as counting the number of words in each
33: # paragraph
34: my @numwords = (); my $i=0;
35: for my $rowref (@$array) {
36: push( @numwords, scalar(@$rowref) ); #2D-array: array of refs!
37: print "Found $numwords[$i] \twords in paragraph \t$i\n";
38: $i++;
39: }
40:
41: =head1 DESCRIPTION
42:
43: Supersplit is just a consequence of the possibility to use 2D arrays in
44: perl. Because this is possible, one also wants a way to conveniently split
45: data into a 2D-array (at least I want to). And vice versa, of course.
46: Supersplit/join just do that.
47:
48: Because I intend to use these methods in numerous one-liners and in my
49: collection of handy filters, an object interface is more often than not
50: cumbersome. So, this module exports two methods, but it's also all it has.
51: If you think modules shouldn't do that, period, use the object interface,
52: SuperSplit::Obj. TIMTOWTDI
53:
54: =over 4
55:
56: =item supersplit($colseparator,$rowseparator,$filehandleref || $string);
57:
58: The first method, supersplit, returns a 2D-array. To do that, it needs data
59: and the strings to split with. Data may be provided as a reference to a
60: filehandle, or as a string. If you want use a string for the data, you MUST
61: provide the strings to split with (3 argument mode). If you don't provide
62: data, supersplit works on STDIN. If you provide a filehandle (a ref to it,
63: anyway), supersplit doesn't need the splitting strings, and assumes columns
64: are separated by whitespace, and rows are separated by newlines. Strings
65: are passed directly to split.
66:
67: Supersplit returns a 2D-array or undef if an error occurred.
68:
69: =item superjoin( $colseparator, $rowseparator, $array2D );
70:
71: The second and last method, superjoin, takes a 2D-array and returns it as a
72: string. In the string, columns (adjacent cells) are separated by the first
73: argument provided. Rows (normally lines) are separated by the second
74: argument. Alternatively, you may give the 2D-array as the only argument.
75: In that case, superjoin joins columns with a tab ("\t"), and rows with a
76: newline ("\n").
77:
78: Superjoin returns an undef if an error occurred, for example if you give a
79: ref to an hash. If your first dimension points to hashes, the interpreter
80: will give an error (use strict).
81:
82: =back
83:
84:
85:
86: =head1 AUTHOR
87:
88: J. Elassaiss-Schaap
89:
90: =head1 LICENSE
91:
92: Perl/ artisitic license
93:
94: =head1 STATUS
95:
96: Alpha
97:
98: =cut
99:
100: BEGIN{
101: use Exporter;
102: use vars qw( @EXPORT @ISA @VERSION);
103: @VERSION = 0.01;
104: @ISA = qw( Exporter );
105: @EXPORT = qw( &supersplit &superjoin );
106: }
107:
108: sub supersplit{
109: my $handleref = pop || \*STDIN;
110: unless (ref($handleref) =~ /GLOB/){
111: push(@_, $handleref);
112: undef $handleref;
113: }
114: my $second = $_[0] || '\s+';
115: my $first = $_[1] || '\n';
116: $handleref || (my $text = $_[2]);
117: my $index = 0;
118: my $arrayref = [[]] ;
119: local $/;
120: undef $/;
121: $text = <$handleref> if( ref($handleref) );
122: my @lines = split( $first, $text );
123: for (@lines){
124: $arrayref->[$index] = [ (split($second) || $_)];
125: $index++;
126: }
127: return $arrayref;
128: }
129:
130: sub superjoin{
131: my $array = pop || return undef;
132: my $first = shift || "\t";
133: my $second = shift || "\n";
134: my $text = '';
135: return undef unless( ref($array) eq 'ARRAY' );
136: return undef unless( ref($array->[0]) =~ /ARRAY|HASH/ );
137: my $arrayarray = [];
138: for $arrayarray (@$array) {
139: $text .= join( $first, @$arrayarray );
140: $text .= $second;
141: }
142: return $text;
143: }
144:
145: 1;
Re (tilly) 1: Supersplit
by tilly (Archbishop) on Dec 31, 2000 at 08:03 UTC
|
OK, I took the idea and did basically a complete rewrite.
In particular I noticed the following:
- I made the interfaces less magic. For instance you
have this magic stuff on the filehandle. I made that a
separate function. This will work with tied filehandles
as well. For the same reason I stopped using $/
because the author of a tied method may not pay attention
to that.
- If you are a module, there is no need to do
initializations in a BEGIN block.
- I would have moved your functions into @EXPORT_OK as
Exporter suggests, but you want this for one-offs. OK,
TIMTOWTDI. But if I was using it I would have made that
change.
- I wondered if your @VERSION was meant to be $VERSION.
- I note that there is no equivalent to the third
argument to split. I played both ways with that then
left it alone. Just note that trailing blanks will
get split.
- I am doing a rewrite and didn't include any POD. You
should.
- I made this n-dimensional because, well, because I can.
- You were not completely clear what the argument order
was, and naming the first one $second and the second one
$first is IMO confusing. I made it recursive, but still
you should note the naming issue. If you wanted 2-dim I
would suggest $inner and $outer as names.
- You are using explicit indexes. I almost never find
that necessary. In this version I use map. Otherwise
you could push onto the anon array. Avoiding ever
thinking about the index leads to fewer opportunities to
mess up, and often results in faster code as well!
- I am using qr// to avoid recompiling REs. Given the
function call overhead this probably isn't a win. I did
it mainly to mention that if you are going to do repeated
uses of an RE, you can and should avoid compilation
overhead.
- The reason for my wrappers is so that my recursion
won't mess up on the defaults. :-)
- I considered checking wantarray, but the complication
in the interface did not seem appropriate for short stuff.
- Note that this entire approach is going to fail
miserably on formats with things like escape characters
and escape sequences. For instance the CSV format is
never going to be easily handled using this. Something
to consider before using this for an interesting problem.
Oh right, and you want to see code? OK.
package SuperSplit;
use strict;
use Exporter;
use vars qw( @EXPORT @ISA $VERSION );
$VERSION = 0.02;
@ISA = 'Exporter';
@EXPORT = qw( superjoin supersplit supersplit_io );
# Takes a reference to an n-dim array followed by n strings.
# Joins the array on those strings (inner to outer),
# defaulting to "\t", "\n"
sub superjoin {
my $a_ref = shift;
push (@_, "\t") if @_ < 1;
push (@_, "\n") if @_ < 2;
_join($a_ref, @_);
}
sub _join {
my $a_ref = shift;
my $str = pop;
if (@_) {
@$a_ref = map {_join($_, @_)} @$a_ref;
}
join $str, @$a_ref;
}
# Splits the input from a filehandle
sub supersplit_io {
my $fh = shift;
unless (defined($fh)) {
$fh = \*STDIN;
}
unshift @_, join '', <$fh>;
supersplit(@_);
}
# n-dim split. First arg is text, rest are patterns, listed
# inner to outer. Defaults to /\t/, /\n/
sub supersplit {
my $text = shift;
if (@_ < 1) {
push @_, "\t";
}
if (@_ < 2) {
push @_, "\n";
}
_split($text, map {qr/$_/} @_);
}
sub _split {
my $text = shift;
my $re = pop;
my @res = split($re, $text); # Consider the third arg?
if (@_) {
@res = map {_split($_, @_)} @res;
}
\@res;
}
1;
Cheers,
Ben
PS Please take the quantity and detail of my response as a
sign that I liked the idea enough to critique it, and
not as criticism of the effort you put in... | [reply] [Watch: Dir/Any] [d/l] |
|
If you are a module, there is no need to do initializations in a BEGIN block.
Maybe you haven't found any but I certainly have. They aren't easy to run into but I still find it a valuable habit. Leaving a potentially large window for a race condition to fit into isn't my idea of a good programming technique. I suspect that future features of Perl will make this even more important to avoid.
Along those lines, I'd
use base qw(Exporter) myself, though I look forward to just being able to dispatch &import and avoid the sledgehammer of inheritance. (:
I consider qr// to be new enough of a feature that if I were to use it in a module I would make it optional:
BEGIN {
my $sub;
if( $] < 5.005 ) {
$sub= 'return $_[0]';
} else {
$sub= 'return qr/$_[0]/';
}
eval "sub _qr { $sub }; 1"
or die $@;
}
# ...
_split($text, map {_qr($_)} @_);
(untested, though).
-
tye
(but my friends call me "Tye") | [reply] [Watch: Dir/Any] [d/l] [select] |
|
I disagree quite strongly on the BEGIN issue.
When someone loads your module via use or require, there
is no gap between the finish of parsing your module and
the execution of code in your module. Therefore there is
no possibility of a race if you don't play games with
BEGIN in your module, and stupid games are not played
while checking who is loading you. I assume, of course,
that you are nt using an unstable experimental feature.
(ie Threads. And if Perl 6 gets into races with
initialization code while loading modules, then that is a
showstopper in my books!)
If I am wrong then please show me how exactly this race can
happen in Perl. If it is something which I think could
possibly happen to me, then I will start blocking it. But
not unless. In general I won't put energy into defensively
coding against things that I don't think I will get wrong.
Conversely if it is something that I can conceivably get
wrong by accident, I will become a paranoid nut. :-)
Now I will give you a very good reason not to move
your initialization into BEGIN. If you don't move your
manipulation of @ISA into BEGIN, then on 5.005 (and if they
fix the stupid $Carp::CarpLevel games in Exporter on 5.6
as well) if you mess up a use statement then you will by
default get it correctly reported from the correct package
in your module rather than from the module which uses
you. If you move the manipulation of @ISA into BEGIN or
use base to achieve the same effect you will mess that
up. (Note that the fixed Carp that will appear in 5.8
does not have this issue.) Therefore by not playing games
with what parts of your initializations occur before your
module is done compiling, you will get error reporting which
is more likely to be informative.
So the BEGIN time initialization not only doesn't buy me
anything that I care about, it loses me something that I
consider very important!
The qr// point is a matter of taste and environment. No,
it is not supported in older Perl's. If that is an issue
for you, then it is easy enough to drop it and just split
on /$re/.
| [reply] [Watch: Dir/Any] |
|
|
|
|
|
Well, tilly, thanx a lot for the thorough critics. And also for the last
remark, I needed that ;-). I have taken your code in the script, but
merged it to some of my code where appropiate.
Of course, I have some counter-remarks, here we go.
1. Why reverse the order for the arguments? join and split both first
start with the separator, and than the input. So I changed the order back
to strings, input.
2. I don't like to pre-compile the regex's, otherwise the split couldn't
cope with changing delimiters, as in a text file (see the SYNOPSIS), or
with sprintf'ed data. So I changed that back. Furthermore, I couldn't
find any reference to qr// in manpages. Could you please explain?
3. On tie's comments: more dimensional arrays are a perl 5 feature,
so I should check for that anyway. Out of time now, so next version of
supersplit.
4. I really like the recursive approach.
5. I don't see the need for a separate IO version, so I changed that back,
too. I just try to treat the string as a filehandle, or try to open it as
file (new feature). I didn't succeed to get supersplit( INPUT ), with
INPUT as a filehandle, to work. That's peculiar, because the manpage tells
me that <$fh>, with $fh='INPUT', should work.
6. You are totally right on the matter of the inner/ outer naming
convention.
7. And ++ for the join( $_, @_) stuff. I never would have
dared to use it. But of course $_ and @_ have different namespaces...
8. I removed the BEGIN blocks. Is this something for the manpages (perldoc
perlmod)?
Finally, I tested the code with 2D-arrays. It works. I'm leaving home for
the remainder of this year, so we'll continue next year.
Happy new year everyone, best wishes, and thanx for the comments!
Jeroen
The new code, with POD, are here:
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Here is explanation for my feedback:
- Why reverse the order for the arguments? join and
split both first start with the separator, and than the
input. So I changed the order back to strings, input.
Because if you have positionally determined arguments with
one list being variable length, it is usual to have the
variable list at the end of your argument list. In this
case when I made it recursive (and therefore capable of
handling n-dim arrays) you had a variable length list of
things to join and split. So I moved those arguments to
the end.
- I don't like to pre-compile the regex's, otherwise
the split couldn't cope with changing delimiters, as in
a text file (see the SYNOPSIS), or with sprintf'ed data.
So I changed that back. Furthermore, I couldn't find any
reference to qr// in manpages. Could you please explain?
Pre-compiling the regexes should not pose a problem for
having patterns that handle multiple delimiters. Could
you try it and report back? As for documents, the docs
on this server are 5.003 specific. (Same as Camel 2.)
Most people are on 5.005 or 5.6. On those machines you
can find out about the feature from the local
documentation using the perldoc utility. In fact in
this case:
perldoc -f qr
directs you to perlop/"Regexp Quote-Like
Operators". So try
perldoc perlop
and then type /Quote-Like to get to the relevant
section. Then /qr and hit 'n' until you get to
the right spot. (The same search/paging tricks work with
utilities like man and less on *nix systems.)
- ...more dimensional arrays are a perl 5 feature,
so I should check for that anyway. Out of time now, so
next version of supersplit.
I already did that with the recursion. :-)
- I really like the recursive approach.
So did I. :-)
- I don't see the need for a separate IO version, so
I changed that back, too. I just try to treat the string
as a filehandle, or try to open it as file (new feature).
I didn't succeed to get supersplit( INPUT ), with INPUT
as a filehandle, to work. That's peculiar, because the
manpage tells me that <$fh$gt;, with $fh='INPUT',
should work.
The need is due to your having overloaded the input too
much. For instance if someone tried to use your current
version of supersplit() on an uploaded file from
CGI they would fail miserably. I also really
don't like trying an open and silently failing.
Additionally it is generally a bad idea to limit how your
caller can pass information. What if I really want to
pass you data from a socket? Or from IO::Scalar? Or
from a string I have already pre-processed? Having two
functions, one of which is a wrapper around the other,
for that situation leaves you with a consistent interface
and more flexibility.
As for your comment on what you are surprised is failing,
I would not expect that to work. Which manpage led you
to expect that it would?
- You are totally right on the matter of the
inner/outer naming convention.
Get bitten often enough and you become sensitive to
potential confusions in names. :-)
The real issue here is the same one which makes it hard
for programmers to find their own bugs. You need to
step out of your own pre-conceptions of how you are
supposed to be working and thinking and see the problem
from what another person's PoV is likely to be. This is
frequently much easier for another person to do...
- And ++ for the join( $_, @_) stuff. I never would
have dared to use it. But of course $_ and @_ have
different namespaces..
:-)
- I I removed the BEGIN blocks. Is this something for
the manpages (perldoc perlmod)?
Well it is something that I know because I looked in some
detail at Carp and Exporter a while ago. While the
principles of what happens when are documented, I don't
think that the conclusion is stated anywhere. I certainly
had to learn it by reading and thinking through the code.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
Re: Supersplit
by ichimunki (Priest) on Dec 20, 2000 at 18:31 UTC
|
Cool function ideas.
The POD doesn't work for me as posted. I put in some blank lines after/before heading directives, changed the list sequence to a =head1 METHODS
=head2 name
=B<usage>
text paragraph
=head2 name...
construction and then it was readable-- although not the usual way I've seen methods listed. | [reply] [Watch: Dir/Any] [d/l] |
|
Granted: I should have checked the POD. Added empty lines,
fixed the item's. Checked it, it's parsable now.
I'm not familiar with PODs for methods,
so if someone could give some pointers,
it would be appreciated. Eventually, I would like to submit
this as a module. So should I adhere to a module-like POD,
or should it be in the method form?
Thanks for the reply and the compliments, mrmick too.
Have fun,
Jeroen
I was dreaming of guitarnotes that would irritate an executive kind of guy (FZ)
| [reply] [Watch: Dir/Any] |
Re: Supersplit
by mrmick (Curate) on Dec 20, 2000 at 18:40 UTC
|
Awesome!!
Me tries it and Me likes it!
I can see some definite uses for this one! Thanks, jeroenes , for sharing it with us. ++ for you. Mick | [reply] [Watch: Dir/Any] |
|
|