benaw has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks,
Background: I have an excel spreadsheet and one of the columns has multiple values separated with a comma like an array. What I want to do is unwind the array so each value of the array is on a separate line and all other fields are duplicated or carried with it mongo documentation of $unwind does what I'd like or it may be a map reduce?
In the below example | is the delimiter in a CSV because I'm trying to target the commas in a particular cell and placing all the same text on each line. For example:
Source:
some text FOO | some text BAR | oh , no , commas | some text BAZ | some text QUX
Output:
some text FOO | some text BAR | oh | some text BAZ | some text QUX
some text FOO | some text BAR | no | some text BAZ | some text QUX
some text FOO | some text BAR |commas | some text BAZ | some text QUX
Re: splinting a line of text by comma
by tybalt89 (Monsignor) on Nov 22, 2018 at 22:36 UTC
|
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1226189
use strict;
use warnings;
while( <DATA> )
{
if( /\|([^|,]*(?:,[^|,]*)+)\|/ )
{
print "$`|$_|$'" for split /,/, $1;
}
else
{
print;
}
}
__DATA__
some text FOO | some text BAR | oh , no , commas | some text BAZ | so
+me text QUX
some text FOO | some text BAR | single | some text BAZ | some text QU
+X
Outputs:
some text FOO | some text BAR | oh | some text BAZ | some text QUX
some text FOO | some text BAR | no | some text BAZ | some text QUX
some text FOO | some text BAR | commas | some text BAZ | some text QU
+X
some text FOO | some text BAR | single | some text BAZ | some text QU
+X
| [reply] [d/l] [select] |
Re: splinting a line of text by comma
by tybalt89 (Monsignor) on Nov 22, 2018 at 22:47 UTC
|
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1226189
use strict;
use warnings;
while( <DATA> )
{
my @queue = $_;
while( @queue )
{
local $_ = shift @queue;
if( /\|([^|,]*(?:,[^|,]*)+)\|/ )
{
push @queue, map "$`|$_|$'", split /,/, $1;
}
else
{
print;
}
}
}
__DATA__
some text FOO | some text BAR | oh , no , commas | some text BAZ | so
+me text QUX
some text FOO | some text BAR | single | some text BAZ | some text QU
+X
some text FOO | some text BAR | oh , no , commas | one, two | some te
+xt QUX
Outputs:
some text FOO | some text BAR | oh | some text BAZ | some text QUX
some text FOO | some text BAR | no | some text BAZ | some text QUX
some text FOO | some text BAR | commas | some text BAZ | some text QU
+X
some text FOO | some text BAR | single | some text BAZ | some text QU
+X
some text FOO | some text BAR | oh | one| some text QUX
some text FOO | some text BAR | oh | two | some text QUX
some text FOO | some text BAR | no | one| some text QUX
some text FOO | some text BAR | no | two | some text QUX
some text FOO | some text BAR | commas | one| some text QUX
some text FOO | some text BAR | commas | two | some text QUX
| [reply] [d/l] [select] |
|
Rather than iterating through every permutation of every comma-separated sub-phrase, wouldn't it be better to print each set of sub-phrases once per line? Also, comma-separated sub-phrases at the beginning or end of the line will not be processed. How about something like:
c:\@Work\Perl\monks>perl -wMstrict -le
"my @lines = (
'some text foo | some text BAR | oh , no , commas | some text BAZ |
+ some text QUX',
'some text Oof | some, text, BAR | some , text , BAZ | some text QU
+X',
'some, text, FOO | some text BAR | one two | some , text , QUX',
);
;;
my $got_commas = qr{
(?: (?<= [|]) | \A) [^|,]* (?: , [^|,]*)+ (?= [|] | \z)
}xms;
;;
for my $line (@lines) {
while ($line =~ m{ ($got_commas) }xmsg) {
my $before = substr $line, 0, $-[1];
my $after = substr $line, $+[1];
print qq{:$before$_$after:} for split m{ , }xms, $1;
}
}
"
:some text foo | some text BAR | oh | some text BAZ | some text QUX:
:some text foo | some text BAR | no | some text BAZ | some text QUX:
:some text foo | some text BAR | commas | some text BAZ | some text QU
+X:
:some text Oof | some| some , text , BAZ | some text QUX:
:some text Oof | text| some , text , BAZ | some text QUX:
:some text Oof | BAR | some , text , BAZ | some text QUX:
:some text Oof | some, text, BAR | some | some text QUX:
:some text Oof | some, text, BAR | text | some text QUX:
:some text Oof | some, text, BAR | BAZ | some text QUX:
:some| some text BAR | one two | some , text , QUX:
: text| some text BAR | one two | some , text , QUX:
: FOO | some text BAR | one two | some , text , QUX:
:some, text, FOO | some text BAR | one two | some :
:some, text, FOO | some text BAR | one two | text :
:some, text, FOO | some text BAR | one two | QUX:
(Availability of \K with Perl versions 5.10+ can make the (?: (?<= [|]) | \A)) expression a little nicer.)
Update: An "impure" variation. No big improvement; maybe of some interest.
c:\@Work\Perl\monks>perl -wMstrict -le
"my @lines = (
'some text foo | some text BAR | oh , no , commas | some text BAZ |
+ some text QUX',
'some text Oof | some, text, BAR | some , text , BAZ | some text QU
+X',
'some, text, FOO | some text BAR | one two | some , text , QUX',
);
;;
my $got_commas = qr{
(?: (?<= [|]) | \A) [^|,]* (?: , [^|,]*)+ (?= [|] | \z)
}xms;
;;
local our @out;
use re 'eval';
m{ \A (.*?) ($got_commas) (?= (.*))
(?{ push @out, map qq{>$1$_$3<}, split m{ , }xms, $2 })
(?!)
}xmsg
for @lines;
;;
print for @out;
"
>some text foo | some text BAR | oh | some text BAZ | some text QUX<
>some text foo | some text BAR | no | some text BAZ | some text QUX<
>some text foo | some text BAR | commas | some text BAZ | some text QU
+X<
>some text Oof | some| some , text , BAZ | some text QUX<
>some text Oof | text| some , text , BAZ | some text QUX<
>some text Oof | BAR | some , text , BAZ | some text QUX<
>some text Oof | some, text, BAR | some | some text QUX<
>some text Oof | some, text, BAR | text | some text QUX<
>some text Oof | some, text, BAR | BAZ | some text QUX<
>some| some text BAR | one two | some , text , QUX<
> text| some text BAR | one two | some , text , QUX<
> FOO | some text BAR | one two | some , text , QUX<
>some, text, FOO | some text BAR | one two | some <
>some, text, FOO | some text BAR | one two | text <
>some, text, FOO | some text BAR | one two | QUX<
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1226189
use strict;
use warnings;
while( <DATA> )
{
my @queue = $_;
while( @queue )
{
local $_ = shift @queue;
if( / [^|,\n]* (?: , [^|,\n]* )+ /x )
{
push @queue, map "$`$_$'", split /,/, $&;
}
else
{
print;
}
}
}
__DATA__
some text foo | some text BAR | oh , no , commas | some text BAZ | som
+e text QUX
some text Oof | some, text, BAR | some , text , BAZ | some text QUX
some,text,FOO | some text BAR | one two | some , text , QUX
Outputs:
some text foo | some text BAR | oh | some text BAZ | some text QUX
some text foo | some text BAR | no | some text BAZ | some text QUX
some text foo | some text BAR | commas | some text BAZ | some text QUX
some text Oof | some| some | some text QUX
some text Oof | some| text | some text QUX
some text Oof | some| BAZ | some text QUX
some text Oof | text| some | some text QUX
some text Oof | text| text | some text QUX
some text Oof | text| BAZ | some text QUX
some text Oof | BAR | some | some text QUX
some text Oof | BAR | text | some text QUX
some text Oof | BAR | BAZ | some text QUX
some| some text BAR | one two | some
some| some text BAR | one two | text
some| some text BAR | one two | QUX
text| some text BAR | one two | some
text| some text BAR | one two | text
text| some text BAR | one two | QUX
FOO | some text BAR | one two | some
FOO | some text BAR | one two | text
FOO | some text BAR | one two | QUX
| [reply] [d/l] [select] |
|
|
|
Re: splinting a line of text by comma
by Tux (Canon) on Nov 23, 2018 at 07:57 UTC
|
Why would you overcomplicate this? Use Spreadsheet::Read (or Spreadsheet::ParseXLSX directly) to read your Excel sheet and then use Text::CSV_XS (or Text::CSV) on the content of the cell(s) you want to split
Here is a tested example. The pm1226189.xlsx that I used has comma-separated fields that also include newlines and quotation. If your Excel also contains encoded fields, you probably have to deal with that inside the inner loop
#!/pro/bin/perl
use 5.14.1;
use warnings;
use Data::Peek;
use Spreadsheet::Read;
use Text::CSV_XS;
my $book = Spreadsheet::Read->new ("pm1226189.xlsx") or
die "Cannot read spreadsheet: $!\n";
my $sheet = $book->sheet (1) or die "Book has no sheets\n";
my $csv = Text::CSV_XS->new ({
binary => 1,
auto_diag => 1,
allow_whitespace => 1,
});
my @data;
foreach my $col (1 .. $sheet->maxcol) {
foreach my $row (1 .. $sheet->maxrow) {
my $cell = $sheet->cell ($col, $row);
if ($cell && $cell =~ m/,/) {
open my $fh, "<", \$cell;
$data[$col][$row] = $csv->getline ($fh);
close $fh;
}
else {
$data[$col][$row] = $cell;
}
}
}
DDumper \@data;
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] |
Re: splinting a line of text by comma
by swl (Parson) on Nov 22, 2018 at 22:31 UTC
|
This appears to work given the data you provided. However, it does not handle multiple fields that contain commas. It also does not handle quoted commas and the like. You would need to use Text::CSV or similar for that.
Adapt the outer loop as needed (e.g. if reading from a file handle):
use 5.010;
my @lines
= ('some text FOO | some text BAR | oh , no , commas | some text BAZ
+ | some text QUX');
for my $line (@lines) {
chomp $line;
if ($line =~ /,/) {
my @items = split /,/, $line;
my $prefix = substr $items[0], 0, rindex $items[0], '|';
my $suffix = substr $items[-1], index $items[-1], '|';
#say $prefix;
#say $suffix;
foreach my $item (@items) {
say "$prefix | $item $suffix";
}
}
else {
say $line;
}
}
| [reply] [d/l] |
Re: splinting a line of text by comma
by stevieb (Canon) on Nov 22, 2018 at 22:18 UTC
|
Can't speak for everyone, but I'd like to see the code you've tried that isn't working.
Although Perlmonks isn't as stringent in the whole "we're not a code writing service" as some other sites, and periodically no-code posts sometimes do get workable answers, it's only fair that you provide the code you have tried, so we do know you've made some form of effort.
Note that you've got a typo in your Subject line: s/splinting/splitting/.
| [reply] [d/l] |
Re: splinting a line of text by comma
by LanX (Saint) on Nov 23, 2018 at 03:00 UTC
|
use strict;
use warnings;
sub globify {
local $" = ",";
my @commas = map qq{"$_"}, split /,/, $_;
@commas == 1
? qq{"$_"}
: "{@commas}";
}
while (<DATA>) {
print
glob
(
join '|',
map {globify}
split /\|/ ,$_
);
}
__DATA__
some text FOO | some text BAR | oh , no , commas | some text BAZ | som
+e text QUX
some text FOO | some text BAR | single | some text BAZ | some text QUX
some text FOO | some text BAR | oh | some text BAZ | some text QUX
some text FOO | some text BAR | no | some text BAZ | some text QUX
some text FOO | some text BAR | commas | some text BAZ | some text QUX
some text FOO | some text BAR | single | some text BAZ | some text QUX
| [reply] [d/l] [select] |
|
#!/usr/bin/perl
# https://perlmonks.org/?node_id=1226189
use strict;
use warnings;
print glob qq("$_") =~ s/[^|\n]+/{$&}/gr while <DATA>;
__DATA__
some text foo | some text BAR | oh , no , commas | some text BAZ | som
+e text QUX
some text Oof | some, text, BAR | some , text , BAZ | some text QUX
some,text,FOO | some text BAR | one two | some , text , QUX
Outputs:
some text foo | some text BAR | oh | some text BAZ | some text QUX
some text foo | some text BAR | no | some text BAZ | some text QUX
some text foo | some text BAR | commas | some text BAZ | some text QUX
some text Oof | some| some | some text QUX
some text Oof | some| text | some text QUX
some text Oof | some| BAZ | some text QUX
some text Oof | text| some | some text QUX
some text Oof | text| text | some text QUX
some text Oof | text| BAZ | some text QUX
some text Oof | BAR | some | some text QUX
some text Oof | BAR | text | some text QUX
some text Oof | BAR | BAZ | some text QUX
some| some text BAR | one two | some
some| some text BAR | one two | text
some| some text BAR | one two | QUX
text| some text BAR | one two | some
text| some text BAR | one two | text
text| some text BAR | one two | QUX
FOO | some text BAR | one two | some
FOO | some text BAR | one two | text
FOO | some text BAR | one two | QUX
| [reply] [d/l] [select] |
|
yeah, better.
I was already in bed half sleeping and was surprised about whitespace handling in glob.
> Still very much not recommended :)
Well if you find a way to escape curlies and other "wildcard characters" from the input... :)
| [reply] |
|
|
Re: splinting a line of text by comma
by kcott (Archbishop) on Nov 23, 2018 at 08:16 UTC
|
$ perl -nE '
chomp;
my @x = split /\|/;
my ($y) = grep $x[$_] =~ /,/, 0..$#x;
for (split /,/, $x[$y]) {
splice @x, $y, 1, $_;
say join "|", @x;
}
'
a,b,c|d|e
a|d|e
b|d|e
c|d|e
a|b,c,d|e
a|b|e
a|c|e
a|d|e
a|b|c,d,e
a|b|c
a|b|d
a|b|e
Although, for dealing with spreadsheet data,
Text::CSV
would always be my first port of call.
Update:
The code I originally posted had:
my @z = split /,/, $x[$y];
for (@z) {
I've reduced that to:
for (split /,/, $x[$y]) {
The output is identical for both cases.
| [reply] [d/l] [select] |
|
|