Regex for non-patterned input

sidsinha has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to format the content of an array into a table. The input contains rows of data and each of them seperated by a space or a tab (As below). However, i want the table to have 4 columns but input data has one element which needs to be treated as a single value.

For ex, in the below table:

'cOne cTwo   cThree 13 sec  cFour           
cOne cTwo   cThree 11 sec  cFour 
cOne cTwo   cThree 1 min 2 sec  cFour 
cOne cTwo   cThree 13 sec  cFour';
[download]

should be printed as:

ColumnA    ColumnB   ColumnC      ColumnD
cOne       cTwo      13 sec       cFour
cOne       cTwo      11 sec       cFour             
cOne       cTwo      1 min 2 sec  cFour
[download]

the entries with say "13 sec" or "1 min 13 sec" should be in one column. Heres the code I tried but its terribly naive. Could someone help me... thanks


use strict;
use warnings;
use HTML::Table;
my $table = new HTML::Table(-border=>0.2,
 -bgcolor=>'#F4F5F7',
 -head=> ['ColumnA','ColumnB','ColumnC', 'ColumnD']);                 
+                             

my @wtodays=
'cOne cTwo   cThree 13 sec  cFour           
cOne cTwo   cThree 11 sec  cFour 
cOne cTwo   cThree 1 min 2 sec  cFour 
cOne cTwo   cThree 13 sec  cFour'; 
 
for ( @wtodays )
{
$table->addRow(split(/\s+/, "$_\n"));
}

print $table;
[download]

Comment on Regex for non-patterned input Select or Download Code

Replies are listed 'Best First'.
Re: Regex for non-patterned input by choroba (Cardinal) on Aug 15, 2013 at 11:57 UTC
If you know that only the third column might contain whitespace, you can split each line, and then join the cells from the third to the last but one. `for (@lines) { my @cells = split; my $time = join ' ', @cells[2 .. $#cells - 1]; $table->addRow(@cells[0, 1], $time, $cells[-1]); }` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re: Regex for non-patterned input by boftx (Deacon) on Aug 15, 2013 at 12:46 UTC
I humbly submit that the "array" definition and subsequent code will never do what is desired: `my @wtodays= 'cOne cTwo cThree 13 sec cFour cOne cTwo cThree 11 sec cFour cOne cTwo cThree 1 min 2 sec cFour cOne cTwo cThree 13 sec cFour'; for ( @wtodays ) { $table->addRow(split(/\s+/, "$_\n")); }` [download] The array is being assigned a single string constant as if it was a scalar. The data in the array must first be organized properly before any meaningful manipulations can be done with it. For example: `my @wtodays = ( 'cOne cTwo cThree 13 sec cFour'. 'cOne cTwo cThree 11 sec cFour', 'cOne cTwo cThree 1 min 2 sec cFour', 'cOne cTwo cThree 13 sec cFour', ); for ( @wtodays ) { # do whatever text processing you need on a single row }` [download] And yes, if there are tabs between the column data, and spaces only occur in the column values, then the split becomes trivial once the array is properly defined. Update: grammar correction in previous paragraph.	[reply] [d/l] [select]
Re: Regex for non-patterned input by Lawliet (Curate) on Aug 15, 2013 at 11:13 UTC
Ah, so you cannot simply split on whitespace, because one of your columns has whitespace in it. Luckily, the data looks simple enough that we can get around that. For example, try the following (untested) regex: `for ( @wtodays ) { if (/^(\w+)\s+(\w+)\s+(\w+)\s+([\w\s]+)\s+(\w+)$/) { $table->addRow($1, $2, $3, $4, $5); } }` [download] We individually capture each column. You can see that the regex for capturing the fourth column looks different than the others because of the whitespace it will contain. Specifically, instead of grabbing all the word-like characters, we grab all word-like and space-like characters, and then continue on our merry way to capturing the fifth column. I hope this helps, and I hope you understand the logic behind it.	[reply] [d/l]
Re: Regex for non-patterned input by mtmcc (Hermit) on Aug 15, 2013 at 11:39 UTC
Does your data have to be all in a single array? If your data has tabs/spaces/newlines where they should be, you can just split on the tab, and not just on any space, something like this: `#!/usr/bin/perl use strict; use warnings; use HTML::Table; my $table = new HTML::Table(-border=>0.2, -bgcolor=>'#F4F5F7', -head=> ['ColumnA','ColumnB','ColumnC', 'ColumnD', 'ColumnE']); + while ( <DATA> ) { my @row = split (/\t/, $_); $table->addRow(@row); } print $table; __DATA__ cOne cTwo cThree 13 sec cFour cOne cTwo cThree 11 sec cFour cOne cTwo cThree 1 min 2 sec cFour cOne cTwo cThree 13 sec cFour` [download]	[reply] [d/l]
Re: Regex for non-patterned input by kcott (Archbishop) on Aug 16, 2013 at 08:56 UTC
G'day sidsinha, The data you present is at odds with what you've described. The following works on the data you've shown: $ perl -Mstrict -Mwarnings -le ' my @wtodays= q{cOne cTwo cThree 13 sec cFour cOne cTwo cThree 11 sec cFour cOne cTwo cThree 1 min 2 sec cFour cOne cTwo cThree 13 sec cFour}; my $re = qr{ ^ # anchor: start of line \s* # discard: possible whites +pace (\w+) # capture: cOne \s+ # discard: whitespace (\w+) # capture: cTwo \s+ # discard: whitespace \w+ # discard: cThree \s+ # discard: whitespace ((?:\d+ \s+ min \s+)? \d+ \s+ sec) # capture: possible min an +d sec \s+ # discard: whitespace (\w+) # capture: cFour \s* # discard: possible whites +pace $ # anchor: end of line }mx; my @table_data; for (@wtodays) { while (/$re/g) { push @table_data => [$1, $2, $3, $4]; } } { local $" = "\|"; for (@table_data) { print "@$_"; } } ' cOne\|cTwo\|13 sec\|cFour cOne\|cTwo\|11 sec\|cFour cOne\|cTwo\|1 min 2 sec\|cFour cOne\|cTwo\|13 sec\|cFour [download] -- Ken	[reply] [d/l]
Re: Regex for non-patterned input by Anonymous Monk on Aug 15, 2013 at 11:23 UTC
Eat the timestamp first, then eat the nonwhitespace `use Data::Dump; $_ = q{cOne cTwo cThree 1 min 2 sec cFour }; dd[ m/(\d+\smin\s\d+\ssec\|\S+)/g ] __END__ ["cOne", "cTwo", "cThree", "1 min 2 sec", "cFour"]` [download] perlintro#More complex regular expressions /perlrequick/split Read more... (11 kB)	[reply] [d/l]


There's more than one way to do things
	PerlMonks