ak_mmx has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
I am trying to process a text file and print the output in a desired format.
I am able to get a small piece of the code working, however i am finding it difficult to produce the results as expected, did some research and experimented with loading to hash etc. Didnt work as expected , any advice on how to proceed would really help.
SCRIPT
#!/usr/bin/perl
use strict;
use warnings;
my $data = 'input.txt';
my $placeholder = '';
unless(open(INPUT, $data)) { die "Can't open input file $data\n"; }
# Print heading
print "Product Release product Type color basic color all ov
+erseas shipping\n";
while(my $line = <INPUT>)
{
if($line =~ m/Release Date(\d+)\/(\d+)\/(\d+)/)
{
print "$line";
}
}
close INPUT;
INPUT FILE (input.txt)
Release Date2/2/2019
product clock1(analog)
color basic white
color all white,black,silver
warranty 1 year
not sold yet
Release Date2/2/2020
product none
Release Date2/2/2021
product clock1(digital)
color basic black
color all black,silver
warranty 1 year
not sold yet
Release Date2/2/2022
product clock2(digital)
color basic white
color all white
overseas shipping yes
shipping charges yes
warranty 1 year
not sold yet
RESULTS
$ ./generate.pl
Product Release product Type color basic color all overseas
+shipping
Release Date2/2/2019
Release Date2/2/2020
Release Date2/2/2021
Release Date2/2/2022
$
DESIRED RESULTS
Product Release product Type color basic color all
+ overseas shipping
Release Date2/2/2019 analog white white,black,silve
+r N/A
Release Date2/2/2020 none N/A N/A
+ N/A
Release Date2/2/2021 digital black black,silver
+ N/A
Release Date2/2/2022 digital white white
+ yes
Re: process multiline text and print in desired format
by kcott (Archbishop) on Mar 17, 2021 at 19:32 UTC
|
G'day ak_mmx,
Welcome to the Monastery.
The following script shows how I'd probably attack this problem.
See the notes at the end for some details.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $data = 'pm_11129830_input.txt';
my $fmt = "%-24s%-15s%-14s%-24s%s\n";
my @headings = (
'Product Release',
'product Type',
'color basic',
'color all',
'overseas shipping',
);
printf $fmt, @headings;
{
open my $fh, '<', $data;
local $/ = '';
while (<$fh>) {
my ($rel, $type, $colb, $cola, $ship)
= ('', 'none', 'N/A', 'N/A', 'none');
for my $item (map /^\s*(.+)$/, split /\n/) {
for ($item) {
/^Release Date/ && do {
$rel = $item;
last;
};
/^product [^(]+\(([^()]+)/ && do {
$type = $1;
last;
};
/^color basic (.+)$/ && do {
$colb = $1;
last;
};
/^color all (.+)$/ && do {
$cola = $1;
last;
};
/^shipping charges (.+)$/ && do {
$ship = $1;
last;
};
}
}
printf $fmt, $rel, $type, $colb, $cola, $ship;
}
}
Output:
Product Release product Type color basic color all
+ overseas shipping
Release Date2/2/2019 analog white white,black,silve
+r none
Release Date2/2/2020 none N/A N/A
+ none
Release Date2/2/2021 digital black black,silver
+ none
Release Date2/2/2022 digital white white
+ yes
Notes:
-
Use a lexical filehandle in as small a scope as possible; that's my $fh in the code above.
Using a package variable (e.g. INPUT, OUTPUT, etc.) may work fine in a short script like you posted;
however, that looks to me like a proof-of-concept script and, when put into more substantial code,
there could easily be conflicts with other variables using the same name.
Also note that Perl will close $fh for you when it goes out of scope.
-
Use the 3-argument form of open.
This is the preferred method and there are a number of reasons for doing this.
See the linked documentation for more about that;
you may also find perlopentut useful.
-
Hand-crafting I/O error messages is tedious and error-prone.
Your message actually has a problem in that it doesn't report the reason for the failure
(e.g. file doesn't exist; you don't have read permission; something else).
Using the autodie pragma is much easier:
it saves you work and gives you better feedback if something goes wrong.
-
The 'local $/ = ''; line specifies reading in paragraph mode.
See "perlvar: $/" for more about that.
-
The 'for ($item) {...}' code is effectively a switch/case construct.
I find this easy to write, read and, where necessary, modify.
There are other methods for achieving this:
see "perlsyn: Basic BLOCKs";
be wary of those flagged as experimental and I'd strongly advise that you don't use those in production code.
-
When you want to lay things out in columns, printf can make this very easy
(sprintf has substantial documentation on the formats available).
| [reply] [d/l] [select] |
|
Thanks Ken,
Appreciate you putting time to write code and putting relevant notes to explain in detail.
I will embrace the 3-argument form of open and other recommendations.
| [reply] |
Re: process multiline text and print in desired format
by Fletch (Bishop) on Mar 17, 2021 at 14:39 UTC
|
You only ever do anything (print the line) for the "header" line of each of your blocks. Two approaches that come to mind:
- Keep reading the file line by line, but implement a small state machine. Start off looking for the release line; once you've seen that look for column values (stashing them off in a hash) until you encounter a blank line. When you get the blank line your "record" is complete, so dump out the output line information you've accumulated (clearing out the state back to a default) and start looking for the next release.
- Since your sample data looks amenable to it, use $/ to read in paragraph mode (see the entry for it in perlvar) and then parse things out from that record text.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
Re: process multiline text and print in desired format
by davido (Cardinal) on Mar 17, 2021 at 17:28 UTC
|
The code is only going to do as much as you program it to do. You're not handling all the other fields at all.
I would treat each "\n\n" as a record separator so that you can easily distinguish between records. Alternatively any row that starts with "Release Date" could begin a new record. But your sample input makes it seem that double-newline is sufficient.
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n"; # Set the input record separator to double-newline
+so that each
# grouping can be treated as a single record.
output('Product Release', 'product Type', 'color basic', 'color all',
+'overseas shipping');
while (my $record = <DATA>) {
my ($date, $product, $color_basic, $color_all, $overseas, $shippin
+g_charges, $warranty, $status);
if ($record =~ m{^Release\s+Date(\d+/\d+/\d+)\n\s*product\s+(.+)$}
+m) {
($date, $product) = ($1, $2);
$color_basic = $record =~ m/^\s*color\s+basic\s+(.+)$/m
+ ? $1 : 'N/A';
$color_all = $record =~ m/^\s*color\s+all\s+(.+)$/m
+ ? $1 : 'N/A';
$overseas = $record =~ m/^\s*overseas\s+shipping\s+(
+.+)$/m ? $1 : 'N/A';
$shipping_charges = $record =~ m/^\s*shipping\s+charges\s+(.
++)$/m ? $1 : 'N/A';
$warranty = $record =~ m/^\s*warranty\s+(.+)$/m
+ ? $1 : 'N/A';
$status = $record =~ m/^\s*(.+)\Z(?!.)/m
+ ? $1 : 'No status';
output("Release Date$date", $product, $color_basic, $color_all
+, $overseas);
}
else {
warn "Bad record: $record\n";
}
}
sub output {
my @fields = @_;
printf "%-24s%-32s%-24s%-24s%-24s\n", @fields;
}
__DATA__
Release Date2/2/2019
product clock1(analog)
color basic white
color all white,black,silver
warranty 1 year
not sold yet
Release Date2/2/2020
product none
Release Date2/2/2021
product clock1(digital)
color basic black
color all black,silver
warranty 1 year
not sold yet
Release Date2/2/2022
product clock2(digital)
color basic white
color all white
overseas shipping yes
shipping charges yes
warranty 1 year
not sold yet
This produces:
Product Release product Type color basic
+ color all overseas shipping
Release Date2/2/2019 clock1(analog) white
+ white,black,silver N/A
Release Date2/2/2020 none N/A
+ N/A N/A
Release Date2/2/2021 clock1(digital) black
+ black,silver N/A
Release Date2/2/2022 clock2(digital) white
+ white yes
Your sample output doesn't do anything with shipping charges, and with warranty, or with the inventory status. So although this sample captures them into variables, it's not getting printed in output.
| [reply] [d/l] [select] |
|
Thanks Dave for your support.
Yes, i was aware my script does nothing besides printing the headers and only Release dates.
My logic to handle all other fields flawed & there were too many errors to post , rather thought would be easier if i pointed where i was stuck. i should have explained better.
I will go through your script and correct my mistakes.
| [reply] |
Re: process multiline text and print in desired format
by tybalt89 (Monsignor) on Mar 17, 2021 at 17:25 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11129830
use warnings;
my $format = "%-25s%-15s%-14s%-24s%s\n";
printf $format, split / +/, 'Product Release product Type c
+olor basic color all overseas shipping';
local ($/, @ARGV) = ('', 'input.txt');
while( <> )
{
printf $format,
/(Release Date\S+)/ ? "$1" : 'N/A',
/(?|product clock\d\((\w+)\)|product (\w+))/ ? "$1" : 'N/A',
/color basic (\S+)/ ? "$1" : 'N/A',
/color all (\S+)/ ? "$1" : 'N/A',
/overseas shipping (\w+)/ ? "$1" : 'N/A';
}
Outputs:
Product Release product Type color basic color all
+ overseas shipping
Release Date2/2/2019 analog white white,black,silv
+er N/A
Release Date2/2/2020 none N/A N/A
+ N/A
Release Date2/2/2021 digital black black,silver
+ N/A
Release Date2/2/2022 digital white white
+ yes
| [reply] [d/l] [select] |
|
Thanks for putting time to write the code.
would you please advise on the pattern match ?
input file has leading spaces for all lines except "Release Date.xxxxx"
your suggestion seems to match without the regex for whitespaces.
/color basic
/color all
/overseas shipping
| [reply] |
|
| [reply] |
Re: process multiline text and print in desired format
by LanX (Saint) on Mar 17, 2021 at 14:42 UTC
|
| [reply] |
Re: process multiline text and print in desired format
by jcb (Parson) on Mar 17, 2021 at 23:33 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
our %rec; # formats need global variables, not mere lexicals
format STDOUT_TOP =
Product Product basic all overse
+as
Release Type color colors shippi
+ng
----------------------------------------------------------------------
+--------
.
;
format STDOUT =
^>>>>>>>>>> ^<<<<<<<<<<< ^<<<<<<<<<< ^<<<<<<<<<<<<<<<<<< ^<<
+<<<<< ~~
$rec{rel}, $rec{type}, $rec{bascol}, $rec{allcol}, $rec
+{ovship}
.
;
# to read a file instead, use this and change "<DATA>" below to "<IN>
+"
# open IN, '<', $ARGV[0] or die "open $ARGV[0]: $!";
$: .= ','; # split filled lines also on comma
while (<DATA>) {
chomp;
if (m/^Release Date([\d\/]+$)/) {
%rec = (type => 'none',
map { $_ => 'N/A' } qw(bascol allcol ovship));
$rec{rel} = $1;
}
$rec{type} = $1 if m/^\s+product [^(]+\(([^)]+)\)$/;
$rec{bascol} = $1 if m/^\s+color basic (.*)$/;
$rec{allcol} = $1 if m/^\s+color all (.*)$/;
$rec{ovship} = $1 if m/^\s+overseas shipping (.*)$/;
write if m/^$/;
}
write # emit the last record if there was no trailing blank line
__DATA__
Release Date2/2/2019
product clock1(analog)
color basic white
color all white,black,silver
warranty 1 year
not sold yet
Release Date2/2/2020
product none
Release Date2/2/2021
product clock1(digital)
color basic black
color all black,silver
warranty 1 year
not sold yet
Release Date2/2/2022
product clock2(digital)
color basic white
color all white
overseas shipping yes
shipping charges yes
warranty 1 year
not sold yet
See perlform for more information about the format mechanism, although it is somewhat obscure and most suited to simple scripts like this. If this is part of a larger system as some of my fellow monks suspect, this is probably a sub-optimal solution. | [reply] [d/l] |
|
Thanks for the response, yes the input data can be few 1000 lines or more. I just pasted a portion to get an idea on how to approach the problem. Would you suggest other optimal solutions if input lines are more ? the script logic is likely not to change much though.
| [reply] |
|
More input is no problem; just change the script as indicated to read from a file instead and delete the __DATA__ section. The only problems would come if the script logic were to be embedded in a larger script because formats are a very old feature and have some limitations due to being far older than many of the features that support modern Perl programming. Most notably, formats can only access global variables and are themselves in a global namespace.
| [reply] [d/l] |
Re: process multiline text and print in desired format
by karlgoethebier (Abbot) on Mar 17, 2021 at 14:43 UTC
|
| [reply] |
Re: process multiline text and print in desired format
by ak_mmx (Novice) on Mar 18, 2021 at 01:09 UTC
|
It is very encouraging to see the overwhelming response received for the subject.
I will go through all responses and try to write code & come back if any questions.
B/W i have seen an example in the web where one user removes all the leading and trailing space right after reading from the input data file. I tend to believe if that approach is taken one can avoid the "^\s*" repetition for all such match cases.
Is that a good approach in general, does anyone see any hidden problems
ie, AFTER READING to FH something like below
s/^\s+|\s+$//g
To avoid "^\s*"
m/^\s*color
m/^\s*overseas
m/^\s*shipping
| [reply] [d/l] [select] |
|
"... removes all the leading and trailing space ... avoid the "^\s*" repetition ..."
Your posted data had no trailing spaces.
In my example code, I did remove all leading spaces for this very reason: "... map /^\s*(.+)$/ ...".
Subsequent regexes looked like "/^color ..."; not "/^\s*color ...".
Removal of potential trailing spaces may well be a good idea.
I don't know the source of your input, but trailing spaces are often impossible to spot by inspection; e.g.
$ cat > fred
abc
def
$ cat fred
abc
def
$ cat -vet fred
abc$
def $
If you want to do this, you can modify my regex; however, be aware of a subtle gotcha.
You cannot simply tag another \s* on the end; you'll also need to change .+ to .+?.
Compare these examples:
$ perl -E 'my $x = " xyz "; say "|$_|" for map /^\s*(.+)$/, $x'
|xyz |
$ perl -E 'my $x = " xyz "; say "|$_|" for map /^\s*(.+)\s*$/, $x'
|xyz |
$ perl -E 'my $x = " xyz "; say "|$_|" for map /^\s*(.+?)\s*$/, $x'
|xyz|
| [reply] [d/l] [select] |
|
|