Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

process multiline text and print in desired format

by ak_mmx (Novice)
on Mar 17, 2021 at 14:19 UTC ( [id://11129830]=perlquestion: print w/replies, xml ) Need Help??

ak_mmx has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am trying to process a text file and print the output in a desired format. I am able to get a small piece of the code working, however i am finding it difficult to produce the results as expected, did some research and experimented with loading to hash etc. Didnt work as expected , any advice on how to proceed would really help. SCRIPT
#!/usr/bin/perl use strict; use warnings; my $data = 'input.txt'; my $placeholder = ''; unless(open(INPUT, $data)) { die "Can't open input file $data\n"; } # Print heading print "Product Release product Type color basic color all ov +erseas shipping\n"; while(my $line = <INPUT>) { if($line =~ m/Release Date(\d+)\/(\d+)\/(\d+)/) { print "$line"; } } close INPUT;
INPUT FILE (input.txt)
Release Date2/2/2019 product clock1(analog) color basic white color all white,black,silver warranty 1 year not sold yet Release Date2/2/2020 product none Release Date2/2/2021 product clock1(digital) color basic black color all black,silver warranty 1 year not sold yet Release Date2/2/2022 product clock2(digital) color basic white color all white overseas shipping yes shipping charges yes warranty 1 year not sold yet
RESULTS
$ ./generate.pl Product Release product Type color basic color all overseas +shipping Release Date2/2/2019 Release Date2/2/2020 Release Date2/2/2021 Release Date2/2/2022 $
DESIRED RESULTS
Product Release product Type color basic color all + overseas shipping Release Date2/2/2019 analog white white,black,silve +r N/A Release Date2/2/2020 none N/A N/A + N/A Release Date2/2/2021 digital black black,silver + N/A Release Date2/2/2022 digital white white + yes

Replies are listed 'Best First'.
Re: process multiline text and print in desired format
by kcott (Archbishop) on Mar 17, 2021 at 19:32 UTC

    G'day ak_mmx,

    Welcome to the Monastery.

    The following script shows how I'd probably attack this problem. See the notes at the end for some details.

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $data = 'pm_11129830_input.txt'; my $fmt = "%-24s%-15s%-14s%-24s%s\n"; my @headings = ( 'Product Release', 'product Type', 'color basic', 'color all', 'overseas shipping', ); printf $fmt, @headings; { open my $fh, '<', $data; local $/ = ''; while (<$fh>) { my ($rel, $type, $colb, $cola, $ship) = ('', 'none', 'N/A', 'N/A', 'none'); for my $item (map /^\s*(.+)$/, split /\n/) { for ($item) { /^Release Date/ && do { $rel = $item; last; }; /^product [^(]+\(([^()]+)/ && do { $type = $1; last; }; /^color basic (.+)$/ && do { $colb = $1; last; }; /^color all (.+)$/ && do { $cola = $1; last; }; /^shipping charges (.+)$/ && do { $ship = $1; last; }; } } printf $fmt, $rel, $type, $colb, $cola, $ship; } }

    Output:

    Product Release product Type color basic color all + overseas shipping Release Date2/2/2019 analog white white,black,silve +r none Release Date2/2/2020 none N/A N/A + none Release Date2/2/2021 digital black black,silver + none Release Date2/2/2022 digital white white + yes

    Notes:

    • Use a lexical filehandle in as small a scope as possible; that's my $fh in the code above. Using a package variable (e.g. INPUT, OUTPUT, etc.) may work fine in a short script like you posted; however, that looks to me like a proof-of-concept script and, when put into more substantial code, there could easily be conflicts with other variables using the same name. Also note that Perl will close $fh for you when it goes out of scope.
    • Use the 3-argument form of open. This is the preferred method and there are a number of reasons for doing this. See the linked documentation for more about that; you may also find perlopentut useful.
    • Hand-crafting I/O error messages is tedious and error-prone. Your message actually has a problem in that it doesn't report the reason for the failure (e.g. file doesn't exist; you don't have read permission; something else). Using the autodie pragma is much easier: it saves you work and gives you better feedback if something goes wrong.
    • The 'local $/ = ''; line specifies reading in paragraph mode. See "perlvar: $/" for more about that.
    • The 'for ($item) {...}' code is effectively a switch/case construct. I find this easy to write, read and, where necessary, modify. There are other methods for achieving this: see "perlsyn: Basic BLOCKs"; be wary of those flagged as experimental and I'd strongly advise that you don't use those in production code.
    • When you want to lay things out in columns, printf can make this very easy (sprintf has substantial documentation on the formats available).

    — Ken

      Thanks Ken, Appreciate you putting time to write code and putting relevant notes to explain in detail. I will embrace the 3-argument form of open and other recommendations.
Re: process multiline text and print in desired format
by Fletch (Bishop) on Mar 17, 2021 at 14:39 UTC

    You only ever do anything (print the line) for the "header" line of each of your blocks. Two approaches that come to mind:

    • Keep reading the file line by line, but implement a small state machine. Start off looking for the release line; once you've seen that look for column values (stashing them off in a hash) until you encounter a blank line. When you get the blank line your "record" is complete, so dump out the output line information you've accumulated (clearing out the state back to a default) and start looking for the next release.
    • Since your sample data looks amenable to it, use $/ to read in paragraph mode (see the entry for it in perlvar) and then parse things out from that record text.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: process multiline text and print in desired format
by davido (Cardinal) on Mar 17, 2021 at 17:28 UTC

    The code is only going to do as much as you program it to do. You're not handling all the other fields at all.

    I would treat each "\n\n" as a record separator so that you can easily distinguish between records. Alternatively any row that starts with "Release Date" could begin a new record. But your sample input makes it seem that double-newline is sufficient.

    #!/usr/bin/env perl use strict; use warnings; local $/ = "\n\n"; # Set the input record separator to double-newline +so that each # grouping can be treated as a single record. output('Product Release', 'product Type', 'color basic', 'color all', +'overseas shipping'); while (my $record = <DATA>) { my ($date, $product, $color_basic, $color_all, $overseas, $shippin +g_charges, $warranty, $status); if ($record =~ m{^Release\s+Date(\d+/\d+/\d+)\n\s*product\s+(.+)$} +m) { ($date, $product) = ($1, $2); $color_basic = $record =~ m/^\s*color\s+basic\s+(.+)$/m + ? $1 : 'N/A'; $color_all = $record =~ m/^\s*color\s+all\s+(.+)$/m + ? $1 : 'N/A'; $overseas = $record =~ m/^\s*overseas\s+shipping\s+( +.+)$/m ? $1 : 'N/A'; $shipping_charges = $record =~ m/^\s*shipping\s+charges\s+(. ++)$/m ? $1 : 'N/A'; $warranty = $record =~ m/^\s*warranty\s+(.+)$/m + ? $1 : 'N/A'; $status = $record =~ m/^\s*(.+)\Z(?!.)/m + ? $1 : 'No status'; output("Release Date$date", $product, $color_basic, $color_all +, $overseas); } else { warn "Bad record: $record\n"; } } sub output { my @fields = @_; printf "%-24s%-32s%-24s%-24s%-24s\n", @fields; } __DATA__ Release Date2/2/2019 product clock1(analog) color basic white color all white,black,silver warranty 1 year not sold yet Release Date2/2/2020 product none Release Date2/2/2021 product clock1(digital) color basic black color all black,silver warranty 1 year not sold yet Release Date2/2/2022 product clock2(digital) color basic white color all white overseas shipping yes shipping charges yes warranty 1 year not sold yet

    This produces:

    Product Release product Type color basic + color all overseas shipping Release Date2/2/2019 clock1(analog) white + white,black,silver N/A Release Date2/2/2020 none N/A + N/A N/A Release Date2/2/2021 clock1(digital) black + black,silver N/A Release Date2/2/2022 clock2(digital) white + white yes

    Your sample output doesn't do anything with shipping charges, and with warranty, or with the inventory status. So although this sample captures them into variables, it's not getting printed in output.


    Dave

      Thanks Dave for your support. Yes, i was aware my script does nothing besides printing the headers and only Release dates. My logic to handle all other fields flawed & there were too many errors to post , rather thought would be easier if i pointed where i was stuck. i should have explained better. I will go through your script and correct my mistakes.
Re: process multiline text and print in desired format
by tybalt89 (Monsignor) on Mar 17, 2021 at 17:25 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11129830 use warnings; my $format = "%-25s%-15s%-14s%-24s%s\n"; printf $format, split / +/, 'Product Release product Type c +olor basic color all overseas shipping'; local ($/, @ARGV) = ('', 'input.txt'); while( <> ) { printf $format, /(Release Date\S+)/ ? "$1" : 'N/A', /(?|product clock\d\((\w+)\)|product (\w+))/ ? "$1" : 'N/A', /color basic (\S+)/ ? "$1" : 'N/A', /color all (\S+)/ ? "$1" : 'N/A', /overseas shipping (\w+)/ ? "$1" : 'N/A'; }

    Outputs:

    Product Release product Type color basic color all + overseas shipping Release Date2/2/2019 analog white white,black,silv +er N/A Release Date2/2/2020 none N/A N/A + N/A Release Date2/2/2021 digital black black,silver + N/A Release Date2/2/2022 digital white white + yes
      Thanks for putting time to write the code. would you please advise on the pattern match ? input file has leading spaces for all lines except "Release Date.xxxxx" your suggestion seems to match without the regex for whitespaces. /color basic /color all /overseas shipping

        There is no need to match the whitespace at the beginning of lines. A match can start anywhere.

Re: process multiline text and print in desired format
by LanX (Saint) on Mar 17, 2021 at 14:42 UTC
    Hi welcome to the monastery! :)

    TIMTOWTDI ...

    ... e.g. you could adapt this flip-flop solution to your needs.

    update

    with

    • $start flag set by the release-rexex
    • $end-flag set by the empty line or End-of-file -> eof
    • inside flip-flop-condition parse the field-lines into a hash
    • in end-condition print table-line with hash-elements and reset hash

    HTH :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: process multiline text and print in desired format
by jcb (Parson) on Mar 17, 2021 at 23:33 UTC

    This is an example of a problem that formats can solve quite nicely:

    #!/usr/bin/perl use strict; use warnings; our %rec; # formats need global variables, not mere lexicals format STDOUT_TOP = Product Product basic all overse +as Release Type color colors shippi +ng ---------------------------------------------------------------------- +-------- . ; format STDOUT = ^>>>>>>>>>> ^<<<<<<<<<<< ^<<<<<<<<<< ^<<<<<<<<<<<<<<<<<< ^<< +<<<<< ~~ $rec{rel}, $rec{type}, $rec{bascol}, $rec{allcol}, $rec +{ovship} . ; # to read a file instead, use this and change "<DATA>" below to "<IN> +" # open IN, '<', $ARGV[0] or die "open $ARGV[0]: $!"; $: .= ','; # split filled lines also on comma while (<DATA>) { chomp; if (m/^Release Date([\d\/]+$)/) { %rec = (type => 'none', map { $_ => 'N/A' } qw(bascol allcol ovship)); $rec{rel} = $1; } $rec{type} = $1 if m/^\s+product [^(]+\(([^)]+)\)$/; $rec{bascol} = $1 if m/^\s+color basic (.*)$/; $rec{allcol} = $1 if m/^\s+color all (.*)$/; $rec{ovship} = $1 if m/^\s+overseas shipping (.*)$/; write if m/^$/; } write # emit the last record if there was no trailing blank line __DATA__ Release Date2/2/2019 product clock1(analog) color basic white color all white,black,silver warranty 1 year not sold yet Release Date2/2/2020 product none Release Date2/2/2021 product clock1(digital) color basic black color all black,silver warranty 1 year not sold yet Release Date2/2/2022 product clock2(digital) color basic white color all white overseas shipping yes shipping charges yes warranty 1 year not sold yet

    See perlform for more information about the format mechanism, although it is somewhat obscure and most suited to simple scripts like this. If this is part of a larger system as some of my fellow monks suspect, this is probably a sub-optimal solution.

      Thanks for the response, yes the input data can be few 1000 lines or more. I just pasted a portion to get an idea on how to approach the problem. Would you suggest other optimal solutions if input lines are more ? the script logic is likely not to change much though.

        More input is no problem; just change the script as indicated to read from a file instead and delete the __DATA__ section. The only problems would come if the script logic were to be embedded in a larger script because formats are a very old feature and have some limitations due to being far older than many of the features that support modern Perl programming. Most notably, formats can only access global variables and are themselves in a global namespace.

Re: process multiline text and print in desired format
by karlgoethebier (Abbot) on Mar 17, 2021 at 14:43 UTC
Re: process multiline text and print in desired format
by ak_mmx (Novice) on Mar 18, 2021 at 01:09 UTC

    It is very encouraging to see the overwhelming response received for the subject. I will go through all responses and try to write code & come back if any questions.

    B/W i have seen an example in the web where one user removes all the leading and trailing space right after reading from the input data file. I tend to believe if that approach is taken one can avoid the "^\s*" repetition for all such match cases. Is that a good approach in general, does anyone see any hidden problems

    ie, AFTER READING to FH something like below
    s/^\s+|\s+$//g
    To avoid "^\s*"
    m/^\s*color m/^\s*overseas m/^\s*shipping
      "... removes all the leading and trailing space ... avoid the "^\s*" repetition ..."

      Your posted data had no trailing spaces. In my example code, I did remove all leading spaces for this very reason: "... map /^\s*(.+)$/ ...". Subsequent regexes looked like "/^color ..."; not "/^\s*color ...".

      Removal of potential trailing spaces may well be a good idea. I don't know the source of your input, but trailing spaces are often impossible to spot by inspection; e.g.

      $ cat > fred abc def $ cat fred abc def $ cat -vet fred abc$ def $

      If you want to do this, you can modify my regex; however, be aware of a subtle gotcha. You cannot simply tag another \s* on the end; you'll also need to change .+ to .+?. Compare these examples:

      $ perl -E 'my $x = " xyz "; say "|$_|" for map /^\s*(.+)$/, $x' |xyz | $ perl -E 'my $x = " xyz "; say "|$_|" for map /^\s*(.+)\s*$/, $x' |xyz | $ perl -E 'my $x = " xyz "; say "|$_|" for map /^\s*(.+?)\s*$/, $x' |xyz|

      — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11129830]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-22 00:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found