Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Parsing a line of text items

by mikkoi (Beadle)
on Mar 30, 2021 at 10:57 UTC ( #11130582=perlquestion: print w/replies, xml ) Need Help??

mikkoi has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse the arguments of a custom command.

The arguments are separated by whitespace except when they have quotation marks around them.

Example:

23 45.67 "John Marcus" Surname

After parsing it, it should result in [23, 45.67, 'John Marcus', 'Surname']. Would it be better to use a lexer or regex? If lexer, any good modules in CPAN?

Replies are listed 'Best First'.
Re: Parsing a line of text items
by philipbailey (Deacon) on Mar 30, 2021 at 12:23 UTC
    I often use Text::ParseWords for this problem. It has the advantage of being a core module.
    use strict; use warnings; use feature "say"; use Text::ParseWords; my $args = '23 45.67 "John Marcus" Surname'; my @parsed = parse_line('\s+', 0, $args); say for @parsed;
    Output:
    23 45.67 John Marcus Surname

      Thanks. I had no idea Text-ParseWords existed. This is the ideal solution. And it is in the core!

      I also tested Text-CSV and while good, it left some problems, especially the possible multiple whitespace between words.

        I had no idea Text-ParseWords existed

        Likewise...
        Perhaps more time is needed studying the list of core modules

      Oh Text::ParseWords is pretty cool, thanks for sharing. :)

      (So many core modules which need more attention)

      > It has the advantage of being a core module.

      Indeed.

      C:\Strawberry\perl\bin>corelist Text::ParseWords Data for 2021-01-23 Text::ParseWords was first released with perl 5

      Tho it's exporting a lot on default

      our @EXPORT = qw(shellwords quotewords nested_quotewords parse_line);

      And you can tell the documentation is old, could have more examples.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: Parsing a line of text items
by hippo (Chancellor) on Mar 30, 2021 at 11:25 UTC

    A Text::CSV solution:

    use strict; use warnings; use Text::CSV; use Test::More tests => 2; my $in = '23 45.67 "John Marcus" Surname'; my $want = [23, 45.67, 'John Marcus', 'Surname']; my $csv = Text::CSV->new ({sep_char => ' '}); ok $csv->parse ($in), 'Parsing'; is_deeply [$csv->fields], $want, 'Fields match';

    You will probably want to extend the tests to better reflect your real-world requirements.


    🦛

Re: Parsing a line of text items
by choroba (Archbishop) on Mar 30, 2021 at 12:11 UTC
    Use glob. But make sure the input doesn't contain *, ?, and {}.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; sub parse_args { my ($input) = @_; return [glob $input] } use Test::More tests => 1; is_deeply parse_args('23 45.67 "John Marcus" Surname'), [23, 45.67, 'John Marcus', 'Surname'];

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Parsing a line of text items (updated)
by AnomalousMonk (Bishop) on Mar 30, 2021 at 16:19 UTC

    A Text::CSV (or Text::CSV_XS for speed) solution seems very appropriate, but if you need to roll your own, maybe something like:

    Win8 Strawberry 5.30.3.1 (64) Tue 03/30/2021 11:53:39 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings use 5.010; # needs (?|...) branch reset my $rx_dq_body = qr{ [^\\"]* (?: \\. [^\\"]* )* }xms; my $rx_unquoted = qr{ \S+ }xms; for my $args ( '', ' ', '23 45.67 "John Marcus O\"Ddly" Surname', '"only \"quoted\" thing"', 'no quoted stuff', ) { my $got_parsed_args = my @parsed_args = $args =~ m{ \G \s* (?| " ($rx_dq_body) " | ($rx_unquoted)) }xmsg; print ">$args< -> "; if ($got_parsed_args) { printf "%s \n", join ' ', map ">$_<", @parsed_args; } else { print "nada \n"; } } ^Z >< -> nada > < -> nada >23 45.67 "John Marcus O\"Ddly" Surname< -> >23< >45.67< >John Marcus +O\"Ddly< >Surname< >"only \"quoted\" thing"< -> >only \"quoted\" thing< >no quoted stuff< -> >no< >quoted< >stuff<

    This needs Perl version 5.10+ for the (?|...) "branch reset" operator, but modification for pre-5.10 Perls is simple; let me know if you need it. The $rx_dq_body regex to match a double-quoted body supports embedded escaped double-quotes (and any other escaped character). You can play with this regex to get exactly what you want/need.

    Of course, lots of tests should be done to verify this (or any other solution) really does what you want.

    Update: For some reason, I included a \G \s* group in the regex above. It is entirely unnecessary although it does no harm AFAICT. The match regex
        m{ (?| " ($rx_dq_body) " | ($rx_unquoted)) }xmsg
    should be exactly equivalent.


    Give a man a fish:  <%-{-{-{-<

      I can understand the challenge to hack it by yourself ... :)

      But I think the suggested Text::ParseWords is core and offers everything I expect from parsing a command line.

      It has also tests, is cutomizable and the source is well structured and documented.

      So if I "wanna roll my own" and need to make special adjustments (like e.g. paired {quotes} ) I can take the code as a base.

      DB<94> use Text::ParseWords qw/shellwords/ DB<96> x shellwords(q{this is 'an example' "with different quoting a +nd \" escaping" including\ escaped\ whitespace}) 0 'this' 1 'is' 2 'an example' 3 'with different quoting and " escaping' 4 'including escaped whitespace' DB<97>

      In case larger files need to be parsed I'll consider a dependency to Text::CSV , but this really looks good.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        I would tend to agree that an approach using a reliable, common module like Text::ParseWords (of which I had not previously been aware -- thanks, philipbailey++) or Text::CSV is usually best. But I wanted to give an example of a "pure" regex approach.

        As an aside, I think it's worth emphasizing again that whatever approach is taken, a thorough suite of tests for the final code is advisable even if the approach is based on well-tested modules.


        Give a man a fish:  <%-{-{-{-<

Re: Parsing a line of text items
by LanX (Cardinal) on Mar 30, 2021 at 11:21 UTC
    update

    scratch it, this doesn't work. It could, but it takes too much efforts to figure it out.

    Better use the Text::CSV approach


    maybe

    DB<45> p $_ 23 45.67 "John Marcus" Surname 23 45.67 "John Marcus" Surname DB<46> say $2 while /(?:^|("|\s+))(.*?)\1/g 45.67 John Marcus Surname 45.67 John Marcus Surname DB<47>

    Here are dragons, no guaranty whatsoever.

    edit

    as expected, it only works if it ends with a whitespace, and I had problems using (?:$|\1) at the end.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Just for fun:

      This seems to work if the input is surrounded by exactly one whitespace, but don't try to escape doublequotes

      • $1 is the whitespace
      • $2 the optional doublequote
      • $3 the enclosed text

      DB<89> p "'$_'" ' 23 45.67 "John Marcus" Surname 23 45.67 "John Marcus" Surname ext +ra ' DB<90> say "'$3'" while /\s*(\s)("?)(.*?)\2(?=\1)/g '23' '45.67' 'John Marcus' 'Surname' '23' '45.67' 'John Marcus' 'Surname' 'extra' DB<91>

      For testing I'd suggest to automatically create strings for random input. Like this you can cover a large set of cases.

      NB: here are still dragons.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11130582]
Approved by Corion
Front-paged by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2021-06-12 15:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (53 votes). Check out past polls.

    Notices?