http://www.perlmonks.org?node_id=1199235


in reply to Re^2: Having to manually escape quote character in args to "system"?
in thread Having to manually escape quote character in args to "system"?

Everyone quotes command line arguments the wrong way is quite funny, in a sad way, and it is wrong. As wrong as any other program attempting to quote on Windows. It is a game that you simply can not win.

You explained the basic problem: Arguments are passed to programs as a single string on systems derived from CP/M (i.e. DOS, Windows, OS/2), and programs (or the underlying runtime libraries) decide how to split that single string into arguments (see also Re^3: Perl Rename). Backwards compatibility to ancient DOS and WinNT, including bugs in command.com and cmd.exe, have lead to a ridiculous amount of complex rules for quoting and escaping.

The CommandLineToArgV convention mentioned in "Everyone quotes command line arguments the wrong way" is just that - a convention. All programs are free to use different quoting rules, and at least legacy programs do have different rules. (I did not look up or test, but I would not be surprised if cygwin-based programs would implement very different quoting rules, or even use a cygwin-only way to pass argv[] around, with a command line string only as fallback for non-cygwin programs.)

Pretending that this convention is universal for all programs, and claiming that code that escapes and quotes according to the convention is the only correct solution, would be really funny, if it was posted by a noob in some dusty corner of the internet or our local universal expert. Posting that at microsoft.com is just sad.

Unix has gone a long way, but the authors got argument passing right at the first attempt (i.e. fork() and exec()). And based on that lucky API, they made argument-splitting a problem of the shell, so you can use exactly the same quoting for all invoked programs. Over time, the shells got rid of most argument-splitting and argument-passing problems. That made quoting rules on Unix quite simple (but still far from being perfect). The best thing is that on Unix, you don't have to invoke the shell at all, so you don't have to quote at all. You pass a list of arguments to exec(), and main() will get exactly that list in argv[].

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^4: Having to manually escape quote character in args to "system"?
by afoken (Chancellor) on Feb 12, 2023 at 18:17 UTC
    The CommandLineToArgV convention mentioned in "Everyone quotes command line arguments the wrong way" is just that - a convention. All programs are free to use different quoting rules, and at least legacy programs do have different rules.

    It seems I was way too optimistic. Not only legacy programs, but also modern programs don't follow the CommandLineToArgV convention. I stumbled upon an article by Chris Wellons, The wild west of Windows command line parsing, from 2022. He ran down the rabbit hole, in an attempt to get rid of the standard libc, and it is even worse than I thought. Yes, there is an API function for splitting the command line string, called CommandLineToArgvW(), which needs to be called with the command line in "wide" (UCS-2) format from GetCommandLineW(). But that API function is burried in shell32.dll, which you might want to avoid linking in. And so:

    Many runtimes, including Microsoft’s own CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s messier than I expected, and when I started digging into it I wasn’t expecting it to involve a few days of research.

    The GetCommandLineW has a rough explanation: split arguments on whitespace (not defined), quoting is involved, and there’s something about counting backslashes, but only if they stop on a quote. It’s not quite enough to implement your own, and if you test against it, it’s quickly apparent that this documentation is at best incomplete. It links to a deprecated page about parsing C++ command line arguments with a few more details. Unfortunately the algorithm described on this page is not the algorithm used by GetCommandLineW, nor is it used by any runtime I could find. It even varies between Microsoft’s own CRTs. There is no canonical command line parsing result, not even a de facto standard.

    I eventually came across David Deley’s How Command Line Parameters Are Parsed, which is the closest there is to an authoritative document on the matter (also). Unfortunately it focuses on runtimes rather than CommandLineToArgvW, and so some of those details aren’t captured. In particular, the first argument (i.e. argv[0]) follows entirely different rules, which really confused me for while. The Wine documentation was helpful particularly for CommandLineToArgvW. As far as I can tell, they’ve re-implemented it perfectly, matching it bug-for-bug as they do.

    (Emphasis mine)

    Chris Wellons also compares to other implementations:

    I also peeked at some language runtimes to see how others handle it. Just as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft CRT. Also expected, CPython implicitly does whatever the underlying C runtime does, so its exact command line behavior depends on which version of Visual Studio was used to build the Python binary. OpenJDK pragmatically calls CommandLineToArgvW. Go (gc) does its own parsing, with behavior mixed between CommandLineToArgvW and some of Microsoft’s CRTs, but not quite matching either.

    And he also researched and implemented the inverse function, creating a command line from an array of strings for which CommandLineToArgvW() returns the same array of strings. Surprise: There is none.

    I’ve always been boggled as to why there’s no complementary inverse to CommandLineToArgvW. When spawning processes with arbitrary arguments, everyone is left to implement the inverse of this under-specified and non-trivial command line format to serialize an argv. Hopefully the receiver parses it compatibly! There’s no falling back on a system routine to help out. This has lead to a lot of repeated effort: it’s not limited to high level runtimes, but almost any extensible application (itself a kind of runtime). Fortunately serializing is not quite as complex as parsing since many of the edge cases simply don’t come up if done in a straightforward way.

    He searched for other implementations:

    How do others handle this?

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      David Deley’s How Command Line Parameters Are Parsed is not only well written, it also shows how messy the entire DOS/Windows command line handling is.

      Printed to a DIN A4 PDF document, this fills 31 pages. The entire command line parameters on Unix are explained on a single page, plus a heading on the previous page, plus an extra line on the following page, plus four footnotes (19, 1, 17, 18). And that including three examples. Let's say one and a half page. Another page is used for the table of contents, and the final page just contains a copyright and updates. The remaining 27.5 pages explain what a stinking mess Windows command line parsing is, and how to work around the different parsing rules.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)