Unconsistent command-line parsing in case of UTF-8 quoted arguments

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Unconsistent command-line parsing in case of UTF-8 quoted arguments

Jérôme Froissart
Hello,

By discussing a merge request on another project [1], I think
billziss-gh found a weirdness in the way Cygwin parses the command
line arguments when non-ASCII characters come into play.

EXPECTED BEHAVIOUR:
cygwin should parse the following command line
    binary.exe --non-ascii "charaçtérs" --ascii "nothing-fancy-here"
as
    argv = ["binary.exe",
            "--non-ascii",
            "chara\xXX\xXXt\xXX\xXXrs",
            "--ascii",
            "nothing-fancy-here"]
    // \xXX\xXX being the UTF-8 encoding of the special characters,
but this does not really matter here
before calling main()

ACTUAL BEHAVIOUR:
it parses it as
    argv = ["binary.exe",
            "--non-ascii",
            "\"chara\xXX\xXXt\xXX\xXXrs\"", // mind the unstripped
quotes here...
            "--ascii",
            "nothing-fancy-here" // ...but not here
    ]

It looks that words containing UTF-8 characters are not properly
stripped when they are surrounded by quotes, unlinke ASCII words.

More examples and a better description is available at [1] (thanks to
billziss-gh for his analysis, much more thorough than mine)
For the record, we wrote a work-around in our specific program, but
handling this issue in Cygwin might be a better way to solve it.

[1]: https://github.com/billziss-gh/sshfs-win/pull/208 (Checking for
quotes around non-ascii usernames passed by Windows)

Thanks for your help! In case you didn't have time, please tell me
where to look at, and I might try to fix it myself and send a patch
proposal if that is easy enough (I have never read Cygwin's code yet).
Jérôme
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Cygwin list mailing list
On Fri, 2 Oct 2020 at 15:41, Jérôme Froissart <> wrote:
>
> By discussing a merge request on another project [1], I think
> billziss-gh found a weirdness in the way Cygwin parses the command
> line arguments when non-ASCII characters come into play.
>
> EXPECTED BEHAVIOUR:
> cygwin should parse the following command line
>     binary.exe --non-ascii "charaçtérs" --ascii "nothing-fancy-here"

Please show us the output from "uname -a" and "locale" run from the bash prompt.

Tell is more about "binary.exe". Is it compiled for cygwon with gcc,
for windows with mingw64 or windows with a native tool chain. Also are
you running it from a bash prompt or a cmd.exe prompt?

--
Doug Henderson, Calgary, Alberta, Canada - from gmail.com
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Cygwin list mailing list
In reply to this post by Jérôme Froissart
Greetings, Jérôme Froissart!

> By discussing a merge request on another project [1], I think
> billziss-gh found a weirdness in the way Cygwin parses the command
> line arguments when non-ASCII characters come into play.

> EXPECTED BEHAVIOUR:
> cygwin should parse the following command line
>     binary.exe --non-ascii "charaçtérs" --ascii "nothing-fancy-here"
> as
>     argv = ["binary.exe",
>             "--non-ascii",
>             "chara\xXX\xXXt\xXX\xXXrs",
>             "--ascii",
>             "nothing-fancy-here"]
>     // \xXX\xXX being the UTF-8 encoding of the special characters,
> but this does not really matter here
> before calling main()

> ACTUAL BEHAVIOUR:
> it parses it as
>     argv = ["binary.exe",
>             "--non-ascii",
>             "\"chara\xXX\xXXt\xXX\xXXrs\"", // mind the unstripped
> quotes here...
>             "--ascii",
>             "nothing-fancy-here" // ...but not here
>     ]

> It looks that words containing UTF-8 characters are not properly
> stripped when they are surrounded by quotes, unlinke ASCII words.

> More examples and a better description is available at [1] (thanks to
> billziss-gh for his analysis, much more thorough than mine)
> For the record, we wrote a work-around in our specific program, but
> handling this issue in Cygwin might be a better way to solve it.

> [1]: https://github.com/billziss-gh/sshfs-win/pull/208 (Checking for
> quotes around non-ascii usernames passed by Windows)

> Thanks for your help! In case you didn't have time, please tell me
> where to look at, and I might try to fix it myself and send a patch
> proposal if that is easy enough (I have never read Cygwin's code yet).

This seems like the Cygwin command was launched from a non-Cygwin terminal or
from a terminal where locale was not set to UNICODE.

Please provide the results of "locale" command right before running your test
binary.


--
With best regards,
Andrey Repin
Sunday, October 4, 2020 14:16:17

Sorry for my terrible english...
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Jérôme Froissart
Thanks for your replies.
This issue only happens when a program is run from cmd.exe, not from a
Cygwin bash shell.
This is important for me, since I discovered this bug in a project
that must be run from Windows graphical shell (i.e. there is no
sensible way to run it through Cygwin and Bash).

> Please show us the output from "uname -a" and "locale" run from the bash prompt.

> Please provide the results of "locale" command right before running your test
> binary.
Here are the more detailed steps to reproduce the issue (along with
answers to your requests about `uname`, `locale`, etc.).
(I mostly reproduced what billziss-gh had done before, I do not take
all the credits :D)

Here is an example C file
    $ cat example.c
    #include <stdio.h>

    const char *GetCommandLineA(void);

    int main(int argc, char *argv[])
    {
        const char *s = GetCommandLineA();
        printf("C=%s\n", s);

        for (int i = 0; argc > i; i++)
            printf("%d=%s\n", i, argv[i]);

        return 0;
    }

I have built it with gcc from Cygwin
    $ gcc -o binary example.c

Running it from the same Cygwin bash prompt works as expected
    $ uname -a
    CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
    # (XPS is my Windows machine name)

    $ locale
    LANG=fr_FR.UTF-8
    LC_CTYPE="fr_FR.UTF-8"
    LC_NUMERIC="fr_FR.UTF-8"
    LC_TIME="fr_FR.UTF-8"
    LC_COLLATE="fr_FR.UTF-8"
    LC_MONETARY="fr_FR.UTF-8"
    LC_MESSAGES="fr_FR.UTF-8"
    LC_ALL=

    $ which gcc
    /usr/bin/gcc

    # The following runs as expected
    $ ./binary.exe "foo bar" "Jérôme"
    C="C:\Users\Public\binary.exe"
    0=./binary
    1=foo bar
    2=Jérôme

Now, let's start a Windows shell (cmd.exe)
Note that I had to copy cygwin1.dll from my Cygwin installation
directory, otherwise binary.exe would not start.
I do not know whether there is a `locale` equivalent in Windows
command prompt, so I merely ran my program.
    C:\Users\Public>binary.exe "foo bar" "Jérôme"
    C=binary.exe  "foo bar" "J□r□me"
    0=binary
    1=foo bar
    2="Jérôme"

This behaviour is not expected and is quite inconsistent with what
happened through Bash.
Besides the "strange squares" that appear on the first line, and the
extra space after binary.exe, I especially did not expect "Jérôme" to
remain quoted as a second argument.

Sorry for the delay in my answer. I hope this is now clear, please ask
me for more examples or investigation if you need.
Thanks for your help.

Jérôme
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Cygwin list mailing list
Greetings, Jérôme Froissart!

> Now, let's start a Windows shell (cmd.exe)

That explains it.

> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows

We've specifically asked to run Cygwin's /bin/locale.exe tool.

> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"

> This behaviour is not expected and is quite inconsistent with what
> happened through Bash.
> Besides the "strange squares" that appear on the first line, and the

1. Run CMD in a more capable terminal. Either M$ Terminal 1.0, or select true
type font for your console.

> extra space after binary.exe, I especially did not expect "Jérôme" to
> remain quoted as a second argument.

2. Then you are parsing the command line wrong. In Windows, it is up to called
program to parse the command line.

> Sorry for the delay in my answer. I hope this is now clear, please ask
> me for more examples or investigation if you need.
> Thanks for your help.


--
With best regards,
Andrey Repin
Wednesday, October 7, 2020 1:02:59

Sorry for my terrible english...
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Brian Inglis
In reply to this post by Jérôme Froissart
On 2020-10-06 15:36, Jérôme Froissart wrote:

> Thanks for your replies.
> This issue only happens when a program is run from cmd.exe, not from a
> Cygwin bash shell.
> This is important for me, since I discovered this bug in a project
> that must be run from Windows graphical shell (i.e. there is no
> sensible way to run it through Cygwin and Bash).
>
>> Please show us the output from "uname -a" and "locale" run from the bash prompt.
>
>> Please provide the results of "locale" command right before running your test
>> binary.
> Here are the more detailed steps to reproduce the issue (along with
> answers to your requests about `uname`, `locale`, etc.).
> (I mostly reproduced what billziss-gh had done before, I do not take
> all the credits :D)
>
> Here is an example C file

> I have built it with gcc from Cygwin
>     $ gcc -o binary example.c
>
> Running it from the same Cygwin bash prompt works as expected
>     $ uname -a
>     CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
>     # (XPS is my Windows machine name)
>
>     $ locale
>     LANG=fr_FR.UTF-8
>     LC_CTYPE="fr_FR.UTF-8"
>     LC_NUMERIC="fr_FR.UTF-8"
>     LC_TIME="fr_FR.UTF-8"
>     LC_COLLATE="fr_FR.UTF-8"
>     LC_MONETARY="fr_FR.UTF-8"
>     LC_MESSAGES="fr_FR.UTF-8"
>     LC_ALL=
>
>     $ which gcc
>     /usr/bin/gcc
>
>     # The following runs as expected
>     $ ./binary.exe "foo bar" "Jérôme"
>     C="C:\Users\Public\binary.exe"
>     0=./binary
>     1=foo bar
>     2=Jérôme
>
> Now, let's start a Windows shell (cmd.exe)
> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows
> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"
>
> This behaviour is not expected and is quite inconsistent with what
> happened through Bash.
> Besides the "strange squares" that appear on the first line, and the
> extra space after binary.exe, I especially did not expect "Jérôme" to
> remain quoted as a second argument.
>
> Sorry for the delay in my answer. I hope this is now clear, please ask
> me for more examples or investigation if you need.
> Thanks for your help.

Create a new or change your current Command Prompt shortcut to run:

        "%windir%\system32\cmd /u"

"/U Causes the output of internal commands to a pipe or file to be Unicode"

and add "chcp 65001":

        "%windir%\system32\cmd /u /k chcp 65001"

or set

        HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun

or

        HKEY_CURRENT_USER\Software\Microsoft\Command Processor\AutoRun

to command

        "@chcp 65001 > nul"

e.g.

        > reg add HKEY_CURRENT_USER\Software\Microsoft\Command Processor ^
                /v AutoRun /d "@chcp 65001 > nul" /f

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Thomas Wolff
In reply to this post by Jérôme Froissart


Am 06.10.2020 um 23:36 schrieb Jérôme Froissart:

> Thanks for your replies.
> This issue only happens when a program is run from cmd.exe, not from a
> Cygwin bash shell.
> This is important for me, since I discovered this bug in a project
> that must be run from Windows graphical shell (i.e. there is no
> sensible way to run it through Cygwin and Bash).
>
>> Please show us the output from "uname -a" and "locale" run from the bash prompt.
> Running it from the same Cygwin bash prompt works as expected
>      $ uname -a
>      CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
Please update to cygwin 3.1.7; there were issues about command line
quoting before, I'm not sure whether there was a tweak since 3.1.5 already.
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Jérôme Froissart
In reply to this post by Cygwin list mailing list
Thanks for your reply.

Andrey Repin wrote:
> 1. Run CMD in a more capable terminal. Either M$ Terminal 1.0, or select true
> type font for your console.
I tried Windows Terminal 1.3, but this did not change anything :-(
Besides, I think my cmd.exe was already using True Type fonts (if I
understand the icons from the settings window correctly)

Anyway, I now understand that the terminal I use matters. In my case
however, I do not intend to run the binary (built with Cygwin) in a
terminal at all.
I am using win-sshfs [2]. It is built from Cygwin, but it is then used
as a standalone executable, without any GUI. It is called by a Windows
component/driver (with a command line that contains quoted UTF-8
arguments), invoked by some clicks and actions from the 'My computer'
window. What could I do so that this program correctly handles the
command line?
> 2. Then you are parsing the command line wrong. In Windows, it is up to called
> program to parse the command line.
Right, but my program starts at `int main(int argc, char *argv[])`,
where the parsing is already handled (by some Cygwin runtime
component?). How could I parse it differently?
And would that even make sense that I parse it in a custom way? Since
-I suppose- every C program built by Cygwin faces the same issues,
wouldn't we rather want a "universal" change on how the Cygwin runtime
parses command lines?
For the record, this is what I have done in this program [1], but that
feels more like a work around some UTF-8-related bug than a proper,
custom command line parsing :-S

...or maybe I'm completely mistaken in how Cygwin works, in case I'd
be happy to be told :-)

[1] https://github.com/billziss-gh/sshfs-win/pull/208
[2] https://github.com/billziss-gh/sshfs-win

Thanks for your help
Jérôme
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Brian Inglis
In reply to this post by Thomas Wolff
On 2020-10-06 23:17, Thomas Wolff wrote:

>
>
> Am 06.10.2020 um 23:36 schrieb Jérôme Froissart:
>> Thanks for your replies.
>> This issue only happens when a program is run from cmd.exe, not from a
>> Cygwin bash shell.
>> This is important for me, since I discovered this bug in a project
>> that must be run from Windows graphical shell (i.e. there is no
>> sensible way to run it through Cygwin and Bash).
>>
>>> Please show us the output from "uname -a" and "locale" run from the bash prompt.
>> Running it from the same Cygwin bash prompt works as expected
>>      $ uname -a
>>      CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
> Please update to cygwin 3.1.7; there were issues about command line quoting
> before, I'm not sure whether there was a tweak since 3.1.5 already.

[PATCH] Cygwin: console: Replace WriteConsoleA() with WriteConsoleW():
        https://cygwin.com/pipermail/cygwin-patches/2020q3/010495.html

[PATCH v4 1/3] Cygwin: rewrite and make public cmdline parser:
        https://cygwin.com/pipermail/cygwin-patches/2020q3/010577.html
Issues raised and no v5 response so far

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Eliot Moss
I think what we mean is that, under Windows cmd, some things the shell does for you under Linux and
Cygwin will not have been done.  For example, there is "glob" expansion of filenames.  If I write
*.txt under bash, it gets expanded to a space-separated list of names of files that match that
pattern.  This happens _before_ calling my program.  If the program is run from Windows cmd.exe, the
program will receive an argument *.txt, and it will have to do the "globbing" itself.  Etc.

Regards - Eliot Moss
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Brian Inglis
On 2020-10-07 18:59, Eliot Moss wrote:
> I think what we mean is that, under Windows cmd, some things the shell does for
> you under Linux and Cygwin will not have been done.  For example, there is
> "glob" expansion of filenames.  If I write *.txt under bash, it gets expanded to
> a space-separated list of names of files that match that pattern.  This happens
> _before_ calling my program.  If the program is run from Windows cmd.exe, the
> program will receive an argument *.txt, and it will have to do the "globbing"
> itself.  Etc.

That's handled automatically by the Cygwin program startup command line parser
if it is not passed a "Cygwin" command line: that avoids the startup expanding
quoted args that contain wildcards passed from another Cygwin program or shell.

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Cygwin list mailing list
In reply to this post by Jérôme Froissart
Greetings, Jérôme Froissart!

> Andrey Repin wrote:
>> 1. Run CMD in a more capable terminal. Either M$ Terminal 1.0, or select true
>> type font for your console.
> I tried Windows Terminal 1.3, but this did not change anything :-(
> Besides, I think my cmd.exe was already using True Type fonts (if I
> understand the icons from the settings window correctly)

> Anyway, I now understand that the terminal I use matters. In my case
> however, I do not intend to run the binary (built with Cygwin) in a
> terminal at all.
> I am using win-sshfs [2]. It is built from Cygwin, but it is then used
> as a standalone executable, without any GUI. It is called by a Windows
> component/driver (with a command line that contains quoted UTF-8
> arguments), invoked by some clicks and actions from the 'My computer'
> window. What could I do so that this program correctly handles the
> command line?

I would at least run it with LANG env. variable set.
F.e. LANG=ru_RU.UTF-8 in my case.
I further tweak it with
LC_MESSAGES=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8

to get more consistent program output parsing experience.

>> 2. Then you are parsing the command line wrong. In Windows, it is up to called
>> program to parse the command line.
> Right, but my program starts at `int main(int argc, char *argv[])`,
> where the parsing is already handled (by some Cygwin runtime
> component?). How could I parse it differently?
> And would that even make sense that I parse it in a custom way? Since
> -I suppose- every C program built by Cygwin faces the same issues,
> wouldn't we rather want a "universal" change on how the Cygwin runtime
> parses command lines?
> For the record, this is what I have done in this program [1], but that
> feels more like a work around some UTF-8-related bug than a proper,
> custom command line parsing :-S

> ...or maybe I'm completely mistaken in how Cygwin works, in case I'd
> be happy to be told :-)

> [1] https://github.com/billziss-gh/sshfs-win/pull/208
> [2] https://github.com/billziss-gh/sshfs-win

P.S.

I suggest
ln -fs /proc/cygdrive/c/Windows/System32/chcp.com /usr/local/bin/chcp


--
With best regards,
Andrey Repin
Sunday, October 11, 2020 21:51:57

Sorry for my terrible english...
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Cygwin list mailing list
In reply to this post by Jérôme Froissart
On 2020-10-06 14:36, Jérôme Froissart wrote:

> Here is an example C file
>     $ cat example.c
>     #include <stdio.h>
>
>     const char *GetCommandLineA(void);
>
>     int main(int argc, char *argv[])
>     {
>         const char *s = GetCommandLineA();
>         printf("C=%s\n", s);
>
>         for (int i = 0; argc > i; i++)
>             printf("%d=%s\n", i, argv[i]);
>
>         return 0;
>     }

Your program's comparison seems to be based on the
hypothesis that Cygwin parses the GetCommandLineA() command line.

But this hypothesis is almost certainly wrong.

> Now, let's start a Windows shell (cmd.exe)
> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows
> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"

The "A" command line from GetCommandLineA has "tofu"
characters: é and ô were not decoded properly.

The é and ô characters we see in the Cygwin-parsed
arguments coming into main could not have been recovered
from these "tofu" replacement characters.

What is actually being parsed must be the WCHAR command line
corresponding to what comes from GetCommandLineW().

It's necessary to show that one to get a more complete understanding.

--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Brian Inglis
In reply to this post by Jérôme Froissart
On 2020-10-06 15:36, Jérôme Froissart wrote:

> Here are the more detailed steps to reproduce the issue (along with
> answers to your requests about `uname`, `locale`, etc.).
> (I mostly reproduced what billziss-gh had done before, I do not take
> all the credits :D)
>
> Here is an example C file
>     $ cat example.c
>     #include <stdio.h>
>
>     const char *GetCommandLineA(void);
>
>     int main(int argc, char *argv[])
>     {
>         const char *s = GetCommandLineA();
>         printf("C=%s\n", s);
>
>         for (int i = 0; argc > i; i++)
>             printf("%d=%s\n", i, argv[i]);
>
>         return 0;
>     }
>
> I have built it with gcc from Cygwin
>     $ gcc -o binary example.c
>
> Running it from the same Cygwin bash prompt works as expected
>     $ uname -a
>     CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
>     # (XPS is my Windows machine name)
>
>     $ locale
>     LANG=fr_FR.UTF-8
>     LC_CTYPE="fr_FR.UTF-8"
>     LC_NUMERIC="fr_FR.UTF-8"
>     LC_TIME="fr_FR.UTF-8"
>     LC_COLLATE="fr_FR.UTF-8"
>     LC_MONETARY="fr_FR.UTF-8"
>     LC_MESSAGES="fr_FR.UTF-8"
>     LC_ALL=
>
>     $ which gcc
>     /usr/bin/gcc
>
>     # The following runs as expected
>     $ ./binary.exe "foo bar" "Jérôme"
>     C="C:\Users\Public\binary.exe"
>     0=./binary
>     1=foo bar
>     2=Jérôme
>
> Now, let's start a Windows shell (cmd.exe)
> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows
> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"
>
> This behaviour is not expected and is quite inconsistent with what
> happened through Bash.
> Besides the "strange squares" that appear on the first line, and the
> extra space after binary.exe, I especially did not expect "Jérôme" to
> remain quoted as a second argument.

Don't call inappropriate Windows functions without understanding the limitations
of Windows and its APIs.
Cygwin args are consistent with what you ran and what we would all expect.
I don't see any Cygwin problems here.

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Jérôme Froissart
In reply to this post by Cygwin list mailing list
Thank you everyone, I now have a better understanding of how Windows
and Cygwin work (being rather a Linux guy, I was not really aware of
all of this).

However, there is still a question that is puzzling me. I now
understand _why_ things happen that way, but I am still wondering
whether this is really what we _want_. I mean, keeping the double
quotes around an UTF-8 argument just because it is not run from
Cygwin's bash sounds like a bug for me, doesn't it? (yet I definitely
understand the reasons that explain this behaviour). Since I cannot
run my program from bash, I have to resort to manually trimming the
quotes, which I would have liked to avoid.

I'd like to share a message that the maintainer of sshfs-win has
posted on Github [1], which is a follow-up to our discussions (he did
not know whether he was able to post in the mailing list without
subscribing first).
(besides, I unfortunately don't have much time currently to
investigate on this issue (for instance, I have not yet succeeded in
doing the same experiments with the very latest version of Cygwin), so
having his feedback is very valuable).

Here is what he says:

> It seems to me that the list is missing the important point
> about the double quote characters that should NOT be there
> regardless of how the é and ô characters are being interpreted.
> (As evidence of this: the Cygwin command line parser was able
> to break the command line into arguments correctly, but chose
> to retain the double quotes.)
>
> The choice of GetCommandLineA was for illustration purposes;
> had I used GetCommandLineW I would not be able to printf
> using %ls under CMD.EXE, because of code page issues. However
> here is a modified version of the test program that uses
> GetCommandLineW.
>
>     #include <stdio.h>
>
>     wchar_t *GetCommandLineW(void);
>
>     int main(int argc, char *argv[])
>     {
>         wchar_t *s = GetCommandLineW();
>
>         for (wchar_t *p = s; *p; p++)
>             printf("%04x %c%s",
>                 *p,
>                 32 <= *p && *p < 127 ? *p : '.',
>                 (p - s) % 8 + 1 != 8 ? "   " : "\n");
>         printf("\n");
>
>         for (int i = 0; argc > i; i++)
>             printf("%d=%s\n", i, argv[i]);
>
>         return 0;
>     }
>
> I compiled this program under Cygwin to produce cyg.exe and ran
> it under Cygwin and CMD.EXE.
>
> Cygwin run:
> > billziss@xps:~/Projects/t$ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_ALL=
> billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\Jérôme"
> 0022 "   0043 C   003a :   005c \   0055 U   0073 s   0065 e   0072 r
> 0073 s   005c \   0062 b   0069 i   006c l   006c l   007a z   0069 i
> 0073 s   0073 s   005c \   0050 P   0072 r   006f o   006a j   0065 e
> 0063 c   0074 t   0073 s   005c \   0074 t   005c \   0063 c   0079 y
> 0067 g   002e .   0065 e   0078 x   0065 e   0022 "
> 0=./cyg
> 1=foo bar
> 2=Domain\Jérôme
>
>
>
>
>
> CMD.EXE run:
>
> C:\Users\billziss\Projects\t>\Windows\System32\chcp.com
> Active code page: 437
>
> C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\Jérôme"
> 0063 c   0079 y   0067 g   002e .   0065 e   0078 x   0065 e   0020
> 0020     0022 "   0066 f   006f o   006f o   0020     0062 b   0061 a
> 0072 r   0022 "   0020     0022 "   0044 D   006f o   006d m   0061 a
> 0069 i   006e n   005c \   004a J   00e9 .   0072 r   00f4 .   006d m
> 0065 e   0022 "
> 0=cyg
> 1=foo bar
> 2="Domain\Jérôme"


[1] https://github.com/billziss-gh/sshfs-win/pull/208

Thank you very much
Jérôme

Le mar. 13 oct. 2020 à 18:30, Kaz Kylheku (Cygwin)
<[hidden email]> a écrit :

>
> On 2020-10-06 14:36, Jérôme Froissart wrote:
> > Here is an example C file
> >     $ cat example.c
> >     #include <stdio.h>
> >
> >     const char *GetCommandLineA(void);
> >
> >     int main(int argc, char *argv[])
> >     {
> >         const char *s = GetCommandLineA();
> >         printf("C=%s\n", s);
> >
> >         for (int i = 0; argc > i; i++)
> >             printf("%d=%s\n", i, argv[i]);
> >
> >         return 0;
> >     }
>
> Your program's comparison seems to be based on the
> hypothesis that Cygwin parses the GetCommandLineA() command line.
>
> But this hypothesis is almost certainly wrong.
>
> > Now, let's start a Windows shell (cmd.exe)
> > Note that I had to copy cygwin1.dll from my Cygwin installation
> > directory, otherwise binary.exe would not start.
> > I do not know whether there is a `locale` equivalent in Windows
> > command prompt, so I merely ran my program.
> >     C:\Users\Public>binary.exe "foo bar" "Jérôme"
> >     C=binary.exe  "foo bar" "J□r□me"
> >     0=binary
> >     1=foo bar
> >     2="Jérôme"
>
> The "A" command line from GetCommandLineA has "tofu"
> characters: é and ô were not decoded properly.
>
> The é and ô characters we see in the Cygwin-parsed
> arguments coming into main could not have been recovered
> from these "tofu" replacement characters.
>
> What is actually being parsed must be the WCHAR command line
> corresponding to what comes from GetCommandLineW().
>
> It's necessary to show that one to get a more complete understanding.
>
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Jérôme Froissart
Le mer. 14 oct. 2020 à 23:47, Jérôme Froissart <[hidden email]> a écrit :
> However, there is still a question that is puzzling me. I now
> understand _why_ things happen that way, but I am still wondering
> whether this is really what we _want_. I mean, keeping the double
> quotes around an UTF-8 argument just because it is not run from
> Cygwin's bash sounds like a bug for me, doesn't it? (yet I definitely
> understand the reasons that explain this behaviour). Since I cannot
> run my program from bash, I have to resort to manually trimming the
> quotes, which I would have liked to avoid.

Just to rephrase what is puzzling me:
When I understood that sshfs-win had a bug when an argument contained
diacritics, I expected many possible issues : mismatching codepages,
poorly-handled encodings, implicit conversions between UTF-8 and
Latin-1, etc., which would make some sense.
But I definitely did not expect that "double quotes were not properly
removed by the runtime", which (imho) does not make any sense.

I hope I have managed to rephrase my problem clearly :D
Thanks to all of you for your help!
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 quoted args passed to program include quotes when run from cmd

Brian Inglis
In reply to this post by Jérôme Froissart
[changed subject]

On 2020-10-14 15:47, Jérôme Froissart wrote:

>> (As evidence of this: the Cygwin command line parser was able to break the
>> command line into arguments correctly, but chose to retain the double
>> quotes.)
>>
>>     #include <stdio.h>
>>
>>     int main(int argc, char *argv[])
>>     {
>>         for (int i = 0; argc > i; i++)
>>             printf("%d=%s\n", i, argv[i]);
>>
>>         return 0;
>>     }
>>
>> I compiled this program under Cygwin to produce cyg.exe and ran it under
>> Cygwin and CMD.EXE.

Please post compile and link command lines, as Cygwin can create native Windows
as well as its own Unix like executables, and the command line parsing may vary.

>> Cygwin run:
>>> billziss@xps:~/Projects/t$ locale
>> LANG=en_US.UTF-8
>> LC_CTYPE="en_US.UTF-8"
>> LC_NUMERIC="en_US.UTF-8"
>> LC_TIME="en_US.UTF-8"
>> LC_COLLATE="en_US.UTF-8"
>> LC_MONETARY="en_US.UTF-8"
>> LC_MESSAGES="en_US.UTF-8"
>> LC_ALL=
>> billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\Jérôme"
>> 0=./cyg
>> 1=foo bar
>> 2=Domain\Jérôme

>> CMD.EXE run:
>> C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\Jérôme"
>> 0=cyg
>> 1=foo bar
>> 2="Domain\Jérôme"

>>> Now, let's start a Windows shell (cmd.exe)
>>> Note that I had to copy cygwin1.dll from my Cygwin installation
>>> directory, otherwise binary.exe would not start.
>>> I do not know whether there is a `locale` equivalent in Windows
>>> command prompt, so I merely ran my program.
>>>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>>>     0=binary
>>>     1=foo bar
>>>     2="Jérôme"

Your Windows CommandLineA/W outputs were confusing.

The point is that Cygwin programs run from cmd shell appear to receive UTF-8
arguments with the surrounding double quotes included intact, whereas the double
quotes are stripped when run from a Cygwin shell.

I think the charset needs verified by dumping each arg as hex bytes e.g.

//!/usr/bin/gcc -g -Og -Wall -Wextra -o quoted-arg-dump quoted-arg-dump.c
// quoted-arg-dump.c - dump quoted args under Cygwin and Windows shells
// outputs:
// $ ./quoted-arg-dump "foo bar" "Jérôme"
// 0 './quoted-arg-dump' 2e 2f 71 75 6f 74 65 64 2d 61 72 67 2d 64 75 6d 70
// 1 'foo bar' 66 6f 6f 20 62 61 72
// 2 'Jérôme' 4a c3 a9 72 c3 b4 6d 65
// >quoted-arg-dump "foo bar" "Jérôme"
// 0 'quoted-arg-dump' 71 75 6f 74 65 64 2d 61 72 67 2d 64 75 6d 70
// 1 'foo bar' 66 6f 6f 20 62 61 72
// 2 '"Jérôme"' 22 4a c3 a9 72 c3 b4 6d 65 22
// checks:
// $ grep -a '[éô]' unicode-symbols.txt
// é  U+00E9  LATIN SMALL LETTER E WITH ACUTE
// ô  U+00F4  LATIN SMALL LETTER O WITH CIRCUMFLEX
// $ grep -a '[éô]' unicode-symbols.txt | od -An -tx1z -w11
// c3 a9 20 20 55 2b 30 30 45 39 20  >..  U+00E9 <
// 20 4c 41 54 49 4e 20 53 4d 41 4c  > LATIN SMAL<
// 4c 20 4c 45 54 54 45 52 20 45 20  >L LETTER E <
// 57 49 54 48 20 41 43 55 54 45 0a  >WITH ACUTE.<
// c3 b4 20 20 55 2b 30 30 46 34 20  >..  U+00F4 <
// 20 4c 41 54 49 4e 20 53 4d 41 4c  > LATIN SMAL<
// 4c 20 4c 45 54 54 45 52 20 4f 20  >L LETTER O <
// 57 49 54 48 20 43 49 52 43 55 4d  >WITH CIRCUM<
// 46 4c 45 58 0a                    >FLEX.<
#include <stdio.h>
int
main(int argc, char *argv[]) {
        for (int a = 0; a < argc; ++a) {
                printf("%d '%s'", a, argv[a]);

                for (char *p = argv[a]; *p; ++p) {
                        printf(" %.2hhx", *p);
                } // for chars

                printf("\n");
        } // for args
} // main()

This verifies that Cygwin does not strip double quotes from UTF-8 args when run
from Windows cmd, and the args are received and output as UTF-8 characters.

It might be interesting if you could also run from PowerShell and/or Terminal
for comparison to see if the Windows cmd behaviour is reproduced there.

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
Reply | Threaded
Open this post in threaded view
|

Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments

Cygwin list mailing list
In reply to this post by Jérôme Froissart
On 2020-10-14 14:47, Jérôme Froissart wrote:
>> The choice of GetCommandLineA was for illustration purposes;
>> had I used GetCommandLineW I would not be able to printf
>> using %ls under CMD.EXE, because of code page issues. However
>> here is a modified version of the test program that uses
>> GetCommandLineW.

[ ... ]

>> billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\Jérôme"
>> 0022 "   0043 C   003a :   005c \   0055 U   0073 s   0065 e   0072 r
>> 0073 s   005c \   0062 b   0069 i   006c l   006c l   007a z   0069 i
>> 0073 s   0073 s   005c \   0050 P   0072 r   006f o   006a j   0065 e
>> 0063 c   0074 t   0073 s   005c \   0074 t   005c \   0063 c   0079 y
>> 0067 g   002e .   0065 e   0078 x   0065 e   0022 "

[ ... ]

>> C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\Jérôme"
>> 0063 c   0079 y   0067 g   002e .   0065 e   0078 x   0065 e   0020
>> 0020     0022 "   0066 f   006f o   006f o   0020     0062 b   0061 a
>> 0072 r   0022 "   0020     0022 "   0044 D   006f o   006d m   0061 a
>> 0069 i   006e n   005c \   004a J   00e9 .   0072 r   00f4 .   006d m
>> 0065 e   0022 "

Aha! There is a hint of a problem here. Firstly, the command lines
are obviously different.

The Cygwin one starts with a quote that we did not see, wrapping
the full path to the executable:

   "C:\Users\billziss\Projects\t\cyg.exe"

It ends there. Why is that? I'm guessing that the command line was
tokenized destructively; a null character was written.

But under cmd.exe, we see the whole command line, without any null
character having been written in it. Moreover, the program name just
appears as the original relative path cyg.exe with no quotes.

What a mess. :)



--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple