[PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

classic Classic list List threaded Threaded
60 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Johannes Schindelin
When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
correct, but after that (at least if Pseudo Console support is enabled),
we try to find the default code page for that `LCID`, which is ASCII
(437). Subsequently, we set the Console output code page to that value,
completely ignoring that we wanted to use UTF-8.

Let's not ignore the specifically asked-for UTF-8 character set.

While at it, let's also set the Console output code page even if Pseudo
Console support is disabled; contrary to the behavior of v3.0.7, the
Console output code page is not ignored in that case.

The most common symptom would be that console applications which do not
specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.

This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
https://github.com/msys2/MSYS2-packages/issues/2012,
https://github.com/rust-lang/cargo/issues/8369,
https://github.com/git-for-windows/git/issues/2734,
https://github.com/git-for-windows/git/issues/2793,
https://github.com/git-for-windows/git/issues/2792, and possibly quite a
few others.

Signed-off-by: Johannes Schindelin <[hidden email]>
---
 winsup/cygwin/fhandler_tty.cc | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index 06789a500..414c26992 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
   char charset[ENCODING_LEN + 1] = "ASCII";
   LCID lcid = get_langinfo (locale, charset);

+  /* Special-case the UTF-8 character set */
+  if (strcasecmp (charset, "UTF-8") == 0)
+    {
+      get_ttyp ()->term_code_page = CP_UTF8;
+      SetConsoleCP (CP_UTF8);
+      SetConsoleOutputCP (CP_UTF8);
+      return;
+    }
+
   /* Set console code page from locale */
   if (get_pseudo_console ())
     {
--
2.27.0

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Johannes Schindelin
Hi,

On Tue, 1 Sep 2020, Johannes Schindelin wrote:

> When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> correct, but after that (at least if Pseudo Console support is enabled),
> we try to find the default code page for that `LCID`, which is ASCII
> (437). Subsequently, we set the Console output code page to that value,
> completely ignoring that we wanted to use UTF-8.
>
> Let's not ignore the specifically asked-for UTF-8 character set.
>
> While at it, let's also set the Console output code page even if Pseudo
> Console support is disabled; contrary to the behavior of v3.0.7, the
> Console output code page is not ignored in that case.
>
> The most common symptom would be that console applications which do not
> specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
>
> This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> https://github.com/msys2/MSYS2-packages/issues/2012,
> https://github.com/rust-lang/cargo/issues/8369,
> https://github.com/git-for-windows/git/issues/2734,
> https://github.com/git-for-windows/git/issues/2793,
> https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> few others.
>
> Signed-off-by: Johannes Schindelin <[hidden email]>
> ---
>  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> index 06789a500..414c26992 100644
> --- a/winsup/cygwin/fhandler_tty.cc
> +++ b/winsup/cygwin/fhandler_tty.cc
> @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
>    char charset[ENCODING_LEN + 1] = "ASCII";
>    LCID lcid = get_langinfo (locale, charset);
>
> +  /* Special-case the UTF-8 character set */
> +  if (strcasecmp (charset, "UTF-8") == 0)
> +    {
> +      get_ttyp ()->term_code_page = CP_UTF8;
> +      SetConsoleCP (CP_UTF8);
> +      SetConsoleOutputCP (CP_UTF8);
> +      return;
> +    }
> +

Just a word of warning: while this patch can be ported to a634adda5
(libm/machine/arm: Rename s*_fma.c -> s*_fma_arm.c, 2020-09-01) relatively
easily (and the first two patches of this patch series cannot, as they are
no longer applicable after the complete redesign of the Pseudo Console
support), it only works as intended in the `disable_pcon` mode.

The new design calls for Pseudo Consoles to be created per spawned console
application.

And I have not found any way to convince my local version of the runtime
to change the code page of these Pseudo Consoles away from the rather
unfortunate default 437.

This is a problem.

Take for example https://github.com/git-for-windows/git/issues/2793.
Telling the users that they should patch node.js and recompile is probably
not going to fly.

Hopefully there is a way to fix this, otherwise Pseudo Console support
will continue to be quite the support burden.

Ciao,
Johannes

>    /* Set console code page from locale */
>    if (get_pseudo_console ())
>      {
> --
> 2.27.0
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Corinna Vinschen-2
In reply to this post by Johannes Schindelin
On Sep  1 18:19, Johannes Schindelin wrote:

> When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> correct, but after that (at least if Pseudo Console support is enabled),
> we try to find the default code page for that `LCID`, which is ASCII
> (437). Subsequently, we set the Console output code page to that value,
> completely ignoring that we wanted to use UTF-8.
>
> Let's not ignore the specifically asked-for UTF-8 character set.
>
> While at it, let's also set the Console output code page even if Pseudo
> Console support is disabled; contrary to the behavior of v3.0.7, the
> Console output code page is not ignored in that case.
>
> The most common symptom would be that console applications which do not
> specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
>
> This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> https://github.com/msys2/MSYS2-packages/issues/2012,
> https://github.com/rust-lang/cargo/issues/8369,
> https://github.com/git-for-windows/git/issues/2734,
> https://github.com/git-for-windows/git/issues/2793,
> https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> few others.
>
> Signed-off-by: Johannes Schindelin <[hidden email]>
> ---
>  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
>  1 file changed, 9 insertions(+)

Ok guys, I'm not opposed to this change in terms of its result,
but I'm starting to wonder why all this locale code in fhandler_tty
is necessary at all.

I see that get_langinfo() calls __loadlocale and performs a lot of stuff
on the charsets which looks like duplicates of the initial_setlocale()
call performed at DLL startup.

If there's anything missing in the initial_setlocale() call which would
be required by the pseudo tty code?  What exactly is it?  The codepage?
And why can't we just add the info to cygheap->locale at initial_setlocale()
time so it's available at exec time without going through all this hassle
every time?

Apart from that, all this locale/charset/lcid stuff should be concentrated
in nlsfunc.cc ideally.


Thanks,
Corinna
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Corinna Vinschen-2
On Sep  2 10:30, Corinna Vinschen wrote:

> On Sep  1 18:19, Johannes Schindelin wrote:
> > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > correct, but after that (at least if Pseudo Console support is enabled),
> > we try to find the default code page for that `LCID`, which is ASCII
> > (437). Subsequently, we set the Console output code page to that value,
> > completely ignoring that we wanted to use UTF-8.
> >
> > Let's not ignore the specifically asked-for UTF-8 character set.
> >
> > While at it, let's also set the Console output code page even if Pseudo
> > Console support is disabled; contrary to the behavior of v3.0.7, the
> > Console output code page is not ignored in that case.
> >
> > The most common symptom would be that console applications which do not
> > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> >
> > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > https://github.com/msys2/MSYS2-packages/issues/2012,
> > https://github.com/rust-lang/cargo/issues/8369,
> > https://github.com/git-for-windows/git/issues/2734,
> > https://github.com/git-for-windows/git/issues/2793,
> > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > few others.
> >
> > Signed-off-by: Johannes Schindelin <[hidden email]>
> > ---
> >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> >  1 file changed, 9 insertions(+)
>
> Ok guys, I'm not opposed to this change in terms of its result,
> but I'm starting to wonder why all this locale code in fhandler_tty
> is necessary at all.
>
> I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> on the charsets which looks like duplicates of the initial_setlocale()
> call performed at DLL startup.
>
> If there's anything missing in the initial_setlocale() call which would
> be required by the pseudo tty code?  What exactly is it?  The codepage?
> And why can't we just add the info to cygheap->locale at initial_setlocale()
> time so it's available at exec time without going through all this hassle
> every time?
>
> Apart from that, all this locale/charset/lcid stuff should be concentrated
> in nlsfunc.cc ideally.

get_locale_from_env() and get_langinfo() should go away.  If we just
need a codepage for get_ttyp ()->term_code_page, we should really find a
way to do this from within internal_setlocale().


Corinna
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

cygwin-patches mailing list
In reply to this post by Corinna Vinschen-2
On Wed, 2 Sep 2020 10:30:14 +0200
Corinna Vinschen wrote:

> On Sep  1 18:19, Johannes Schindelin wrote:
> > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > correct, but after that (at least if Pseudo Console support is enabled),
> > we try to find the default code page for that `LCID`, which is ASCII
> > (437). Subsequently, we set the Console output code page to that value,
> > completely ignoring that we wanted to use UTF-8.
> >
> > Let's not ignore the specifically asked-for UTF-8 character set.
> >
> > While at it, let's also set the Console output code page even if Pseudo
> > Console support is disabled; contrary to the behavior of v3.0.7, the
> > Console output code page is not ignored in that case.
> >
> > The most common symptom would be that console applications which do not
> > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> >
> > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > https://github.com/msys2/MSYS2-packages/issues/2012,
> > https://github.com/rust-lang/cargo/issues/8369,
> > https://github.com/git-for-windows/git/issues/2734,
> > https://github.com/git-for-windows/git/issues/2793,
> > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > few others.
> >
> > Signed-off-by: Johannes Schindelin <[hidden email]>
> > ---
> >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> >  1 file changed, 9 insertions(+)
>
> Ok guys, I'm not opposed to this change in terms of its result,

I am sorry, but I cannot agree with Johannes's patch.

For example, code page in Japan is CP932 by default.
In this case, cmd.exe, netsh.exe and so on are generate
messages in Japanese.

If the code page is set to CP_UTF8, messages from such
commands changes to english. I guess similar things happen
for other locales.

I do not prefer this result.

Furthermore, I looked into the issue:
https://github.com/git-for-windows/git/issues/2734
and I found that git-for-windows always use utf-8
encoding even if the locale is ja_JP.CP932.
It does not change coding based on locale or code
page.

Even with Johannes's patch, if mintty is started with
locale ja_JP.CP932, the file name will be garbled
bacause SetConsoleOutputCP(CP_UTF8) will not be called.

IMHO, it is the problem of git-for-windows rather
than cygwin and msys2.

To make current version of git-for-windows work, it is
necessary to set code page to CP_UTF8 regardless locale.
This does not make sense at all.

--
Takashi Yano <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Johannes Schindelin
Hi Takashi,

On Wed, 2 Sep 2020, Takashi Yano via Cygwin-patches wrote:

> On Wed, 2 Sep 2020 10:30:14 +0200
> Corinna Vinschen wrote:
> > On Sep  1 18:19, Johannes Schindelin wrote:
> > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > correct, but after that (at least if Pseudo Console support is enabled),
> > > we try to find the default code page for that `LCID`, which is ASCII
> > > (437). Subsequently, we set the Console output code page to that value,
> > > completely ignoring that we wanted to use UTF-8.
> > >
> > > Let's not ignore the specifically asked-for UTF-8 character set.
> > >
> > > While at it, let's also set the Console output code page even if Pseudo
> > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > Console output code page is not ignored in that case.
> > >
> > > The most common symptom would be that console applications which do not
> > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > >
> > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > https://github.com/rust-lang/cargo/issues/8369,
> > > https://github.com/git-for-windows/git/issues/2734,
> > > https://github.com/git-for-windows/git/issues/2793,
> > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > few others.
> > >
> > > Signed-off-by: Johannes Schindelin <[hidden email]>
> > > ---
> > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> >
> > Ok guys, I'm not opposed to this change in terms of its result,
>
> I am sorry, but I cannot agree with Johannes's patch.
>
> For example, code page in Japan is CP932 by default.
> In this case, cmd.exe, netsh.exe and so on are generate
> messages in Japanese.
>
> If the code page is set to CP_UTF8, messages from such
> commands changes to english. I guess similar things happen
> for other locales.
>
> I do not prefer this result.
>
> Furthermore, I looked into the issue:
> https://github.com/git-for-windows/git/issues/2734
> and I found that git-for-windows always use utf-8
> encoding even if the locale is ja_JP.CP932.
> It does not change coding based on locale or code
> page.
>
> Even with Johannes's patch, if mintty is started with
> locale ja_JP.CP932, the file name will be garbled
> bacause SetConsoleOutputCP(CP_UTF8) will not be called.
>
> IMHO, it is the problem of git-for-windows rather
> than cygwin and msys2.
>
> To make current version of git-for-windows work, it is
> necessary to set code page to CP_UTF8 regardless locale.
> This does not make sense at all.

You are misrepresenting the problem. It has nothing to do with Git for
Windows. For example, if you run tests in an Angular project inside
Cygwin's MinTTY, the output will be garbled because node.js (or the
Angular libraries, I don't know) will interpret `LANG` or something.

This is a much bigger problem than you make it sound. The console
applications that do _not_ call `SetConsoleOutputCP()` are sooooo
ubiquituous. I think you are underestimating that problem rather
dramatically.

Ciao,
Johannes
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

cygwin-patches mailing list
In reply to this post by Corinna Vinschen-2
Hi Corinna,

On Wed, 2 Sep 2020 10:38:18 +0200
Corinna Vinschen wrote:

> On Sep  2 10:30, Corinna Vinschen wrote:
> > Ok guys, I'm not opposed to this change in terms of its result,
> > but I'm starting to wonder why all this locale code in fhandler_tty
> > is necessary at all.
> >
> > I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> > on the charsets which looks like duplicates of the initial_setlocale()
> > call performed at DLL startup.
> >
> > If there's anything missing in the initial_setlocale() call which would
> > be required by the pseudo tty code?  What exactly is it?  The codepage?
> > And why can't we just add the info to cygheap->locale at initial_setlocale()
> > time so it's available at exec time without going through all this hassle
> > every time?
> >
> > Apart from that, all this locale/charset/lcid stuff should be concentrated
> > in nlsfunc.cc ideally.
>
> get_locale_from_env() and get_langinfo() should go away.  If we just
> need a codepage for get_ttyp ()->term_code_page, we should really find a
> way to do this from within internal_setlocale().

I looked into internal_setlocale() code, but I could not found
the code which handles thecode page. I found the code handling
the code page in __set_charset_from_locale() function in nlsfuncs.cc,
but it does not return code page itself. Could you please explain
more detail of your idea?

--
Takashi Yano <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

cygwin-patches mailing list
In reply to this post by Johannes Schindelin
On Wed, 2 Sep 2020 08:26:04 +0200 (CEST)
Johannes Schindelin wrote:

> Hi Takashi,
>
> On Wed, 2 Sep 2020, Takashi Yano via Cygwin-patches wrote:
>
> > On Wed, 2 Sep 2020 10:30:14 +0200
> > Corinna Vinschen wrote:
> > > On Sep  1 18:19, Johannes Schindelin wrote:
> > > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > > correct, but after that (at least if Pseudo Console support is enabled),
> > > > we try to find the default code page for that `LCID`, which is ASCII
> > > > (437). Subsequently, we set the Console output code page to that value,
> > > > completely ignoring that we wanted to use UTF-8.
> > > >
> > > > Let's not ignore the specifically asked-for UTF-8 character set.
> > > >
> > > > While at it, let's also set the Console output code page even if Pseudo
> > > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > > Console output code page is not ignored in that case.
> > > >
> > > > The most common symptom would be that console applications which do not
> > > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > > >
> > > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > > https://github.com/rust-lang/cargo/issues/8369,
> > > > https://github.com/git-for-windows/git/issues/2734,
> > > > https://github.com/git-for-windows/git/issues/2793,
> > > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > > few others.
> > > >
> > > > Signed-off-by: Johannes Schindelin <[hidden email]>
> > > > ---
> > > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > > >  1 file changed, 9 insertions(+)
> > >
> > > Ok guys, I'm not opposed to this change in terms of its result,
> >
> > I am sorry, but I cannot agree with Johannes's patch.
> >
> > For example, code page in Japan is CP932 by default.
> > In this case, cmd.exe, netsh.exe and so on are generate
> > messages in Japanese.
> >
> > If the code page is set to CP_UTF8, messages from such
> > commands changes to english. I guess similar things happen
> > for other locales.
> >
> > I do not prefer this result.
> >
> > Furthermore, I looked into the issue:
> > https://github.com/git-for-windows/git/issues/2734
> > and I found that git-for-windows always use utf-8
> > encoding even if the locale is ja_JP.CP932.
> > It does not change coding based on locale or code
> > page.
> >
> > Even with Johannes's patch, if mintty is started with
> > locale ja_JP.CP932, the file name will be garbled
> > bacause SetConsoleOutputCP(CP_UTF8) will not be called.
> >
> > IMHO, it is the problem of git-for-windows rather
> > than cygwin and msys2.
> >
> > To make current version of git-for-windows work, it is
> > necessary to set code page to CP_UTF8 regardless locale.
> > This does not make sense at all.
>
> You are misrepresenting the problem. It has nothing to do with Git for
> Windows. For example, if you run tests in an Angular project inside
> Cygwin's MinTTY, the output will be garbled because node.js (or the
> Angular libraries, I don't know) will interpret `LANG` or something.

You listed these issues in git-for-windows:
> > > > https://github.com/git-for-windows/git/issues/2734,
> > > > https://github.com/git-for-windows/git/issues/2792,
didn't you? Therefore, I looked into them.

OK, I will check Angular/CLI next. But I am not familier with
Agnular/CLI. Could you please provide simple steps to reproduce
the problem?

> This is a much bigger problem than you make it sound. The console
> applications that do _not_ call `SetConsoleOutputCP()` are sooooo
> ubiquituous. I think you are underestimating that problem rather
> dramatically.
>
> Ciao,
> Johannes


--
Takashi Yano <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Johannes Schindelin
Hi Takashi,


On Wed, 2 Sep 2020, Takashi Yano wrote:

> On Wed, 2 Sep 2020 08:26:04 +0200 (CEST)
> Johannes Schindelin wrote:
> > Hi Takashi,
> >
> > On Wed, 2 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> >
> > > On Wed, 2 Sep 2020 10:30:14 +0200
> > > Corinna Vinschen wrote:
> > > > On Sep  1 18:19, Johannes Schindelin wrote:
> > > > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > > > correct, but after that (at least if Pseudo Console support is enabled),
> > > > > we try to find the default code page for that `LCID`, which is ASCII
> > > > > (437). Subsequently, we set the Console output code page to that value,
> > > > > completely ignoring that we wanted to use UTF-8.
> > > > >
> > > > > Let's not ignore the specifically asked-for UTF-8 character set.
> > > > >
> > > > > While at it, let's also set the Console output code page even if Pseudo
> > > > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > > > Console output code page is not ignored in that case.
> > > > >
> > > > > The most common symptom would be that console applications which do not
> > > > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > > > >
> > > > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > > > https://github.com/rust-lang/cargo/issues/8369,
> > > > > https://github.com/git-for-windows/git/issues/2734,
> > > > > https://github.com/git-for-windows/git/issues/2793,
> > > > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > > > few others.
> > > > >
> > > > > Signed-off-by: Johannes Schindelin <[hidden email]>
> > > > > ---
> > > > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > > > >  1 file changed, 9 insertions(+)
> > > >
> > > > Ok guys, I'm not opposed to this change in terms of its result,
> > >
> > > I am sorry, but I cannot agree with Johannes's patch.
> > >
> > > For example, code page in Japan is CP932 by default.
> > > In this case, cmd.exe, netsh.exe and so on are generate
> > > messages in Japanese.
> > >
> > > If the code page is set to CP_UTF8, messages from such
> > > commands changes to english. I guess similar things happen
> > > for other locales.
> > >
> > > I do not prefer this result.
> > >
> > > Furthermore, I looked into the issue:
> > > https://github.com/git-for-windows/git/issues/2734
> > > and I found that git-for-windows always use utf-8
> > > encoding even if the locale is ja_JP.CP932.
> > > It does not change coding based on locale or code
> > > page.
> > >
> > > Even with Johannes's patch, if mintty is started with
> > > locale ja_JP.CP932, the file name will be garbled
> > > bacause SetConsoleOutputCP(CP_UTF8) will not be called.
> > >
> > > IMHO, it is the problem of git-for-windows rather
> > > than cygwin and msys2.
> > >
> > > To make current version of git-for-windows work, it is
> > > necessary to set code page to CP_UTF8 regardless locale.
> > > This does not make sense at all.
> >
> > You are misrepresenting the problem. It has nothing to do with Git for
> > Windows. For example, if you run tests in an Angular project inside
> > Cygwin's MinTTY, the output will be garbled because node.js (or the
> > Angular libraries, I don't know) will interpret `LANG` or something.
>
> You listed these issues in git-for-windows:
> > > > > https://github.com/git-for-windows/git/issues/2734,
> > > > > https://github.com/git-for-windows/git/issues/2792,
> didn't you? Therefore, I looked into them.
>
> OK, I will check Angular/CLI next. But I am not familier with
> Agnular/CLI. Could you please provide simple steps to reproduce
> the problem?

Here is a report: https://github.com/git-for-windows/git/issues/2793

But why do you want to look into Angular/CLI? Do you really think that it
makes sense to try to fix every console app out there? Really? Wouldn't it
make more sense to just accept that there are many console applications
that are broken by the recent change, and accommodate them in the Cygwin
runtime? That would take substantially less time.

Ciao,
Johannes

>
> > This is a much bigger problem than you make it sound. The console
> > applications that do _not_ call `SetConsoleOutputCP()` are sooooo
> > ubiquituous. I think you are underestimating that problem rather
> > dramatically.
> >
> > Ciao,
> > Johannes
>
>
> --
> Takashi Yano <[hidden email]>
>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

cygwin-patches mailing list
On Wed, 2 Sep 2020 11:12:53 +0200 (CEST)
Johannes Schindelin wrote:
> On Wed, 2 Sep 2020, Takashi Yano wrote:
> > OK, I will check Angular/CLI next. But I am not familier with
> > Agnular/CLI. Could you please provide simple steps to reproduce
> > the problem?
>
> Here is a report: https://github.com/git-for-windows/git/issues/2793

I already read that thread, and I have try following step.

1) Install node.js
2) npm install --global @angular/cli
3) ng new test-app
4) cd test-app
5) ng test --code-coverage

However, the output is very differnt from the bug report,
and there seems to be no garbled output.

The output is something like:

Compiling @angular/core : es2015 as esm2015
Compiling @angular/animations : es2015 as esm2015
Compiling @angular/compiler/testing : es2015 as esm2015
Compiling @angular/common : es2015 as esm2015
Compiling @angular/animations/browser : es2015 as esm2015
Compiling @angular/core/testing : es2015 as esm2015
Compiling @angular/animations/browser/testing : es2015 as esm2015
Compiling @angular/platform-browser : es2015 as esm2015
Compiling @angular/common/http : es2015 as esm2015
Compiling @angular/common/testing : es2015 as esm2015
Compiling @angular/router : es2015 as esm2015
Compiling @angular/forms : es2015 as esm2015
Compiling @angular/platform-browser-dynamic : es2015 as esm2015
Compiling @angular/platform-browser/testing : es2015 as esm2015
Compiling @angular/platform-browser/animations : es2015 as esm2015
Compiling @angular/common/http/testing : es2015 as esm2015
Compiling @angular/platform-browser-dynamic/testing : es2015 as esm2015
Compiling @angular/router/testing : es2015 as esm2015
10% building 2/2 modules 0 active02 09 2020 23:49:25.110:WARN [karma]: No captur
ed browser, open http://localhost:9876/
02 09 2020 23:49:25.118:INFO [karma-server]: Karma v5.0.9 server started at http
://0.0.0.0:9876/
02 09 2020 23:49:25.119:INFO [launcher]: Launching browsers Chrome with concurre
ncy unlimited
02 09 2020 23:49:25.125:INFO [launcher]: Starting browser Chrome
92% additional asset processing copy-webpack-plugin02 09 2020 23:49:34.003:WARN
[karma]: No captured browser, open http://localhost:9876/
02 09 2020 23:49:34.907:INFO [Chrome 85.0.4183.83 (Windows 10)]: Connected on so
cket z7khH23JfW4v-_TtAAAA with id 59608191
Chrome 85.0.4183.83 (Windows 10): Executed 3 of 3 SUCCESS (0.62 secs / 0.273 sec
s)
TOTAL: 3 SUCCESS
TOTAL: 3 SUCCESS
TOTAL: 3 SUCCESS

=============================== Coverage summary ===============================
Statements   : 100% ( 6/6 )
Branches     : 100% ( 0/0 )
Functions    : 100% ( 0/0 )
Lines        : 100% ( 5/5 )
================================================================================

What's wrong?

> But why do you want to look into Angular/CLI? Do you really think that it
> makes sense to try to fix every console app out there? Really? Wouldn't it
> make more sense to just accept that there are many console applications
> that are broken by the recent change, and accommodate them in the Cygwin
> runtime? That would take substantially less time.

It is important to know what's happning actually to fix the issue.
Superficial patch makes the problem more complicated.

--
Takashi Yano <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Corinna Vinschen-2
In reply to this post by cygwin-patches mailing list
On Sep  2 19:54, Takashi Yano via Cygwin-patches wrote:

> Hi Corinna,
>
> On Wed, 2 Sep 2020 10:38:18 +0200
> Corinna Vinschen wrote:
> > On Sep  2 10:30, Corinna Vinschen wrote:
> > > Ok guys, I'm not opposed to this change in terms of its result,
> > > but I'm starting to wonder why all this locale code in fhandler_tty
> > > is necessary at all.
> > >
> > > I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> > > on the charsets which looks like duplicates of the initial_setlocale()
> > > call performed at DLL startup.
> > >
> > > If there's anything missing in the initial_setlocale() call which would
> > > be required by the pseudo tty code?  What exactly is it?  The codepage?
> > > And why can't we just add the info to cygheap->locale at initial_setlocale()
> > > time so it's available at exec time without going through all this hassle
> > > every time?
> > >
> > > Apart from that, all this locale/charset/lcid stuff should be concentrated
> > > in nlsfunc.cc ideally.
> >
> > get_locale_from_env() and get_langinfo() should go away.  If we just
> > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > way to do this from within internal_setlocale().
>
> I looked into internal_setlocale() code, but I could not found
> the code which handles thecode page. I found the code handling
> the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> but it does not return code page itself. Could you please explain
> more detail of your idea?

I had none yet :)  I was just musing, without actually thinking about a
solution.  But I think this isn't very complicated.  Given this is
inside Cygwin, nothing keeps the function to have a well-defined
side-effect, as in setting a (not yet existing) member "term_code_page"
of cygheap->locale.

Kind of like this:

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 8877cc358c39..2b84f4252071 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,6 +341,7 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
+  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 668d7eb9e778..752f4239d911 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1298,6 +1298,9 @@ __set_charset_from_locale (const char *locale, char *charset)
     LOCALE_IDEFAULTANSICODEPAGE | LOCALE_RETURN_NUMBER,
     (PWCHAR) &cp, sizeof cp))
     cp = 0;
+  /* Store codepage in cygheap->locale so fhandler_tty can switch the
+     pseudo console to the correct codepage. */
+  cygheap->locale.term_code_page = cp ?: CP_UTF8;
   /* Translate codepage and lcid to a charset closely aligned with the default
      charsets defined in Glibc. */
   const char *cs;

Make sense?


Thanks,
Corinna
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Corinna Vinschen-2
On Sep  2 17:24, Corinna Vinschen wrote:

> On Sep  2 19:54, Takashi Yano via Cygwin-patches wrote:
> > Hi Corinna,
> >
> > On Wed, 2 Sep 2020 10:38:18 +0200
> > Corinna Vinschen wrote:
> > > On Sep  2 10:30, Corinna Vinschen wrote:
> > > > Ok guys, I'm not opposed to this change in terms of its result,
> > > > but I'm starting to wonder why all this locale code in fhandler_tty
> > > > is necessary at all.
> > > >
> > > > I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> > > > on the charsets which looks like duplicates of the initial_setlocale()
> > > > call performed at DLL startup.
> > > >
> > > > If there's anything missing in the initial_setlocale() call which would
> > > > be required by the pseudo tty code?  What exactly is it?  The codepage?
> > > > And why can't we just add the info to cygheap->locale at initial_setlocale()
> > > > time so it's available at exec time without going through all this hassle
> > > > every time?
> > > >
> > > > Apart from that, all this locale/charset/lcid stuff should be concentrated
> > > > in nlsfunc.cc ideally.
> > >
> > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > way to do this from within internal_setlocale().
> >
> > I looked into internal_setlocale() code, but I could not found
> > the code which handles thecode page. I found the code handling
> > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > but it does not return code page itself. Could you please explain
> > more detail of your idea?
>
> I had none yet :)  I was just musing, without actually thinking about a
> solution.  But I think this isn't very complicated.  Given this is
> inside Cygwin, nothing keeps the function to have a well-defined
> side-effect, as in setting a (not yet existing) member "term_code_page"
> of cygheap->locale.
>
> Kind of like this:

Actually, this is a bit too simple, but you get the idea.  We need to
align the terminal codepage with the actual Cygwin charset, along the
lines of what your setup_locale is doing standalone yet.  Except in case
of ASCII, where we default to UTF-8 internally.  The important part here
is that we do this once, and that we don't have unnecessary code
duplication.


Corinna
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

cygwin-patches mailing list
In reply to this post by Corinna Vinschen-2
Hi Corinna,

On Wed, 2 Sep 2020 17:24:50 +0200
Corinna Vinschen  wrote:

> On Sep  2 19:54, Takashi Yano via Cygwin-patches wrote:
> > Hi Corinna,
> >
> > On Wed, 2 Sep 2020 10:38:18 +0200
> > Corinna Vinschen wrote:
> > > On Sep  2 10:30, Corinna Vinschen wrote:
> > > > Ok guys, I'm not opposed to this change in terms of its result,
> > > > but I'm starting to wonder why all this locale code in fhandler_tty
> > > > is necessary at all.
> > > >
> > > > I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> > > > on the charsets which looks like duplicates of the initial_setlocale()
> > > > call performed at DLL startup.
> > > >
> > > > If there's anything missing in the initial_setlocale() call which would
> > > > be required by the pseudo tty code?  What exactly is it?  The codepage?
> > > > And why can't we just add the info to cygheap->locale at initial_setlocale()
> > > > time so it's available at exec time without going through all this hassle
> > > > every time?
> > > >
> > > > Apart from that, all this locale/charset/lcid stuff should be concentrated
> > > > in nlsfunc.cc ideally.
> > >
> > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > way to do this from within internal_setlocale().
> >
> > I looked into internal_setlocale() code, but I could not found
> > the code which handles thecode page. I found the code handling
> > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > but it does not return code page itself. Could you please explain
> > more detail of your idea?
>
> I had none yet :)  I was just musing, without actually thinking about a
> solution.  But I think this isn't very complicated.  Given this is
> inside Cygwin, nothing keeps the function to have a well-defined
> side-effect, as in setting a (not yet existing) member "term_code_page"
> of cygheap->locale.
>
> Kind of like this:
>
> diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
> index 8877cc358c39..2b84f4252071 100644
> --- a/winsup/cygwin/cygheap.h
> +++ b/winsup/cygwin/cygheap.h
> @@ -341,6 +341,7 @@ struct cygheap_debug
>  struct cygheap_locale
>  {
>    mbtowc_p mbtowc;
> +  UINT term_code_page;
>  };
>  
>  struct user_heap_info
> diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
> index 668d7eb9e778..752f4239d911 100644
> --- a/winsup/cygwin/nlsfuncs.cc
> +++ b/winsup/cygwin/nlsfuncs.cc
> @@ -1298,6 +1298,9 @@ __set_charset_from_locale (const char *locale, char *charset)
>      LOCALE_IDEFAULTANSICODEPAGE | LOCALE_RETURN_NUMBER,
>      (PWCHAR) &cp, sizeof cp))
>      cp = 0;
> +  /* Store codepage in cygheap->locale so fhandler_tty can switch the
> +     pseudo console to the correct codepage. */
> +  cygheap->locale.term_code_page = cp ?: CP_UTF8;
>    /* Translate codepage and lcid to a charset closely aligned with the default
>       charsets defined in Glibc. */
>    const char *cs;
>
> Make sense?

I have tried your code, however, it does not work as expected.
It seems that __set_charset_from_locale() is not called.
cygheap->locale.term_code_page is always 0.

I have added following lines into setup_locale() to make sure
to call __set_charset_from_locale() for a test,

  setlocale (LC_ALL, "");
  __set_charset_from_locale (__get_global_locale()->categories[LC_CTYPE], charset);
  get_ttyp ()->term_code_page = cygheap->locale.term_code_page;

however, term_code_page is set to 932 if locale is ja_JP.UTF-8.
In this case term_code_page should be CP_UTF8 (65001).

The code page retrieved in __set_charset_from_locale() is not
based on "UTF-8" but "ja_JP".

Let me consider a while.

--
Takashi Yano <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Corinna Vinschen-2
Hi Takashi,

On Sep  3 01:25, Takashi Yano via Cygwin-patches wrote:

> Hi Corinna,
>
> On Wed, 2 Sep 2020 17:24:50 +0200
> Corinna Vinschen  wrote:
> > > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > > way to do this from within internal_setlocale().
> > >
> > > I looked into internal_setlocale() code, but I could not found
> > > the code which handles thecode page. I found the code handling
> > > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > > but it does not return code page itself. Could you please explain
> > > more detail of your idea?
> >
> > I had none yet :)  I was just musing, without actually thinking about a
> > solution.  But I think this isn't very complicated.  Given this is
> > inside Cygwin, nothing keeps the function to have a well-defined
> > side-effect, as in setting a (not yet existing) member "term_code_page"
> > of cygheap->locale.
> >
> > Kind of like this:
> > [...]
> I have tried your code, however, it does not work as expected.
> It seems that __set_charset_from_locale() is not called.
> cygheap->locale.term_code_page is always 0.

Ah, right!  Take a look into newlib/libc/locale/locale.c, function
__loadlocale().  This function is called from _setlocale_r().  However,
it calls __set_charset_from_locale() *only* if the charset isn't already
given explicitely in the LC_* or LANG string, because otherwise we
already know the charset, after all.

Darn!  That foils my plans for world domination...

> Let me consider a while.

Thanks, I'll do the same.  I'd really like to simplify this stuff
and doing the locale shuffle in two entirely different locations
at different times is prone to getting out of sync.


Corinna
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Corinna Vinschen-2
On Sep  2 18:38, Corinna Vinschen wrote:

> Hi Takashi,
>
> On Sep  3 01:25, Takashi Yano via Cygwin-patches wrote:
> > Hi Corinna,
> >
> > On Wed, 2 Sep 2020 17:24:50 +0200
> > Corinna Vinschen  wrote:
> > > > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > > > way to do this from within internal_setlocale().
> > > >
> > > > I looked into internal_setlocale() code, but I could not found
> > > > the code which handles thecode page. I found the code handling
> > > > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > > > but it does not return code page itself. Could you please explain
> > > > more detail of your idea?
> > >
> > > I had none yet :)  I was just musing, without actually thinking about a
> > > solution.  But I think this isn't very complicated.  Given this is
> > > inside Cygwin, nothing keeps the function to have a well-defined
> > > side-effect, as in setting a (not yet existing) member "term_code_page"
> > > of cygheap->locale.
> > >
> > > Kind of like this:
> > > [...]
> > I have tried your code, however, it does not work as expected.
> > It seems that __set_charset_from_locale() is not called.
> > cygheap->locale.term_code_page is always 0.
>
> Ah, right!  Take a look into newlib/libc/locale/locale.c, function
> __loadlocale().  This function is called from _setlocale_r().  However,
> it calls __set_charset_from_locale() *only* if the charset isn't already
> given explicitely in the LC_* or LANG string, because otherwise we
> already know the charset, after all.
>
> Darn!  That foils my plans for world domination...
>
> > Let me consider a while.
>
> Thanks, I'll do the same.  I'd really like to simplify this stuff
> and doing the locale shuffle in two entirely different locations
> at different times is prone to getting out of sync.

The only idea I had so far was, changing the way __set_charset_from_locale
works from within _setlocale_r:

We could add a Cygwin-specific function only fetching the codepage and
call it unconditionally from _setlocale_r.  __set_charset_from_locale is
called with a new parameter "codepage", so it doesn't have to fetch the
CP by itself, but it's still only called from _setlocale_r if necessary.

Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
could be done at the point the codepage is actually required.


Corinna
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

cygwin-patches mailing list
Hi Corinna,

On Thu, 3 Sep 2020 19:59:12 +0200
Corinna Vinschen wrote:

> On Sep  2 18:38, Corinna Vinschen wrote:
> > Hi Takashi,
> >
> > On Sep  3 01:25, Takashi Yano via Cygwin-patches wrote:
> > > Hi Corinna,
> > >
> > > On Wed, 2 Sep 2020 17:24:50 +0200
> > > Corinna Vinschen  wrote:
> > > > > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > > > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > > > > way to do this from within internal_setlocale().
> > > > >
> > > > > I looked into internal_setlocale() code, but I could not found
> > > > > the code which handles thecode page. I found the code handling
> > > > > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > > > > but it does not return code page itself. Could you please explain
> > > > > more detail of your idea?
> > > >
> > > > I had none yet :)  I was just musing, without actually thinking about a
> > > > solution.  But I think this isn't very complicated.  Given this is
> > > > inside Cygwin, nothing keeps the function to have a well-defined
> > > > side-effect, as in setting a (not yet existing) member "term_code_page"
> > > > of cygheap->locale.
> > > >
> > > > Kind of like this:
> > > > [...]
> > > I have tried your code, however, it does not work as expected.
> > > It seems that __set_charset_from_locale() is not called.
> > > cygheap->locale.term_code_page is always 0.
> >
> > Ah, right!  Take a look into newlib/libc/locale/locale.c, function
> > __loadlocale().  This function is called from _setlocale_r().  However,
> > it calls __set_charset_from_locale() *only* if the charset isn't already
> > given explicitely in the LC_* or LANG string, because otherwise we
> > already know the charset, after all.
> >
> > Darn!  That foils my plans for world domination...
> >
> > > Let me consider a while.
> >
> > Thanks, I'll do the same.  I'd really like to simplify this stuff
> > and doing the locale shuffle in two entirely different locations
> > at different times is prone to getting out of sync.
>
> The only idea I had so far was, changing the way __set_charset_from_locale
> works from within _setlocale_r:
>
> We could add a Cygwin-specific function only fetching the codepage and
> call it unconditionally from _setlocale_r.  __set_charset_from_locale is
> called with a new parameter "codepage", so it doesn't have to fetch the
> CP by itself, but it's still only called from _setlocale_r if necessary.
>
> Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
> could be done at the point the codepage is actually required.
I think I have found the answer to your request.
Patch attached. What do you think of this patch?

Calling initial_setlocale() is necessary because
nl_langinfo() always returns "ANSI_X3.4-1968"
regardless locale setting if this is not called.

--
Takashi Yano <[hidden email]>

0001-Cygwin-pty-Replace-pty-specific-locale-functions-wit.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

cygwin-patches mailing list
In reply to this post by Johannes Schindelin
Hi Johannes and Corinna,

On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
Johannes Schindelin wrote:

> When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> correct, but after that (at least if Pseudo Console support is enabled),
> we try to find the default code page for that `LCID`, which is ASCII
> (437). Subsequently, we set the Console output code page to that value,
> completely ignoring that we wanted to use UTF-8.
>
> Let's not ignore the specifically asked-for UTF-8 character set.
>
> While at it, let's also set the Console output code page even if Pseudo
> Console support is disabled; contrary to the behavior of v3.0.7, the
> Console output code page is not ignored in that case.
>
> The most common symptom would be that console applications which do not
> specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
>
> This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> https://github.com/msys2/MSYS2-packages/issues/2012,
> https://github.com/rust-lang/cargo/issues/8369,
> https://github.com/git-for-windows/git/issues/2734,
> https://github.com/git-for-windows/git/issues/2793,
> https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> few others.
>
> Signed-off-by: Johannes Schindelin <[hidden email]>
> ---
>  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> index 06789a500..414c26992 100644
> --- a/winsup/cygwin/fhandler_tty.cc
> +++ b/winsup/cygwin/fhandler_tty.cc
> @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
>    char charset[ENCODING_LEN + 1] = "ASCII";
>    LCID lcid = get_langinfo (locale, charset);
>
> +  /* Special-case the UTF-8 character set */
> +  if (strcasecmp (charset, "UTF-8") == 0)
> +    {
> +      get_ttyp ()->term_code_page = CP_UTF8;
> +      SetConsoleCP (CP_UTF8);
> +      SetConsoleOutputCP (CP_UTF8);
> +      return;
> +    }
> +
>    /* Set console code page from locale */
>    if (get_pseudo_console ())
>      {
> --
> 2.27.0
I would like to propose a counter patch attached.
What do you think of this patch?

This patch does not treat UTF-8 as special.

--
Takashi Yano <[hidden email]>

0001-Cygwin-pty-Prevent-garbled-output-when-pseudo-consol.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Johannes Schindelin
Hi Takashi,

On Fri, 4 Sep 2020, Takashi Yano via Cygwin-patches wrote:

> Hi Johannes and Corinna,
>
> On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
> Johannes Schindelin wrote:
>
> > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > correct, but after that (at least if Pseudo Console support is enabled),
> > we try to find the default code page for that `LCID`, which is ASCII
> > (437). Subsequently, we set the Console output code page to that value,
> > completely ignoring that we wanted to use UTF-8.
> >
> > Let's not ignore the specifically asked-for UTF-8 character set.
> >
> > While at it, let's also set the Console output code page even if Pseudo
> > Console support is disabled; contrary to the behavior of v3.0.7, the
> > Console output code page is not ignored in that case.
> >
> > The most common symptom would be that console applications which do not
> > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> >
> > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > https://github.com/msys2/MSYS2-packages/issues/2012,
> > https://github.com/rust-lang/cargo/issues/8369,
> > https://github.com/git-for-windows/git/issues/2734,
> > https://github.com/git-for-windows/git/issues/2793,
> > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > few others.
> >
> > Signed-off-by: Johannes Schindelin <[hidden email]>
> > ---
> >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > index 06789a500..414c26992 100644
> > --- a/winsup/cygwin/fhandler_tty.cc
> > +++ b/winsup/cygwin/fhandler_tty.cc
> > @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
> >    char charset[ENCODING_LEN + 1] = "ASCII";
> >    LCID lcid = get_langinfo (locale, charset);
> >
> > +  /* Special-case the UTF-8 character set */
> > +  if (strcasecmp (charset, "UTF-8") == 0)
> > +    {
> > +      get_ttyp ()->term_code_page = CP_UTF8;
> > +      SetConsoleCP (CP_UTF8);
> > +      SetConsoleOutputCP (CP_UTF8);
> > +      return;
> > +    }
> > +
> >    /* Set console code page from locale */
> >    if (get_pseudo_console ())
> >      {
> > --
> > 2.27.0
>
> I would like to propose a counter patch attached.
> What do you think of this patch?
>
> This patch does not treat UTF-8 as special.

Sure, but it only fixes the issue in `disable_pcon` mode in the current
tip commit. That's because a new Pseudo Console is created for every
spawned non-Cygwin console application, and that new Pseudo Console does
_not_ have the code page set by your patch.

I verified that this patch works around the issue in `disable_pcon` mode,
which is good.

Ciao,
Dscho
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Corinna Vinschen-2
In reply to this post by cygwin-patches mailing list
Hi Takashi,

On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:

> Hi Corinna,
>
> On Thu, 3 Sep 2020 19:59:12 +0200
> Corinna Vinschen wrote:
> > The only idea I had so far was, changing the way __set_charset_from_locale
> > works from within _setlocale_r:
> >
> > We could add a Cygwin-specific function only fetching the codepage and
> > call it unconditionally from _setlocale_r.  __set_charset_from_locale is
> > called with a new parameter "codepage", so it doesn't have to fetch the
> > CP by itself, but it's still only called from _setlocale_r if necessary.
> >
> > Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
> > could be done at the point the codepage is actually required.
>
> I think I have found the answer to your request.
> Patch attached. What do you think of this patch?
>
> Calling initial_setlocale() is necessary because
> nl_langinfo() always returns "ANSI_X3.4-1968"
> regardless locale setting if this is not called.
Well, this is correct.  Per POSIX, a standard-conformant application is
running in the "C" locale unless it calls setlocale() explicitely.
That's one reason Cygwin defaults to UTF-8 internally.  Everything,
including the terminal, is supposed to default to UTF-8.  That's the
most sane default, even if an application is not locale-aware.

So, to follow POSIX, initial_setlocale() is used to set up the
environment and command line stuff and then, before calling the
application's main, Cygwin calls _setlocale_r (_REENT, LC_CTYPE, "C");
to reset the apps default locale to "C".  That's why nl_langinfo()
returns "ANSI_X3.4-1968".

However, the initial_setlocale() call in dll_crt0_1 calls
internal_setlocale(), and *that* function sets the conversion functions
for the internal conversions.  What it *doesn't* do yet at the moment is
to store the charset name itself or, better, the equivalent codepage.

If we change that, setup_locale can simply go away.  Below is a patch
doing just that.  Can you please check if that works in your test
scenarios?

However, there's something which worries me.  Why do we need or set the
pseudo terminal codepage at all?  I see that you convert from MB charset
to MB charset and then use the result in WriteFile to the connecting
pipes.  Question is this: Why not just converting the strings via
sys_mbstowcs to a UTF-16 string and then send that over the line, using
WriteConsoleW for the final output to the console?  That would simplify
this stuff quite a bit, wouldn't it?  After all, for writing UTF-16 to
the console, we simply don't need to know or care for the console CP.


Thanks,
Corinna

0001-Cygwin-fetch-codepage-for-pseudo-console-at-initial_.patch (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"

Brian Inglis
On 2020-09-04 06:44, Corinna Vinschen wrote:

> Hi Takashi,
>
> On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
>> Hi Corinna,
>>
>> On Thu, 3 Sep 2020 19:59:12 +0200
>> Corinna Vinschen wrote:
>>> The only idea I had so far was, changing the way __set_charset_from_locale
>>> works from within _setlocale_r:
>>>
>>> We could add a Cygwin-specific function only fetching the codepage and
>>> call it unconditionally from _setlocale_r.  __set_charset_from_locale is
>>> called with a new parameter "codepage", so it doesn't have to fetch the
>>> CP by itself, but it's still only called from _setlocale_r if necessary.
>>>
>>> Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
>>> could be done at the point the codepage is actually required.
>>
>> I think I have found the answer to your request.
>> Patch attached. What do you think of this patch?
>>
>> Calling initial_setlocale() is necessary because
>> nl_langinfo() always returns "ANSI_X3.4-1968"
>> regardless locale setting if this is not called.
>
> Well, this is correct.  Per POSIX, a standard-conformant application is
> running in the "C" locale unless it calls setlocale() explicitely.
> That's one reason Cygwin defaults to UTF-8 internally.  Everything,
> including the terminal, is supposed to default to UTF-8.  That's the
> most sane default, even if an application is not locale-aware.
>
> So, to follow POSIX, initial_setlocale() is used to set up the
> environment and command line stuff and then, before calling the
> application's main, Cygwin calls _setlocale_r (_REENT, LC_CTYPE, "C");
> to reset the apps default locale to "C".  That's why nl_langinfo()
> returns "ANSI_X3.4-1968".
>
> However, the initial_setlocale() call in dll_crt0_1 calls
> internal_setlocale(), and *that* function sets the conversion functions
> for the internal conversions.  What it *doesn't* do yet at the moment is
> to store the charset name itself or, better, the equivalent codepage.
>
> If we change that, setup_locale can simply go away.  Below is a patch
> doing just that.  Can you please check if that works in your test
> scenarios?
>
> However, there's something which worries me.  Why do we need or set the
> pseudo terminal codepage at all?  I see that you convert from MB charset
> to MB charset and then use the result in WriteFile to the connecting
> pipes.  Question is this: Why not just converting the strings via
> sys_mbstowcs to a UTF-16 string and then send that over the line, using
> WriteConsoleW for the final output to the console?  That would simplify
> this stuff quite a bit, wouldn't it?  After all, for writing UTF-16 to
> the console, we simply don't need to know or care for the console CP.

IIRC his locale was ja_JP.UTF-8 but he got English messages on CP 932!

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
123