Unicode width data inconsistent/outdated

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Unicode width data inconsistent/outdated

Thomas Wolff
It would be good to keep wcwidth/wcswidth in sync with the installed
Unicode data version (package unicode-ucd).
Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
be used.
I can provide some scripts to generate the respective tables if desired.
Thomas

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Corinna Vinschen-2
On Jul 26 08:49, Thomas Wolff wrote:
> It would be good to keep wcwidth/wcswidth in sync with the installed
> Unicode data version (package unicode-ucd).
> Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
> it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
> be used.
> I can provide some scripts to generate the respective tables if desired.
> Thomas

If you can update the newlib files this way and send matching patches
to the newlib list, this would be highly appreciated.


Thanks,
Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Yaakov Selkowitz
On 2017-07-26 03:08, Corinna Vinschen wrote:

> On Jul 26 08:49, Thomas Wolff wrote:
>> It would be good to keep wcwidth/wcswidth in sync with the installed
>> Unicode data version (package unicode-ucd).
>> Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
>> it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
>> be used.
>> I can provide some scripts to generate the respective tables if desired.
>> Thomas
>
> If you can update the newlib files this way and send matching patches
> to the newlib list, this would be highly appreciated.
Thomas, I just updated unicode-ucd to 10.0 for this purpose.

--
Yaakov


signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Corinna Vinschen-2
On Jul 26 03:16, Yaakov Selkowitz wrote:

> On 2017-07-26 03:08, Corinna Vinschen wrote:
> > On Jul 26 08:49, Thomas Wolff wrote:
> >> It would be good to keep wcwidth/wcswidth in sync with the installed
> >> Unicode data version (package unicode-ucd).
> >> Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
> >> it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
> >> be used.
> >> I can provide some scripts to generate the respective tables if desired.
> >> Thomas
> >
> > If you can update the newlib files this way and send matching patches
> > to the newlib list, this would be highly appreciated.
>
> Thomas, I just updated unicode-ucd to 10.0 for this purpose.
Oh, and, btw, the comment in wcwidth.c isn't quite correct.  The
cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper.c.


Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Thomas Wolff
Am 26.07.2017 um 11:50 schrieb Corinna Vinschen:

> On Jul 26 03:16, Yaakov Selkowitz wrote:
>> On 2017-07-26 03:08, Corinna Vinschen wrote:
>>> On Jul 26 08:49, Thomas Wolff wrote:
>>>> It would be good to keep wcwidth/wcswidth in sync with the installed
>>>> Unicode data version (package unicode-ucd).
>>>> Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
>>>> it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
>>>> be used.
>>>> I can provide some scripts to generate the respective tables if desired.
>>>> Thomas
>>> If you can update the newlib files this way and send matching patches
>>> to the newlib list, this would be highly appreciated.
>> Thomas, I just updated unicode-ucd to 10.0 for this purpose.
Thanks.
>
> Oh, and, btw, the comment in wcwidth.c isn't quite correct.  The
> cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper.c.
Oh, a number of other embedded tables. To make the tow* and isw*
functions more easily adaptable to Unicode updates, there will be some
revisions to do here. And the to* and is* ones (without 'w') even refer
to locales in a way I do not understand. Maybe I'll restrict my effort
to wcwidth first...
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Corinna Vinschen-2
On Jul 26 23:43, Thomas Wolff wrote:

> Am 26.07.2017 um 11:50 schrieb Corinna Vinschen:
> > On Jul 26 03:16, Yaakov Selkowitz wrote:
> > > On 2017-07-26 03:08, Corinna Vinschen wrote:
> > > > On Jul 26 08:49, Thomas Wolff wrote:
> > > > > It would be good to keep wcwidth/wcswidth in sync with the installed
> > > > > Unicode data version (package unicode-ucd).
> > > > > Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
> > > > > it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
> > > > > be used.
> > > > > I can provide some scripts to generate the respective tables if desired.
> > > > > Thomas
> > > > If you can update the newlib files this way and send matching patches
> > > > to the newlib list, this would be highly appreciated.
> > > Thomas, I just updated unicode-ucd to 10.0 for this purpose.
> Thanks.
> >
> > Oh, and, btw, the comment in wcwidth.c isn't quite correct.  The
> > cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper.c.
> Oh, a number of other embedded tables. To make the tow* and isw* functions
> more easily adaptable to Unicode updates, there will be some revisions to do
> here. And the to* and is* ones (without 'w') even refer to locales in a way
> I do not understand. Maybe I'll restrict my effort to wcwidth first...
The to* and is* ones (without 'w') don't matter at all and you don't
have to touch them.

The Unicode stuff only affects the tow and isw functions.

As for how to fetch the data, you may want to have a look into
newlib/libc/ctype/utf8alpha.h and newlib/libc/ctype/utf8print.h.  The
header comments contain the awk scripts used to collect the data.

All other isw* files like iswblank.c contain comments explaining
what Unicode character categories are covered.


Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Thomas Wolff
Am 28.07.2017 um 21:58 schrieb Corinna Vinschen:

> On Jul 26 23:43, Thomas Wolff wrote:
>> Am 26.07.2017 um 11:50 schrieb Corinna Vinschen:
>>> On Jul 26 03:16, Yaakov Selkowitz wrote:
>>>> On 2017-07-26 03:08, Corinna Vinschen wrote:
>>>>> On Jul 26 08:49, Thomas Wolff wrote:
>>>>>> It would be good to keep wcwidth/wcswidth in sync with the installed
>>>>>> Unicode data version (package unicode-ucd).
>>>>>> Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
>>>>>> it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
>>>>>> be used.
>>>>>> I can provide some scripts to generate the respective tables if desired.
>>>>>> Thomas
>>>>> If you can update the newlib files this way and send matching patches
>>>>> to the newlib list, this would be highly appreciated.
>>>> Thomas, I just updated unicode-ucd to 10.0 for this purpose.
>> Thanks.
>>> Oh, and, btw, the comment in wcwidth.c isn't quite correct.  The
>>> cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper.c.
>> Oh, a number of other embedded tables. To make the tow* and isw* functions
>> more easily adaptable to Unicode updates, there will be some revisions to do
>> here. And the to* and is* ones (without 'w') even refer to locales in a way
>> I do not understand. Maybe I'll restrict my effort to wcwidth first...
> The to* and is* ones (without 'w') don't matter at all and you don't
> have to touch them.
>
> The Unicode stuff only affects the tow and isw functions.
>
> As for how to fetch the data, you may want to have a look into
> newlib/libc/ctype/utf8alpha.h and newlib/libc/ctype/utf8print.h.  The
> header comments contain the awk scripts used to collect the data.
But there are no instructions to adapt the embedded conditional
statements referring to those data...
My attempt would be to base the functions on a common table of character
categories instead.

> All other isw* files like iswblank.c contain comments explaining
> what Unicode character categories are covered.
I'm comparing results based on Unicode 5.2 data. There will be some
deviations and maybe some things to discuss.
For example, I wonder why in the current implementation currency symbols
are considered as punctuation (which can be easily reproduced).

Also, there are 3 other issues:


Issue 1 is about handling non-BMP characters by wcwidth.
This has been discussed before.

On Mon, 31 Jan 2011 09:58:19 -0700
(https://sourceware.org/ml/cygwin/2011-01/msg00453.html)
Erik Blake wrote:
> POSIX requires that 1 wchar_t corresponds to 1 character; so any use
> of surrogates to get the full benefit of UTF-16 falls outside the
> bounds of POSIX.
> At which point, the POSIX definition of those functions no longer
> apply, and we can (try) to make the various wc* functions try to
> behave as smartly as possible (as is the case with Cygwin); where
> those smarts are only needed when you use surrogate pairs.

On Wed, 2 Feb 2011 12:29:03 +0100
(https://sourceware.org/ml/cygwin/2011-02/msg00037.html)
Bruno Haible wrote:
> Code that uses <wctype.h> and wcwidth() is written precisely according
> to POSIX.
> The problem is that this code cannot work correctly when wchar_t[] is
> in UTF-16 encoding.
> There simply is no way to define these functions in a reasonable way
> for surrogates.
I don’t agree with this, see below.

On Wed, 2 Feb 2011 13:21:02 +0100
(https://sourceware.org/ml/cygwin/2011-02/msg00040.html)
Corinna Vinschen wrote:
> And, please note the wording in SUSv4, for instance in
> http://calimero.vinschen.de/susv4/functions/iswalpha.html
(not found)
>   The wc argument is a wint_t, the value of which the application shall
>                        ^^^^^^                         ^^^^^^^^^^^
>   ensure is a wide-character code corresponding to a valid character
> in the current locale, or equal to the value of the macro WEOF. If the
> argument has any other value, the behavior is undefined.
> I don't see any words in that which would disallow to convert UTF-16
> wchar_t surrogates to a wint_t UTF-32 value before calling one of the
> wctype functions.  Just like you have to be careful not to call the
> ctype functions with a signed char.

While wcswidth works already (using internal __wcwidth), and the isw*
and tow* functions work as well because they use wint_t, wcwidth is the
only function (inconsistently insisting on wchar_t) that does not work.
But note https://linux.die.net/man/3/wcwidth which says
> Note that glibc before 2.2.5 used the prototype
> int wcwidth(wint_t c);
Why not revert to wcwidth(wint_t)?
I think for cygwin it is the only solution that makes wcwidth work for
non-BMP characters and is also compatible (unlike some proposals
discussed later in the quoted thread).


Issue 2 is the handling of titlecase characters (e.g. "Nj" as one
Unicode character U+01CB). The current implementation considers them to
be both upper and lower (iswupper: return towlower (c) != c); I'd rather
consider them as neither upper nor lower (iswalpha (c) && towupper (c)
== c).
https://linux.die.net/man/3/iswupper allows both interpretations:
> The wide-character class "upper" contains *at least* those characters
> wc which are equal to towupper(wc) and different from towlower(wc).


Issue 3 is the special conversion jp2uc which seems to be half-bred;
there is no such handling for Chinese or Korean.
If by definition the arguments of isw* functions are not Unicode but
wide characters according of the current locale (not sure where that is
defined), they must be transformed for all locales (CJK and also 8-bit
ones);
also in towupper and towlower the result must be transformed back to the
current locale encoding (now missing).


Thomas

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Corinna Vinschen-2
On Aug  3 21:44, Thomas Wolff wrote:

> Am 28.07.2017 um 21:58 schrieb Corinna Vinschen:
> > On Jul 26 23:43, Thomas Wolff wrote:
> > > Am 26.07.2017 um 11:50 schrieb Corinna Vinschen:
> > > > On Jul 26 03:16, Yaakov Selkowitz wrote:
> > > > > On 2017-07-26 03:08, Corinna Vinschen wrote:
> > > > > > On Jul 26 08:49, Thomas Wolff wrote:
> > > > > > > It would be good to keep wcwidth/wcswidth in sync with the installed
> > > > > > > Unicode data version (package unicode-ucd).
> > > > > > > Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
> > > > > > > it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
> > > > > > > be used.
> > > > > > > I can provide some scripts to generate the respective tables if desired.
> > > > > > > Thomas
> > > > > > If you can update the newlib files this way and send matching patches
> > > > > > to the newlib list, this would be highly appreciated.
> > > > > Thomas, I just updated unicode-ucd to 10.0 for this purpose.
> > > Thanks.
> > > > Oh, and, btw, the comment in wcwidth.c isn't quite correct.  The
> > > > cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper.c.
> > > Oh, a number of other embedded tables. To make the tow* and isw* functions
> > > more easily adaptable to Unicode updates, there will be some revisions to do
> > > here. And the to* and is* ones (without 'w') even refer to locales in a way
> > > I do not understand. Maybe I'll restrict my effort to wcwidth first...
> > The to* and is* ones (without 'w') don't matter at all and you don't
> > have to touch them.
> >
> > The Unicode stuff only affects the tow and isw functions.
> >
> > As for how to fetch the data, you may want to have a look into
> > newlib/libc/ctype/utf8alpha.h and newlib/libc/ctype/utf8print.h.  The
> > header comments contain the awk scripts used to collect the data.
> But there are no instructions to adapt the embedded conditional statements
> referring to those data...
Tables are scanned in-order.  Each table handles a range of 256
characters.  A table comprises the lower 8 bits of a character which
matches the condition.  A 0 character (except in array position 0) is
a continuation marker, which means, all chars between the previous and
the next value match the condition.

Here's an example from utf8alpha.h:

  static const unsigned char ua7[] = {
    0x17, 0x0, 0x1f, 0x22, 0x0, 0x88,
    0x8b, 0x8c,
    0xfb, 0x0, 0xff };

ua7 is the array handling the characters in the range 0xa700 up to 0xa7ff.
The first alpha character in this range is 0xa717.  The next char in the
array is a 0x0, followed by 0x1f.  That means, all character from 0xa717
up to 0xa71f are alphas.  Then we have a 0x22, a 0, and a 0x88.  So all
chars from 0xa722 up to 0xa788 are alphas.  Then we have two chars not
followed by a 0, so they just stand for themselves.  0xa78b and 0xa78c
are alpha chars.  The last group 0xfb, 0x0, 0xff of course means,
0x8afb up to 0xa7ff are alpha chars.

> My attempt would be to base the functions on a common table of character
> categories instead.

Keep in mind that the table is not loaded into memory on demand, as on
Linux.  Rather it will be part of the Cygwin DLL, and worse in case
newlib, any target using the wctype functions.

The idea here is that the tables take less space than a full-fledged
category table.  The tables in utf8print.h and utf8alpha.h and the code
in iswalpha and iswprint combined are 10K, code and data of the
tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
covering Unicode 5.2 with 107K codepoints.

A category table would have to contain the category bits for the entire
Unicode codepoint range.  The number of potential bits is > 8 as far as I
know so it needs 2 bytes per char, but let's make that 1 byte for now.
For Unicode 5.2 only the table would be at least 107K, and that would
only cover the iswXXX functions.

> > All other isw* files like iswblank.c contain comments explaining
> > what Unicode character categories are covered.
> I'm comparing results based on Unicode 5.2 data. There will be some
> deviations and maybe some things to discuss.
> For example, I wonder why in the current implementation currency symbols are
> considered as punctuation (which can be easily reproduced).

  iswpunct (c) == !iswalnum (c) && iswgraph (c)

Linux man page claims:

  This function's name is a misnomer when dealing  with  Unicode
  characters,  because  the wide-character class "punct" contains both
  punctuation characters and symbol (math, currency, etc.) characters.

> Also, there are 3 other issues:
>
> Issue 1 is about handling non-BMP characters by wcwidth.
> This has been discussed before.
> [...]
> (https://sourceware.org/ml/cygwin/2011-02/msg00040.html)
> Corinna Vinschen wrote:
> > And, please note the wording in SUSv4, for instance in
> > http://calimero.vinschen.de/susv4/functions/iswalpha.html
> (not found)
oops, good one.  Just see the upstream SUSv4 iswalpha man page.

> >   The wc argument is a wint_t, the value of which the application shall
> >                        ^^^^^^                         ^^^^^^^^^^^
> >   ensure is a wide-character code corresponding to a valid character in
> > the current locale, or equal to the value of the macro WEOF. If the
> > argument has any other value, the behavior is undefined.
> > I don't see any words in that which would disallow to convert UTF-16
> > wchar_t surrogates to a wint_t UTF-32 value before calling one of the
> > wctype functions.  Just like you have to be careful not to call the
> > ctype functions with a signed char.
>
> While wcswidth works already (using internal __wcwidth), and the isw* and
> tow* functions work as well because they use wint_t, wcwidth is the only
> function (inconsistently insisting on wchar_t) that does not work.
Trying to be close to the standard here.

> But note https://linux.die.net/man/3/wcwidth which says
> > Note that glibc before 2.2.5 used the prototype
> > int wcwidth(wint_t c);
> Why not revert to wcwidth(wint_t)?
> I think for cygwin it is the only solution that makes wcwidth work for
> non-BMP characters and is also compatible (unlike some proposals discussed
> later in the quoted thread).

We can do this, but it may result in complaints from the other
newlib consumers.  If in doubt, use #ifdef __CYGWIN__

> Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Unicode
> character U+01CB). The current implementation considers them to be both
> upper and lower (iswupper: return towlower (c) != c); I'd rather consider
> them as neither upper nor lower (iswalpha (c) && towupper (c) == c).
> https://linux.die.net/man/3/iswupper allows both interpretations:
> > The wide-character class "upper" contains *at least* those characters wc
> > which are equal to towupper(wc) and different from towlower(wc).

Susv4 says "The iswupper() [...] functions shall test whether wc is a
wide-character code representing a character of class upper." Whatever
does that correctly with a low footprint is fine.

> Issue 3 is the special conversion jp2uc which seems to be half-bred; there
> is no such handling for Chinese or Korean.

This shouldn't matter to you, just keep it in place.  It's a historical,
low footprint conversion for japanese characters without pulling in the
unicode stuff.  Not used on Cygwin so just ignore.


Thanks,
Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Thomas Wolff
Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:

> On Aug  3 21:44, Thomas Wolff wrote:
>> Am 28.07.2017 um 21:58 schrieb Corinna Vinschen:
>>> On Jul 26 23:43, Thomas Wolff wrote:
>>>> Am 26.07.2017 um 11:50 schrieb Corinna Vinschen:
>>>>> On Jul 26 03:16, Yaakov Selkowitz wrote:
>>>>>> On 2017-07-26 03:08, Corinna Vinschen wrote:
>>>>>>> On Jul 26 08:49, Thomas Wolff wrote:
>>>>>>>> It would be good to keep wcwidth/wcswidth in sync with the installed
>>>>>>>> Unicode data version (package unicode-ucd).
>>>>>>>> Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
>>>>>>>> it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
>>>>>>>> be used.
>>>>>>>> I can provide some scripts to generate the respective tables if desired.
>>>>>>>> Thomas
>>>>>>> If you can update the newlib files this way and send matching patches
>>>>>>> to the newlib list, this would be highly appreciated.
>>>>>> Thomas, I just updated unicode-ucd to 10.0 for this purpose.
>>>> Thanks.
>>>>> Oh, and, btw, the comment in wcwidth.c isn't quite correct.  The
>>>>> cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper.c.
>>>> Oh, a number of other embedded tables. To make the tow* and isw* functions
>>>> more easily adaptable to Unicode updates, there will be some revisions to do
>>>> here. And the to* and is* ones (without 'w') even refer to locales in a way
>>>> I do not understand. Maybe I'll restrict my effort to wcwidth first...
>>> The to* and is* ones (without 'w') don't matter at all and you don't
>>> have to touch them.
>>>
>>> The Unicode stuff only affects the tow and isw functions.
>>>
>>> As for how to fetch the data, you may want to have a look into
>>> newlib/libc/ctype/utf8alpha.h and newlib/libc/ctype/utf8print.h.  The
>>> header comments contain the awk scripts used to collect the data.
>> But there are no instructions to adapt the embedded conditional statements
>> referring to those data...
> Tables are ...
I had an impression how the tables work. Yet there is no automatic
mechanism to generate the data-based conditionals in the code which
would need to be adapted too for Unicode updates. Therefore:
>> My attempt would be to base the functions on a common table of character categories instead.
> Keep in mind that the table is not loaded into memory on demand, as on
> Linux.  Rather it will be part of the Cygwin DLL, and worse in case
> newlib, any target using the wctype functions.
Maybe we could change that (load on demand, or put them in a shared
library perhaps), but...

> The idea here is that the tables take less space than a full-fledged
> category table.  The tables in utf8print.h and utf8alpha.h and the code
> in iswalpha and iswprint combined are 10K, code and data of the
> tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
> covering Unicode 5.2 with 107K codepoints.
>
> A category table would have to contain the category bits for the entire
> Unicode codepoint range.  The number of potential bits is > 8 as far as I
> know so it needs 2 bytes per char, but let's make that 1 byte for now.
> For Unicode 5.2 only the table would be at least 107K, and that would
> only cover the iswXXX functions.
I have a working version now, and it uses much less as the category
table is range-based.
Another table is needed for case conversion. Size estimates are as
follows (based on Unicode 5.2 for a fair comparison, going up a little
bit for 10.0 of course):

Categories: 2313 entries (10.0: 2715)
each entry needs 9 bytes, total 20817 bytes
I don't know whether that expands by some word-alignment.
I could pack entries to 7 bytes, or even 6 bytes if that helps (total
16191 or 13878).

Case conversion: 2062 entries (10.0: 2621)
each entry needs 12 bytes, total 24744
packed 8 bytes, total 16496

The Categories table could be boiled down to 1223 entries (penalty:
double runtime for iswupper and iswlower)
The Case conversion table could be transformed to a compact form
Case conversion compact: 1201 entries
each entry needs 16 bytes, total 19216
packed 12 or 11 (or even 10), total 14412 (or 12010)

So I think the increase is acceptable for the benefit of simple and
automatic generation and also more efficient processing by some of the
functions. Also they would apply to more functions, e.g. iswdigit which
would confirm all Unicode digits, not just the ASCII ones.

> ...
>> Also, there are 3 other issues:
>>
>> Issue 1 is about handling non-BMP characters by wcwidth.
>> This has been discussed before.
>> [...]
>> ...
>>
>>
>> While wcswidth works already (using internal __wcwidth), and the isw* and
>> tow* functions work as well because they use wint_t, wcwidth is the only
>> function (inconsistently insisting on wchar_t) that does not work.
> Trying to be close to the standard here.
>
>> But note https://linux.die.net/man/3/wcwidth which says
>>> Note that glibc before 2.2.5 used the prototype
>>> int wcwidth(wint_t c);
>> Why not revert to wcwidth(wint_t)?
>> I think for cygwin it is the only solution that makes wcwidth work for
>> non-BMP characters and is also compatible (unlike some proposals discussed
>> later in the quoted thread).
> We can do this, but it may result in complaints from the other
> newlib consumers.  If in doubt, use #ifdef __CYGWIN__
Which other platforms do actually use newlib?

>
>> Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Unicode
>> character U+01CB). The current implementation considers them to be both
>> upper and lower (iswupper: return towlower (c) != c); I'd rather consider
>> them as neither upper nor lower (iswalpha (c) && towupper (c) == c).
>> https://linux.die.net/man/3/iswupper allows both interpretations:
>>> The wide-character class "upper" contains *at least* those characters wc
>>> which are equal to towupper(wc) and different from towlower(wc).
> Susv4 says "The iswupper() [...] functions shall test whether wc is a
> wide-character code representing a character of class upper." Whatever
> does that correctly with a low footprint is fine.
The question here is how "character of class upper" is defined, and how
to interpret pre-Unicode assumptions in a Unicode context.

>> Issue 3 is the special conversion jp2uc which seems to be half-bred; there
>> is no such handling for Chinese or Korean.
> This shouldn't matter to you, just keep it in place.  It's a historical,
> low footprint conversion for japanese characters without pulling in the
> unicode stuff.  Not used on Cygwin so just ignore.
I had noticed meanwhile that this is not active in Cygwin, but it's
broken anyway for multiple reasons:
    * platforms for which wchar_t is not Unicode should be explicitly listed
    * if used, the transformation needs to be applied to all non-Unicode
locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
    * for towupper and towlower, the result must be back-transformed
into the respective locale encoding
    * particulary the locale-specific _l functions inconsistently do not
use the transformation but have this note:
>      We're using a locale-independent representation of upper/lower case
>      based on Unicode data.  Thus, the locale doesn't matter.

So I'd suggest to drop that stuff unless someone would like to fix it.

Should I send my proposal to [hidden email] or
[hidden email]?

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Brian Inglis
On 2017-08-05 13:06, Thomas Wolff wrote:

> Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
>> On Aug  3 21:44, Thomas Wolff wrote:
>>> Am 28.07.2017 um 21:58 schrieb Corinna Vinschen:
>>>> On Jul 26 23:43, Thomas Wolff wrote:
>>>>> Am 26.07.2017 um 11:50 schrieb Corinna Vinschen:
>>>>>> On Jul 26 03:16, Yaakov Selkowitz wrote:
>>>>>>> On 2017-07-26 03:08, Corinna Vinschen wrote:
>>>>>>>> On Jul 26 08:49, Thomas Wolff wrote:
>>>>>>>>> It would be good to keep wcwidth/wcswidth in sync with the
>>>>>>>>> installed Unicode data version (package unicode-ucd).
>>>>>>>>> Currently it seems to be hard-coded (in
>>>>>>>>> newlib/libc/string/wcwidth.c); it refers to Unicode 5.0 while
>>>>>>>>> installed Unicode data suggest 9.0 would be used.
>>>>>>>>> I can provide some scripts to generate the respective tables
>>>>>>>>> if desired.

>>>>>>>> If you can update the newlib files this way and send matching
>>>>>>>> patches to the newlib list, this would be highly appreciated.

Submit to               ^

>>>>>>> I just updated unicode-ucd to 10.0 for this purpose.

>>>>>> Oh, and, btw, the comment in wcwidth.c isn't quite correct. The
>>>>>> cwstate in newlib is on Unicode 5.2, see
>>>>>> newlib/libc/ctype/towupper.c.

>>>>> Oh, a number of other embedded tables. To make the tow* and isw*
>>>>> functions more easily adaptable to Unicode updates, there will be
>>>>> some revisions to do here. And the to* and is* ones (without 'w')
>>>>> even refer to locales in a way I do not understand. Maybe I'll
>>>>> restrict my effort to wcwidth first...

>>>> The to* and is* ones (without 'w') don't matter at all and you don't
>>>> have to touch them.

>>>> The Unicode stuff only affects the tow and isw functions.
>>>>
>>>> As for how to fetch the data, you may want to have a look into
>>>> newlib/libc/ctype/utf8alpha.h and newlib/libc/ctype/utf8print.h. The
>>>> header comments contain the awk scripts used to collect the data.

>>> But there are no instructions to adapt the embedded conditional
>>> statements referring to those data...

>> Tables are ...

> I had an impression how the tables work. Yet there is no automatic mechanism
> to generate the data-based conditionals in the code which would need to be
> adapted too for Unicode updates. Therefore:

>>> My attempt would be to base the functions on a common table of character
>>> categories instead.

>> Keep in mind that the table is not loaded into memory on demand, as on
>> Linux.  Rather it will be part of the Cygwin DLL, and worse in case newlib,
>> any target using the wctype functions.
> Maybe we could change that (load on demand, or put them in a shared library
> perhaps), but...

>> The idea here is that the tables take less space than a full-fledged
>> category table. The tables in utf8print.h and utf8alpha.h and the code in
>> iswalpha and iswprint combined are 10K, code and data of the
>> tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K, covering
>> Unicode 5.2 with 107K codepoints.
>>
>> A category table would have to contain the category bits for the entire
>> Unicode codepoint range. The number of potential bits is > 8 as far as I
>> know so it needs 2 bytes per char, but let's make that 1 byte for now. For
>> Unicode 5.2 only the table would be at least 107K, and that would only
>> cover the iswXXX functions.

> I have a working version now, and it uses much less as the category table is
> range-based.
> Another table is needed for case conversion. Size estimates are as follows
> (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0
> of course):
>
> Categories: 2313 entries (10.0: 2715)
> each entry needs 9 bytes, total 20817 bytes
> I don't know whether that expands by some word-alignment.
> I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191
> or 13878).
>
> Case conversion: 2062 entries (10.0: 2621)
> each entry needs 12 bytes, total 24744
> packed 8 bytes, total 16496
>
> The Categories table could be boiled down to 1223 entries (penalty: double
> runtime for iswupper and iswlower)
> The Case conversion table could be transformed to a compact form
> Case conversion compact: 1201 entries
> each entry needs 16 bytes, total 19216
> packed 12 or 11 (or even 10), total 14412 (or 12010)
>
> So I think the increase is acceptable for the benefit of simple and
> automatic generation and also more efficient processing by some of the
> functions. Also they would apply to more functions, e.g. iswdigit which would
> confirm all Unicode digits, not just the ASCII ones.

>>> Also, there are 3 other issues:
>>>
>>> Issue 1 is about handling non-BMP characters by wcwidth.
>>> This has been discussed before.

>>> While wcswidth works already (using internal __wcwidth), and the isw*
>>> and tow* functions work as well because they use wint_t, wcwidth is the
>>> only function (inconsistently insisting on wchar_t) that does not work.
>> Trying to be close to the standard here.

>>> But note https://linux.die.net/man/3/wcwidth which says
>>>> Note that glibc before 2.2.5 used the prototype
>>>> int wcwidth(wint_t c);

>>> Why not revert to wcwidth(wint_t)?
>>> I think for cygwin it is the only solution that makes wcwidth work for
>>> non-BMP characters and is also compatible (unlike some proposals
>>> discussed later in the quoted thread).

>> We can do this, but it may result in complaints from the other newlib
>> consumers. If in doubt, use #ifdef __CYGWIN__

> Which other platforms do actually use newlib?

Many historical uPs and current uCs used in embedded systems supporting gcc not
using Linux, including RTEMS, devKits for Nintendo and Sony game systems, aome
Android, Google NaCl.

>>> Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Unicode
>>> character U+01CB). The current implementation considers them to be both
>>> upper and lower (iswupper: return towlower (c) != c); I'd rather consider
>>> them as neither upper nor lower (iswalpha (c) && towupper (c) == c).
>>> https://linux.die.net/man/3/iswupper allows both interpretations:
>>>> The wide-character class "upper" contains *at least* those characters wc
>>>> which are equal to towupper(wc) and different from towlower(wc).
>> Susv4 says "The iswupper() [...] functions shall test whether wc is a
>> wide-character code representing a character of class upper." Whatever
>> does that correctly with a low footprint is fine.
> The question here is how "character of class upper" is defined, and how to
> interpret pre-Unicode assumptions in a Unicode context.
>
>>> Issue 3 is the special conversion jp2uc which seems to be half-bred;
>>> there is no such handling for Chinese or Korean.

>> This shouldn't matter to you, just keep it in place. It's a historical, low
>> footprint conversion for japanese characters without pulling in the unicode
>> stuff. Not used on Cygwin so just ignore.

> I had noticed meanwhile that this is not active in Cygwin, but it's broken
> anyway for multiple reasons:
> * platforms for which wchar_t is not Unicode should be explicitly listed
> * if used, the transformation needs to be applied to all non-Unicode locales
> (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
> * for towupper and towlower, the result must be back-transformed into the
> respective locale encoding
> * particulary the locale-specific _l functions inconsistently do not use the
> transformation but have this note:
>> We're using a locale-independent representation of upper/lower case based
>> on Unicode data. Thus, the locale doesn't matter.

> So I'd suggest to drop that stuff unless someone would like to fix it.

Looks like JIS support is under newlib/iconvdata

> Should I send my proposal to [hidden email] or
> [hidden email]?
See note near top:
newlib for anything under that directory,
patches for anything under winsup directory.

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Thomas Wolff
Am 05.08.2017 um 22:24 schrieb Brian Inglis:
> On 2017-08-05 13:06, Thomas Wolff wrote:
> ...
>> Which other platforms do actually use newlib?
> Many historical uPs and current uCs used in embedded systems supporting gcc not
> using Linux, including RTEMS, devKits for Nintendo and Sony game systems, aome
> Android, Google NaCl.
Do they all handle wchar_t to be encoded locale-specifically? I doubt that.
https://www.gnu.org/software/libunistring/manual/html_node/The-wchar_005ft-mess.html
particularly points out Solaris and FreeBSD, no others.

>>>> Issue 3 is the special conversion jp2uc which seems to be half-bred;
>>>> there is no such handling for Chinese or Korean.
>>> This shouldn't matter to you, just keep it in place. It's a historical, low
>>> footprint conversion for japanese characters without pulling in the unicode
>>> stuff. Not used on Cygwin so just ignore.
>> I had noticed meanwhile that this is not active in Cygwin, but it's broken
>> anyway for multiple reasons:
>> * platforms for which wchar_t is not Unicode should be explicitly listed
>> * if used, the transformation needs to be applied to all non-Unicode locales
>> (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
>> * for towupper and towlower, the result must be back-transformed into the
>> respective locale encoding
>> * particulary the locale-specific _l functions inconsistently do not use the
>> transformation but have this note:
>>> We're using a locale-independent representation of upper/lower case based
>>> on Unicode data. Thus, the locale doesn't matter.
>> So I'd suggest to drop that stuff unless someone would like to fix it.
> Looks like JIS support is under newlib/iconvdata
So maybe the conversion can call jisx0201_to_ucs4 etc. from there, and
also the back-conversion for towupper/lower is available.
But then the stuff is still broken for the other reasons. I could map
the _l functions properly, if that's really desired, but how to handle
other encodings and on which platforms?

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Corinna Vinschen-2
In reply to this post by Thomas Wolff
On Aug  5 21:06, Thomas Wolff wrote:
> Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
> > On Aug  3 21:44, Thomas Wolff wrote:
> > > My attempt would be to base the functions on a common table of character categories instead.
> > Keep in mind that the table is not loaded into memory on demand, as on
> > Linux.  Rather it will be part of the Cygwin DLL, and worse in case
> > newlib, any target using the wctype functions.
> Maybe we could change that (load on demand, or put them in a shared library
> perhaps), but...

That won't work for embedded targets, especially small ones.

If you want to go that route, you would have to extend struct __locale_t
or lc_ctype_T (in newlib/libc/locale/setlocale.h) to contain pointers to
conversion tables (Cygwin-only), and the __set_lc_ctype_from_win function
or a new function inside Cygwin (but called from __ctype_load_locale)
could load the tables.

Then you could create new iswXXX, towXXX, and wcwidth functions inside
Cygwin using these tables, rather than relying on the newlib code.

Alternatively, if RTEMS is interested as well, we may strive for a
newlib solution which is opt-in.  Loading tables (or even big tables at
all) isn't a good solution for very small targets.

> > The idea here is that the tables take less space than a full-fledged
> > category table.  The tables in utf8print.h and utf8alpha.h and the code
> > in iswalpha and iswprint combined are 10K, code and data of the
> > tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
> > covering Unicode 5.2 with 107K codepoints.
> >
> > A category table would have to contain the category bits for the entire
> > Unicode codepoint range.  The number of potential bits is > 8 as far as I
> > know so it needs 2 bytes per char, but let's make that 1 byte for now.
> > For Unicode 5.2 only the table would be at least 107K, and that would
> > only cover the iswXXX functions.
> I have a working version now, and it uses much less as the category table is
> range-based.
> Another table is needed for case conversion. Size estimates are as follows
> (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0
> of course):
>
> Categories: 2313 entries (10.0: 2715)
> each entry needs 9 bytes, total 20817 bytes
> I don't know whether that expands by some word-alignment.
> I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191
> or 13878).
>
> Case conversion: 2062 entries (10.0: 2621)
> each entry needs 12 bytes, total 24744
> packed 8 bytes, total 16496
>
> The Categories table could be boiled down to 1223 entries (penalty: double
> runtime for iswupper and iswlower)
> The Case conversion table could be transformed to a compact form
> Case conversion compact: 1201 entries
> each entry needs 16 bytes, total 19216
> packed 12 or 11 (or even 10), total 14412 (or 12010)
> So I think the increase is acceptable for the benefit of simple and
> automatic generation
So we're at 40K+ plus code then.

newlib: embedded targets, looking for small sized solutions.  Simple
and automatic generation is not the main goal.

> and also more efficient processing by some of the
> functions. Also they would apply to more functions, e.g. iswdigit which
> would confirm all Unicode digits, not just the ASCII ones.

Don't do that.  There's a collision with C99 if you define other
characters than ASCII digits to return nonzero from iswdigit.  Comment
from inside Glibc:

% The "digit" class must only contain the BASIC LATIN digits, says ISO C 99
% (sections 7.25.2.1.5 and 5.2.1).

> > > > int wcwidth(wint_t c);
> > > Why not revert to wcwidth(wint_t)?
> > > I think for cygwin it is the only solution that makes wcwidth work for
> > > non-BMP characters and is also compatible (unlike some proposals discussed
> > > later in the quoted thread).
> > We can do this, but it may result in complaints from the other
> > newlib consumers.  If in doubt, use #ifdef __CYGWIN__
> Which other platforms do actually use newlib?

Lots of embedded and bare-metal tagets.

> > > Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Unicode
> > > character U+01CB). The current implementation considers them to be both
> > > upper and lower (iswupper: return towlower (c) != c); I'd rather consider
> > > them as neither upper nor lower (iswalpha (c) && towupper (c) == c).
> > > https://linux.die.net/man/3/iswupper allows both interpretations:
> > > > The wide-character class "upper" contains *at least* those characters wc
> > > > which are equal to towupper(wc) and different from towlower(wc).
> > Susv4 says "The iswupper() [...] functions shall test whether wc is a
> > wide-character code representing a character of class upper." Whatever
> > does that correctly with a low footprint is fine.
> The question here is how "character of class upper" is defined, and how to
> interpret pre-Unicode assumptions in a Unicode context.
In theory, do it as glibc does and you're fine.

> > > Issue 3 is the special conversion jp2uc which seems to be half-bred; there
> > > is no such handling for Chinese or Korean.
> > This shouldn't matter to you, just keep it in place.  It's a historical,
> > low footprint conversion for japanese characters without pulling in the
> > unicode stuff.  Not used on Cygwin so just ignore.
> I had noticed meanwhile that this is not active in Cygwin, but it's broken
> anyway for multiple reasons:
>    * platforms for which wchar_t is not Unicode should be explicitly listed
>    * if used, the transformation needs to be applied to all non-Unicode
> locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
>    * for towupper and towlower, the result must be back-transformed into the
> respective locale encoding
>    * particulary the locale-specific _l functions inconsistently do not use
> the transformation but have this note:
No, no, no.  The functionality is restricted to certain use-cases and
always was.  It was a paid-for customer extension back in the day and it
was *sufficient* for the use-cases.  It's not clear how many newlib
users are still using it, but it's not a good idea to remove it without
checking first.  That means, ask on the newlib mailing list how many are
using the historical jp2uc code, and if we don't get a reply within,
say, a month, we can probably nuke it.


Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Corinna Vinschen-2
On Aug  7 11:28, Corinna Vinschen wrote:

> On Aug  5 21:06, Thomas Wolff wrote:
> > Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
> > > This shouldn't matter to you, just keep it in place.  It's a historical,
> > > low footprint conversion for japanese characters without pulling in the
> > > unicode stuff.  Not used on Cygwin so just ignore.
> > I had noticed meanwhile that this is not active in Cygwin, but it's broken
> > anyway for multiple reasons:
> >    * platforms for which wchar_t is not Unicode should be explicitly listed
> >    * if used, the transformation needs to be applied to all non-Unicode
> > locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
> >    * for towupper and towlower, the result must be back-transformed into the
> > respective locale encoding
> >    * particulary the locale-specific _l functions inconsistently do not use
> > the transformation but have this note:
>
> No, no, no.  The functionality is restricted to certain use-cases and
> always was.  It was a paid-for customer extension back in the day and it
> was *sufficient* for the use-cases.  It's not clear how many newlib
> users are still using it, but it's not a good idea to remove it without
> checking first.  That means, ask on the newlib mailing list how many are
> using the historical jp2uc code, and if we don't get a reply within,
> say, a month, we can probably nuke it.
To clarify where we're coming from:

If you look into newlib/libc/locale/locale.c, function __loadlocale,
you'll notice that outside of Cygwin, only six single/double/multi-bytes
codesets are supported at all:

  ASCII
  ISO-8859-1
  EUCJP
  JIS
  SJIS
  UTF-8

The multichar/widechar conversion functions for EUCJP, JIS and SJIS were
implemented to have a low footprint in the first place, see, for
instance, __sjis_wctomb in newlib/libc/stdlib/wctomb_r.c.

This is all about simplification for small targets.  There was never a
requirement that converting a UTF-8 char to wchar_t, and converting the
equivalent SJIS char to wchar_t would result in the same wide char.

Consequentially, Cygwin does not use these conversion functions.  Rather
it uses Windows conversion functions, see the conversion functions in
winsup/cygwin/strfuncs.cc, to get a consistent wide char representation
(UTF-16).  Another side-effect is that Cygwin does not support JIS at
all, only SJIS, see the comment in strfuncs.cc.


Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Brian Inglis
In reply to this post by Corinna Vinschen-2
On 2017-08-07 03:28, Corinna Vinschen wrote:

> On Aug  5 21:06, Thomas Wolff wrote:
>> Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
>>> On Aug  3 21:44, Thomas Wolff wrote:
>>>> My attempt would be to base the functions on a common table of character categories instead.
>>> Keep in mind that the table is not loaded into memory on demand, as on
>>> Linux.  Rather it will be part of the Cygwin DLL, and worse in case
>>> newlib, any target using the wctype functions.
>> Maybe we could change that (load on demand, or put them in a shared library
>> perhaps), but...
>
> That won't work for embedded targets, especially small ones.
>
> If you want to go that route, you would have to extend struct __locale_t
> or lc_ctype_T (in newlib/libc/locale/setlocale.h) to contain pointers to
> conversion tables (Cygwin-only), and the __set_lc_ctype_from_win function
> or a new function inside Cygwin (but called from __ctype_load_locale)
> could load the tables.
>
> Then you could create new iswXXX, towXXX, and wcwidth functions inside
> Cygwin using these tables, rather than relying on the newlib code.
>
> Alternatively, if RTEMS is interested as well, we may strive for a
> newlib solution which is opt-in.  Loading tables (or even big tables at
> all) isn't a good solution for very small targets.
>
>>> The idea here is that the tables take less space than a full-fledged
>>> category table.  The tables in utf8print.h and utf8alpha.h and the code
>>> in iswalpha and iswprint combined are 10K, code and data of the
>>> tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
>>> covering Unicode 5.2 with 107K codepoints.
>>>
>>> A category table would have to contain the category bits for the entire
>>> Unicode codepoint range.  The number of potential bits is > 8 as far as I
>>> know so it needs 2 bytes per char, but let's make that 1 byte for now.
>>> For Unicode 5.2 only the table would be at least 107K, and that would
>>> only cover the iswXXX functions.
>> I have a working version now, and it uses much less as the category table is
>> range-based.
>> Another table is needed for case conversion. Size estimates are as follows
>> (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0
>> of course):
>>
>> Categories: 2313 entries (10.0: 2715)
>> each entry needs 9 bytes, total 20817 bytes
>> I don't know whether that expands by some word-alignment.
>> I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191
>> or 13878).
>>
>> Case conversion: 2062 entries (10.0: 2621)
>> each entry needs 12 bytes, total 24744
>> packed 8 bytes, total 16496
>>
>> The Categories table could be boiled down to 1223 entries (penalty: double
>> runtime for iswupper and iswlower)
>> The Case conversion table could be transformed to a compact form
>> Case conversion compact: 1201 entries
>> each entry needs 16 bytes, total 19216
>> packed 12 or 11 (or even 10), total 14412 (or 12010)
>> So I think the increase is acceptable for the benefit of simple and
>> automatic generation
>
> So we're at 40K+ plus code then.
>
> newlib: embedded targets, looking for small sized solutions.  Simple
> and automatic generation is not the main goal.
>
>> and also more efficient processing by some of the
>> functions. Also they would apply to more functions, e.g. iswdigit which
>> would confirm all Unicode digits, not just the ASCII ones.
>
> Don't do that.  There's a collision with C99 if you define other
> characters than ASCII digits to return nonzero from iswdigit.  Comment
> from inside Glibc:
>
> % The "digit" class must only contain the BASIC LATIN digits, says ISO C 99
> % (sections 7.25.2.1.5 and 5.2.1).
>
>>>>> int wcwidth(wint_t c);
>>>> Why not revert to wcwidth(wint_t)?
>>>> I think for cygwin it is the only solution that makes wcwidth work for
>>>> non-BMP characters and is also compatible (unlike some proposals discussed
>>>> later in the quoted thread).
>>> We can do this, but it may result in complaints from the other
>>> newlib consumers.  If in doubt, use #ifdef __CYGWIN__
>> Which other platforms do actually use newlib?
>
> Lots of embedded and bare-metal tagets.
>
>>>> Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Unicode
>>>> character U+01CB). The current implementation considers them to be both
>>>> upper and lower (iswupper: return towlower (c) != c); I'd rather consider
>>>> them as neither upper nor lower (iswalpha (c) && towupper (c) == c).
>>>> https://linux.die.net/man/3/iswupper allows both interpretations:
>>>>> The wide-character class "upper" contains *at least* those characters wc
>>>>> which are equal to towupper(wc) and different from towlower(wc).
>>> Susv4 says "The iswupper() [...] functions shall test whether wc is a
>>> wide-character code representing a character of class upper." Whatever
>>> does that correctly with a low footprint is fine.
>> The question here is how "character of class upper" is defined, and how to
>> interpret pre-Unicode assumptions in a Unicode context.
>
> In theory, do it as glibc does and you're fine.

Implementation considerations for handling the Unicode tables described in
        http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
and implemented in
        https://www.strchr.com/multi-stage_tables

ICU icu4[cj] uses a folded trie of the properties, where the unique property
combinations are indexed, strings of those indices are generated for fixed size
groups of character codes, unique values of those strings are then indexed, and
those indices assigned to each character code group. The result is a multi-level
indexing operation that returns the required property combination for each
character.

https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode

The FOX Toolkit uses a similar approach, splitting the 21 bit character code
into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
eliminate redundancy.

ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Thomas Wolff
In reply to this post by Corinna Vinschen-2
Am 07.08.2017 um 11:28 schrieb Corinna Vinschen:

> On Aug  5 21:06, Thomas Wolff wrote:
>> Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
>>> On Aug  3 21:44, Thomas Wolff wrote:
>>>> My attempt would be to base the functions on a common table of character categories instead.
>>> ...Keep in mind that the table is not loaded into memory on demand, as on
>>> Linux.  Rather it will be part of the Cygwin DLL, and worse in case
>>> newlib, any target using the wctype functions.
>> Maybe we could change that (load on demand, or put them in a shared library
>> perhaps), but...
> That won't work for embedded targets, especially small ones.
>
> If you want to go that route, you would have to extend struct __locale_t
> or lc_ctype_T (in newlib/libc/locale/setlocale.h) to contain pointers to
> conversion tables (Cygwin-only), and the __set_lc_ctype_from_win function
> or a new function inside Cygwin (but called from __ctype_load_locale)
> could load the tables.
>
> Then you could create new iswXXX, towXXX, and wcwidth functions inside
> Cygwin using these tables, rather than relying on the newlib code.
>
> Alternatively, if RTEMS is interested as well, we may strive for a
> newlib solution which is opt-in.  Loading tables (or even big tables at
> all) isn't a good solution for very small targets.
>
>>> The idea here is that the tables take less space than a full-fledged
>>> category table.  The tables in utf8print.h and utf8alpha.h and the code
>>> in iswalpha and iswprint combined are 10K, code and data of the
>>> tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
>>> covering Unicode 5.2 with 107K codepoints.
>>>
>>> A category table would have to contain the category bits for the entire
>>> Unicode codepoint range.  The number of potential bits is > 8 as far as I
>>> know so it needs 2 bytes per char, but let's make that 1 byte for now.
>>> For Unicode 5.2 only the table would be at least 107K, and that would
>>> only cover the iswXXX functions.
>> I have a working version now, and it uses much less as the category table is
>> range-based.
>> Another table is needed for case conversion. Size estimates are as follows
>> (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0
>> of course):
>>
>> Categories: 2313 entries (10.0: 2715)
>> each entry needs 9 bytes, total 20817 bytes
>> I don't know whether that expands by some word-alignment.
>> I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191
>> or 13878).
>>
>> Case conversion: 2062 entries (10.0: 2621)
>> each entry needs 12 bytes, total 24744
>> packed 8 bytes, total 16496
>>
>> The Categories table could be boiled down to 1223 entries (penalty: double
>> runtime for iswupper and iswlower)
>> The Case conversion table could be transformed to a compact form
>> Case conversion compact: 1201 entries
>> each entry needs 16 bytes, total 19216
>> packed 12 or 11 (or even 10), total 14412 (or 12010)
>> So I think the increase is acceptable for the benefit of simple and
>> automatic generation
> So we're at 40K+ plus code then.
No, if I implement the packed versions, it's 19.3K, so even smaller the
currently.

> newlib: embedded targets, looking for small sized solutions.  Simple
> and automatic generation is not the main goal.
>
>> and also more efficient processing by some of the
>> functions. Also they would apply to more functions, e.g. iswdigit which
>> would confirm all Unicode digits, not just the ASCII ones.
> Don't do that.  There's a collision with C99 if you define other
> characters than ASCII digits to return nonzero from iswdigit.  ...
OK.

>>>> Issue 3 is the special conversion jp2uc which seems to be half-bred; there
>>>> is no such handling for Chinese or Korean.
>>> This shouldn't matter to you, just keep it in place.  It's a historical,
>>> low footprint conversion for japanese characters without pulling in the
>>> unicode stuff.  Not used on Cygwin so just ignore.
>> I had noticed meanwhile that this is not active in Cygwin, but it's broken
>> anyway for multiple reasons:
>>     * platforms for which wchar_t is not Unicode should be explicitly listed
>>     * if used, the transformation needs to be applied to all non-Unicode
>> locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
>>     * for towupper and towlower, the result must be back-transformed into the
>> respective locale encoding
>>     * particulary the locale-specific _l functions inconsistently do not use
>> the transformation but have this note:
> No, no, no.  The functionality is restricted to certain use-cases and
> always was.  It was a paid-for customer extension back in the day and it
> was *sufficient* for the use-cases.  It's not clear how many newlib
> users are still using it, but it's not a good idea to remove it without
> checking first.  That means, ask on the newlib mailing list how many are
> using the historical jp2uc code, and if we don't get a reply within,
> say, a month, we can probably nuke it.
OK, let's make such a request after holiday time.
But, even if this shall persist as a special solution, it's still broken
and should be fixed.
Can we then substitute the current table with calling the iconvdata
functions? In that case, as I said, the back-conversion would be
available too, and I could fix that and add the missing handling of the
_l functions, for a consistent solution.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Thomas Wolff
In reply to this post by Brian Inglis
Hi Brian,

Am 07.08.2017 um 21:07 schrieb Brian Inglis:

> ...
> Implementation considerations for handling the Unicode tables described in
> http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
> and implemented in
> https://www.strchr.com/multi-stage_tables
>
> ICU icu4[cj] uses a folded trie of the properties, where the unique property
> combinations are indexed, strings of those indices are generated for fixed size
> groups of character codes, unique values of those strings are then indexed, and
> those indices assigned to each character code group. The result is a multi-level
> indexing operation that returns the required property combination for each
> character.
>
> https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode
>
> The FOX Toolkit uses a similar approach, splitting the 21 bit character code
> into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
> eliminate redundancy.
>
> ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf
>
Thanks for the interesting links, I'll chech them out.
But such multi-level tables don't really help without a given procedure
how to update them (that's only available for the lowest level, not for
the code-embedded levels).
Also, as I've demonstrated, my more straight-forward and more efficient
approach will even use less total space than the multi-level approach if
packed table entries are used.
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Brian Inglis
On 2017-08-07 13:30, Thomas Wolff wrote:

> Am 07.08.2017 um 21:07 schrieb Brian Inglis:
>> Implementation considerations for handling the Unicode tables described in
>>     http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
>> and implemented in
>>     https://www.strchr.com/multi-stage_tables
>>
>> ICU icu4[cj] uses a folded trie of the properties, where the unique property
>> combinations are indexed, strings of those indices are generated for fixed size
>> groups of character codes, unique values of those strings are then indexed, and
>> those indices assigned to each character code group. The result is a multi-level
>> indexing operation that returns the required property combination for each
>> character.
>>
>> https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode
>>
>>
>> The FOX Toolkit uses a similar approach, splitting the 21 bit character code
>> into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
>> eliminate redundancy.
>>
>> ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf
>>
> Thanks for the interesting links, I'll chech them out.
> But such multi-level tables don't really help without a given procedure how to
> update them (that's only available for the lowest level, not for the
> code-embedded levels).

Unicode estimates property tables can be reduced to 7-8KB using these
techniques, including using minimal int sizes for indices and array elements e.g
char, short if you can keep the indices small, rather than pointers.

Creation scripts used by PCRE and Python projects are linked from the bottom of
the second link above. Source and docs for these packages and ICU is available
under Cygwin, and FOX Toolkit is available in some distros and by FTP.

> Also, as I've demonstrated, my more straight-forward and more efficient approach
> will even use less total space than the multi-level approach if packed table
> entries are used.

Unicode recommends the double table index approach as a means of eliminating the
massive redundancy that exists in char property entries and char groups, and
using small integers instead of pointers, that can be optimized to meet
conformance levels and platform speed and size limits, at the cost of an annual
review of properties and rebuild. The amount of redundancy removed by this
approach is estimated in the FOX Toolkit doc and ranges across orders of
magnitude. Unfortunately none of these docs or sources quote sizes for any
Unicode release!

My own first take on these was to use run length encoded bitstrings for each
binary property, similar to database bitmap indices, but the grouping of
property blocks in Unicode, and their recommendation, persuaded me their
approach was likely backed by a bunch of supporting corps' and devs' R&D, and is
similar to those used for decades in database queries handling (lots of) small
value set equivalence class columns to reduce memory pressure while speeding up
selections.

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Thomas Wolff
Am 07.08.2017 um 23:29 schrieb Brian Inglis:

> On 2017-08-07 13:30, Thomas Wolff wrote:
>> Am 07.08.2017 um 21:07 schrieb Brian Inglis:
>>> Implementation considerations for handling the Unicode tables described in
>>>      http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
>>> and implemented in
>>>      https://www.strchr.com/multi-stage_tables
>>>
>>> ICU icu4[cj] uses a folded trie of the properties, where the unique property
>>> combinations are indexed, strings of those indices are generated for fixed size
>>> groups of character codes, unique values of those strings are then indexed, and
>>> those indices assigned to each character code group. The result is a multi-level
>>> indexing operation that returns the required property combination for each
>>> character.
>>>
>>> https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode
>>>
>>>
>>> The FOX Toolkit uses a similar approach, splitting the 21 bit character code
>>> into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
>>> eliminate redundancy.
>>>
>>> ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf
>>>
>> Thanks for the interesting links, I'll chech them out.
>> But such multi-level tables don't really help without a given procedure how to
>> update them (that's only available for the lowest level, not for the
>> code-embedded levels).
> Unicode estimates property tables can be reduced to 7-8KB using these
> techniques, including using minimal int sizes for indices and array elements e.g
> char, short if you can keep the indices small, rather than pointers.
>
> Creation scripts used by PCRE and Python projects are linked from the bottom of
> the second link above. Source and docs for these packages and ICU is available
> under Cygwin, and FOX Toolkit is available in some distros and by FTP.
>
>> Also, as I've demonstrated, my more straight-forward and more efficient approach
>> will even use less total space than the multi-level approach if packed table
>> entries are used.
> Unicode recommends the double table index approach as a means of eliminating the
> massive redundancy that exists in char property entries and char groups, and
> using small integers instead of pointers, that can be optimized to meet
> conformance levels and platform speed and size limits, at the cost of an annual
> review of properties and rebuild. The amount of redundancy removed by this
> approach is estimated in the FOX Toolkit doc and ranges across orders of
> magnitude. Unfortunately none of these docs or sources quote sizes for any
> Unicode release!
>
> My own first take on these was to use run length encoded bitstrings for each
> binary property, similar to database bitmap indices, but the grouping of
> property blocks in Unicode, and their recommendation, persuaded me their
> approach was likely backed by a bunch of supporting corps' and devs' R&D, and is
> similar to those used for decades in database queries handling (lots of) small
> value set equivalence class columns to reduce memory pressure while speeding up
> selections.
I am not quite sure what you're trying to suggest or recommend now, but
the thing is, I just wanted to get an update of width data in the first
place, which is an easy and undisputed changed; then Corinna pointed out
that the ctype functions are based on old Unicode data too, so I made an
attempt to update them too. I use the approach that I also use for two
other projects (mined and mintty) and I didn't mean this to become a
research project for me :/
I am certainly willing to consider specs and all that to achieve a
suitable result, but I don't feel like implementing any fancy algorithm
recommended by Unicode with unconvincing rationale, especially after
I've calculated that my method uses even less memory.
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode width data inconsistent/outdated

Corinna Vinschen-2
In reply to this post by Thomas Wolff
On Aug  7 21:27, Thomas Wolff wrote:

> Am 07.08.2017 um 11:28 schrieb Corinna Vinschen:
> > On Aug  5 21:06, Thomas Wolff wrote:
> > > I have a working version now, and it uses much less as the category table is
> > > range-based.
> > > Another table is needed for case conversion. Size estimates are as follows
> > > (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0
> > > of course):
> > >
> > > Categories: 2313 entries (10.0: 2715)
> > > each entry needs 9 bytes, total 20817 bytes
> > > I don't know whether that expands by some word-alignment.
> > > I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191
> > > or 13878).
> > >
> > > Case conversion: 2062 entries (10.0: 2621)
> > > each entry needs 12 bytes, total 24744
> > > packed 8 bytes, total 16496
> > >
> > > The Categories table could be boiled down to 1223 entries (penalty: double
> > > runtime for iswupper and iswlower)
> > > The Case conversion table could be transformed to a compact form
> > > Case conversion compact: 1201 entries
> > > each entry needs 16 bytes, total 19216
> > > packed 12 or 11 (or even 10), total 14412 (or 12010)
> > > So I think the increase is acceptable for the benefit of simple and
> > > automatic generation
> > So we're at 40K+ plus code then.
> No, if I implement the packed versions, it's 19.3K, so even smaller the
> currently.
Apparently I added up wrongly.

> > > I had noticed meanwhile that this is not active in Cygwin, but it's broken
> > > anyway for multiple reasons:
> > >     * platforms for which wchar_t is not Unicode should be explicitly listed
> > >     * if used, the transformation needs to be applied to all non-Unicode
> > > locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
> > >     * for towupper and towlower, the result must be back-transformed into the
> > > respective locale encoding
> > >     * particulary the locale-specific _l functions inconsistently do not use
> > > the transformation but have this note:
> > No, no, no.  The functionality is restricted to certain use-cases and
> > always was.  It was a paid-for customer extension back in the day and it
> > was *sufficient* for the use-cases.  It's not clear how many newlib
> > users are still using it, but it's not a good idea to remove it without
> > checking first.  That means, ask on the newlib mailing list how many are
> > using the historical jp2uc code, and if we don't get a reply within,
> > say, a month, we can probably nuke it.
> OK, let's make such a request after holiday time.
> But, even if this shall persist as a special solution, it's still broken and
> should be fixed.
> Can we then substitute the current table with calling the iconvdata
> functions? In that case, as I said, the back-conversion would be available
> too, and I could fix that and add the missing handling of the _l functions,
> for a consistent solution.
I'm not quite sure I follow.  Do you mean, iconvdata tables for the
three japanese codesets only?  Wouldn't that mean to convert the
multibyte stuff into unicode and vice versa, basically getting rid
of the jp2uc workaround?

After a night's sleep, that might actually be the best way anyway.  I
agree that the jp2uc workaround is a bit of a hack.  Well, not a bit.

However, give that this does not affect Cygwin, we should really discuss
this on the newlib list.


Thanks,
Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

signature.asc (836 bytes) Download Attachment
Loading...