Extreme slowdown due to malloc?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Extreme slowdown due to malloc?

Achim Gratz

I've been experimenting a bit with ZStandard dictionaries.  The
dictionary builder is probably not the most optimized piece of software
and if you feed it large amounts of data it needs quite a lot of
cycles.  So I thought I run some of this on Cygwin since that machine is
faster and has more threads than my Linux box.  Unfortunately that plan
shattered due to extreme slowness of the first (single-threaded) part of
the dictionary builder that sets up the partial suffix array.

|------+---------------+---------------|
|      | E3-1225v3     | E3-1276v3     |
|      | 4C/4T         | 4C/8T         |
|      | 3.2/3.6GHz    | 3.6/4.0GHz    |
|------+---------------+---------------|
|  100 | 00:14 /   55s | 00:23 /  126s |
|  200 | 00:39 /  145s | 01:10 /  241s |
|  400 | 01:12 /  266s | 01:25 /  322s |
|  800 | 02:06 /  466s | 11:12 / 1245s |
| 1600 | 03:57 /  872s | > 2hr         |
| 3200 | 08:03 / 1756s | n/a           |
| 6400 | 16:17 / 3581s | n/a           |
|------+---------------+---------------|

The obvious difference is that I/O takes a lot longer on Cygwin (roughly
a minute for reading all the data) and that I have an insane amount of
page faults on Windows (as reported by time) vs. none on Linux.

While doing that I also noticed that top shows the program taking 100%
CPU in the multithreaded portion of the program, while it should show
close to 800% at that time.  I'm not sure if that information just isn't
available on Windows or if procps-ng needs to look someplace else for
that to be shown as expected.


Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Waldorf MIDI Implementation & additional documentation:
http://Synth.Stromeko.net/Downloads.html#WaldorfDocs
Reply | Threaded
Open this post in threaded view
|

Re: Extreme slowdown due to malloc?

Mark Geisert
Hi Achim,

Achim Gratz wrote:
> I've been experimenting a bit with ZStandard dictionaries.  The
> dictionary builder is probably not the most optimized piece of software

Is this what leads you to suspect malloc?  Really heavy use of malloc?

> and if you feed it large amounts of data it needs quite a lot of
> cycles.  So I thought I run some of this on Cygwin since that machine is
> faster and has more threads than my Linux box.  Unfortunately that plan
> shattered due to extreme slowness of the first (single-threaded) part of
> the dictionary builder that sets up the partial suffix array.
>
> |------+---------------+---------------|
> |      | E3-1225v3     | E3-1276v3     |
> |      | 4C/4T         | 4C/8T         |
> |      | 3.2/3.6GHz    | 3.6/4.0GHz    |
> |------+---------------+---------------|
> |  100 | 00:14 /   55s | 00:23 /  126s |
> |  200 | 00:39 /  145s | 01:10 /  241s |
> |  400 | 01:12 /  266s | 01:25 /  322s |
> |  800 | 02:06 /  466s | 11:12 / 1245s |
> | 1600 | 03:57 /  872s | > 2hr         |
> | 3200 | 08:03 / 1756s | n/a           |
> | 6400 | 16:17 / 3581s | n/a           |
> |------+---------------+---------------|
>
> The obvious difference is that I/O takes a lot longer on Cygwin (roughly
> a minute for reading all the data) and that I have an insane amount of
> page faults on Windows (as reported by time) vs. none on Linux.

How much RAM does the Windows machine have?  Do you have a paging file?  Is it
fixed size or "let Windows manage"?  How big is it?

> While doing that I also noticed that top shows the program taking 100%
> CPU in the multithreaded portion of the program, while it should show
> close to 800% at that time.  I'm not sure if that information just isn't
> available on Windows or if procps-ng needs to look someplace else for
> that to be shown as expected.

No offense, but are you sure it's actually running multi-threaded on Windows?

I have a Cygwin malloc speedup patch that *might* help the m-t part.  I'll prepare
and submit that to cygwin-patches shortly.
Cheers,

..mark
Reply | Threaded
Open this post in threaded view
|

Re: Extreme slowdown due to malloc?

Achim Gratz
Mark Geisert writes:
>> I've been experimenting a bit with ZStandard dictionaries.  The
>> dictionary builder is probably not the most optimized piece of software
>
> Is this what leads you to suspect malloc?  Really heavy use of malloc?

That piece of code does a lot of smallish allocations when it builds up
the suffix array.  Also that is where I suspect the high number of page
faults come from as I have no other explanation.

>> The obvious difference is that I/O takes a lot longer on Cygwin
>> (roughly
>> a minute for reading all the data) and that I have an insane amount of
>> page faults on Windows (as reported by time) vs. none on Linux.
>
> How much RAM does the Windows machine have?  Do you have a paging
> file?  Is it fixed size or "let Windows manage"?  How big is it?

This machine is fully expanded to 32GiB, it won't run out of physical
memory for any of the tests shown.  The full dictionary build that I'd
wanted to have run on this machine would come in as something like
100000 in that table and takes around 6 hours consuming 15GiB on my
Linux box (4T), so if I could run that I might see a bit of memory
pressure.  The pagefile set up as fixed (I think 64GiB), but is unused.

>> While doing that I also noticed that top shows the program taking 100%
>> CPU in the multithreaded portion of the program, while it should show
>> close to 800% at that time.  I'm not sure if that information just isn't
>> available on Windows or if procps-ng needs to look someplace else for
>> that to be shown as expected.
>
> No offense, but are you sure it's actually running multi-threaded on Windows?

Yes, once it finally comes to the statistics part of the dictionary
build it'll use all eight threads no problem.  That's why it doesn't
take twice as long (wall time) at the 400 mark compared to the 200.

> I have a Cygwin malloc speedup patch that *might* help the m-t part.
> I'll prepare and submit that to cygwin-patches shortly.

Well, if you want to test it with the new ZStandard, give it a spin…
I'll check how far I can strip that test down so you can use the Cygwin
source tree for testing.


Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Waldorf MIDI Implementation & additional documentation:
http://Synth.Stromeko.net/Downloads.html#WaldorfDocs
Reply | Threaded
Open this post in threaded view
|

Re: Extreme slowdown due to malloc?

Achim Gratz
ASSI writes:
>> I have a Cygwin malloc speedup patch that *might* help the m-t part.
>> I'll prepare and submit that to cygwin-patches shortly.
>
> Well, if you want to test it with the new ZStandard, give it a spin…
> I'll check how far I can strip that test down so you can use the Cygwin
> source tree for testing.

OK, it's actually pretty simple, do this inside a checkout of
newlib-cygwin:

$ find newlib winsup texinfo -type f > flist
$ zstd --train-cover --ultra -22 -T0 -vv --filelist=flist -o dict-cover

On Linux, it reads in all the files in about two seconds, while it takes
quite a while longer on Cygwin.  But the real bummer is that
constructing the partial suffix arrays (which is single-threaded) will
seemingly take forever, while it's done much faster on Linux.  You can
pare down the number of files like that:

$ shuf -n 320 flist > slist

and then use that shorter file instead.  I get this:

*** Linux E3-1225v3 4C/4T 3.2/3.6GHz
|------+----------+-------+----------+---------+--------+--------+------------|
|    n |     user |   sys |    total |    wall |   util | serial | pagefaults |
|------+----------+-------+----------+---------+--------+--------+------------|
|  100 |  116.092 | 0.187 |  116.279 |   30.82 | 377.3% |   2.0% |          0 |
|  200 |  145.481 | 0.135 |  145.616 |   38.65 | 376.8% |   2.1% |          0 |
|  400 |  288.341 | 0.414 |  288.755 |   77.84 | 371.0% |   2.6% |          0 |
|  800 |  517.288 | 0.623 |  517.911 |  138.93 | 372.8% |   2.4% |          0 |
| 1600 | 1229.348 | 1.752 | 1231.100 |  333.37 | 369.3% |   2.8% |          0 |
| 3200 | 2508.250 | 3.632 | 2511.882 |  678.96 | 370.0% |   2.7% |          0 |
| 6400 | 4380.693 | 5.352 | 4386.045 | 1176.43 | 372.8% |   2.4% |          0 |
|------+----------+-------+----------+---------+--------+--------+------------|

*** Cygwin E3-1276v3 4C/8T 3.6/4.0GHz
|------+----------+--------+----------+---------+--------+--------+------------|
|    n |     user |    sys |    total |    wall |   util | serial | pagefaults |
|------+----------+--------+----------+---------+--------+--------+------------|
|  100 |  141.906 |  0.796 |  142.702 |   20.53 | 695.1% |   2.2% |     327860 |
|  200 |  198.140 |  1.328 |  199.468 |   29.39 | 678.7% |   2.6% |     452870 |
|  400 |  425.749 |  2.328 |  428.077 |   66.03 | 648.3% |   3.3% |     752357 |
|  800 |  822.250 |  3.499 |  825.749 |  150.42 | 549.0% |   6.5% |    1277198 |
| 1600 | 1773.578 |  8.483 | 1782.061 |  383.42 | 464.8% |  10.3% |    3011298 |
| 3200 | 4322.281 | 15.890 | 4338.171 | 1292.92 | 335.5% |  19.8% |    5746903 |
| 6400 | 8499.750 | 29.437 | 8529.187 | 3275.66 | 260.4% |  29.6% |   10543919 |
|------+----------+--------+----------+---------+--------+--------+------------|

So even with smaller number of files (where the serial portion of the
code is not dominating yet) you see that the faster machine expends more
cycles already.  Looking at the differences there is a strong indication
for those pagefaults to constitute the main portion of that extra time.
The last column is the time per pagefault in µs assuming that the extra
time was all spent there.  This is obviously not quite correct, as that
number should roughly be constant if that assumption holds, but it's
close enough to uphold the original hypothesis.

*** Linux vs. Cygwin
|------+--------------+--------------+----------------+--------------+------------+------|
|    n | Linux  total | Linux scaled | Cygwin   total | Cygwin-Linux | pagefaults | t/pf |
|------+--------------+--------------+----------------+--------------+------------+------|
|  100 |      116.279 |      104.651 |        142.702 |       38.051 |     327860 | 116. |
|  200 |      145.616 |      131.054 |        199.468 |       68.414 |     452870 | 151. |
|  400 |      288.755 |      259.880 |        428.077 |      168.197 |     752357 | 224. |
|  800 |      517.911 |      466.120 |        825.749 |      359.629 |    1277198 | 282. |
| 1600 |     1231.100 |     1107.990 |       1782.061 |      674.071 |    3011298 | 224. |
| 3200 |     2511.882 |     2260.694 |       4338.171 |     2077.477 |    5746903 | 361. |
| 6400 |     4386.045 |     3947.441 |       8529.187 |     4581.746 |   10543919 | 435. |
|------+--------------+--------------+----------------+--------------+------------+------|



Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

SD adaptation for Waldorf rackAttack V1.04R1:
http://Synth.Stromeko.net/Downloads.html#WaldorfSDada
Reply | Threaded
Open this post in threaded view
|

Re: Extreme slowdown due to malloc?

Mark Geisert
Hi Achim,
Thank you very much for the detailed instructions and also the comparison data
Linux vs Cygwin for all those testcases.

Achim Gratz wrote:
> ASSI writes:
>>> I have a Cygwin malloc speedup patch that *might* help the m-t part.
>>> I'll prepare and submit that to cygwin-patches shortly.
>>
>> Well, if you want to test it with the new ZStandard, give it a spin…
>> I'll check how far I can strip that test down so you can use the Cygwin
>> source tree for testing.

I've now done this.  And I don't see any improvement.  Reasons below...

> OK, it's actually pretty simple, do this inside a checkout of
> newlib-cygwin:
>
> $ find newlib winsup texinfo -type f > flist
> $ zstd --train-cover --ultra -22 -T0 -vv --filelist=flist -o dict-cover
>
> On Linux, it reads in all the files in about two seconds, while it takes
> quite a while longer on Cygwin.  But the real bummer is that
> constructing the partial suffix arrays (which is single-threaded) will
> seemingly take forever, while it's done much faster on Linux.  You can
> pare down the number of files like that:
>
> $ shuf -n 320 flist > slist

I've settled on '-n 1600' for testing.  I'm running these Cygwin tests on a 2C/4T
i3-something with 8GB memory and an SSD used for filesystem and page file.  Not a
dog but clearly not a dire-wolf either.

The page fault numbers are comparable to what you've shown for Cygwin on your
system.  The long pause after zstd prints "Constructing partial suffix array" is
because zstd is cpu-bound in qsort() for a long time.  No paging during that time.
  Then when the statistics start being printed out, that's when the paging
insanity starts.

What I discovered is that zstd is repeatedly asking malloc() for large memory
blocks, presumably to mmap files in, then free()ing them.  Any malloc request 256K
or larger is fulfilled by mmap() rather than enlarging the heap for it.  But
crucially, there is no mechanism for our malloc to hang on to freed mmap()ed pages
for future use.  If you free an mmap()ed block, it is unmap()ed immediately.  So
for zstd's usage pattern you get an incredible number of page faults to satisfy
the mmap()s and Windows seems to take a non-trivial bit of time for each mmap().

I will be looking at our malloc implementation to see if tuning something can fix
this behavior.  Adding code is the last resort.
Thanks again for the great testcase.

..mark
Reply | Threaded
Open this post in threaded view
|

Re: Extreme slowdown due to malloc?

Achim Gratz
Mark Geisert writes:
> The page fault numbers are comparable to what you've shown for Cygwin
> on your system.  The long pause after zstd prints "Constructing
> partial suffix array" is because zstd is cpu-bound in qsort() for a
> long time.  No paging during that time.

Then there's probably another performance bug in qsort since that step
finishes much faster on Linux too.

> Then when the statistics start being printed out, that's when the
> paging insanity starts.

OK, so you see the same thing.  Good.

> What I discovered is that zstd is repeatedly asking malloc() for large
> memory blocks, presumably to mmap files in, then free()ing them.

Oh, so it's not recycling those… maybe that's not very hard to change.

> Any malloc request 256K or larger is fulfilled by mmap() rather than
> enlarging the heap for it.  But crucially, there is no mechanism for
> our malloc to hang on to freed mmap()ed pages for future use.  If you
> free an mmap()ed block, it is unmap()ed immediately.  So for zstd's
> usage pattern you get an incredible number of page faults to satisfy
> the mmap()s and Windows seems to take a non-trivial bit of time for
> each mmap().

It probably puts backing store behind it even when there's enough
memory.  Come to think of it I should try to move the page file to the
NVMe drive (off the SSD it's on currently) and see if that changes
things.


Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Factory and User Sound Singles for Waldorf Q+, Q and microQ:
http://Synth.Stromeko.net/Downloads.html#WaldorfSounds