[bitbake-devel,2/2] runqueue: Introduce load balanced task spawning

Submitted by Andreas Müller on Aug. 13, 2018, 9:04 p.m. | Patch ID: 153710

Details

Message ID 20180813210445.32496-3-schnitzeltony@gmail.com
State New
Headers show

Commit Message

Andreas Müller Aug. 13, 2018, 9:04 p.m.
To get most out of build host, bitbake now detects if the CPU workload is low.
If so, additional tasks are spawned. Maximum 'dynamic' tasks are set by
BB_NUMBER_THREADS_LOW_CPU.

So now user can set a range for the count of tasks:

Min: BB_NUMBER_THREADS
Min: BB_NUMBER_THREADS_LOW_CPU

in which bitbake can operate on demand.

Some numbers for 6 core AMD bulldozer 12GB RAM / build image from scratch with
3104 tasks / PARALLEL_MAKE = "-j6" / PARALLEL_MAKEINST="-j6":

Before the patch (same as BB_NUMBER_THREADS_LOW_CPU = "0" or not set):

 BB_NUMBER_THREADS | Build time [s]
------------------------------------
 2                 | 156m48.741s
------------------------------------
 3                 | 126m27.426s
------------------------------------
 4                 | 114m30.560s  <-- winner (as suggested in doc!)
------------------------------------
 5                 | 117m2.679s
------------------------------------
 6                 | 116m37.515s
------------------------------------
 8                 | 116m37.515s
------------------------------------
 10                | 118m18.441s
------------------------------------
 12                | 117m38.264s

With the patch applied and BB_NUMBER_THREADS_LOW_CPU = "20" (as written in docs
for max thread count)

 BB_NUMBER_THREADS | Build time [s]
------------------------------------
 3                 | 114m48.105s
------------------------------------
 4                 | 113m26.502s
------------------------------------

Some additional notes:

+ Although not tested I expect better enhancement for setscene sessions
+ At times back when do_package_qa was a dependency for do_rootfs, the
  performance win would have been more significant: For the image tested only
  very few do_package_qa were performed while do_rootfs. The static threads = 4
  winner had more of them and would have waited longer - sigh :)
+ It's more fun to watch bitbake at work torturing CPU. If you want to do so
  and use gnome's system monitor, be aware that CPU History is delayed for
  2-3s. I was sometimes wondering 'why more task's now?
- For building image from scratch the performace win is somewhat dissapointing ~1%
- Patch creates a dependecy on psutils
- time.monotonic is not yet used. It was introduced in python 3.3 (2012) and is
  considered supported in all environments (whatever that means) since 3.5
  (2015).

Signed-off-by: Andreas Müller <schnitzeltony@gmail.com>
---
 lib/bb/runqueue.py | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Patch hide | download patch | download mbox

diff --git a/lib/bb/runqueue.py b/lib/bb/runqueue.py
index 7095ea5a..2690c2a2 100644
--- a/lib/bb/runqueue.py
+++ b/lib/bb/runqueue.py
@@ -37,6 +37,8 @@  from bb import monitordisk
 import subprocess
 import pickle
 from multiprocessing import Process
+import psutil
+import time
 
 bblogger = logging.getLogger("BitBake")
 logger = logging.getLogger("BitBake.RunQueue")
@@ -1668,6 +1670,7 @@  class RunQueueExecute:
         self.rqdata = rq.rqdata
 
         self.number_tasks = int(self.cfgData.getVar("BB_NUMBER_THREADS") or 1)
+        self.number_tasks_low_cpu = int(self.cfgData.getVar("BB_NUMBER_THREADS_LOW_CPU") or 0)
         self.scheduler = self.cfgData.getVar("BB_SCHEDULER") or "speed"
 
         self.runq_buildable = set()
@@ -1679,6 +1682,8 @@  class RunQueueExecute:
         self.failed_tids = []
 
         self.stampcache = {}
+        self.last_cpu_percent = psutil.cpu_percent()
+        self.last_cpu_percent_time = time.monotonic()
 
         for mc in rq.worker:
             rq.worker[mc].pipe.setrunqueueexec(self)
@@ -1687,6 +1692,8 @@  class RunQueueExecute:
 
         if self.number_tasks <= 0:
              bb.fatal("Invalid BB_NUMBER_THREADS %s" % self.number_tasks)
+        if self.number_tasks_low_cpu < 0:
+             bb.fatal("Invalid BB_NUMBER_THREADS_LOW_CPU %s" % self.number_tasks_low_cpu)
 
     def runqueue_process_waitpid(self, task, status):
 
@@ -1756,6 +1763,17 @@  class RunQueueExecute:
 
     def can_start_task(self):
         can_start = self.stats.active < self.number_tasks
+        # Can we inject extra tasks for low workload?
+        if not can_start and self.number_tasks_low_cpu > 0:
+            _time = time.monotonic()
+            # avoid workload inaccuray
+            if _time - self.last_cpu_percent_time >= 0.1:
+                cpu_percent = psutil.cpu_percent()
+                self.last_cpu_percent = cpu_percent
+                self.last_cpu_percent_time = _time
+                if cpu_percent < 90 and self.stats.active < self.number_tasks_low_cpu:
+                    can_start = True
+
         return can_start
 
 class RunQueueExecuteDummy(RunQueueExecute):

Comments

Alexander Kanavin Aug. 13, 2018, 9:20 p.m.
2018-08-13 23:04 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
> To get most out of build host, bitbake now detects if the CPU workload is low.
> If so, additional tasks are spawned. Maximum 'dynamic' tasks are set by
> BB_NUMBER_THREADS_LOW_CPU.

So the best improvement is going from 114.5 minutes to 113.5 minutes?
I don't think it's worth the trouble. Maybe it's time to invest in 16
(or even 32!) core amd threadripper? :)

Alex
Andreas Müller Aug. 13, 2018, 9:30 p.m.
On Mon, Aug 13, 2018 at 11:20 PM, Alexander Kanavin
<alex.kanavin@gmail.com> wrote:
> 2018-08-13 23:04 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
>> To get most out of build host, bitbake now detects if the CPU workload is low.
>> If so, additional tasks are spawned. Maximum 'dynamic' tasks are set by
>> BB_NUMBER_THREADS_LOW_CPU.
>
> So the best improvement is going from 114.5 minutes to 113.5 minutes?
> I don't think it's worth the trouble. Maybe it's time to invest in 16
> (or even 32!) core amd threadripper? :)
>
Indeed - but I have to wait till next holiday to assemble such kind of machine.

I still can't believe the results are that disappointing...

Andreas
Andre McCurdy Aug. 13, 2018, 10:37 p.m.
On Mon, Aug 13, 2018 at 2:30 PM, Andreas Müller <schnitzeltony@gmail.com> wrote:
> On Mon, Aug 13, 2018 at 11:20 PM, Alexander Kanavin
> <alex.kanavin@gmail.com> wrote:
>> 2018-08-13 23:04 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
>>> To get most out of build host, bitbake now detects if the CPU workload is low.
>>> If so, additional tasks are spawned. Maximum 'dynamic' tasks are set by
>>> BB_NUMBER_THREADS_LOW_CPU.
>>
>> So the best improvement is going from 114.5 minutes to 113.5 minutes?
>> I don't think it's worth the trouble. Maybe it's time to invest in 16
>> (or even 32!) core amd threadripper? :)
>>
> Indeed - but I have to wait till next holiday to assemble such kind of machine.
>
> I still can't believe the results are that disappointing...

I wonder if you've experimented with the opposite approach, ie
spawning fewer tasks when CPU load is very high? If a single
do_compile task can fully load all CPUs, not running other tasks in
parallel with it (especially another do_compile) might give some
benefit?

Dynamically boosting and dynamically lowering BB_NUMBER_THREADS based
on overall CPU load both seem like logical things to do.
Andreas Müller Aug. 14, 2018, 12:11 a.m.
On Tue, Aug 14, 2018 at 12:37 AM, Andre McCurdy <armccurdy@gmail.com> wrote:
> On Mon, Aug 13, 2018 at 2:30 PM, Andreas Müller <schnitzeltony@gmail.com> wrote:
>> On Mon, Aug 13, 2018 at 11:20 PM, Alexander Kanavin
>> <alex.kanavin@gmail.com> wrote:
>>> 2018-08-13 23:04 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
>>>> To get most out of build host, bitbake now detects if the CPU workload is low.
>>>> If so, additional tasks are spawned. Maximum 'dynamic' tasks are set by
>>>> BB_NUMBER_THREADS_LOW_CPU.
>>>
>>> So the best improvement is going from 114.5 minutes to 113.5 minutes?
>>> I don't think it's worth the trouble. Maybe it's time to invest in 16
>>> (or even 32!) core amd threadripper? :)
>>>
>> Indeed - but I have to wait till next holiday to assemble such kind of machine.
>>
>> I still can't believe the results are that disappointing...
>
> I wonder if you've experimented with the opposite approach, ie
> spawning fewer tasks when CPU load is very high? If a single
> do_compile task can fully load all CPUs, not running other tasks in
> parallel with it (especially another do_compile) might give some
> benefit?
How shall this work? 100% should be target.
>
> Dynamically boosting and dynamically lowering BB_NUMBER_THREADS based
> on overall CPU load both seem like logical things to do.
Meanwhile I tested this on another machine (yeah should have done
before sending out): Quite good processor / poor harddisk. As soon as
harddisk is the bottleneck (or when swapping) -> workload goes down ->
task explosion. Not exactly a good idea.

So better go Alex's suggestion :)

Andreas
Andre McCurdy Aug. 14, 2018, 1:01 a.m.
On Mon, Aug 13, 2018 at 5:11 PM, Andreas Müller <schnitzeltony@gmail.com> wrote:
> On Tue, Aug 14, 2018 at 12:37 AM, Andre McCurdy <armccurdy@gmail.com> wrote:
>> On Mon, Aug 13, 2018 at 2:30 PM, Andreas Müller <schnitzeltony@gmail.com> wrote:
>>> On Mon, Aug 13, 2018 at 11:20 PM, Alexander Kanavin
>>> <alex.kanavin@gmail.com> wrote:
>>>> 2018-08-13 23:04 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
>>>>> To get most out of build host, bitbake now detects if the CPU workload is low.
>>>>> If so, additional tasks are spawned. Maximum 'dynamic' tasks are set by
>>>>> BB_NUMBER_THREADS_LOW_CPU.
>>>>
>>>> So the best improvement is going from 114.5 minutes to 113.5 minutes?
>>>> I don't think it's worth the trouble. Maybe it's time to invest in 16
>>>> (or even 32!) core amd threadripper? :)
>>>>
>>> Indeed - but I have to wait till next holiday to assemble such kind of machine.
>>>
>>> I still can't believe the results are that disappointing...
>>
>> I wonder if you've experimented with the opposite approach, ie
>> spawning fewer tasks when CPU load is very high? If a single
>> do_compile task can fully load all CPUs, not running other tasks in
>> parallel with it (especially another do_compile) might give some
>> benefit?
> How shall this work? 100% should be target.

Aim should be to limit the chance that the CPUs are completely
overloaded, e.g. with 4 CPUs, try to avoid running 4 x do_compile in
parallel.

If you define a single do_compile task which is able to load all CPUs
as "100%" then the limit (not target) should perhaps be 200%?

Some experimentation would be needed to fine tune.

>> Dynamically boosting and dynamically lowering BB_NUMBER_THREADS based
>> on overall CPU load both seem like logical things to do.
> Meanwhile I tested this on another machine (yeah should have done
> before sending out): Quite good processor / poor harddisk. As soon as
> harddisk is the bottleneck (or when swapping) -> workload goes down ->
> task explosion. Not exactly a good idea.

What does task explosion mean? Hitting the BB_NUMBER_THREADS_LOW_CPU limit?

If BB_NUMBER_THREADS_LOW_CPU is set to something fairly safe (1.5 x
BB_NUMBER_THREADS ?) hitting that limit doesn't seem like a big issue.

Or does "task explosion" mean your implementation was buggy and
BB_NUMBER_THREADS_LOW_CPU was exceeded?
Martin Jansa Aug. 14, 2018, 1:11 a.m.
On Mon, Aug 13, 2018 at 11:20:50PM +0200, Alexander Kanavin wrote:
> 2018-08-13 23:04 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
> > To get most out of build host, bitbake now detects if the CPU workload is low.
> > If so, additional tasks are spawned. Maximum 'dynamic' tasks are set by
> > BB_NUMBER_THREADS_LOW_CPU.
> 
> So the best improvement is going from 114.5 minutes to 113.5 minutes?
> I don't think it's worth the trouble. Maybe it's time to invest in 16
> (or even 32!) core amd threadripper? :)

IMHO the best improvement was for 3 BB_NUMBER_THREADS and possibly even
bigger improvement for 2 BB_NUMBER_THREADS.


>  2                 | 156m48.741s
> ------------------------------------
>  3                 | 126m27.426s
> ------------------------------------

...

> With the patch applied and BB_NUMBER_THREADS_LOW_CPU = "20" (as
> written in docs
> for max thread count)
> 
>  BB_NUMBER_THREADS | Build time [s]
> ------------------------------------
>  3                 | 114m48.105s

I'm running with 2 BB_NUMBER_THREADS on similar HW (8core Bulldozer
FX(tm)-8120, 32GB ram), because it leaves the desktop usable for other
stuff while some build is running on background.

With 4 BB_NUMBER_THREADS, big build and a bit of bad luck I was getting
4 do_compile tasks like qtbase, chromium, firefox and gimp at the same
time which either makes me drink too much coffee or even invites uncle
OOMK.

I quite like the idea behind BB_NUMBER_THREADS_LOW_CPU.
Mikko Rapeli Aug. 14, 2018, 6:32 a.m.
Just my 2 € cents to the discussion:

we had to limit number of threads because heavy C++ projects were using
all of RAM and causing heavy swapping. Single g++ processes were eating
up to 20 gigabytes of physical ram. It's not just the CPU which is the
limiting factor to parallel task execution.

-Mikko
Andreas Müller Aug. 14, 2018, 7:57 a.m.
On Tue, Aug 14, 2018 at 8:32 AM,  <Mikko.Rapeli@bmw.de> wrote:
> Just my 2 € cents to the discussion:
>
> we had to limit number of threads because heavy C++ projects were using
> all of RAM and causing heavy swapping. Single g++ processes were eating
> up to 20 gigabytes of physical ram. It's not just the CPU which is the
> limiting factor to parallel task execution.
>
LOL: My first approach was to extra-spawn in case memory occupied is
less than 50%. That solution performed really bad. I think adding
tasks on a CPU performing already with ~100% just cause overhead
reducing overall performance. I think the target for best performance
is running CPU with 100% with as few tasks as possible.

Again: Major problem with this solution is that a low workload caused
by CPU waiting for hard-disk or resuming from spawns additional tasks.
That is totally wrong.

Andreas
Alexander Kanavin Aug. 14, 2018, 8:07 a.m.
2018-08-14 8:32 GMT+02:00  <Mikko.Rapeli@bmw.de>:
> Just my 2 € cents to the discussion:
>
> we had to limit number of threads because heavy C++ projects were using
> all of RAM and causing heavy swapping. Single g++ processes were eating
> up to 20 gigabytes of physical ram. It's not just the CPU which is the
> limiting factor to parallel task execution.

I do believe some kind of clever dynamic limiter would be useful here.
Obviously it's an absurd situation when on a 32 core processor there
are 32 do_compile c++ tasks, each running 32 instances of g++ - which
is the default configuration. On the other hand running things like
do_configure or do_install should happen in parallel. I like Andre's
idea, but it should also watch the available RAM.

Alex
Andreas Müller Aug. 14, 2018, 9:05 a.m.
On Tue, Aug 14, 2018 at 12:37 AM, Andre McCurdy <armccurdy@gmail.com> wrote:
> On Mon, Aug 13, 2018 at 2:30 PM, Andreas Müller <schnitzeltony@gmail.com> wrote:
>> On Mon, Aug 13, 2018 at 11:20 PM, Alexander Kanavin
>> <alex.kanavin@gmail.com> wrote:
>>> 2018-08-13 23:04 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
>>>> To get most out of build host, bitbake now detects if the CPU workload is low.
>>>> If so, additional tasks are spawned. Maximum 'dynamic' tasks are set by
>>>> BB_NUMBER_THREADS_LOW_CPU.
>>>
>>> So the best improvement is going from 114.5 minutes to 113.5 minutes?
>>> I don't think it's worth the trouble. Maybe it's time to invest in 16
>>> (or even 32!) core amd threadripper? :)
>>>
>> Indeed - but I have to wait till next holiday to assemble such kind of machine.
>>
>> I still can't believe the results are that disappointing...
>
> I wonder if you've experimented with the opposite approach, ie
> spawning fewer tasks when CPU load is very high? If a single
> do_compile task can fully load all CPUs, not running other tasks in
> parallel with it (especially another do_compile) might give some
> benefit?
>
> Dynamically boosting and dynamically lowering BB_NUMBER_THREADS based
> on overall CPU load both seem like logical things to do.

I think the patch does this of you interpret parameters differently.
So if you use BB_NUMBER_THREADS = 10 currently and want to go down to
2 in case heavy load set

BB_NUMBER_THREADS = 2
BB_NUMBER_THREADS_LOW_CPU = 10

Andreas
Alexander Kanavin Aug. 14, 2018, 9:18 a.m.
2018-08-14 11:05 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
>> I wonder if you've experimented with the opposite approach, ie
>> spawning fewer tasks when CPU load is very high? If a single
>> do_compile task can fully load all CPUs, not running other tasks in
>> parallel with it (especially another do_compile) might give some
>> benefit?
>>
>> Dynamically boosting and dynamically lowering BB_NUMBER_THREADS based
>> on overall CPU load both seem like logical things to do.
>
> I think the patch does this of you interpret parameters differently.
> So if you use BB_NUMBER_THREADS = 10 currently and want to go down to
> 2 in case heavy load set
>
> BB_NUMBER_THREADS = 2
> BB_NUMBER_THREADS_LOW_CPU = 10

Right, then that's quite useful! Is it also possible to detect the low
RAM situation?

Perhaps it's better to rename the parameters to BB_NUMBER_THREADS_MIN
and BB_NUMBER_THREADS_MAX? Then bitbake would run tasks within the
range, with the aim of keeping the CPU loaded, but not overloaded.
There could even be reasonable defaults: number of cores/threads for
MAX, and 2 for MIN.

Alex
Richard Purdie Aug. 14, 2018, 9:43 a.m.
On Tue, 2018-08-14 at 10:07 +0200, Alexander Kanavin wrote:
> 2018-08-14 8:32 GMT+02:00  <Mikko.Rapeli@bmw.de>:
> > Just my 2 € cents to the discussion:
> > 
> > we had to limit number of threads because heavy C++ projects were
> > using
> > all of RAM and causing heavy swapping. Single g++ processes were
> > eating
> > up to 20 gigabytes of physical ram. It's not just the CPU which is
> > the
> > limiting factor to parallel task execution.
> 
> I do believe some kind of clever dynamic limiter would be useful
> here.
> Obviously it's an absurd situation when on a 32 core processor there
> are 32 do_compile c++ tasks, each running 32 instances of g++ - which
> is the default configuration. On the other hand running things like
> do_configure or do_install should happen in parallel. I like Andre's
> idea, but it should also watch the available RAM.

If people want to play, I did experiment with "proper" thread pooling:

http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/wipqueue4&id=d66a327fb6189db5de8bc489859235dcba306237

This implements a make job server within bitbake, then connects make to
it. The net result is that you can then put a limit on the number of
processes across all tasks.

I seem to remember there were some bugs with it and not all are listed
in the commit message, I don't remember what the other issues were...

Bonus marks for connecting in the other parallel "pool" tasks we have
in do_package_* and friends but even a common pool for compile would be
nice!

Cheers,

Richard
Richard Purdie Aug. 14, 2018, 9:45 a.m.
On Tue, 2018-08-14 at 10:43 +0100, Richard Purdie wrote:
> On Tue, 2018-08-14 at 10:07 +0200, Alexander Kanavin wrote:
> > 2018-08-14 8:32 GMT+02:00  <Mikko.Rapeli@bmw.de>:
> > > Just my 2 € cents to the discussion:
> > > 
> > > we had to limit number of threads because heavy C++ projects were
> > > using
> > > all of RAM and causing heavy swapping. Single g++ processes were
> > > eating
> > > up to 20 gigabytes of physical ram. It's not just the CPU which
> > > is
> > > the
> > > limiting factor to parallel task execution.
> > 
> > I do believe some kind of clever dynamic limiter would be useful
> > here.
> > Obviously it's an absurd situation when on a 32 core processor
> > there
> > are 32 do_compile c++ tasks, each running 32 instances of g++ -
> > which
> > is the default configuration. On the other hand running things like
> > do_configure or do_install should happen in parallel. I like
> > Andre's
> > idea, but it should also watch the available RAM.
> 
> If people want to play, I did experiment with "proper" thread
> pooling:
> 
> http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/w
> ipqueue4&id=d66a327fb6189db5de8bc489859235dcba306237

More recent version of the patch:

http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/wipqueue7&id=236ca8be128ba7a4edb9fb2c9e512d181679eee8

Cheers,

Richard
Peter Kjellerstedt Aug. 14, 2018, 10:28 a.m.
> -----Original Message-----

> From: bitbake-devel-bounces@lists.openembedded.org <bitbake-devel-

> bounces@lists.openembedded.org> On Behalf Of Richard Purdie

> Sent: den 14 augusti 2018 11:45

> To: Alexander Kanavin <alex.kanavin@gmail.com>; Mikko.Rapeli@bmw.de;

> Andreas Müller <schnitzeltony@gmail.com>; Martin Jansa

> <martin.jansa@gmail.com>

> Cc: bitbake-devel@lists.openembedded.org

> Subject: Re: [bitbake-devel] [PATCH 2/2] runqueue: Introduce load

> balanced task spawning

> 

> On Tue, 2018-08-14 at 10:43 +0100, Richard Purdie wrote:

> > On Tue, 2018-08-14 at 10:07 +0200, Alexander Kanavin wrote:

> > > 2018-08-14 8:32 GMT+02:00  <Mikko.Rapeli@bmw.de>:

> > > > Just my 2 € cents to the discussion:

> > > >

> > > > we had to limit number of threads because heavy C++ projects 

> > > > were using all of RAM and causing heavy swapping. Single g++ 

> > > > processes were eating up to 20 gigabytes of physical ram. 

> > > > It's not just the CPU which is the limiting factor to 

> > > > parallel task execution.

> > >

> > > I do believe some kind of clever dynamic limiter would be

> > > useful here. Obviously it's an absurd situation when on a 32 

> > > core processor there are 32 do_compile c++ tasks, each running 

> > > 32 instances of g++ - which is the default configuration. On 

> > > the other hand running things like do_configure or do_install 

> > > should happen in parallel. I like Andre's idea, but it should 

> > > also watch the available RAM.

> >

> > If people want to play, I did experiment with "proper" thread

> > pooling:

> >

> > http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/wipqueue4&id=d66a327fb6189db5de8bc489859235dcba306237

> 

> More recent version of the patch:

> 

> http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/wipqueue7&id=236ca8be128ba7a4edb9fb2c9e512d181679eee8

> 

> Cheers,

> 

> Richard


Even though make is the prevalent tool used to build code, we have 
to consider others such as cmake and meson... Any idea if their 
parallelism can be controlled in some similar way?

//Peter
Andreas Müller Aug. 14, 2018, 10:43 a.m.
On Tue, Aug 14, 2018 at 12:28 PM, Peter Kjellerstedt
<peter.kjellerstedt@axis.com> wrote:
>> -----Original Message-----
>> From: bitbake-devel-bounces@lists.openembedded.org <bitbake-devel-
>> bounces@lists.openembedded.org> On Behalf Of Richard Purdie
>> Sent: den 14 augusti 2018 11:45
>> To: Alexander Kanavin <alex.kanavin@gmail.com>; Mikko.Rapeli@bmw.de;
>> Andreas Müller <schnitzeltony@gmail.com>; Martin Jansa
>> <martin.jansa@gmail.com>
>> Cc: bitbake-devel@lists.openembedded.org
>> Subject: Re: [bitbake-devel] [PATCH 2/2] runqueue: Introduce load
>> balanced task spawning
>>
>> On Tue, 2018-08-14 at 10:43 +0100, Richard Purdie wrote:
>> > On Tue, 2018-08-14 at 10:07 +0200, Alexander Kanavin wrote:
>> > > 2018-08-14 8:32 GMT+02:00  <Mikko.Rapeli@bmw.de>:
>> > > > Just my 2 € cents to the discussion:
>> > > >
>> > > > we had to limit number of threads because heavy C++ projects
>> > > > were using all of RAM and causing heavy swapping. Single g++
>> > > > processes were eating up to 20 gigabytes of physical ram.
>> > > > It's not just the CPU which is the limiting factor to
>> > > > parallel task execution.
>> > >
>> > > I do believe some kind of clever dynamic limiter would be
>> > > useful here. Obviously it's an absurd situation when on a 32
>> > > core processor there are 32 do_compile c++ tasks, each running
>> > > 32 instances of g++ - which is the default configuration. On
>> > > the other hand running things like do_configure or do_install
>> > > should happen in parallel. I like Andre's idea, but it should
>> > > also watch the available RAM.
>> >
>> > If people want to play, I did experiment with "proper" thread
>> > pooling:
>> >
>> > http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/wipqueue4&id=d66a327fb6189db5de8bc489859235dcba306237
>>
>> More recent version of the patch:
>>
>> http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/wipqueue7&id=236ca8be128ba7a4edb9fb2c9e512d181679eee8
>>
>> Cheers,
>>
>> Richard
>
> Even though make is the prevalent tool used to build code, we have
> to consider others such as cmake and meson... Any idea if their
> parallelism can be controlled in some similar way?
>
BTW: do_package_ipk is also a CPU eater these days although these task
do not last as long as compile.

Andreas
Alexander Kanavin Aug. 14, 2018, 11:03 a.m.
2018-08-14 11:43 GMT+02:00 Richard Purdie <richard.purdie@linuxfoundation.org>:
> If people want to play, I did experiment with "proper" thread pooling:
>
> http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/wipqueue4&id=d66a327fb6189db5de8bc489859235dcba306237
>
> This implements a make job server within bitbake, then connects make to
> it. The net result is that you can then put a limit on the number of
> processes across all tasks.
>
> I seem to remember there were some bugs with it and not all are listed
> in the commit message, I don't remember what the other issues were...

Both make and ninja have -l option:

      -l [load], --load-average[=load]
            Specifies  that  no  new  jobs (commands) should be
started if there are others jobs running and the load average is at
least load (a floating-point number).  With no argument, removes a
            previous load limit.

       -l N   do not start new jobs if the load average is greater than N

Maybe that could be appended to PARALLEL_MAKE, which is far less
invasive than any other approach?

Alex
Alexander Kanavin Aug. 14, 2018, 11:07 a.m.
2018-08-14 12:28 GMT+02:00 Peter Kjellerstedt <peter.kjellerstedt@axis.com>:
> Even though make is the prevalent tool used to build code, we have
> to consider others such as cmake and meson... Any idea if their
> parallelism can be controlled in some similar way?

These tools (and qmake etc.) only configure the builds, with makefiles
as output. They delegate the actual build job execution to make and/or
ninja.

Alex
Alexander Kanavin Aug. 15, 2018, 12:43 p.m.
2018-08-14 13:03 GMT+02:00 Alexander Kanavin <alex.kanavin@gmail.com>:
> Both make and ninja have -l option:
>
>       -l [load], --load-average[=load]
>             Specifies  that  no  new  jobs (commands) should be
> started if there are others jobs running and the load average is at
> least load (a floating-point number).  With no argument, removes a
>             previous load limit.
>
>        -l N   do not start new jobs if the load average is greater than N
>
> Maybe that could be appended to PARALLEL_MAKE, which is far less
> invasive than any other approach?

I've done some tests, and yes -l does help. We currently have a nasty
quadratic growth rate with cpu cores as input, which hits especially
badly when the amount of cores is high, and a lot of long, heavy (e.g.
c++) do_compile tasks run at once. This potentially means n*n compiler
instances, where n is how many cpu cores are available. '-l' option
does neatly limit that to a constant amount of compilers per core.

However, this does not solve the other resource problem: running out
of available RAM and going into swap thrashing. Neither make nor ninja
can currently watch the RAM, even though it is not complicated:
>>> import psutil
>>> psutil.virtual_memory()
svmem(total=16536903680, available=7968600064, percent=51.8,
used=16347615232, free=189288448, active=11750494208,
inactive=2882383872, buffers=3158528000, cached=4620783616)

I think we should teach both to do that, and then replace a static
'number of jobs' limit in PARALLEL_MAKE with limits on CPU load and
RAM usage.

Alex
Andreas Müller Aug. 15, 2018, 3:01 p.m.
On Wed, Aug 15, 2018 at 2:43 PM, Alexander Kanavin
<alex.kanavin@gmail.com> wrote:
> 2018-08-14 13:03 GMT+02:00 Alexander Kanavin <alex.kanavin@gmail.com>:
>> Both make and ninja have -l option:
>>
>>       -l [load], --load-average[=load]
>>             Specifies  that  no  new  jobs (commands) should be
>> started if there are others jobs running and the load average is at
>> least load (a floating-point number).  With no argument, removes a
>>             previous load limit.
>>
>>        -l N   do not start new jobs if the load average is greater than N
>>
>> Maybe that could be appended to PARALLEL_MAKE, which is far less
>> invasive than any other approach?
>
> I've done some tests, and yes -l does help. We currently have a nasty
> quadratic growth rate with cpu cores as input, which hits especially
> badly when the amount of cores is high, and a lot of long, heavy (e.g.
> c++) do_compile tasks run at once. This potentially means n*n compiler
> instances, where n is how many cpu cores are available. '-l' option
> does neatly limit that to a constant amount of compilers per core.
>
> However, this does not solve the other resource problem: running out
> of available RAM and going into swap thrashing. Neither make nor ninja
> can currently watch the RAM, even though it is not complicated:
>>>> import psutil
>>>> psutil.virtual_memory()
> svmem(total=16536903680, available=7968600064, percent=51.8,
> used=16347615232, free=189288448, active=11750494208,
> inactive=2882383872, buffers=3158528000, cached=4620783616)
>
> I think we should teach both to do that, and then replace a static
> 'number of jobs' limit in PARALLEL_MAKE with limits on CPU load and
> RAM usage.
>
> Alex
1. I like -l - have to try either!!
2. Quadratic explosion: I think it you should reduce number of
parallel bitbake threads. From what I have seen: When running heavy
compiles with -j = number-of-cores it takes only 2-3 do_compiles to
have CPU load at constant ~100%. I think this should be independent of
number of cores. Try 4-5 parallel bitbake threads - I bet that speeds
up your builds and reduces swap floods.

Andreas
Alexander Kanavin Aug. 15, 2018, 4:26 p.m.
2018-08-15 17:01 GMT+02:00 Andreas Müller <schnitzeltony@gmail.com>:
> 2. Quadratic explosion: I think it you should reduce number of
> parallel bitbake threads. From what I have seen: When running heavy
> compiles with -j = number-of-cores it takes only 2-3 do_compiles to
> have CPU load at constant ~100%. I think this should be independent of
> number of cores. Try 4-5 parallel bitbake threads - I bet that speeds
> up your builds and reduces swap floods.

What about the situations where bitbake is mostly busy with other
things than do_compile, or when do_compile takes only a small fraction
of the recipes build time? Particularly do_configure can be
notoriously slow and single-threaded, so I do want to run those with
all available cores.

Alex