diff mbox series

runqueue: Add support for BB_LOADFACTOR_MAX

Message ID 20240221132103.794574-1-richard.purdie@linuxfoundation.org
State Accepted, archived
Commit 14a27306f6dceb4999c2804ccae5a09cc3d8dd49
Headers show
Series runqueue: Add support for BB_LOADFACTOR_MAX | expand

Commit Message

Richard Purdie Feb. 21, 2024, 1:21 p.m. UTC
Some ditros don't enable /proc/pressure and it tends to be those which we
see bitbake timeout issues on, seemingly as load gets too high and the bitbake
processes don't get scheduled in for minutes at a time.

Add support for stopping running extra tasks if the system load average goes
above a certain threshold by setting BB_LOADFACTOR_MAX.

The value used is scaled by CPU number, so a value of 1 would be when
the load average equals the number of cpu cores of the system, under one
only starts tasks when the load average is below the number of cores.

This means you can centrally set a value such as 1.5 which will then
scale correctly to different sized machines with differing numbers
of CPUs.

The pressure regulation is probably more accurate and responsive, however
our graphs do show singificant load spikes on some workers and this
patch is aimed at trying to avoid those.

Pressure regulation is used where available in preference to this load
factor regulation when both are set.

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 lib/bb/runqueue.py | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Randy MacLeod Feb. 26, 2024, 11:12 p.m. UTC | #1
On 2024-02-21 8:21 a.m., Richard Purdie wrote:
> Some ditros don't enable /proc/pressure and it tends to be those which we
> see bitbake timeout issues on, seemingly as load gets too high and the bitbake
> processes don't get scheduled in for minutes at a time.
>
> Add support for stopping running extra tasks if the system load average goes
> above a certain threshold by setting BB_LOADFACTOR_MAX.
>
> The value used is scaled by CPU number, so a value of 1 would be when
> the load average equals the number of cpu cores of the system, under one
> only starts tasks when the load average is below the number of cores.
>
> This means you can centrally set a value such as 1.5 which will then
> scale correctly to different sized machines with differing numbers
> of CPUs.
>
> The pressure regulation is probably more accurate and responsive, however
> our graphs do show singificant load spikes on some workers and this
> patch is aimed at trying to avoid those.
>
> Pressure regulation is used where available in preference to this load
> factor regulation when both are set.

For anyone interested, they could enable PSI on Rocky9 by passing:

    psi=1

to the kernel:

https://git.rockylinux.org/staging/rpms/kernel/-/blob/r9/SOURCES/kernel-x86_64-rhel.config#L4219

https://cateee.net/lkddb/web-lkddb/PSI_DEFAULT_DISABLED.html

For now, Richard and I decided to leave the YP AB Rocky instance as we 
found it and
test the load-base regulator fall-back so we should test it somewhere.


I ran some tests over the weekend and there's nothing suprising in that 
lower load-factor
limits delay the builds:


Note that for the load-factor data, I just multiplied the factor by 1000 
for easy graphing.

I also kept my basement server busy doing some world builds and although 
those runs didn't
complete, they made it far enough in the build to conclude that things 
were working and that
the 1.0 load-factor made the build take ~30% longer than a lightly/not 
regulated build (1).
Compare that to ~20% longer builds for core-image-minimal. The increased 
dealy is likley due to more
long-running jobs.

One interesting aspect to the load-factor job regulation is that the 
first half of the jobs are largely
unregualted likely since they are mostly fetch and configure rather than 
compile. Here's a graph of
the job number on the x-axis and the load / number of tasks on the 
y-axis from the scheduler logs:

Zooming on on the load factor, you can see how the long averaging time 
of the CPU load causes the system
to oscilate above and below the desired average:

and if you squint at a load-factor 0.50, you can see the builds start 
and stop in batches:

~/src/distro/yocto/poky.git/scripts/pybootchartgui/pybootchartgui.py 
buildstats-0.50/20240222030914/

Anyhow....
The thing we really care about on the YP autobuilder is system latency 
which causes
bitbake server timeouts and ptest timer problems so let's see if the 
load-factor helps reduce the
frequency of those problems.

If anyone wants to instrument the bitbake client/server response time 
and graph it over
the time of a build, I'd like to see that data! ;-)

../Randy "I like to graph stuff" MacLeod

1)

$ cd .../b/poky-master-06aab81591
$ cat time-world-1.0.log
Command exited with non-zero status 1
126.66user 28.39system 5:54:20elapsed 0%CPU (0avgtext+0avgdata 
43108maxresident)k
848inputs+9776outputs (4major+154315minor)pagefaults 0swaps

$ cat time-world-10.0.log
Command exited with non-zero status 1
138.36user 74.72system 4:30:26elapsed 1%CPU (0avgtext+0avgdata 
42824maxresident)k
919080inputs+10408outputs (5646major+185828minor)pagefaults 0swaps

$ cat time-world-100.0.log
Command exited with non-zero status 1
140.67user 78.51system 4:30:42elapsed 1%CPU (0avgtext+0avgdata 
43408maxresident)k
933272inputs+10504outputs (5660major+194022minor)pagefaults 0swaps

>
> Signed-off-by: Richard Purdie<richard.purdie@linuxfoundation.org>
> ---
>   lib/bb/runqueue.py | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
>
> diff --git a/lib/bb/runqueue.py b/lib/bb/runqueue.py
> index e86ccd8c61..6987de3e29 100644
> --- a/lib/bb/runqueue.py
> +++ b/lib/bb/runqueue.py
> @@ -220,6 +220,16 @@ class RunQueueScheduler(object):
>                   bb.note("Pressure status changed to CPU: %s, IO: %s, Mem: %s (CPU: %s/%s, IO: %s/%s, Mem: %s/%s) - using %s/%s bitbake threads" % (pressure_state + pressure_values + (len(self.rq.runq_running.difference(self.rq.runq_complete)), self.rq.number_tasks)))
>               self.pressure_state = pressure_state
>               return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure)
> +        elif self.rq.max_loadfactor:
> +            limit = False
> +            loadfactor = float(os.getloadavg()[0]) / os.cpu_count()
> +            # bb.warn("Comparing %s to %s" % (loadfactor, self.rq.max_loadfactor))
> +            if loadfactor > self.rq.max_loadfactor:
> +                limit = True
> +            if hasattr(self, "loadfactor_limit") and limit != self.loadfactor_limit:
> +                bb.note("Load average limiting set to %s as load average: %s - using %s/%s bitbake threads" % (limit, loadfactor, len(self.rq.runq_running.difference(self.rq.runq_complete)), self.rq.number_tasks))
> +            self.loadfactor_limit = limit
> +            return limit
>           return False
>   
>       def next_buildable_task(self):
> @@ -1822,6 +1832,7 @@ class RunQueueExecute:
>           self.max_cpu_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_CPU")
>           self.max_io_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_IO")
>           self.max_memory_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_MEMORY")
> +        self.max_loadfactor = self.cfgData.getVar("BB_LOADFACTOR_MAX")
>   
>           self.sq_buildable = set()
>           self.sq_running = set()
> @@ -1875,6 +1886,11 @@ class RunQueueExecute:
>                   bb.fatal("Invalid BB_PRESSURE_MAX_MEMORY %s, minimum value is %s." % (self.max_memory_pressure, lower_limit))
>               if self.max_memory_pressure > upper_limit:
>                   bb.warn("Your build will be largely unregulated since BB_PRESSURE_MAX_MEMORY is set to %s. It is very unlikely that such high pressure will be experienced." % (self.max_io_pressure))
> +
> +        if self.max_loadfactor:
> +            self.max_loadfactor = float(self.max_loadfactor)
> +            if self.max_loadfactor <= 0:
> +                bb.fatal("Invalid BB_LOADFACTOR_MAX %s, needs to be greater than zero." % (self.max_loadfactor))
>               
>           # List of setscene tasks which we've covered
>           self.scenequeue_covered = set()
diff mbox series

Patch

diff --git a/lib/bb/runqueue.py b/lib/bb/runqueue.py
index e86ccd8c61..6987de3e29 100644
--- a/lib/bb/runqueue.py
+++ b/lib/bb/runqueue.py
@@ -220,6 +220,16 @@  class RunQueueScheduler(object):
                 bb.note("Pressure status changed to CPU: %s, IO: %s, Mem: %s (CPU: %s/%s, IO: %s/%s, Mem: %s/%s) - using %s/%s bitbake threads" % (pressure_state + pressure_values + (len(self.rq.runq_running.difference(self.rq.runq_complete)), self.rq.number_tasks)))
             self.pressure_state = pressure_state
             return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure)
+        elif self.rq.max_loadfactor:
+            limit = False
+            loadfactor = float(os.getloadavg()[0]) / os.cpu_count()
+            # bb.warn("Comparing %s to %s" % (loadfactor, self.rq.max_loadfactor))
+            if loadfactor > self.rq.max_loadfactor:
+                limit = True
+            if hasattr(self, "loadfactor_limit") and limit != self.loadfactor_limit:
+                bb.note("Load average limiting set to %s as load average: %s - using %s/%s bitbake threads" % (limit, loadfactor, len(self.rq.runq_running.difference(self.rq.runq_complete)), self.rq.number_tasks))
+            self.loadfactor_limit = limit
+            return limit
         return False
 
     def next_buildable_task(self):
@@ -1822,6 +1832,7 @@  class RunQueueExecute:
         self.max_cpu_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_CPU")
         self.max_io_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_IO")
         self.max_memory_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_MEMORY")
+        self.max_loadfactor = self.cfgData.getVar("BB_LOADFACTOR_MAX")
 
         self.sq_buildable = set()
         self.sq_running = set()
@@ -1875,6 +1886,11 @@  class RunQueueExecute:
                 bb.fatal("Invalid BB_PRESSURE_MAX_MEMORY %s, minimum value is %s." % (self.max_memory_pressure, lower_limit))
             if self.max_memory_pressure > upper_limit:
                 bb.warn("Your build will be largely unregulated since BB_PRESSURE_MAX_MEMORY is set to %s. It is very unlikely that such high pressure will be experienced." % (self.max_io_pressure))
+
+        if self.max_loadfactor:
+            self.max_loadfactor = float(self.max_loadfactor)
+            if self.max_loadfactor <= 0:
+                bb.fatal("Invalid BB_LOADFACTOR_MAX %s, needs to be greater than zero." % (self.max_loadfactor))
             
         # List of setscene tasks which we've covered
         self.scenequeue_covered = set()