Message ID | 20240221132103.794574-1-richard.purdie@linuxfoundation.org |
---|---|
State | Accepted, archived |
Commit | 14a27306f6dceb4999c2804ccae5a09cc3d8dd49 |
Headers | show |
Series | runqueue: Add support for BB_LOADFACTOR_MAX | expand |
On 2024-02-21 8:21 a.m., Richard Purdie wrote: > Some ditros don't enable /proc/pressure and it tends to be those which we > see bitbake timeout issues on, seemingly as load gets too high and the bitbake > processes don't get scheduled in for minutes at a time. > > Add support for stopping running extra tasks if the system load average goes > above a certain threshold by setting BB_LOADFACTOR_MAX. > > The value used is scaled by CPU number, so a value of 1 would be when > the load average equals the number of cpu cores of the system, under one > only starts tasks when the load average is below the number of cores. > > This means you can centrally set a value such as 1.5 which will then > scale correctly to different sized machines with differing numbers > of CPUs. > > The pressure regulation is probably more accurate and responsive, however > our graphs do show singificant load spikes on some workers and this > patch is aimed at trying to avoid those. > > Pressure regulation is used where available in preference to this load > factor regulation when both are set. For anyone interested, they could enable PSI on Rocky9 by passing: psi=1 to the kernel: https://git.rockylinux.org/staging/rpms/kernel/-/blob/r9/SOURCES/kernel-x86_64-rhel.config#L4219 https://cateee.net/lkddb/web-lkddb/PSI_DEFAULT_DISABLED.html For now, Richard and I decided to leave the YP AB Rocky instance as we found it and test the load-base regulator fall-back so we should test it somewhere. I ran some tests over the weekend and there's nothing suprising in that lower load-factor limits delay the builds: Note that for the load-factor data, I just multiplied the factor by 1000 for easy graphing. I also kept my basement server busy doing some world builds and although those runs didn't complete, they made it far enough in the build to conclude that things were working and that the 1.0 load-factor made the build take ~30% longer than a lightly/not regulated build (1). Compare that to ~20% longer builds for core-image-minimal. The increased dealy is likley due to more long-running jobs. One interesting aspect to the load-factor job regulation is that the first half of the jobs are largely unregualted likely since they are mostly fetch and configure rather than compile. Here's a graph of the job number on the x-axis and the load / number of tasks on the y-axis from the scheduler logs: Zooming on on the load factor, you can see how the long averaging time of the CPU load causes the system to oscilate above and below the desired average: and if you squint at a load-factor 0.50, you can see the builds start and stop in batches: ~/src/distro/yocto/poky.git/scripts/pybootchartgui/pybootchartgui.py buildstats-0.50/20240222030914/ Anyhow.... The thing we really care about on the YP autobuilder is system latency which causes bitbake server timeouts and ptest timer problems so let's see if the load-factor helps reduce the frequency of those problems. If anyone wants to instrument the bitbake client/server response time and graph it over the time of a build, I'd like to see that data! ;-) ../Randy "I like to graph stuff" MacLeod 1) $ cd .../b/poky-master-06aab81591 $ cat time-world-1.0.log Command exited with non-zero status 1 126.66user 28.39system 5:54:20elapsed 0%CPU (0avgtext+0avgdata 43108maxresident)k 848inputs+9776outputs (4major+154315minor)pagefaults 0swaps $ cat time-world-10.0.log Command exited with non-zero status 1 138.36user 74.72system 4:30:26elapsed 1%CPU (0avgtext+0avgdata 42824maxresident)k 919080inputs+10408outputs (5646major+185828minor)pagefaults 0swaps $ cat time-world-100.0.log Command exited with non-zero status 1 140.67user 78.51system 4:30:42elapsed 1%CPU (0avgtext+0avgdata 43408maxresident)k 933272inputs+10504outputs (5660major+194022minor)pagefaults 0swaps > > Signed-off-by: Richard Purdie<richard.purdie@linuxfoundation.org> > --- > lib/bb/runqueue.py | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/lib/bb/runqueue.py b/lib/bb/runqueue.py > index e86ccd8c61..6987de3e29 100644 > --- a/lib/bb/runqueue.py > +++ b/lib/bb/runqueue.py > @@ -220,6 +220,16 @@ class RunQueueScheduler(object): > bb.note("Pressure status changed to CPU: %s, IO: %s, Mem: %s (CPU: %s/%s, IO: %s/%s, Mem: %s/%s) - using %s/%s bitbake threads" % (pressure_state + pressure_values + (len(self.rq.runq_running.difference(self.rq.runq_complete)), self.rq.number_tasks))) > self.pressure_state = pressure_state > return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure) > + elif self.rq.max_loadfactor: > + limit = False > + loadfactor = float(os.getloadavg()[0]) / os.cpu_count() > + # bb.warn("Comparing %s to %s" % (loadfactor, self.rq.max_loadfactor)) > + if loadfactor > self.rq.max_loadfactor: > + limit = True > + if hasattr(self, "loadfactor_limit") and limit != self.loadfactor_limit: > + bb.note("Load average limiting set to %s as load average: %s - using %s/%s bitbake threads" % (limit, loadfactor, len(self.rq.runq_running.difference(self.rq.runq_complete)), self.rq.number_tasks)) > + self.loadfactor_limit = limit > + return limit > return False > > def next_buildable_task(self): > @@ -1822,6 +1832,7 @@ class RunQueueExecute: > self.max_cpu_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_CPU") > self.max_io_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_IO") > self.max_memory_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_MEMORY") > + self.max_loadfactor = self.cfgData.getVar("BB_LOADFACTOR_MAX") > > self.sq_buildable = set() > self.sq_running = set() > @@ -1875,6 +1886,11 @@ class RunQueueExecute: > bb.fatal("Invalid BB_PRESSURE_MAX_MEMORY %s, minimum value is %s." % (self.max_memory_pressure, lower_limit)) > if self.max_memory_pressure > upper_limit: > bb.warn("Your build will be largely unregulated since BB_PRESSURE_MAX_MEMORY is set to %s. It is very unlikely that such high pressure will be experienced." % (self.max_io_pressure)) > + > + if self.max_loadfactor: > + self.max_loadfactor = float(self.max_loadfactor) > + if self.max_loadfactor <= 0: > + bb.fatal("Invalid BB_LOADFACTOR_MAX %s, needs to be greater than zero." % (self.max_loadfactor)) > > # List of setscene tasks which we've covered > self.scenequeue_covered = set()
diff --git a/lib/bb/runqueue.py b/lib/bb/runqueue.py index e86ccd8c61..6987de3e29 100644 --- a/lib/bb/runqueue.py +++ b/lib/bb/runqueue.py @@ -220,6 +220,16 @@ class RunQueueScheduler(object): bb.note("Pressure status changed to CPU: %s, IO: %s, Mem: %s (CPU: %s/%s, IO: %s/%s, Mem: %s/%s) - using %s/%s bitbake threads" % (pressure_state + pressure_values + (len(self.rq.runq_running.difference(self.rq.runq_complete)), self.rq.number_tasks))) self.pressure_state = pressure_state return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure) + elif self.rq.max_loadfactor: + limit = False + loadfactor = float(os.getloadavg()[0]) / os.cpu_count() + # bb.warn("Comparing %s to %s" % (loadfactor, self.rq.max_loadfactor)) + if loadfactor > self.rq.max_loadfactor: + limit = True + if hasattr(self, "loadfactor_limit") and limit != self.loadfactor_limit: + bb.note("Load average limiting set to %s as load average: %s - using %s/%s bitbake threads" % (limit, loadfactor, len(self.rq.runq_running.difference(self.rq.runq_complete)), self.rq.number_tasks)) + self.loadfactor_limit = limit + return limit return False def next_buildable_task(self): @@ -1822,6 +1832,7 @@ class RunQueueExecute: self.max_cpu_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_CPU") self.max_io_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_IO") self.max_memory_pressure = self.cfgData.getVar("BB_PRESSURE_MAX_MEMORY") + self.max_loadfactor = self.cfgData.getVar("BB_LOADFACTOR_MAX") self.sq_buildable = set() self.sq_running = set() @@ -1875,6 +1886,11 @@ class RunQueueExecute: bb.fatal("Invalid BB_PRESSURE_MAX_MEMORY %s, minimum value is %s." % (self.max_memory_pressure, lower_limit)) if self.max_memory_pressure > upper_limit: bb.warn("Your build will be largely unregulated since BB_PRESSURE_MAX_MEMORY is set to %s. It is very unlikely that such high pressure will be experienced." % (self.max_io_pressure)) + + if self.max_loadfactor: + self.max_loadfactor = float(self.max_loadfactor) + if self.max_loadfactor <= 0: + bb.fatal("Invalid BB_LOADFACTOR_MAX %s, needs to be greater than zero." % (self.max_loadfactor)) # List of setscene tasks which we've covered self.scenequeue_covered = set()
Some ditros don't enable /proc/pressure and it tends to be those which we see bitbake timeout issues on, seemingly as load gets too high and the bitbake processes don't get scheduled in for minutes at a time. Add support for stopping running extra tasks if the system load average goes above a certain threshold by setting BB_LOADFACTOR_MAX. The value used is scaled by CPU number, so a value of 1 would be when the load average equals the number of cpu cores of the system, under one only starts tasks when the load average is below the number of cores. This means you can centrally set a value such as 1.5 which will then scale correctly to different sized machines with differing numbers of CPUs. The pressure regulation is probably more accurate and responsive, however our graphs do show singificant load spikes on some workers and this patch is aimed at trying to avoid those. Pressure regulation is used where available in preference to this load factor regulation when both are set. Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org> --- lib/bb/runqueue.py | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)