diff mbox series

[bitbake-devel] bitbake: add --noreply-timeout option

Message ID 20230510031631.1813338-1-Qi.Chen@windriver.com
State New
Headers show
Series [bitbake-devel] bitbake: add --noreply-timeout option | expand

Commit Message

ChenQi May 10, 2023, 3:16 a.m. UTC
From: Chen Qi <Qi.Chen@windriver.com>

For now, if the client gets no reply from server when running a
command, it exits after a period of time. The value is currently
60s. Looking at the history, this value was increased from 8s to
30s, and then it was increased again to 60s.

For now, what I can see is that when running one world build on a
128 core, 512G server, starting a second build has a chance to fail
at updateConfig.

Instead of increasing this value again and again, let's add an option
for easier customization of this value.

Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
---
 bitbake/lib/bb/main.py           |  7 ++++++-
 bitbake/lib/bb/server/process.py | 19 ++++++++++---------
 2 files changed, 16 insertions(+), 10 deletions(-)

Comments

Richard Purdie May 10, 2023, 7:52 a.m. UTC | #1
On Wed, 2023-05-10 at 11:16 +0800, Chen Qi via lists.openembedded.org
wrote:
> From: Chen Qi <Qi.Chen@windriver.com>
> 
> For now, if the client gets no reply from server when running a
> command, it exits after a period of time. The value is currently
> 60s. Looking at the history, this value was increased from 8s to
> 30s, and then it was increased again to 60s.
> 
> For now, what I can see is that when running one world build on a
> 128 core, 512G server, starting a second build has a chance to fail
> at updateConfig.
> 
> Instead of increasing this value again and again, let's add an option
> for easier customization of this value.
> 
> Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
> ---
>  bitbake/lib/bb/main.py           |  7 ++++++-
>  bitbake/lib/bb/server/process.py | 19 ++++++++++---------
>  2 files changed, 16 insertions(+), 10 deletions(-)

You're saying that the server can take over a minute to start up and
this is considered acceptable?

What configuration is the other world build running with? I'd suggest
the system is so overloaded it isn't useful any more and you should
have pressure control enabled or other mechanisms to allow the system
to function.

I'm not keen to make this any more configurable as 60s should be enough
and if it isn't, there are different issues at play.

Having bitbake silently hang for minutes at a time isn't good for users
and isn't good for CI either.

Cheers,

Richard
ChenQi May 10, 2023, 9:43 a.m. UTC | #2
Hi Richard,

It's not the bitbake server that takes more than 1 minute to start, it's that the 'updateConfig' command that takes more than 1 minute. From what I see, other commands finish quite fast except the actual building one.
The server was still usable, even after the second world build managed to start.

I'm not using any special configuration. All default values. I do have extra layers added, such as meta-openembedded/*, meta-virtualization, meta-browser, meta-security, meta-selinux, etc. Anyway, the task number of a world build is about 40000+.

In fact, what I really want to do is to set the timeout to be some large value on our autobuilders so that we can avoid this start-up timeout failure. If such patch is not suitable for the official bitbake, I'll make it a local patch until our autobuilders are configured more properly. Do you have any suggestion?

That's all the information and the background.

Regards,
Qi

-----Original Message-----
From: bitbake-devel@lists.openembedded.org <bitbake-devel@lists.openembedded.org> On Behalf Of Richard Purdie via lists.openembedded.org
Sent: Wednesday, May 10, 2023 3:52 PM
To: Chen, Qi <Qi.Chen@windriver.com>; bitbake-devel@lists.openembedded.org
Subject: Re: [bitbake-devel][PATCH] bitbake: add --noreply-timeout option

On Wed, 2023-05-10 at 11:16 +0800, Chen Qi via lists.openembedded.org
wrote:
> From: Chen Qi <Qi.Chen@windriver.com>
> 
> For now, if the client gets no reply from server when running a 
> command, it exits after a period of time. The value is currently 60s. 
> Looking at the history, this value was increased from 8s to 30s, and 
> then it was increased again to 60s.
> 
> For now, what I can see is that when running one world build on a
> 128 core, 512G server, starting a second build has a chance to fail at 
> updateConfig.
> 
> Instead of increasing this value again and again, let's add an option 
> for easier customization of this value.
> 
> Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
> ---
>  bitbake/lib/bb/main.py           |  7 ++++++-
>  bitbake/lib/bb/server/process.py | 19 ++++++++++---------
>  2 files changed, 16 insertions(+), 10 deletions(-)

You're saying that the server can take over a minute to start up and this is considered acceptable?

What configuration is the other world build running with? I'd suggest the system is so overloaded it isn't useful any more and you should have pressure control enabled or other mechanisms to allow the system to function.

I'm not keen to make this any more configurable as 60s should be enough and if it isn't, there are different issues at play.

Having bitbake silently hang for minutes at a time isn't good for users and isn't good for CI either.

Cheers,

Richard
Richard Purdie May 10, 2023, 10:35 a.m. UTC | #3
On Wed, 2023-05-10 at 09:43 +0000, Chen, Qi wrote:
> Hi Richard,
> 
> It's not the bitbake server that takes more than 1 minute to start,
> it's that the 'updateConfig' command that takes more than 1 minute.
> From what I see, other commands finish quite fast except the actual
> building one.

The updateConfig command doesn't actually build anything. All it does
is triggers a parse of the base configuration, it isn't even parsing
recipes, just the bitbake.conf and layer.conf files.

Think of updateConfig as just bringing the two ends of the connection,
client and server into sync.

I'm a bit worried that if the system is so overloaded it can't do that
in 60s, we have bigger problems.

> The server was still usable, even after the second world build
> managed to start.
> 
> I'm not using any special configuration. All default values. I do
> have extra layers added, such as meta-openembedded/*, meta-
> virtualization, meta-browser, meta-security, meta-selinux, etc.
> Anyway, the task number of a world build is about 40000+.
> 
> In fact, what I really want to do is to set the timeout to be some
> large value on our autobuilders so that we can avoid this start-up
> timeout failure. If such patch is not suitable for the official
> bitbake, I'll make it a local patch until our autobuilders are
> configured more properly. Do you have any suggestion?
> 
> That's all the information and the background.

What is BB_NUMBER_THREADS set to and how many CPU cores? Spinning media
or SSDs? What are the system load numbers when this happens?

I think that if this startup timeout is happening, there will be other
issues and you need to resolve the overloaded system problem rather
than just move the timeout problem to somewhere else.

Cheers,

Richard
ChenQi May 10, 2023, 11:40 a.m. UTC | #4
Hi Richard,

Thanks for your reply. I totally agree that it should be the server configurations (e.g., BB_NUMBER_THREADS for each build, the number of builds to start in parallel, and maybe PSI related configs, etc.) that should be adjusted to avoid such timeout issue.

To add more information: the server I used where I saw this start-up timeout failure is 128 cores + 512G + 4 RADI0 disks. The uptime value when the timeout happened was around 500. The BB_NUMBER_THREADS remains its default value.

I also found such startup timeout is more likely to happen on machines with more cores. Because on another server, which has 40 cores, I have to start 3~5 builds to trigger this timeout error.

Regards,
Qi


-----Original Message-----
From: Richard Purdie <richard.purdie@linuxfoundation.org> 
Sent: Wednesday, May 10, 2023 6:36 PM
To: Chen, Qi <Qi.Chen@windriver.com>; bitbake-devel@lists.openembedded.org
Subject: Re: [bitbake-devel][PATCH] bitbake: add --noreply-timeout option

On Wed, 2023-05-10 at 09:43 +0000, Chen, Qi wrote:
> Hi Richard,
> 
> It's not the bitbake server that takes more than 1 minute to start, 
> it's that the 'updateConfig' command that takes more than 1 minute.
> From what I see, other commands finish quite fast except the actual 
> building one.

The updateConfig command doesn't actually build anything. All it does is triggers a parse of the base configuration, it isn't even parsing recipes, just the bitbake.conf and layer.conf files.

Think of updateConfig as just bringing the two ends of the connection, client and server into sync.

I'm a bit worried that if the system is so overloaded it can't do that in 60s, we have bigger problems.

> The server was still usable, even after the second world build managed 
> to start.
> 
> I'm not using any special configuration. All default values. I do have 
> extra layers added, such as meta-openembedded/*, meta- virtualization, 
> meta-browser, meta-security, meta-selinux, etc.
> Anyway, the task number of a world build is about 40000+.
> 
> In fact, what I really want to do is to set the timeout to be some 
> large value on our autobuilders so that we can avoid this start-up 
> timeout failure. If such patch is not suitable for the official 
> bitbake, I'll make it a local patch until our autobuilders are 
> configured more properly. Do you have any suggestion?
> 
> That's all the information and the background.

What is BB_NUMBER_THREADS set to and how many CPU cores? Spinning media or SSDs? What are the system load numbers when this happens?

I think that if this startup timeout is happening, there will be other issues and you need to resolve the overloaded system problem rather than just move the timeout problem to somewhere else.

Cheers,

Richard
ChenQi May 10, 2023, 3:05 p.m. UTC | #5
Hi Richard,

Thanks for your info. After checking the codes of updateConfig and looking at my project's special configuration, I finally found why my updateConfig takes so long.
We have an event triggered by ConfigParsed, and in that event, we do file search, mtime checking, etc, and then set BB_INVALIDCONF to True, which triggers a new parse.

Regards,
Qi

-----Original Message-----
From: bitbake-devel@lists.openembedded.org <bitbake-devel@lists.openembedded.org> On Behalf Of Chen Qi via lists.openembedded.org
Sent: Wednesday, May 10, 2023 7:40 PM
To: Richard Purdie <richard.purdie@linuxfoundation.org>; bitbake-devel@lists.openembedded.org
Subject: Re: [bitbake-devel][PATCH] bitbake: add --noreply-timeout option

Hi Richard,

Thanks for your reply. I totally agree that it should be the server configurations (e.g., BB_NUMBER_THREADS for each build, the number of builds to start in parallel, and maybe PSI related configs, etc.) that should be adjusted to avoid such timeout issue.

To add more information: the server I used where I saw this start-up timeout failure is 128 cores + 512G + 4 RADI0 disks. The uptime value when the timeout happened was around 500. The BB_NUMBER_THREADS remains its default value.

I also found such startup timeout is more likely to happen on machines with more cores. Because on another server, which has 40 cores, I have to start 3~5 builds to trigger this timeout error.

Regards,
Qi


-----Original Message-----
From: Richard Purdie <richard.purdie@linuxfoundation.org>
Sent: Wednesday, May 10, 2023 6:36 PM
To: Chen, Qi <Qi.Chen@windriver.com>; bitbake-devel@lists.openembedded.org
Subject: Re: [bitbake-devel][PATCH] bitbake: add --noreply-timeout option

On Wed, 2023-05-10 at 09:43 +0000, Chen, Qi wrote:
> Hi Richard,
> 
> It's not the bitbake server that takes more than 1 minute to start, 
> it's that the 'updateConfig' command that takes more than 1 minute.
> From what I see, other commands finish quite fast except the actual 
> building one.

The updateConfig command doesn't actually build anything. All it does is triggers a parse of the base configuration, it isn't even parsing recipes, just the bitbake.conf and layer.conf files.

Think of updateConfig as just bringing the two ends of the connection, client and server into sync.

I'm a bit worried that if the system is so overloaded it can't do that in 60s, we have bigger problems.

> The server was still usable, even after the second world build managed 
> to start.
> 
> I'm not using any special configuration. All default values. I do have 
> extra layers added, such as meta-openembedded/*, meta- virtualization, 
> meta-browser, meta-security, meta-selinux, etc.
> Anyway, the task number of a world build is about 40000+.
> 
> In fact, what I really want to do is to set the timeout to be some 
> large value on our autobuilders so that we can avoid this start-up 
> timeout failure. If such patch is not suitable for the official 
> bitbake, I'll make it a local patch until our autobuilders are 
> configured more properly. Do you have any suggestion?
> 
> That's all the information and the background.

What is BB_NUMBER_THREADS set to and how many CPU cores? Spinning media or SSDs? What are the system load numbers when this happens?

I think that if this startup timeout is happening, there will be other issues and you need to resolve the overloaded system problem rather than just move the timeout problem to somewhere else.

Cheers,

Richard
diff mbox series

Patch

diff --git a/bitbake/lib/bb/main.py b/bitbake/lib/bb/main.py
index 92d8dc0293..47c000b03e 100755
--- a/bitbake/lib/bb/main.py
+++ b/bitbake/lib/bb/main.py
@@ -267,6 +267,11 @@  def create_bitbake_parser():
                              "set to -1 means no unload, "
                              "default: Environment variable BB_SERVER_TIMEOUT.")
 
+    server_group.add_argument("--noreply-timeout", type=int, dest="noreply_timeout",
+                        default=60,
+                        help="Set timeout to exit bitbake client due to no reply from bitbake server, "
+                             "default: 60s.")
+
     server_group.add_argument("--remote-server",
                         default=os.environ.get("BBSERVER"),
                         help="Connect to the specified server.")
@@ -484,7 +489,7 @@  def setup_bitbake(configParams, extrafeatures=None):
                             bb.utils.unlockfile(lock)
                         raise bb.server.process.ProcessTimeout("Bitbake still shutting down as socket exists but no lock?")
                 if not configParams.server_only:
-                    server_connection = bb.server.process.connectProcessServer(sockname, featureset)
+                    server_connection = bb.server.process.connectProcessServer(sockname, featureset, configParams.noreply_timeout)
 
                 if server_connection or configParams.server_only:
                     break
diff --git a/bitbake/lib/bb/server/process.py b/bitbake/lib/bb/server/process.py
index db417c8428..8c7b6da64a 100644
--- a/bitbake/lib/bb/server/process.py
+++ b/bitbake/lib/bb/server/process.py
@@ -495,16 +495,17 @@  class ProcessServer():
 
 
 class ServerCommunicator():
-    def __init__(self, connection, recv):
+    def __init__(self, connection, recv, timeout):
         self.connection = connection
         self.recv = recv
+        self.timeout = timeout
 
     def runCommand(self, command):
         self.connection.send(command)
-        if not self.recv.poll(30):
-            logger.info("No reply from server in 30s")
-            if not self.recv.poll(30):
-                raise ProcessTimeout("Timeout while waiting for a reply from the bitbake server (60s)")
+        if not self.recv.poll(self.timeout / 2):
+            logger.info("No reply from server in %ss" % (self.timeout / 2))
+            if not self.recv.poll(self.timeout):
+                raise ProcessTimeout("Timeout while waiting for a reply from the bitbake server (%ss)" % self.timeout)
         ret, exc = self.recv.get()
         # Should probably turn all exceptions in exc back into exceptions?
         # For now, at least handle BBHandledException
@@ -531,8 +532,8 @@  class ServerCommunicator():
         return
 
 class BitBakeProcessServerConnection(object):
-    def __init__(self, ui_channel, recv, eq, sock):
-        self.connection = ServerCommunicator(ui_channel, recv)
+    def __init__(self, ui_channel, recv, eq, sock, timeout):
+        self.connection = ServerCommunicator(ui_channel, recv, timeout)
         self.events = eq
         # Save sock so it doesn't get gc'd for the life of our connection
         self.socket_connection = sock
@@ -666,7 +667,7 @@  def execServer(lockfd, readypipeinfd, lockname, sockname, server_timeout, xmlrpc
         sys.stdout.flush()
         sys.stderr.flush()
 
-def connectProcessServer(sockname, featureset):
+def connectProcessServer(sockname, featureset, timeout):
     # Connect to socket
     sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
     # AF_UNIX has path length issues so chdir here to workaround
@@ -704,7 +705,7 @@  def connectProcessServer(sockname, featureset):
 
         sendfds(sock, [writefd, readfd1, writefd2])
 
-        server_connection = BitBakeProcessServerConnection(command_chan, command_chan_recv, eq, sock)
+        server_connection = BitBakeProcessServerConnection(command_chan, command_chan_recv, eq, sock, timeout)
 
         # Close the ends of the pipes we won't use
         for i in [writefd, readfd1, writefd2]: