diff mbox series

bitbake: Add task timeout support

Message ID 20230525102105.1480610-1-michalwsieron@gmail.com
State New
Headers show
Series bitbake: Add task timeout support | expand

Commit Message

Michal Sieron May 25, 2023, 10:21 a.m. UTC
By setting `BB_TASK_TIMEOUT` you can control timeout of for tasks.
Using flags `BB_TASK_TIMEOUT[do_mytask]` you can override timeout
settings for specific tasks.

This is especially useful when some server doesn't work properly and
`do_fetch` task takes too long. We may want to kill it early instead
of waiting for a fetch that won't happen.

Signed-off-by: Michal Sieron <michalwsieron@gmail.com>
Signed-off-by: Tomasz Dziendzielski <tomasz.dziendzielski@gmail.com>
Signed-off-by: Mateusz Marciniec <mateuszmar2@gmail.com>
---
 .../bitbake-user-manual-ref-variables.rst     |  7 +++
 bitbake/lib/bb/build.py                       | 43 ++++++++++++++++++-
 2 files changed, 49 insertions(+), 1 deletion(-)

Comments

Mikko Rapeli May 25, 2023, 10:44 a.m. UTC | #1
Hi,

On Thu, May 25, 2023 at 12:21:05PM +0200, Michal Sieron wrote:
> By setting `BB_TASK_TIMEOUT` you can control timeout of for tasks.
> Using flags `BB_TASK_TIMEOUT[do_mytask]` you can override timeout
> settings for specific tasks.
> 
> This is especially useful when some server doesn't work properly and
> `do_fetch` task takes too long. We may want to kill it early instead
> of waiting for a fetch that won't happen.

Is the problem related some tools and network protocols? Which?

An alternative would be to configure the tool (git, svn, curl etc) specific
settings, or even the default socket timeouts on the running system.

I've seen this kind of problems too, but I'm not sure this task timeout is the
correct way to solve it. Time is too relative :)

Cheers,

-Mikko
Alexander Kanavin May 25, 2023, 10:59 a.m. UTC | #2
I tend to agree, if the problem is that some server is too slow, or
unresponsive, this does not solve the problem. If you want to forcibly
terminate stuff (which is still not a good idea - you should look into
why something 'doesn't work properly', not work around it), you can
terminate the whole bitbake process instead.

Alex

On Thu, 25 May 2023 at 12:45, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
>
> Hi,
>
> On Thu, May 25, 2023 at 12:21:05PM +0200, Michal Sieron wrote:
> > By setting `BB_TASK_TIMEOUT` you can control timeout of for tasks.
> > Using flags `BB_TASK_TIMEOUT[do_mytask]` you can override timeout
> > settings for specific tasks.
> >
> > This is especially useful when some server doesn't work properly and
> > `do_fetch` task takes too long. We may want to kill it early instead
> > of waiting for a fetch that won't happen.
>
> Is the problem related some tools and network protocols? Which?
>
> An alternative would be to configure the tool (git, svn, curl etc) specific
> settings, or even the default socket timeouts on the running system.
>
> I've seen this kind of problems too, but I'm not sure this task timeout is the
> correct way to solve it. Time is too relative :)
>
> Cheers,
>
> -Mikko
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#14807): https://lists.openembedded.org/g/bitbake-devel/message/14807
> Mute This Topic: https://lists.openembedded.org/mt/99127010/1686489
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [alex.kanavin@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Tomasz Dziendzielski May 25, 2023, 11:44 a.m. UTC | #3
Hello,
let me give some more context on the patch. We use self hosted gerrit and
artifactory (with git-lfs content and tarballs with source code) instances
which tend to be slow sometimes due to some overloads and other reasons. We
have a 3h timeout for the whole build. Now let's assume do_fetch tasks take
already 60 hours instead of 10 minutes. With the timeout solution proposed
by Michał we can detect it quite fast and stop the build job and unblock
the jenkins nodes for other builds or just retry the build in case the
fetch on the second run would take less time. Now without that and with
timeout on the whole bitbake process we're still holding the jenkins node
for another 2-3 hours even though we know this build has no future.

Best regards,
Tomasz Dziendzielski
Alexander Kanavin May 25, 2023, noon UTC | #4
Yocto has a download cache mechanism, so you should not have to fetch
more than once. Are you throwing away that cache between builds?

Again, the actual problem is slow gerrit/artifactory instances, not
stuck fetch tasks. The idea of 'retrying' the fetch so that it maybe
works better on second try is especially off-putting.

Alex

On Thu, 25 May 2023 at 13:41, Tomasz Dziendzielski
<tomasz.dziendzielski@gmail.com> wrote:
>
> Hello,
> let me give some more context on the patch. We use self hosted gerrit and artifactory (with git-lfs content and tarballs with source code) instances which tend to be slow sometimes due to some overloads and other reasons. We have a 3h timeout for the whole build. Now let's assume do_fetch tasks take already 60 hours instead of 10 minutes. With the timeout solution proposed by Michał we can detect it quite fast and stop the build job and unblock the jenkins nodes for other builds or just retry the build in case the fetch on the second run would take less time. Now without that and with timeout on the whole bitbake process we're still holding the jenkins node for another 2-3 hours even though we know this build has no future.
>
> Best regards,
> Tomasz Dziendzielski
Tomasz Dziendzielski May 25, 2023, 12:13 p.m. UTC | #5
>Yocto has a download cache mechanism, so you should not have to fetch
>more than once. Are you throwing away that cache between builds?

Do you mean the DL_DIR or something else as cache mechanism? We do keep
DL_DIR between builds, but we're talking about cases where there was a
change in the repository.

>the actual problem is slow gerrit/artifactory instances, not
>stuck fetch tasks
Totally agree, but we're not able to fix the slow gerrit/artifactory in a
short term.

>The idea of 'retrying' the fetch
Well, it's  not like I'm saying this as a solution, usually we just
disable jobs until the slowness is gone. But isn't that what we already do
with wget in bitbake?
"wget -t 2 -T 30" from bitbake/lib/bb/fetch2/wget.py, where -t is tries,
and -T is timeout
Problem is that we get the slowness for git repositories and git does not
support such a thing.

Best regards,
Tomasz Dziendzielski
Alexander Kanavin May 25, 2023, 12:21 p.m. UTC | #6
On Thu, 25 May 2023 at 14:09, Tomasz Dziendzielski
<tomasz.dziendzielski@gmail.com> wrote:
> Well, it's  not like I'm saying this as a solution, usually we just disable jobs until the slowness is gone. But isn't that what we already do with wget in bitbake?
> "wget -t 2 -T 30" from bitbake/lib/bb/fetch2/wget.py, where -t is tries, and -T is timeout
> Problem is that we get the slowness for git repositories and git does not support such a thing.

But then you can simply wrap git with a timeout command.
https://man7.org/linux/man-pages/man1/timeout.1.html

No?

Alex
Tomasz Dziendzielski May 25, 2023, 12:35 p.m. UTC | #7
>But then you can simply wrap git with a timeout command.
>https://man7.org/linux/man-pages/man1/timeout.1.html
Well, this is a good point. I don't know how we could not come up with that
idea :D We can just simply use the FETCHCMD_git variable with timeout.
Please disregard this patch then.

Best regards,
Tomasz Dziendzielski
Mikko Rapeli May 25, 2023, 12:57 p.m. UTC | #8
Hi,

On Thu, May 25, 2023 at 02:13:10PM +0200, Tomasz Dziendzielski wrote:
> >Yocto has a download cache mechanism, so you should not have to fetch
> >more than once. Are you throwing away that cache between builds?
> 
> Do you mean the DL_DIR or something else as cache mechanism? We do keep
> DL_DIR between builds, but we're talking about cases where there was a
> change in the repository.
> 
> >the actual problem is slow gerrit/artifactory instances, not
> >stuck fetch tasks
> Totally agree, but we're not able to fix the slow gerrit/artifactory in a
> short term.
> 
> >The idea of 'retrying' the fetch
> Well, it's  not like I'm saying this as a solution, usually we just
> disable jobs until the slowness is gone. But isn't that what we already do
> with wget in bitbake?
> "wget -t 2 -T 30" from bitbake/lib/bb/fetch2/wget.py, where -t is tries,
> and -T is timeout
> Problem is that we get the slowness for git repositories and git does not
> support such a thing.

git uses curl, can't we shorten the timeouts or even add retries there? Maybe even
specific ones for the lfs support?

I've also seen a lot of problematic artifactory and gerrit instances, in addition
to maven, npm etc...

Cheers,

-Mikko
Michal Sieron June 5, 2023, 11:46 a.m. UTC | #9
Hi,

On Thu, 25 May 2023 15:57:12 +0300
Mikko Rapeli <mikko.rapeli@linaro.org> wrote:

> git uses curl, can't we shorten the timeouts or even add retries
> there? Maybe even specific ones for the lfs support?
> 
> I've also seen a lot of problematic artifactory and gerrit instances,
> in addition to maven, npm etc...

I checked this and couldn't find any environment variables that would
control timeout for the whole operation.

The only solution I found (which doesn't require patching/recompiling)
was using previously mentioned timeout command like so:

```
FETCHCMD_git = "git \
    -c core.fsyncobjectfiles=0 \
    -c filter.lfs.clean='/usr/bin/timeout 300 git-lfs clean -- %f' \
    -c filter.lfs.smudge='/usr/bin/timeout 300 git-lfs smudge -- %f' \
    -c filter.lfs.process='/usr/bin/timeout 300 git-lfs filter-process' \
"
```

You can of course put those settings in your gitconfig file instead.

However, this method is not so nice, as when the fetch task ends, and
prints error message from git, we get "fatal: the remote end hung up
unexpectedly", which doesn't really explain what went wrong.

Michal
diff mbox series

Patch

diff --git a/bitbake/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst b/bitbake/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
index 01d4f8d14a..eaaa307960 100644
--- a/bitbake/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
+++ b/bitbake/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
@@ -686,6 +686,13 @@  overview of their function and contents.
       in images is given a higher priority as compared to build tasks to
       ensure that images do not suffer timeouts on loaded systems.
 
+   :term:`BB_TASK_TIMEOUT`
+      Specifies time after which still running tasks will be terminated.
+
+      Time is provided as an integer number of seconds.
+      You can set timeout for specific task using ``BB_TASK_TIMEOUT[do_mytask]``.
+      To remove timeout use an empty value ``BB_TASK_TIMEOUT = ""``.
+
    :term:`BB_TASKHASH`
       Within an executing task, this variable holds the hash of the task as
       returned by the currently enabled signature generator.
diff --git a/bitbake/lib/bb/build.py b/bitbake/lib/bb/build.py
index 44d08f5c55..2057dd9415 100644
--- a/bitbake/lib/bb/build.py
+++ b/bitbake/lib/bb/build.py
@@ -17,6 +17,7 @@  import sys
 import logging
 import glob
 import itertools
+import multiprocessing
 import time
 import re
 import stat
@@ -594,7 +595,7 @@  def _task_data(fn, task, d):
     bb.data.expandKeys(localdata)
     return localdata
 
-def _exec_task(fn, task, d, quieterr):
+def __exec_task(fn, task, d, quieterr):
     """Execute a BB 'task'
 
     Execution of a task involves a bit more setup than executing a function,
@@ -761,6 +762,46 @@  def _exec_task(fn, task, d, quieterr):
 
     return 0
 
+def get_task_timeout(d, task):
+    # task specific timeout
+    timeout = d.getVarFlag("BB_TASK_TIMEOUT", task)
+
+    if timeout is None:
+        # any task timeout
+        timeout = d.getVar("BB_TASK_TIMEOUT")
+
+    if timeout is None or timeout == "":
+        return None
+    else:
+        try:
+            return int(timeout)
+        except ValueError as e:
+            raise ValueError("invalid `BB_TASK_TIMEOUT` value '%s'" % timeout)
+
+def _exec_task(fn, task, d, quieterr):
+    """Intermediate function between exec_task and __exec_task, to support task timeouts."""
+
+    def targetFunc(func):
+        def inner(*args, **kwargs):
+            sys.exit(func(*args, **kwargs))
+
+        return inner
+
+    try:
+        timeout = get_task_timeout(d, task)
+    except ValueError as e:
+        logger.error("%s: %s" % (d.getVar("PF"), e.args[0]))
+        return 1
+
+    task_proc = multiprocessing.Process(target=targetFunc(__exec_task), args=(fn, task, d, quieterr))
+    task_proc.start()
+    task_proc.join(timeout)
+    if task_proc.exitcode is None:
+        logger.error("timeout exceeded (%s s)" % (str(timeout)))
+        return 1
+    else:
+        return task_proc.exitcode
+
 def exec_task(fn, task, d, profile = False):
     try:
         quieterr = False