Patchwork [bitbake-devel,1/1] process.py, build.py: Fix log truncation problems with flush()

login
register
mail settings
Submitter Jason Wessel
Date June 14, 2012, 2:58 p.m.
Message ID <1339685898-24757-2-git-send-email-jason.wessel@windriver.com>
Download mbox | patch
Permalink /patch/29839/
State New
Headers show

Comments

Jason Wessel - June 14, 2012, 2:58 p.m.
There are two problems with the _logged_communicate that are both
caused as a result of buffering I/O, instead of flushing it out to the
log files as it arrives.

1) log truncation when python dumps
   I have seen the task logs missing data that was not flushed when
   bitbake crashes.

2) While a bitbake task is running it is impossible to see what is
   going on if it is only writing a small incremental log that is
   smaller than the buffer, or you get only a partial log, up until
   the task exists.  It is worse in the case that stderr and stdout
   are separate file handles, because previous code blocks on the read
   of stdout and then stderr, serially.

The right approach is simply to use select() to determine if there is
data available and then to flush the log buffers as they arrive.  This
still makes use of buffering in the cases where there is more than 1
byte of data, but the buffers are much more dynamic.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
---
 lib/bb/build.py   |    3 ++-
 lib/bb/process.py |   29 +++++++++++++++++++++++++++--
 2 files changed, 29 insertions(+), 3 deletions(-)
Richard Purdie - June 21, 2012, 3:55 p.m.
On Thu, 2012-06-14 at 09:58 -0500, Jason Wessel wrote:
> There are two problems with the _logged_communicate that are both
> caused as a result of buffering I/O, instead of flushing it out to the
> log files as it arrives.
> 
> 1) log truncation when python dumps
>    I have seen the task logs missing data that was not flushed when
>    bitbake crashes.
> 
> 2) While a bitbake task is running it is impossible to see what is
>    going on if it is only writing a small incremental log that is
>    smaller than the buffer, or you get only a partial log, up until
>    the task exists.  It is worse in the case that stderr and stdout
>    are separate file handles, because previous code blocks on the read
>    of stdout and then stderr, serially.
> 
> The right approach is simply to use select() to determine if there is
> data available and then to flush the log buffers as they arrive.  This
> still makes use of buffering in the cases where there is more than 1
> byte of data, but the buffers are much more dynamic.
> 
> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> ---
>  lib/bb/build.py   |    3 ++-
>  lib/bb/process.py |   29 +++++++++++++++++++++++++++--
>  2 files changed, 29 insertions(+), 3 deletions(-)

For what its worth I'm seeing a small but consistent increase in real,
sys and user times with this patch which is why I'm pausing to look at
it a little further :(

Cheers,

Richard



> diff --git a/lib/bb/build.py b/lib/bb/build.py
> index a9ba02d..7e9bb23 100644
> --- a/lib/bb/build.py
> +++ b/lib/bb/build.py
> @@ -135,7 +135,8 @@ class LogTee(object):
>  
>      def __repr__(self):
>          return '<LogTee {0}>'.format(self.name)
> -
> +    def flush(self):
> +        self.outfile.flush()
>  
>  def exec_func(func, d, dirs = None):
>      """Execute an BB 'function'"""
> diff --git a/lib/bb/process.py b/lib/bb/process.py
> index b74cb18..9848fc3 100644
> --- a/lib/bb/process.py
> +++ b/lib/bb/process.py
> @@ -1,6 +1,9 @@
>  import logging
>  import signal
>  import subprocess
> +import fcntl
> +import errno
> +import select
>  
>  logger = logging.getLogger('BitBake.Process')
>  
> @@ -70,18 +73,40 @@ def _logged_communicate(pipe, log, input):
>  
>      bufsize = 512
>      outdata, errdata = [], []
> +    rin = []
> +
> +    if pipe.stdout is not None:
> +        fd = pipe.stdout.fileno()
> +        fl = fcntl.fcntl(fd, fcntl.F_GETFL)
> +        fcntl.fcntl(fd, fcntl.F_SETFL, fl | os.O_NONBLOCK)
> +        rin.append(pipe.stdout)
> +    if pipe.stderr is not None:
> +        fd = pipe.stderr.fileno()
> +        fl = fcntl.fcntl(fd, fcntl.F_GETFL)
> +        fcntl.fcntl(fd, fcntl.F_SETFL, fl | os.O_NONBLOCK)
> +        rin.append(pipe.stderr)
> +
>      while pipe.poll() is None:
> -        if pipe.stdout is not None:
> +        rlist = rin
> +        try:
> +            r,w,e = select.select (rlist, [], [])
> +        except OSError, e:
> +            if e.errno != errno.EINTR:
> +                raise
> +
> +        if pipe.stdout in r:
>              data = pipe.stdout.read(bufsize)
>              if data is not None:
>                  outdata.append(data)
>                  log.write(data)
> +                log.flush()
>  
> -        if pipe.stderr is not None:
> +        if pipe.stderr in r:
>              data = pipe.stderr.read(bufsize)
>              if data is not None:
>                  errdata.append(data)
>                  log.write(data)
> +                log.flush()
>      return ''.join(outdata), ''.join(errdata)
>  
>  def run(cmd, input=None, log=None, **options):
Jason Wessel - June 21, 2012, 4:04 p.m.
On 06/21/2012 10:55 AM, Richard Purdie wrote:
> On Thu, 2012-06-14 at 09:58 -0500, Jason Wessel wrote:
>> There are two problems with the _logged_communicate that are both
>> caused as a result of buffering I/O, instead of flushing it out to the
>> log files as it arrives.
>>
>> 1) log truncation when python dumps
>>    I have seen the task logs missing data that was not flushed when
>>    bitbake crashes.
>>
>> 2) While a bitbake task is running it is impossible to see what is
>>    going on if it is only writing a small incremental log that is
>>    smaller than the buffer, or you get only a partial log, up until
>>    the task exists.  It is worse in the case that stderr and stdout
>>    are separate file handles, because previous code blocks on the read
>>    of stdout and then stderr, serially.
>>
>> The right approach is simply to use select() to determine if there is
>> data available and then to flush the log buffers as they arrive.  This
>> still makes use of buffering in the cases where there is more than 1
>> byte of data, but the buffers are much more dynamic.
>>
>> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
>> ---
>>  lib/bb/build.py   |    3 ++-
>>  lib/bb/process.py |   29 +++++++++++++++++++++++++++--
>>  2 files changed, 29 insertions(+), 3 deletions(-)
> For what its worth I'm seeing a small but consistent increase in real,
> sys and user times with this patch which is why I'm pausing to look at
> it a little further :(


In my experience, it varied quite a bit.  If need be, we simply allow allow a config variable like BB_FORCE_LOG_FLUSH = "1", and get rid of the flushes by default.

At the distro level, I'll turn this on, but still allow users to override it, because timely logging of builds is critical, and loss of data due to a python crash is completely unacceptable, it just makes build failures even harder to diagnose.

Jason.
Richard Purdie - June 21, 2012, 4:12 p.m.
On Thu, 2012-06-21 at 11:04 -0500, Jason Wessel wrote:
> On 06/21/2012 10:55 AM, Richard Purdie wrote:
> > On Thu, 2012-06-14 at 09:58 -0500, Jason Wessel wrote:
> >> There are two problems with the _logged_communicate that are both
> >> caused as a result of buffering I/O, instead of flushing it out to the
> >> log files as it arrives.
> >>
> >> 1) log truncation when python dumps
> >>    I have seen the task logs missing data that was not flushed when
> >>    bitbake crashes.
> >>
> >> 2) While a bitbake task is running it is impossible to see what is
> >>    going on if it is only writing a small incremental log that is
> >>    smaller than the buffer, or you get only a partial log, up until
> >>    the task exists.  It is worse in the case that stderr and stdout
> >>    are separate file handles, because previous code blocks on the read
> >>    of stdout and then stderr, serially.
> >>
> >> The right approach is simply to use select() to determine if there is
> >> data available and then to flush the log buffers as they arrive.  This
> >> still makes use of buffering in the cases where there is more than 1
> >> byte of data, but the buffers are much more dynamic.
> >>
> >> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> >> ---
> >>  lib/bb/build.py   |    3 ++-
> >>  lib/bb/process.py |   29 +++++++++++++++++++++++++++--
> >>  2 files changed, 29 insertions(+), 3 deletions(-)
> > For what its worth I'm seeing a small but consistent increase in real,
> > sys and user times with this patch which is why I'm pausing to look at
> > it a little further :(
> 
> 
> In my experience, it varied quite a bit.  If need be, we simply allow
> allow a config variable like BB_FORCE_LOG_FLUSH = "1", and get rid of
> the flushes by default.
> 
> At the distro level, I'll turn this on, but still allow users to
> override it, because timely logging of builds is critical, and loss of
> data due to a python crash is completely unacceptable, it just makes
> build failures even harder to diagnose.

Python has some better tools to handle things like this, such as
something like:

try:
    <normal case>
finally:
    x.flush()

so it would be interesting to see if that helps.

Cheers,

Richard
Jason Wessel - June 21, 2012, 4:26 p.m.
On 06/21/2012 11:12 AM, Richard Purdie wrote:
> On Thu, 2012-06-21 at 11:04 -0500, Jason Wessel wrote:
>> On 06/21/2012 10:55 AM, Richard Purdie wrote:
>>> On Thu, 2012-06-14 at 09:58 -0500, Jason Wessel wrote:
>>>> There are two problems with the _logged_communicate that are both
>>>> caused as a result of buffering I/O, instead of flushing it out to the
>>>> log files as it arrives.
>>>>
>>>> 1) log truncation when python dumps
>>>>    I have seen the task logs missing data that was not flushed when
>>>>    bitbake crashes.
>>>>
>>>> 2) While a bitbake task is running it is impossible to see what is
>>>>    going on if it is only writing a small incremental log that is
>>>>    smaller than the buffer, or you get only a partial log, up until
>>>>    the task exists.  It is worse in the case that stderr and stdout
>>>>    are separate file handles, because previous code blocks on the read
>>>>    of stdout and then stderr, serially.
>>>>
>>>> The right approach is simply to use select() to determine if there is
>>>> data available and then to flush the log buffers as they arrive.  This
>>>> still makes use of buffering in the cases where there is more than 1
>>>> byte of data, but the buffers are much more dynamic.
>>>>
>>>> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
>>>> ---
>>>>  lib/bb/build.py   |    3 ++-
>>>>  lib/bb/process.py |   29 +++++++++++++++++++++++++++--
>>>>  2 files changed, 29 insertions(+), 3 deletions(-)
>>> For what its worth I'm seeing a small but consistent increase in real,
>>> sys and user times with this patch which is why I'm pausing to look at
>>> it a little further :(
>>
>> In my experience, it varied quite a bit.  If need be, we simply allow
>> allow a config variable like BB_FORCE_LOG_FLUSH = "1", and get rid of
>> the flushes by default.
>>
>> At the distro level, I'll turn this on, but still allow users to
>> override it, because timely logging of builds is critical, and loss of
>> data due to a python crash is completely unacceptable, it just makes
>> build failures even harder to diagnose.
> Python has some better tools to handle things like this, such as
> something like:
>
> try:
>     <normal case>
> finally:
>     x.flush()
>
>

There is a secondary case you have to consider.   There is the case where I want the data "now" like in the real time logging, or tailing of log files.

As for your example above, the core log framework needs some kind of major change in this case.  When the task engine crashes for example.  The data in the buffers is gone with the crash.  Perhaps we are talking about two different things however.

Jason.

Patch

diff --git a/lib/bb/build.py b/lib/bb/build.py
index a9ba02d..7e9bb23 100644
--- a/lib/bb/build.py
+++ b/lib/bb/build.py
@@ -135,7 +135,8 @@  class LogTee(object):
 
     def __repr__(self):
         return '<LogTee {0}>'.format(self.name)
-
+    def flush(self):
+        self.outfile.flush()
 
 def exec_func(func, d, dirs = None):
     """Execute an BB 'function'"""
diff --git a/lib/bb/process.py b/lib/bb/process.py
index b74cb18..9848fc3 100644
--- a/lib/bb/process.py
+++ b/lib/bb/process.py
@@ -1,6 +1,9 @@ 
 import logging
 import signal
 import subprocess
+import fcntl
+import errno
+import select
 
 logger = logging.getLogger('BitBake.Process')
 
@@ -70,18 +73,40 @@  def _logged_communicate(pipe, log, input):
 
     bufsize = 512
     outdata, errdata = [], []
+    rin = []
+
+    if pipe.stdout is not None:
+        fd = pipe.stdout.fileno()
+        fl = fcntl.fcntl(fd, fcntl.F_GETFL)
+        fcntl.fcntl(fd, fcntl.F_SETFL, fl | os.O_NONBLOCK)
+        rin.append(pipe.stdout)
+    if pipe.stderr is not None:
+        fd = pipe.stderr.fileno()
+        fl = fcntl.fcntl(fd, fcntl.F_GETFL)
+        fcntl.fcntl(fd, fcntl.F_SETFL, fl | os.O_NONBLOCK)
+        rin.append(pipe.stderr)
+
     while pipe.poll() is None:
-        if pipe.stdout is not None:
+        rlist = rin
+        try:
+            r,w,e = select.select (rlist, [], [])
+        except OSError, e:
+            if e.errno != errno.EINTR:
+                raise
+
+        if pipe.stdout in r:
             data = pipe.stdout.read(bufsize)
             if data is not None:
                 outdata.append(data)
                 log.write(data)
+                log.flush()
 
-        if pipe.stderr is not None:
+        if pipe.stderr in r:
             data = pipe.stderr.read(bufsize)
             if data is not None:
                 errdata.append(data)
                 log.write(data)
+                log.flush()
     return ''.join(outdata), ''.join(errdata)
 
 def run(cmd, input=None, log=None, **options):