diff mbox series

[v2,2/8] oeqa ssh.py: fix hangs in run()

Message ID 20230209080936.148489-3-mikko.rapeli@linaro.org
State Accepted, archived
Commit 9c63970fce3a3d6029745252a6ec2bf9b9da862d
Headers show
Series fix oeqa runtime test framework when qemu hangs | expand

Commit Message

Mikko Rapeli Feb. 9, 2023, 8:09 a.m. UTC
When qemu machine hangs, the ssh commands done by tests
are not timing out. do_testimage() task has last logs like this:

DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502

The test process is stuck for hours, or for ever if the
executing command or test case did not set a timeout correctly.
The default 300 second timeout is not working when target hangs.
Note that timeout is really a "inactive timeout" since data returned
by the process will reset the timeout.

Make the process stdout non-blocking so read() will always return
right away using os.set_blocking() available in python 3.5 and later.

Then change from python codec reader to plain read() and make
the ssh subprocess stdout non-blocking. Even with select()
making sure the file had input to be read, the codec reader was
trying to find more stuff and blocking for ever when process hangs.

While at it, add a small timeout to read data in larger chunks if
possible. This avoids reading data one or few characters at a time
and makes the debug logs more readable.

close() the stdout file in all cases after read loop is complete.

Then make sure to wait or kill the ssh subprocess in all cases.
Just reading the output stream and receiving EOF there does not mean
that the process exited, and wait() needs a timeout if the process
is hanging. In the end kill the process and return the return value
and captured output utf-8 encoded, just like before these changes.

This fixes ssh run() related deadlocks when a qemu target hangs
completely.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

Comments

Richard Purdie Feb. 9, 2023, 9:15 a.m. UTC | #1
On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> When qemu machine hangs, the ssh commands done by tests
> are not timing out. do_testimage() task has last logs like this:
> 
> DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> 
> The test process is stuck for hours, or for ever if the
> executing command or test case did not set a timeout correctly.
> The default 300 second timeout is not working when target hangs.
> Note that timeout is really a "inactive timeout" since data returned
> by the process will reset the timeout.
> 
> Make the process stdout non-blocking so read() will always return
> right away using os.set_blocking() available in python 3.5 and later.
> 
> Then change from python codec reader to plain read() and make
> the ssh subprocess stdout non-blocking. Even with select()
> making sure the file had input to be read, the codec reader was
> trying to find more stuff and blocking for ever when process hangs.
> 
> While at it, add a small timeout to read data in larger chunks if
> possible. This avoids reading data one or few characters at a time
> and makes the debug logs more readable.
> 
> close() the stdout file in all cases after read loop is complete.
> 
> Then make sure to wait or kill the ssh subprocess in all cases.
> Just reading the output stream and receiving EOF there does not mean
> that the process exited, and wait() needs a timeout if the process
> is hanging. In the end kill the process and return the return value
> and captured output utf-8 encoded, just like before these changes.
> 
> This fixes ssh run() related deadlocks when a qemu target hangs
> completely.
> 
> Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> ---
>  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> index 13fd5b2a49..466a795eb4 100644
> --- a/meta/lib/oeqa/core/target/ssh.py
> +++ b/meta/lib/oeqa/core/target/ssh.py
> @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
>          if timeout:
>              endtime = starttime + timeout
>              eof = False
> +            os.set_blocking(process.stdout.fileno(), False)
>              while time.time() < endtime and not eof:
> -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
>                  try:
> +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
>                      if select.select([process.stdout], [], [], 5)[0] != []:
> -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> -                        data = reader.read(1024, 4096)
> +                        # wait a bit for more data, tries to avoid reading single characters
> +                        time.sleep(0.2)
> +                        data = process.stdout.read()
>                          if not data:
> -                            process.stdout.close()
>                              eof = True

I'm not sure I understand the use of eof now. If stdout is non-
blocking, it would be perfectly normal for it to return with no data
but that doesn't mean we're at eof, it just means there is no data
ready?

I suspect the code might be ok but we should stop calling it eof since
it no longer is?

I also tend to dislike sleep() codepaths as it usually means the code
is sub optimal. I was going to make further suggestions but I stepped
back and looked at the bigger picture of the code.

The code is basically doing:

        if timeout:
            <complex code>
        else:
            output = process.communicate()


Python 3.3 added a timeout parameter to subprocess.communicate(). Can
we just use that instead? Or is there something special about our needs
which means we can't?

Cheers,

Richard
Mikko Rapeli Feb. 9, 2023, 9:36 a.m. UTC | #2
Hi,

On Thu, Feb 09, 2023 at 09:15:10AM +0000, Richard Purdie wrote:
> On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> > When qemu machine hangs, the ssh commands done by tests
> > are not timing out. do_testimage() task has last logs like this:
> > 
> > DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> > 
> > The test process is stuck for hours, or for ever if the
> > executing command or test case did not set a timeout correctly.
> > The default 300 second timeout is not working when target hangs.
> > Note that timeout is really a "inactive timeout" since data returned
> > by the process will reset the timeout.
> > 
> > Make the process stdout non-blocking so read() will always return
> > right away using os.set_blocking() available in python 3.5 and later.
> > 
> > Then change from python codec reader to plain read() and make
> > the ssh subprocess stdout non-blocking. Even with select()
> > making sure the file had input to be read, the codec reader was
> > trying to find more stuff and blocking for ever when process hangs.
> > 
> > While at it, add a small timeout to read data in larger chunks if
> > possible. This avoids reading data one or few characters at a time
> > and makes the debug logs more readable.
> > 
> > close() the stdout file in all cases after read loop is complete.
> > 
> > Then make sure to wait or kill the ssh subprocess in all cases.
> > Just reading the output stream and receiving EOF there does not mean
> > that the process exited, and wait() needs a timeout if the process
> > is hanging. In the end kill the process and return the return value
> > and captured output utf-8 encoded, just like before these changes.
> > 
> > This fixes ssh run() related deadlocks when a qemu target hangs
> > completely.
> > 
> > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > ---
> >  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
> >  1 file changed, 26 insertions(+), 7 deletions(-)
> > 
> > diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> > index 13fd5b2a49..466a795eb4 100644
> > --- a/meta/lib/oeqa/core/target/ssh.py
> > +++ b/meta/lib/oeqa/core/target/ssh.py
> > @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
> >          if timeout:
> >              endtime = starttime + timeout
> >              eof = False
> > +            os.set_blocking(process.stdout.fileno(), False)
> >              while time.time() < endtime and not eof:
> > -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
> >                  try:
> > +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
> >                      if select.select([process.stdout], [], [], 5)[0] != []:
> > -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> > -                        data = reader.read(1024, 4096)
> > +                        # wait a bit for more data, tries to avoid reading single characters
> > +                        time.sleep(0.2)
> > +                        data = process.stdout.read()
> >                          if not data:
> > -                            process.stdout.close()
> >                              eof = True
> 
> I'm not sure I understand the use of eof now. If stdout is non-
> blocking, it would be perfectly normal for it to return with no data
> but that doesn't mean we're at eof, it just means there is no data
> ready?

The select() makes sure that there must be data available for read(), or eof when the
file descriptor is closed. But closed file descriptor does not mean that
the subprocess is dead. For that we must wait() with a timeout and
possibly after that kill().

> I suspect the code might be ok but we should stop calling it eof since
> it no longer is?
> 
> I also tend to dislike sleep() codepaths as it usually means the code
> is sub optimal. I was going to make further suggestions but I stepped
> back and looked at the bigger picture of the code.

I think sleep() is ok in this case to reduce busy looping over serial
console and read() more data than what select() initially saw. Reading
single character at a time is a bit silly and just wastes CPU time when
we don't need ms level accuracy.

> The code is basically doing:
> 
>         if timeout:
>             <complex code>
>         else:
>             output = process.communicate()
> 
> 
> Python 3.3 added a timeout parameter to subprocess.communicate(). Can
> we just use that instead? Or is there something special about our needs
> which means we can't?

The current code uses timeout as "inactivity timeout" and thus code
reads output using select() and read() until the end or if there is no
output until timeout.

Code could use communicate() with a timeout but would still need to use
select() and read() loop to check if there was any output. I did not see
a simple way to rewrite this in a more pythonic way. The reader
iterators did the wrong thing and ended up dead locking when subprocess
hangs.

Cheers,

-Mikko
Richard Purdie Feb. 9, 2023, 10:20 a.m. UTC | #3
On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> When qemu machine hangs, the ssh commands done by tests
> are not timing out. do_testimage() task has last logs like this:
> 
> DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> 
> The test process is stuck for hours, or for ever if the
> executing command or test case did not set a timeout correctly.
> The default 300 second timeout is not working when target hangs.
> Note that timeout is really a "inactive timeout" since data returned
> by the process will reset the timeout.
> 
> Make the process stdout non-blocking so read() will always return
> right away using os.set_blocking() available in python 3.5 and later.
> 
> Then change from python codec reader to plain read() and make
> the ssh subprocess stdout non-blocking. Even with select()
> making sure the file had input to be read, the codec reader was
> trying to find more stuff and blocking for ever when process hangs.
> 
> While at it, add a small timeout to read data in larger chunks if
> possible. This avoids reading data one or few characters at a time
> and makes the debug logs more readable.
> 
> close() the stdout file in all cases after read loop is complete.
> 
> Then make sure to wait or kill the ssh subprocess in all cases.
> Just reading the output stream and receiving EOF there does not mean
> that the process exited, and wait() needs a timeout if the process
> is hanging. In the end kill the process and return the return value
> and captured output utf-8 encoded, just like before these changes.
> 
> This fixes ssh run() related deadlocks when a qemu target hangs
> completely.
> 
> Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> ---
>  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> index 13fd5b2a49..466a795eb4 100644
> --- a/meta/lib/oeqa/core/target/ssh.py
> +++ b/meta/lib/oeqa/core/target/ssh.py
> @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
>          if timeout:
>              endtime = starttime + timeout
>              eof = False
> +            os.set_blocking(process.stdout.fileno(), False)
>              while time.time() < endtime and not eof:
> -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
>                  try:
> +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
>                      if select.select([process.stdout], [], [], 5)[0] != []:
> -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> -                        data = reader.read(1024, 4096)
> +                        # wait a bit for more data, tries to avoid reading single characters
> +                        time.sleep(0.2)
> +                        data = process.stdout.read()
>                          if not data:
> -                            process.stdout.close()
>                              eof = True
>                          else:
> -                            output += data
> -                            logger.debug('Partial data from SSH call:\n%s' % data)
> +                            # ignore errors to capture as much as possible
> +                            string = data.decode('utf-8', errors='ignore')

I've been trying to remember why we have the reader here and  I think
the issue was you can't call decode on partial data :(.

If it happens that there is a multi byte character on the stream
crossing the boundary then you'll get errors. "ignoring" them just
means you'll corrupt data. The case is rare but with long running
commands with utf8 data like some of the ptests, it can be an issue.

That is why the code was written the way it was...

Cheers,

Richard
Mikko Rapeli Feb. 9, 2023, 10:35 a.m. UTC | #4
Hi,

On Thu, Feb 09, 2023 at 10:20:36AM +0000, Richard Purdie wrote:
> On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> > When qemu machine hangs, the ssh commands done by tests
> > are not timing out. do_testimage() task has last logs like this:
> > 
> > DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> > 
> > The test process is stuck for hours, or for ever if the
> > executing command or test case did not set a timeout correctly.
> > The default 300 second timeout is not working when target hangs.
> > Note that timeout is really a "inactive timeout" since data returned
> > by the process will reset the timeout.
> > 
> > Make the process stdout non-blocking so read() will always return
> > right away using os.set_blocking() available in python 3.5 and later.
> > 
> > Then change from python codec reader to plain read() and make
> > the ssh subprocess stdout non-blocking. Even with select()
> > making sure the file had input to be read, the codec reader was
> > trying to find more stuff and blocking for ever when process hangs.
> > 
> > While at it, add a small timeout to read data in larger chunks if
> > possible. This avoids reading data one or few characters at a time
> > and makes the debug logs more readable.
> > 
> > close() the stdout file in all cases after read loop is complete.
> > 
> > Then make sure to wait or kill the ssh subprocess in all cases.
> > Just reading the output stream and receiving EOF there does not mean
> > that the process exited, and wait() needs a timeout if the process
> > is hanging. In the end kill the process and return the return value
> > and captured output utf-8 encoded, just like before these changes.
> > 
> > This fixes ssh run() related deadlocks when a qemu target hangs
> > completely.
> > 
> > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > ---
> >  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
> >  1 file changed, 26 insertions(+), 7 deletions(-)
> > 
> > diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> > index 13fd5b2a49..466a795eb4 100644
> > --- a/meta/lib/oeqa/core/target/ssh.py
> > +++ b/meta/lib/oeqa/core/target/ssh.py
> > @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
> >          if timeout:
> >              endtime = starttime + timeout
> >              eof = False
> > +            os.set_blocking(process.stdout.fileno(), False)
> >              while time.time() < endtime and not eof:
> > -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
> >                  try:
> > +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
> >                      if select.select([process.stdout], [], [], 5)[0] != []:
> > -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> > -                        data = reader.read(1024, 4096)
> > +                        # wait a bit for more data, tries to avoid reading single characters
> > +                        time.sleep(0.2)
> > +                        data = process.stdout.read()
> >                          if not data:
> > -                            process.stdout.close()
> >                              eof = True
> >                          else:
> > -                            output += data
> > -                            logger.debug('Partial data from SSH call:\n%s' % data)
> > +                            # ignore errors to capture as much as possible
> > +                            string = data.decode('utf-8', errors='ignore')
> 
> I've been trying to remember why we have the reader here and  I think
> the issue was you can't call decode on partial data :(.

I get this. Sadly readers read more than file has available and they don't have
timeouts...

> If it happens that there is a multi byte character on the stream
> crossing the boundary then you'll get errors. "ignoring" them just
> means you'll corrupt data. The case is rare but with long running
> commands with utf8 data like some of the ptests, it can be an issue.
>
> That is why the code was written the way it was...

For logging purposes errors are fine. Should the raw data be captured to
output then? Or another conversion to utf-8 after all data has been
read?

Reading more than one character at a time for sure helps here but
granted there may be windows where we hit the in character bytes..

Cheers,

-Mikko
Richard Purdie Feb. 9, 2023, 10:52 a.m. UTC | #5
On Thu, 2023-02-09 at 12:35 +0200, Mikko Rapeli wrote:
> Hi,
> 
> On Thu, Feb 09, 2023 at 10:20:36AM +0000, Richard Purdie wrote:
> > On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> > > When qemu machine hangs, the ssh commands done by tests
> > > are not timing out. do_testimage() task has last logs like this:
> > > 
> > > DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> > > 
> > > The test process is stuck for hours, or for ever if the
> > > executing command or test case did not set a timeout correctly.
> > > The default 300 second timeout is not working when target hangs.
> > > Note that timeout is really a "inactive timeout" since data returned
> > > by the process will reset the timeout.
> > > 
> > > Make the process stdout non-blocking so read() will always return
> > > right away using os.set_blocking() available in python 3.5 and later.
> > > 
> > > Then change from python codec reader to plain read() and make
> > > the ssh subprocess stdout non-blocking. Even with select()
> > > making sure the file had input to be read, the codec reader was
> > > trying to find more stuff and blocking for ever when process hangs.
> > > 
> > > While at it, add a small timeout to read data in larger chunks if
> > > possible. This avoids reading data one or few characters at a time
> > > and makes the debug logs more readable.
> > > 
> > > close() the stdout file in all cases after read loop is complete.
> > > 
> > > Then make sure to wait or kill the ssh subprocess in all cases.
> > > Just reading the output stream and receiving EOF there does not mean
> > > that the process exited, and wait() needs a timeout if the process
> > > is hanging. In the end kill the process and return the return value
> > > and captured output utf-8 encoded, just like before these changes.
> > > 
> > > This fixes ssh run() related deadlocks when a qemu target hangs
> > > completely.
> > > 
> > > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > > ---
> > >  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
> > >  1 file changed, 26 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> > > index 13fd5b2a49..466a795eb4 100644
> > > --- a/meta/lib/oeqa/core/target/ssh.py
> > > +++ b/meta/lib/oeqa/core/target/ssh.py
> > > @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
> > >          if timeout:
> > >              endtime = starttime + timeout
> > >              eof = False
> > > +            os.set_blocking(process.stdout.fileno(), False)
> > >              while time.time() < endtime and not eof:
> > > -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
> > >                  try:
> > > +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
> > >                      if select.select([process.stdout], [], [], 5)[0] != []:
> > > -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> > > -                        data = reader.read(1024, 4096)
> > > +                        # wait a bit for more data, tries to avoid reading single characters
> > > +                        time.sleep(0.2)
> > > +                        data = process.stdout.read()
> > >                          if not data:
> > > -                            process.stdout.close()
> > >                              eof = True
> > >                          else:
> > > -                            output += data
> > > -                            logger.debug('Partial data from SSH call:\n%s' % data)
> > > +                            # ignore errors to capture as much as possible
> > > +                            string = data.decode('utf-8', errors='ignore')
> > 
> > I've been trying to remember why we have the reader here and  I think
> > the issue was you can't call decode on partial data :(.
> 
> I get this. Sadly readers read more than file has available and they don't have
> timeouts...
> 
> > If it happens that there is a multi byte character on the stream
> > crossing the boundary then you'll get errors. "ignoring" them just
> > means you'll corrupt data. The case is rare but with long running
> > commands with utf8 data like some of the ptests, it can be an issue.
> > 
> > That is why the code was written the way it was...
> 
> For logging purposes errors are fine. Should the raw data be captured to
> output then? Or another conversion to utf-8 after all data has been
> read?

Converting at the end is one option.

The other option would be to feed the reader our own stream object
where it's read method would only return the data we have. I think you
can create dummy "file like" objects in python, it might be easiest
with io.BytesIO(). The codec reader could then interact with that as a
dummy stream. 


Cheers,

Richard
Alexander Kanavin Feb. 9, 2023, 10:53 a.m. UTC | #6
On Thu, 9 Feb 2023 at 10:36, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> I think sleep() is ok in this case to reduce busy looping over serial
> console and read() more data than what select() initially saw. Reading
> single character at a time is a bit silly and just wastes CPU time when
> we don't need ms level accuracy.

Something doesn't compute here. The writer should be sending data in
chunks, and there are buffers all the way to the reader, so where does
the splitting into single characters happen?

Alex
Richard Purdie Feb. 9, 2023, 10:56 a.m. UTC | #7
On Thu, 2023-02-09 at 11:53 +0100, Alexander Kanavin wrote:
> On Thu, 9 Feb 2023 at 10:36, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> > I think sleep() is ok in this case to reduce busy looping over serial
> > console and read() more data than what select() initially saw. Reading
> > single character at a time is a bit silly and just wastes CPU time when
> > we don't need ms level accuracy.
> 
> Something doesn't compute here. The writer should be sending data in
> chunks, and there are buffers all the way to the reader, so where does
> the splitting into single characters happen?

Just guessing but in the case of the serial emulation in qemu,
115200bps is probably "slow" compared to the system speed so they
arrive as bytes?

Cheers,

Richard
Mikko Rapeli Feb. 9, 2023, 11:12 a.m. UTC | #8
Hi,

On Thu, Feb 09, 2023 at 11:53:00AM +0100, Alexander Kanavin wrote:
> On Thu, 9 Feb 2023 at 10:36, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> > I think sleep() is ok in this case to reduce busy looping over serial
> > console and read() more data than what select() initially saw. Reading
> > single character at a time is a bit silly and just wastes CPU time when
> > we don't need ms level accuracy.
> 
> Something doesn't compute here. The writer should be sending data in
> chunks, and there are buffers all the way to the reader, so where does
> the splitting into single characters happen?

Sorry I mixed with the serial console patch, but the output was similar
over ssh.

select() returns with any data, even single character. Fancy readers
will read larger chunks but will read more data than is available and thus
block for ever. Plain read() will read the data which is available from
select() and thus really small amounts like single characters if that's
available over the socket. This is visible in the partial read output of
do_testimage() task output log which becomes quite unreadable without change.

Cheers,

-Mikko
Alexander Kanavin Feb. 9, 2023, 11:42 a.m. UTC | #9
On Thu, 9 Feb 2023 at 12:12, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> select() returns with any data, even single character. Fancy readers
> will read larger chunks but will read more data than is available and thus
> block for ever. Plain read() will read the data which is available from
> select() and thus really small amounts like single characters if that's
> available over the socket. This is visible in the partial read output of
> do_testimage() task output log which becomes quite unreadable without change.

Yes, but this does not explain why single characters happen to begin
with - RP offered a hypothesis, but the problem is that data arrives
that way, not the tight-loop manner in which we're reading it.

Alex
Mikko Rapeli Feb. 9, 2023, 12:23 p.m. UTC | #10
Hi,

On Thu, Feb 09, 2023 at 12:42:09PM +0100, Alexander Kanavin wrote:
> On Thu, 9 Feb 2023 at 12:12, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> > select() returns with any data, even single character. Fancy readers
> > will read larger chunks but will read more data than is available and thus
> > block for ever. Plain read() will read the data which is available from
> > select() and thus really small amounts like single characters if that's
> > available over the socket. This is visible in the partial read output of
> > do_testimage() task output log which becomes quite unreadable without change.
> 
> Yes, but this does not explain why single characters happen to begin
> with - RP offered a hypothesis, but the problem is that data arrives
> that way, not the tight-loop manner in which we're reading it.

Without the sleep(0.2) with ssh, the do_testimage task output can be like this (which
isn't too bad, just annoying):

DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; cat /etc/controllerimage
DEBUG: Waiting for process output: time: 1675942983.6053402, endtime: 1675943283.6033137
DEBUG: Partial data from SSH call:
cat: 
DEBUG: Waiting for process output: time: 1675942986.2085373, endtime: 1675943286.208536
DEBUG: Partial data from SSH call:
/etc/controllerimage
DEBUG: Waiting for process output: time: 1675942986.2111514, endtime: 1675943286.2111504
DEBUG: Partial data from SSH call:
: No such file or directory
DEBUG: Waiting for process output: time: 1675942986.2140906, endtime: 1675943286.2140894
DEBUG: Partial data from SSH call:


DEBUG: Waiting for process output: time: 1675942986.216261, endtime: 1675943286.21626
DEBUG: [Command returned '1' after 2.64 seconds]

I think qemu is just slower than what anyone expects, including the ssh
protocol which does some buffer but then gives up. It may also be the
kernel and image running under qemu which busy starting up, e.g.
generating entropy or keys.

With the sleep(0.2) output is:

DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; cat /etc/controllerimage
DEBUG: Waiting for process output: time: 1675945065.6815689, endtime: 1675945365.6790204
DEBUG: Partial data from SSH call:
cat: /etc/controllerimage: No such file or directory

DEBUG: Waiting for process output: time: 1675945068.308055, endtime: 1675945368.3080513
DEBUG: [Command returned '1' after 2.83 seconds]
DEBUG: Command: cat /etc/controllerimage
Status: 1 Output:
cat: /etc/controllerimage: No such file or directory

NOTE:  ... ok

Similar thing happens then over serial console which is even slower. I
know debugging the data read and written to serial are not wanted but I
had the changes in place and saw bootloader and kernel messages being
read initally one character at a time and later on 5-10 characters at a time
which made debugging problems really annoying. The small sleep does
improve things, doesn't solve everything but also doesn't really harm.

No data is lost and less time is used looping around the file
descriptor.

Cheers,

-Mikko
diff mbox series

Patch

diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
index 13fd5b2a49..466a795eb4 100644
--- a/meta/lib/oeqa/core/target/ssh.py
+++ b/meta/lib/oeqa/core/target/ssh.py
@@ -229,22 +229,28 @@  def SSHCall(command, logger, timeout=None, **opts):
         if timeout:
             endtime = starttime + timeout
             eof = False
+            os.set_blocking(process.stdout.fileno(), False)
             while time.time() < endtime and not eof:
-                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
                 try:
+                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
                     if select.select([process.stdout], [], [], 5)[0] != []:
-                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
-                        data = reader.read(1024, 4096)
+                        # wait a bit for more data, tries to avoid reading single characters
+                        time.sleep(0.2)
+                        data = process.stdout.read()
                         if not data:
-                            process.stdout.close()
                             eof = True
                         else:
-                            output += data
-                            logger.debug('Partial data from SSH call:\n%s' % data)
+                            # ignore errors to capture as much as possible
+                            string = data.decode('utf-8', errors='ignore')
+                            output += string
+                            logger.debug('Partial data from SSH call:\n%s' % string)
                             endtime = time.time() + timeout
                 except InterruptedError:
+                    logger.debug('InterruptedError')
                     continue
 
+            process.stdout.close()
+
             # process hasn't returned yet
             if not eof:
                 process.terminate()
@@ -252,6 +258,7 @@  def SSHCall(command, logger, timeout=None, **opts):
                 try:
                     process.kill()
                 except OSError:
+                    logger.debug('OSError when killing process')
                     pass
                 endtime = time.time() - starttime
                 lastline = ("\nProcess killed - no output for %d seconds. Total"
@@ -263,6 +270,17 @@  def SSHCall(command, logger, timeout=None, **opts):
             output = process.communicate()[0].decode('utf-8', errors='ignore')
             logger.debug('Data from SSH call:\n%s' % output.rstrip())
 
+        # timout or not, make sure process exits and is not hanging
+        if process.returncode == None:
+            try:
+                process.wait(timeout=5)
+            except TimeoutExpired:
+                try:
+                    process.kill()
+                except OSError:
+                    logger.debug('OSError')
+                    pass
+
     options = {
         "stdout": subprocess.PIPE,
         "stderr": subprocess.STDOUT,
@@ -290,4 +308,5 @@  def SSHCall(command, logger, timeout=None, **opts):
             process.kill()
         logger.debug('Something went wrong, killing SSH process')
         raise
-    return (process.wait(), output.rstrip())
+
+    return (process.returncode, output.rstrip())