diff mbox series

[2/2] qemurunner: Impove handling of serial port output blocking

Message ID 20231218152951.3398964-2-richard.purdie@linuxfoundation.org
State New
Headers show
Series [1/2] qemurunner: Impove stdout logging handling | expand

Commit Message

Richard Purdie Dec. 18, 2023, 3:29 p.m. UTC
Similar to stdout in the previous commit, we need to ensure serial output
if written is read and put somewhere, else qemu might block on writes to
the serial port leading to hangs in the kernel. Use our existing logging
thread to log data when run_serial is not in use.

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 meta/lib/oeqa/utils/qemurunner.py | 97 ++++++++++++++++++++++---------
 1 file changed, 68 insertions(+), 29 deletions(-)

Comments

Alex Bennée Dec. 19, 2023, 3:10 p.m. UTC | #1
"Richard Purdie" <richard.purdie@linuxfoundation.org> writes:

> Similar to stdout in the previous commit, we need to ensure serial output
> if written is read and put somewhere, else qemu might block on writes to
> the serial port leading to hangs in the kernel. Use our existing logging
> thread to log data when run_serial is not in use.
>
> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>

I'm still seeing hanging with my test case:

  o regression_1010.5 Invalid memory access 5
    regression_1010.5 OK
  o regression_1010.6 Invalid memory access 6

  Stdout:
  Tried to dump info from target but serial console failed
  Failed CMD: top -bn1
  Tried to dump info from target but serial console failed
  Failed CMD: ps
  Tried to dump info from target but serial console failed
  Failed CMD: free

  NOTE: test_soafee (soafeetestsuite.SoafeeTestSuite.test_soafee)
  DEBUG: Checking if at least one of soafee-test-suite is installed
  DEBUG: Setting up a 400 second(s) timeout
  DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; sudo -i -u trs soafee-test-suite run -r -t -T ; rm -rf /tmp/soafee-test-suite
  DEBUG: Waiting for process output: time: 1702998407.0291, endtime: 1702998707.0285847
  DEBUG: Waiting for process output: time: 1702998412.0338314, endtime: 1702998707.0285847
  DEBUG: Waiting for process output: time: 1702998417.0378022, endtime: 1702998707.0285847
  DEBUG: Waiting for process output: time: 1702998422.0417945, endtime: 1702998707.0285847
  <snip>

The regression tests are hanging because that has triggered output on
the serial port as the firmware attempted to service the test and the
port isn't being drained. Then when that times out we start spinning
waiting for more output.
Richard Purdie Dec. 19, 2023, 3:15 p.m. UTC | #2
On Tue, 2023-12-19 at 15:10 +0000, Alex Bennée wrote:
> "Richard Purdie" <richard.purdie@linuxfoundation.org> writes:
> 
> > Similar to stdout in the previous commit, we need to ensure serial output
> > if written is read and put somewhere, else qemu might block on writes to
> > the serial port leading to hangs in the kernel. Use our existing logging
> > thread to log data when run_serial is not in use.
> > 
> > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> 
> I'm still seeing hanging with my test case:
> 
>   o regression_1010.5 Invalid memory access 5
>     regression_1010.5 OK
>   o regression_1010.6 Invalid memory access 6
> 
>   Stdout:
>   Tried to dump info from target but serial console failed
>   Failed CMD: top -bn1
>   Tried to dump info from target but serial console failed
>   Failed CMD: ps
>   Tried to dump info from target but serial console failed
>   Failed CMD: free
> 
>   NOTE: test_soafee (soafeetestsuite.SoafeeTestSuite.test_soafee)
>   DEBUG: Checking if at least one of soafee-test-suite is installed
>   DEBUG: Setting up a 400 second(s) timeout
>   DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; sudo -i -u trs soafee-test-suite run -r -t -T ; rm -rf /tmp/soafee-test-suite
>   DEBUG: Waiting for process output: time: 1702998407.0291, endtime: 1702998707.0285847
>   DEBUG: Waiting for process output: time: 1702998412.0338314, endtime: 1702998707.0285847
>   DEBUG: Waiting for process output: time: 1702998417.0378022, endtime: 1702998707.0285847
>   DEBUG: Waiting for process output: time: 1702998422.0417945, endtime: 1702998707.0285847
>   <snip>
> 
> The regression tests are hanging because that has triggered output on
> the serial port as the firmware attempted to service the test and the
> port isn't being drained. Then when that times out we start spinning
> waiting for more output.

I don't really understand where it might be hanging unfortunately as in
my local tests it is draining stdout and the serial ports. Does it
login to the serial console correctly? Are there any WARNING messages?

I did use this debug patch:

https://git.yoctoproject.org/poky-contrib/commit/?h=rpurdie/t222&id=0d9029bad610736c114e0770d2433b53133fa620

to prove that it can interleave serial commands with left over output
successfully. A similar patch may help work out where your setup may be
hanging?

Cheers,

Richard
Alex Bennée Dec. 19, 2023, 4:44 p.m. UTC | #3
Richard Purdie <richard.purdie@linuxfoundation.org> writes:

> On Tue, 2023-12-19 at 15:10 +0000, Alex Bennée wrote:
>> "Richard Purdie" <richard.purdie@linuxfoundation.org> writes:
>> 
>> > Similar to stdout in the previous commit, we need to ensure serial output
>> > if written is read and put somewhere, else qemu might block on writes to
>> > the serial port leading to hangs in the kernel. Use our existing logging
>> > thread to log data when run_serial is not in use.
>> > 
>> > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
>> 
>> I'm still seeing hanging with my test case:
>> 
>>   o regression_1010.5 Invalid memory access 5
>>     regression_1010.5 OK
>>   o regression_1010.6 Invalid memory access 6
>> 
>>   Stdout:
>>   Tried to dump info from target but serial console failed
>>   Failed CMD: top -bn1
>>   Tried to dump info from target but serial console failed
>>   Failed CMD: ps
>>   Tried to dump info from target but serial console failed
>>   Failed CMD: free
>> 
>>   NOTE: test_soafee (soafeetestsuite.SoafeeTestSuite.test_soafee)
>>   DEBUG: Checking if at least one of soafee-test-suite is installed
>>   DEBUG: Setting up a 400 second(s) timeout
>>   DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; sudo -i -u trs soafee-test-suite run -r -t -T ; rm -rf /tmp/soafee-test-suite
>>   DEBUG: Waiting for process output: time: 1702998407.0291, endtime: 1702998707.0285847
>>   DEBUG: Waiting for process output: time: 1702998412.0338314, endtime: 1702998707.0285847
>>   DEBUG: Waiting for process output: time: 1702998417.0378022, endtime: 1702998707.0285847
>>   DEBUG: Waiting for process output: time: 1702998422.0417945, endtime: 1702998707.0285847
>>   <snip>
>> 
>> The regression tests are hanging because that has triggered output on
>> the serial port as the firmware attempted to service the test and the
>> port isn't being drained. Then when that times out we start spinning
>> waiting for more output.
>
> I don't really understand where it might be hanging unfortunately as in
> my local tests it is draining stdout and the serial ports. Does it
> login to the serial console correctly? Are there any WARNING messages?
>
> I did use this debug patch:
>
> https://git.yoctoproject.org/poky-contrib/commit/?h=rpurdie/t222&id=0d9029bad610736c114e0770d2433b53133fa620
>
> to prove that it can interleave serial commands with left over output
> successfully. A similar patch may help work out where your setup may be
> hanging?

Yeah I'm only seeing A10/A11:

  o regression_1010.5 Invalid memory access 5
    regression_1010.5 OK
  o regression_1010.6 Invalid memory access 6

  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  NOTE:  ... FAIL
  DEBUG: Removed SIGALRM handler
  Traceback (most recent call last):
    File "/home/alex/lsrc/tests/trs/build/../poky/meta/lib/oeqa/core/decorator/__init__.py", line 35, in wrapped_f
      return func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
    File "/home/alex/lsrc/tests/trs/build/../poky/meta/lib/oeqa/core/decorator/__init__.py", line 35, in wrapped_f
      return func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
    File "/home/alex/lsrc/tests/trs/meta-trs/lib/oeqa/runtime/cases/opteetest.py", line 27, in test_opteetest_xtest
      self.assertEqual(status, 0, msg='\n'.join([cmd, output]))
  AssertionError: 255 != 0 :  xtest               
  Run test suite with level=0

Those are the only two bits hit in the logging:

  grep "WARNING: here" build/tmp_trs-qemuarm64/work/trs_qemuarm64-trs-linux/trs-image/1.0/temp/log.do_testimage
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11
  WARNING: here A10
  WARNING: here A11

>
> Cheers,
>
> Richard
Richard Purdie Dec. 19, 2023, 10:57 p.m. UTC | #4
On Tue, 2023-12-19 at 16:44 +0000, Alex Bennée wrote:
> Richard Purdie <richard.purdie@linuxfoundation.org> writes:
> 
> > On Tue, 2023-12-19 at 15:10 +0000, Alex Bennée wrote:
> > > "Richard Purdie" <richard.purdie@linuxfoundation.org> writes:
> > > 
> > > > Similar to stdout in the previous commit, we need to ensure serial output
> > > > if written is read and put somewhere, else qemu might block on writes to
> > > > the serial port leading to hangs in the kernel. Use our existing logging
> > > > thread to log data when run_serial is not in use.
> > > > 
> > > > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> > > 
> > > I'm still seeing hanging with my test case:
> > > 
> > >   o regression_1010.5 Invalid memory access 5
> > >     regression_1010.5 OK
> > >   o regression_1010.6 Invalid memory access 6
> > > 
> > >   Stdout:
> > >   Tried to dump info from target but serial console failed
> > >   Failed CMD: top -bn1
> > >   Tried to dump info from target but serial console failed
> > >   Failed CMD: ps
> > >   Tried to dump info from target but serial console failed
> > >   Failed CMD: free
> > > 
> > >   NOTE: test_soafee (soafeetestsuite.SoafeeTestSuite.test_soafee)
> > >   DEBUG: Checking if at least one of soafee-test-suite is installed
> > >   DEBUG: Setting up a 400 second(s) timeout
> > >   DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; sudo -i -u trs soafee-test-suite run -r -t -T ; rm -rf /tmp/soafee-test-suite
> > >   DEBUG: Waiting for process output: time: 1702998407.0291, endtime: 1702998707.0285847
> > >   DEBUG: Waiting for process output: time: 1702998412.0338314, endtime: 1702998707.0285847
> > >   DEBUG: Waiting for process output: time: 1702998417.0378022, endtime: 1702998707.0285847
> > >   DEBUG: Waiting for process output: time: 1702998422.0417945, endtime: 1702998707.0285847
> > >   <snip>
> > > 
> > > The regression tests are hanging because that has triggered output on
> > > the serial port as the firmware attempted to service the test and the
> > > port isn't being drained. Then when that times out we start spinning
> > > waiting for more output.
> > 
> > I don't really understand where it might be hanging unfortunately as in
> > my local tests it is draining stdout and the serial ports. Does it
> > login to the serial console correctly? Are there any WARNING messages?
> > 
> > I did use this debug patch:
> > 
> > https://git.yoctoproject.org/poky-contrib/commit/?h=rpurdie/t222&id=0d9029bad610736c114e0770d2433b53133fa620
> > 
> > to prove that it can interleave serial commands with left over output
> > successfully. A similar patch may help work out where your setup may be
> > hanging?
> 
> Yeah I'm only seeing A10/A11:
> 
>   o regression_1010.5 Invalid memory access 5
>     regression_1010.5 OK
>   o regression_1010.6 Invalid memory access 6
> 
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   WARNING: here A10
>   WARNING: here A11
>   NOTE:  ... FAIL
>   DEBUG: Removed SIGALRM handler
>   Traceback (most recent call last):
>     File "/home/alex/lsrc/tests/trs/build/../poky/meta/lib/oeqa/core/decorator/__init__.py", line 35, in wrapped_f
>       return func(*args, **kwargs)
>              ^^^^^^^^^^^^^^^^^^^^^
>     File "/home/alex/lsrc/tests/trs/build/../poky/meta/lib/oeqa/core/decorator/__init__.py", line 35, in wrapped_f
>       return func(*args, **kwargs)
>              ^^^^^^^^^^^^^^^^^^^^^
>     File "/home/alex/lsrc/tests/trs/meta-trs/lib/oeqa/runtime/cases/opteetest.py", line 27, in test_opteetest_xtest
>       self.assertEqual(status, 0, msg='\n'.join([cmd, output]))
>   AssertionError: 255 != 0 :  xtest               
>   Run test suite with level=0

How long does it take for the test data to "back up" and timeout? The
logging thread should take a maximum of 2s to notice data not being
read/handled on the serial port.

What the code should be doing is noticing the run_serial function isn't
using the port and if there is data there, reading it in the logging
thread. You could put further warnings around the code there to see if
it ever detects data to read on that port. A warning when it exits the
run_serial command may help too to see if it is waiting for data there
or not.

I also ended up using:

https://git.yoctoproject.org/poky-contrib/commit/?h=rpurdie/t222&id=febb0d892e4c3648cd88b8ecb9c0260edc45b68e

to show what was going over the serial interface when I found it wasn't
what I expected.

I suspect/hope my patch is a step in the right direction but it looks
like there may be cases it doesn't catch yet :/.

Cheers,

Richard
Alex Bennée Dec. 20, 2023, 9:53 a.m. UTC | #5
Richard Purdie <richard.purdie@linuxfoundation.org> writes:

> On Tue, 2023-12-19 at 16:44 +0000, Alex Bennée wrote:
>> Richard Purdie <richard.purdie@linuxfoundation.org> writes:
>> 
>> > On Tue, 2023-12-19 at 15:10 +0000, Alex Bennée wrote:
>> > > "Richard Purdie" <richard.purdie@linuxfoundation.org> writes:
>> > > 
>> > > > Similar to stdout in the previous commit, we need to ensure serial output
>> > > > if written is read and put somewhere, else qemu might block on writes to
>> > > > the serial port leading to hangs in the kernel. Use our existing logging
>> > > > thread to log data when run_serial is not in use.
>> > > > 
>> > > > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
>> > > 
>> > > I'm still seeing hanging with my test case:
>> > > 
>> > >   o regression_1010.5 Invalid memory access 5
>> > >     regression_1010.5 OK
>> > >   o regression_1010.6 Invalid memory access 6
>> > > 
>> > >   Stdout:
>> > >   Tried to dump info from target but serial console failed
>> > >   Failed CMD: top -bn1
>> > >   Tried to dump info from target but serial console failed
>> > >   Failed CMD: ps
>> > >   Tried to dump info from target but serial console failed
>> > >   Failed CMD: free
>> > > 
>> > >   NOTE: test_soafee (soafeetestsuite.SoafeeTestSuite.test_soafee)
>> > >   DEBUG: Checking if at least one of soafee-test-suite is installed
>> > >   DEBUG: Setting up a 400 second(s) timeout
>> > >   DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; sudo -i -u trs soafee-test-suite run -r -t -T ; rm -rf /tmp/soafee-test-suite
>> > >   DEBUG: Waiting for process output: time: 1702998407.0291, endtime: 1702998707.0285847
>> > >   DEBUG: Waiting for process output: time: 1702998412.0338314, endtime: 1702998707.0285847
>> > >   DEBUG: Waiting for process output: time: 1702998417.0378022, endtime: 1702998707.0285847
>> > >   DEBUG: Waiting for process output: time: 1702998422.0417945, endtime: 1702998707.0285847
>> > >   <snip>
>> > > 
>> > > The regression tests are hanging because that has triggered output on
>> > > the serial port as the firmware attempted to service the test and the
>> > > port isn't being drained. Then when that times out we start spinning
>> > > waiting for more output.
>> > 
>> > I don't really understand where it might be hanging unfortunately as in
>> > my local tests it is draining stdout and the serial ports. Does it
>> > login to the serial console correctly? Are there any WARNING messages?
>> > 
>> > I did use this debug patch:
>> > 
>> > https://git.yoctoproject.org/poky-contrib/commit/?h=rpurdie/t222&id=0d9029bad610736c114e0770d2433b53133fa620
>> > 
>> > to prove that it can interleave serial commands with left over output
>> > successfully. A similar patch may help work out where your setup may be
>> > hanging?
>> 
>> Yeah I'm only seeing A10/A11:
>> 
>>   o regression_1010.5 Invalid memory access 5
>>     regression_1010.5 OK
>>   o regression_1010.6 Invalid memory access 6
>> 
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   WARNING: here A10
>>   WARNING: here A11
>>   NOTE:  ... FAIL
>>   DEBUG: Removed SIGALRM handler
>>   Traceback (most recent call last):
>>     File "/home/alex/lsrc/tests/trs/build/../poky/meta/lib/oeqa/core/decorator/__init__.py", line 35, in wrapped_f
>>       return func(*args, **kwargs)
>>              ^^^^^^^^^^^^^^^^^^^^^
>>     File "/home/alex/lsrc/tests/trs/build/../poky/meta/lib/oeqa/core/decorator/__init__.py", line 35, in wrapped_f
>>       return func(*args, **kwargs)
>>              ^^^^^^^^^^^^^^^^^^^^^
>>     File "/home/alex/lsrc/tests/trs/meta-trs/lib/oeqa/runtime/cases/opteetest.py", line 27, in test_opteetest_xtest
>>       self.assertEqual(status, 0, msg='\n'.join([cmd, output]))
>>   AssertionError: 255 != 0 :  xtest               
>>   Run test suite with level=0
>
> How long does it take for the test data to "back up" and timeout? The
> logging thread should take a maximum of 2s to notice data not being
> read/handled on the serial port.

I don't think it goes around the logging loop more than once. I only
see:

  grep "got" build/tmp_trs-qemuarm64/work/trs_qemuarm64-trs-linux/trs-image/1.0/temp/log.do_testimage
  DEBUG: got [(12, 1)]

from the poll.

>
> What the code should be doing is noticing the run_serial function isn't
> using the port and if there is data there, reading it in the logging
> thread. You could put further warnings around the code there to see if
> it ever detects data to read on that port. A warning when it exits the
> run_serial command may help too to see if it is waiting for data there
> or not.
>
> I also ended up using:
>
> https://git.yoctoproject.org/poky-contrib/commit/?h=rpurdie/t222&id=febb0d892e4c3648cd88b8ecb9c0260edc45b68e
>
> to show what was going over the serial interface when I found it wasn't
> what I expected.

I'll keep digging.

>
> I suspect/hope my patch is a step in the right direction but it looks
> like there may be cases it doesn't catch yet :/.
>
> Cheers,
>
> Richard
Richard Purdie Dec. 20, 2023, 10:04 a.m. UTC | #6
On Wed, 2023-12-20 at 09:53 +0000, Alex Bennée wrote:
> Richard Purdie <richard.purdie@linuxfoundation.org> writes:
> 
> > On Tue, 2023-12-19 at 16:44 +0000, Alex Bennée wrote:
> > > Richard Purdie <richard.purdie@linuxfoundation.org> writes:
> > > Yeah I'm only seeing A10/A11:
> > > 
> > >   o regression_1010.5 Invalid memory access 5
> > >     regression_1010.5 OK
> > >   o regression_1010.6 Invalid memory access 6
> > > 
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   WARNING: here A10
> > >   WARNING: here A11
> > >   NOTE:  ... FAIL
> > >   DEBUG: Removed SIGALRM handler
> > >   Traceback (most recent call last):
> > >     File "/home/alex/lsrc/tests/trs/build/../poky/meta/lib/oeqa/core/decorator/__init__.py", line 35, in wrapped_f
> > >       return func(*args, **kwargs)
> > >              ^^^^^^^^^^^^^^^^^^^^^
> > >     File "/home/alex/lsrc/tests/trs/build/../poky/meta/lib/oeqa/core/decorator/__init__.py", line 35, in wrapped_f
> > >       return func(*args, **kwargs)
> > >              ^^^^^^^^^^^^^^^^^^^^^
> > >     File "/home/alex/lsrc/tests/trs/meta-trs/lib/oeqa/runtime/cases/opteetest.py", line 27, in test_opteetest_xtest
> > >       self.assertEqual(status, 0, msg='\n'.join([cmd, output]))
> > >   AssertionError: 255 != 0 :  xtest               
> > >   Run test suite with level=0
> > 
> > How long does it take for the test data to "back up" and timeout? The
> > logging thread should take a maximum of 2s to notice data not being
> > read/handled on the serial port.
> 
> I don't think it goes around the logging loop more than once. I only
> see:
> 
>   grep "got" build/tmp_trs-qemuarm64/work/trs_qemuarm64-trs-linux/trs-image/1.0/temp/log.do_testimage
>   DEBUG: got [(12, 1)]
> 
> from the poll.
> 
> > 
> > What the code should be doing is noticing the run_serial function isn't
> > using the port and if there is data there, reading it in the logging
> > thread. You could put further warnings around the code there to see if
> > it ever detects data to read on that port. A warning when it exits the
> > run_serial command may help too to see if it is waiting for data there
> > or not.
> > 
> > I also ended up using:
> > 
> > https://git.yoctoproject.org/poky-contrib/commit/?h=rpurdie/t222&id=febb0d892e4c3648cd88b8ecb9c0260edc45b68e
> > 
> > to show what was going over the serial interface when I found it wasn't
> > what I expected.
> 
> I'll keep digging.

I've merged the patches into master since they are helping the stdout
issue and helped a number of issues others were seeing.

In the final version of the patch I did fix an issue where the
exception handling from the thread was not working so well, it does at
least print a, exception/traceback properly now (in threadtarget()).

One of the reasons I asked about the warnings was in case that was
showing an issue, previously it silently failed.

I can believe there are other issues hiding in here :/.

Cheers,

Richard
Alex Bennée Dec. 20, 2023, 4:28 p.m. UTC | #7
Richard Purdie <richard.purdie@linuxfoundation.org> writes:

> On Wed, 2023-12-20 at 09:53 +0000, Alex Bennée wrote:
>> Richard Purdie <richard.purdie@linuxfoundation.org> writes:
>> 
>> > On Tue, 2023-12-19 at 16:44 +0000, Alex Bennée wrote:
>> > > Richard Purdie <richard.purdie@linuxfoundation.org> writes:
>> > > Yeah I'm only seeing A10/A11:
>> > > 
>> > >   o regression_1010.5 Invalid memory access 5
>> > >     regression_1010.5 OK
>> > >   o regression_1010.6 Invalid memory access 6
>> > > 
>> > >   WARNING: here A10
>> > >   WARNING: here A11
<snip>
>> > 
>> > How long does it take for the test data to "back up" and timeout? The
>> > logging thread should take a maximum of 2s to notice data not being
>> > read/handled on the serial port.
>> 
>> I don't think it goes around the logging loop more than once. I only
>> see:
>> 
>>   grep "got" build/tmp_trs-qemuarm64/work/trs_qemuarm64-trs-linux/trs-image/1.0/temp/log.do_testimage
>>   DEBUG: got [(12, 1)]
>> 
>> from the poll.
>> 
<snip>
>> I'll keep digging.
>
> I've merged the patches into master since they are helping the stdout
> issue and helped a number of issues others were seeing.
>
> In the final version of the patch I did fix an issue where the
> exception handling from the thread was not working so well, it does at
> least print a, exception/traceback properly now (in threadtarget()).
>
> One of the reasons I asked about the warnings was in case that was
> showing an issue, previously it silently failed.
>
> I can believe there are other issues hiding in here :/.

So the failure was due to an exception being thrown at:

  elif self.readsock.fileno() == event[1]:

when my config hadn't triggered its opening. A bit of defensive coding
and it works. See:

  Subject: [PATCH] qemurunner: more cleanups for output blocking
  Date: Wed, 20 Dec 2023 16:14:50 +0000
  Message-Id: <20231220161450.1422484-1-alex.bennee@linaro.org>

>
> Cheers,
>
> Richard
Richard Purdie Dec. 20, 2023, 4:52 p.m. UTC | #8
On Wed, 2023-12-20 at 16:28 +0000, Alex Bennée wrote:
> Richard Purdie <richard.purdie@linuxfoundation.org> writes:
> 
> > On Wed, 2023-12-20 at 09:53 +0000, Alex Bennée wrote:
> > > Richard Purdie <richard.purdie@linuxfoundation.org> writes:
> > > 
> > > > On Tue, 2023-12-19 at 16:44 +0000, Alex Bennée wrote:
> > > > > Richard Purdie <richard.purdie@linuxfoundation.org> writes:
> > > > > Yeah I'm only seeing A10/A11:
> > > > > 
> > > > >   o regression_1010.5 Invalid memory access 5
> > > > >     regression_1010.5 OK
> > > > >   o regression_1010.6 Invalid memory access 6
> > > > > 
> > > > >   WARNING: here A10
> > > > >   WARNING: here A11
> <snip>
> > > > 
> > > > How long does it take for the test data to "back up" and timeout? The
> > > > logging thread should take a maximum of 2s to notice data not being
> > > > read/handled on the serial port.
> > > 
> > > I don't think it goes around the logging loop more than once. I only
> > > see:
> > > 
> > >   grep "got" build/tmp_trs-qemuarm64/work/trs_qemuarm64-trs-linux/trs-image/1.0/temp/log.do_testimage
> > >   DEBUG: got [(12, 1)]
> > > 
> > > from the poll.
> > > 
> <snip>
> > > I'll keep digging.
> > 
> > I've merged the patches into master since they are helping the stdout
> > issue and helped a number of issues others were seeing.
> > 
> > In the final version of the patch I did fix an issue where the
> > exception handling from the thread was not working so well, it does at
> > least print a, exception/traceback properly now (in threadtarget()).
> > 
> > One of the reasons I asked about the warnings was in case that was
> > showing an issue, previously it silently failed.
> > 
> > I can believe there are other issues hiding in here :/.
> 
> So the failure was due to an exception being thrown at:
> 
>   elif self.readsock.fileno() == event[1]:
> 
> when my config hadn't triggered its opening. A bit of defensive coding
> and it works. See:
> 
>   Subject: [PATCH] qemurunner: more cleanups for output blocking
>   Date: Wed, 20 Dec 2023 16:14:50 +0000
>   Message-Id: <20231220161450.1422484-1-alex.bennee@linaro.org>

Glad it was something relatively simple! The fix looks good to me.

Cheers,

Richard
diff mbox series

Patch

diff --git a/meta/lib/oeqa/utils/qemurunner.py b/meta/lib/oeqa/utils/qemurunner.py
index 36bdfb96ae6..16aa1c08d3e 100644
--- a/meta/lib/oeqa/utils/qemurunner.py
+++ b/meta/lib/oeqa/utils/qemurunner.py
@@ -21,6 +21,7 @@  import threading
 import codecs
 import tempfile
 from collections import defaultdict
+from contextlib import contextmanager
 import importlib
 
 # Get Unicode non printable control chars
@@ -517,8 +518,12 @@  class QemuRunner:
                 except Exception as e:
                     self.logger.warning('Extra log data exception %s' % repr(e))
                     data = None
+            self.thread.serial_lock.release()
             return False
 
+        with self.thread.serial_lock:
+            self.thread.set_serialsock(self.server_socket)
+
         # If we are not able to login the tests can continue
         try:
             (status, output) = self.run_serial(self.boot_patterns['send_login_user'], raw=True, timeout=120)
@@ -653,31 +658,32 @@  class QemuRunner:
 
         data = ''
         status = 0
-        self.server_socket.sendall(command.encode('utf-8'))
-        start = time.time()
-        end = start + timeout
-        while True:
-            now = time.time()
-            if now >= end:
-                data += "<<< run_serial(): command timed out after %d seconds without output >>>\r\n\r\n" % timeout
-                break
-            try:
-                sread, _, _ = select.select([self.server_socket],[],[], end - now)
-            except InterruptedError:
-                continue
-            if sread:
-                # try to avoid reading single character at a time
-                time.sleep(0.1)
-                answer = self.server_socket.recv(1024)
-                if answer:
-                    data += answer.decode('utf-8')
-                    # Search the prompt to stop
-                    if re.search(self.boot_patterns['search_cmd_finished'], data):
-                        break
-                else:
-                    if self.canexit:
-                        return (1, "")
-                    raise Exception("No data on serial console socket, connection closed?")
+        with self.thread.serial_lock:
+            self.server_socket.sendall(command.encode('utf-8'))
+            start = time.time()
+            end = start + timeout
+            while True:
+                now = time.time()
+                if now >= end:
+                    data += "<<< run_serial(): command timed out after %d seconds without output >>>\r\n\r\n" % timeout
+                    break
+                try:
+                    sread, _, _ = select.select([self.server_socket],[],[], end - now)
+                except InterruptedError:
+                    continue
+                if sread:
+                    # try to avoid reading single character at a time
+                    time.sleep(0.1)
+                    answer = self.server_socket.recv(1024)
+                    if answer:
+                        data += answer.decode('utf-8')
+                        # Search the prompt to stop
+                        if re.search(self.boot_patterns['search_cmd_finished'], data):
+                            break
+                    else:
+                        if self.canexit:
+                            return (1, "")
+                        raise Exception("No data on serial console socket, connection closed?")
 
         if data:
             if raw:
@@ -696,6 +702,15 @@  class QemuRunner:
                     status = 1
         return (status, str(data))
 
+@contextmanager
+def nonblocking_lock(lock):
+    locked = lock.acquire(False)
+    try:
+        yield locked
+    finally:
+        if locked:
+            lock.release()
+
 # This class is for reading data from a socket and passing it to logfunc
 # to be processed. It's completely event driven and has a straightforward
 # event loop. The mechanism for stopping the thread is a simple pipe which
@@ -703,8 +718,10 @@  class QemuRunner:
 class LoggingThread(threading.Thread):
     def __init__(self, logfunc, sock, logger, qemuoutput):
         self.connection_established = threading.Event()
+        self.serial_lock = threading.Lock()
 
         self.serversock = sock
+        self.serialsock = None
         self.qemuoutput = qemuoutput
         self.logfunc = logfunc
         self.logger = logger
@@ -717,9 +734,14 @@  class LoggingThread(threading.Thread):
 
         threading.Thread.__init__(self, target=self.threadtarget)
 
+    def set_serialsock(self, serialsock):
+        self.serialsock = serialsock
+
     def threadtarget(self):
         try:
             self.eventloop()
+        except Exception as e:
+            self.logger.warning("Exception %s in logging thread" % repr(e))
         finally:
             self.teardown()
 
@@ -753,6 +775,7 @@  class LoggingThread(threading.Thread):
         event_read_mask = self.errorevents | self.readevents
         if self.serversock:
             poll.register(self.serversock.fileno())
+        serial_registered = False
         poll.register(self.qemuoutput.fileno())
         poll.register(self.readpipe, event_read_mask)
 
@@ -760,7 +783,7 @@  class LoggingThread(threading.Thread):
         self.running = True
         self.logger.debug("Starting thread event loop")
         while not breakout:
-            events = poll.poll()
+            events = poll.poll(2)
             for event in events:
                 # An error occurred, bail out
                 if event[1] & self.errorevents:
@@ -785,18 +808,34 @@  class LoggingThread(threading.Thread):
 
                 # Actual data to be logged
                 elif self.readsock.fileno() == event[0]:
-                    data = self.recv(1024)
+                    data = self.recv(1024, self.readsock)
                     self.logfunc(data)
                 elif self.qemuoutput.fileno() == event[0]:
                     data = getOutput(self.qemuoutput)
                     self.logger.debug("Data received on qemu stdout %s" % data)
                     self.logfunc(data, ".stdout")
+                elif self.serialsock and self.serialsock.fileno() == event[0]:
+                    if self.serial_lock.acquire(blocking=False):
+                        data = self.recv(1024, self.serialsock)
+                        self.logger.debug("Data received serial thread %s" % data.decode('utf-8', 'replace'))
+                        self.logfunc(data, ".2")
+                        self.serial_lock.release()
+                    else:
+                        serial_registered = False
+                        poll.unregister(self.serialsock.fileno())
+
+            if not serial_registered and self.serialsock:
+                with nonblocking_lock(self.serial_lock) as l:
+                    if l:
+                        serial_registered = True
+                        poll.register(self.serialsock.fileno(), event_read_mask)
+
 
     # Since the socket is non-blocking make sure to honor EAGAIN
     # and EWOULDBLOCK.
-    def recv(self, count):
+    def recv(self, count, sock):
         try:
-            data = self.readsock.recv(count)
+            data = sock.recv(count)
         except socket.error as e:
             if e.errno == errno.EAGAIN or e.errno == errno.EWOULDBLOCK:
                 return b''