diff mbox series

[5/5] qemurunner: Show both the login console log and all logging upon failure

Message ID 20231004123823.2990916-5-richard.purdie@linuxfoundation.org
State Accepted, archived
Commit 605938b3bb19dcf6c7218648b4d16df9eaa675fc
Headers show
Series [1/5] wic: Add console parameters to qemux86 canned-wks | expand

Commit Message

Richard Purdie Oct. 4, 2023, 12:38 p.m. UTC
It is unclear when things fail which output was on which serial port. Improve the output
to show the last lines of both data to improve debugging.

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 meta/lib/oeqa/utils/qemurunner.py | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Mikko Rapeli Oct. 4, 2023, 1:07 p.m. UTC | #1
Hi,

Acked-by: Mikko Rapeli <mikko.rapeli@linaro.org>

for the full series. I hope our discussion over #yocto irc channel
helped resolve these and the test failures seen with master-next.

Cheers,

-Mikko
Richard Purdie Oct. 4, 2023, 1:29 p.m. UTC | #2
On Wed, 2023-10-04 at 16:07 +0300, Mikko Rapeli wrote:
> Hi,
> 
> Acked-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> 
> for the full series. I hope our discussion over #yocto irc channel
> helped resolve these and the test failures seen with master-next.

Yes, thanks! It was useful to have someone to talk the issues through
with and much appreciated.

Sadly I don't think we've found all the issues yet.

On the build N-1 we have a selftest failure on debian12-ty-1:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5803/steps/14/logs/stdio

with these logs saved:

https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/d/qemurunner_log.20231004022851
https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/d/qemu_boot_log.20231004022851

which suggest that ttyS1 never got any output but a getty did appear on
ttyS0. I confirmed there were only two serial devices in that
configuration. The exact image downloaded off that worker and run
locally worked so it isn't the image, it has to be some kind of race.
The image does have gettys configured for both serial ports. The kernel
boot log isn't present which a patch in this series attempts to fix.

There was also:

https://autobuilder.yoctoproject.org/typhoon/#/builders/82/builds/5528/steps/13/logs/stdio

which looks again like a getty didn't show up on a serial port. We
don't have more logs for that one but "Running postinst /etc/rpm-
postinsts/100-sysvinit-inittab" worries Ross and I a bit. It may or may
not be related.

That brings me to:

https://autobuilder.yoctoproject.org/typhoon/#/builders/127/builds/2199

which has two failures, I think both in commands run over the serial
interface. These ones appear to be more reproducible on ARM.

I'm trying a new build run with this series applied to see if that
helps shed any light on what is going on.

Cheers,

Richard
Mikko Rapeli Oct. 4, 2023, 7:34 p.m. UTC | #3
Hi,

On Wed, Oct 04, 2023 at 02:29:40PM +0100, Richard Purdie wrote:
> On Wed, 2023-10-04 at 16:07 +0300, Mikko Rapeli wrote:
> > Hi,
> > 
> > Acked-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > 
> > for the full series. I hope our discussion over #yocto irc channel
> > helped resolve these and the test failures seen with master-next.
> 
> Yes, thanks! It was useful to have someone to talk the issues through
> with and much appreciated.
> 
> Sadly I don't think we've found all the issues yet.
> 
> On the build N-1 we have a selftest failure on debian12-ty-1:
>
> https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5803/steps/14/logs/stdio

Can I get from this test run stdio to the more detailed task, qemurunner or qemu boot logs somehow?
Or do you have to dig into the system manually to capture them, like in the example below?

I would like to check into the detailed logs from both passing and failing test runs.
Of course kernel dmesg would need to be there so maybe after your patch to the runqemu
side at least.

If there is a regression in kernel to the serial device detection and numbering,
then we need some way to see e.g. 8250 uart/tty0/ttyAMA0 related dmesg messages.

Cheers,

-Mikko

> with these logs saved:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/d/qemurunner_log.20231004022851
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/d/qemu_boot_log.20231004022851
> 
> which suggest that ttyS1 never got any output but a getty did appear on
> ttyS0. I confirmed there were only two serial devices in that
> configuration. The exact image downloaded off that worker and run
> locally worked so it isn't the image, it has to be some kind of race.
> The image does have gettys configured for both serial ports. The kernel
> boot log isn't present which a patch in this series attempts to fix.
> 
> There was also:
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/82/builds/5528/steps/13/logs/stdio
> 
> which looks again like a getty didn't show up on a serial port. We
> don't have more logs for that one but "Running postinst /etc/rpm-
> postinsts/100-sysvinit-inittab" worries Ross and I a bit. It may or may
> not be related.
> 
> That brings me to:
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/127/builds/2199
> 
> which has two failures, I think both in commands run over the serial
> interface. These ones appear to be more reproducible on ARM.
> 
> I'm trying a new build run with this series applied to see if that
> helps shed any light on what is going on.
> 
> Cheers,
> 
> Richard
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
Richard Purdie Oct. 4, 2023, 8:35 p.m. UTC | #4
On Wed, 2023-10-04 at 22:34 +0300, Mikko Rapeli wrote:
> Hi,
> 
> On Wed, Oct 04, 2023 at 02:29:40PM +0100, Richard Purdie wrote:
> > On Wed, 2023-10-04 at 16:07 +0300, Mikko Rapeli wrote:
> > > Hi,
> > > 
> > > Acked-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > > 
> > > for the full series. I hope our discussion over #yocto irc channel
> > > helped resolve these and the test failures seen with master-next.
> > 
> > Yes, thanks! It was useful to have someone to talk the issues through
> > with and much appreciated.
> > 
> > Sadly I don't think we've found all the issues yet.
> > 
> > On the build N-1 we have a selftest failure on debian12-ty-1:
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5803/steps/14/logs/stdio
> 
> Can I get from this test run stdio to the more detailed task, qemurunner or qemu boot logs somehow?
> Or do you have to dig into the system manually to capture them, like in the example below?
> 
> I would like to check into the detailed logs from both passing and failing test runs.
> Of course kernel dmesg would need to be there so maybe after your patch to the runqemu
> side at least.
> 
> If there is a regression in kernel to the serial device detection and numbering,
> then we need some way to see e.g. 8250 uart/tty0/ttyAMA0 related dmesg messages.

I can't capture everything in time in each case but I did grab:

https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/e/

which is from

https://autobuilder.yoctoproject.org/typhoon/#/builders/76/builds/7863/steps/21/logs/stdio

which has this threads patches applied.

I've put two sets of logs there from the same builds, one success, one
is missing the getty on one log.


I also have the logs from:

https://autobuilder.yoctoproject.org/typhoon/#/builders/86/builds/5848/steps/14/logs/stdio

which I've saved here:

https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/

noting that: 

https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/qemu_boot_log.20231004134427

is 3 bytes in size.

Also interestingly, 

https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/qemu_boot_log.20231004140326

contains:

"bootlogd: writing to console: Input/output error"

which appears to come from sysvinit:

sysvinit/3.04/sysvinit-3.04/src/bootlogd.c:

/*
 *      We got a write error on the real console. If its an EIO,
 *      somebody hung up our filedescriptor, so try to re-open it.
 */
int write_err(int pts, int realfd, char *realcons, int e)
{
        int     fd;

        if (e != EIO) {
werr:
                close(pts);
                fprintf(stderr, "bootlogd: writing to console: %s\n",
                        strerror(e));
                return -1;
        }
        close(realfd);
        if ((fd = open_nb(realcons)) < 0)
                goto werr;

        return fd;
}

and the werr goto must be triggering. I think we've seen these for a
while but it is interesting something is closing console fds.

Cheers,

Richard
Mikko Rapeli Oct. 5, 2023, 3:27 p.m. UTC | #5
Hi,

On Wed, Oct 04, 2023 at 09:35:20PM +0100, Richard Purdie wrote:
> On Wed, 2023-10-04 at 22:34 +0300, Mikko Rapeli wrote:
> > Hi,
> > 
> > On Wed, Oct 04, 2023 at 02:29:40PM +0100, Richard Purdie wrote:
> > > On Wed, 2023-10-04 at 16:07 +0300, Mikko Rapeli wrote:
> > > > Hi,
> > > > 
> > > > Acked-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > > > 
> > > > for the full series. I hope our discussion over #yocto irc channel
> > > > helped resolve these and the test failures seen with master-next.
> > > 
> > > Yes, thanks! It was useful to have someone to talk the issues through
> > > with and much appreciated.
> > > 
> > > Sadly I don't think we've found all the issues yet.
> > > 
> > > On the build N-1 we have a selftest failure on debian12-ty-1:
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5803/steps/14/logs/stdio
> > 
> > Can I get from this test run stdio to the more detailed task, qemurunner or qemu boot logs somehow?
> > Or do you have to dig into the system manually to capture them, like in the example below?
> > 
> > I would like to check into the detailed logs from both passing and failing test runs.
> > Of course kernel dmesg would need to be there so maybe after your patch to the runqemu
> > side at least.
> > 
> > If there is a regression in kernel to the serial device detection and numbering,
> > then we need some way to see e.g. 8250 uart/tty0/ttyAMA0 related dmesg messages.
> 
> I can't capture everything in time in each case but I did grab:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/e/
> 
> which is from
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/76/builds/7863/steps/21/logs/stdio
> 
> which has this threads patches applied.
> 
> I've put two sets of logs there from the same builds, one success, one
> is missing the getty on one log.
> 
> 
> I also have the logs from:
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/86/builds/5848/steps/14/logs/stdio
> 
> which I've saved here:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/
> 
> noting that:�
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/qemu_boot_log.20231004134427
> 
> is 3 bytes in size.

All this is very confusing indeed. I sent another version of the detect login prompt without
utf-8 conversion, with some luck that could help. At least on some of the logs the login prompt
is there but with lots of garbage around it. Serial port core got refactored in 6.5 and diff to
6.4 is quite large so there could be bugs there, and those could be triggering new
errors visible to userspace too.

Cheers,

-Mikko
Richard Purdie Oct. 5, 2023, 3:47 p.m. UTC | #6
On Thu, 2023-10-05 at 18:27 +0300, Mikko Rapeli wrote:
> Hi,
> 
> On Wed, Oct 04, 2023 at 09:35:20PM +0100, Richard Purdie wrote:
> > On Wed, 2023-10-04 at 22:34 +0300, Mikko Rapeli wrote:
> > > Hi,
> > > 
> > > On Wed, Oct 04, 2023 at 02:29:40PM +0100, Richard Purdie wrote:
> > > > On Wed, 2023-10-04 at 16:07 +0300, Mikko Rapeli wrote:
> > > > > Hi,
> > > > > 
> > > > > Acked-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > > > > 
> > > > > for the full series. I hope our discussion over #yocto irc channel
> > > > > helped resolve these and the test failures seen with master-next.
> > > > 
> > > > Yes, thanks! It was useful to have someone to talk the issues through
> > > > with and much appreciated.
> > > > 
> > > > Sadly I don't think we've found all the issues yet.
> > > > 
> > > > On the build N-1 we have a selftest failure on debian12-ty-1:
> > > > 
> > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5803/steps/14/logs/stdio
> > > 
> > > Can I get from this test run stdio to the more detailed task, qemurunner or qemu boot logs somehow?
> > > Or do you have to dig into the system manually to capture them, like in the example below?
> > > 
> > > I would like to check into the detailed logs from both passing and failing test runs.
> > > Of course kernel dmesg would need to be there so maybe after your patch to the runqemu
> > > side at least.
> > > 
> > > If there is a regression in kernel to the serial device detection and numbering,
> > > then we need some way to see e.g. 8250 uart/tty0/ttyAMA0 related dmesg messages.
> > 
> > I can't capture everything in time in each case but I did grab:
> > 
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/e/
> > 
> > which is from
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/76/builds/7863/steps/21/logs/stdio
> > 
> > which has this threads patches applied.
> > 
> > I've put two sets of logs there from the same builds, one success, one
> > is missing the getty on one log.
> > 
> > 
> > I also have the logs from:
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/86/builds/5848/steps/14/logs/stdio
> > 
> > which I've saved here:
> > 
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/
> > 
> > noting that: 
> > 
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/qemu_boot_log.20231004134427
> > 
> > is 3 bytes in size.
> 
> All this is very confusing indeed. I sent another version of the detect login prompt without
> utf-8 conversion, with some luck that could help. At least on some of the logs the login prompt
> is there but with lots of garbage around it. Serial port core got refactored in 6.5 and diff to
> 6.4 is quite large so there could be bugs there, and those could be triggering new
> errors visible to userspace too.

With the changes I've made to the logging and the raw log files we have
without any utf-8 conversion, I'm now 99% certain this isn't any utf
conversion issue since the login prompts are not in the raw binary
files and those are written with no conversion. It is much more likely
the getty isn't running or isn't communicating somehow.

I'm trying to work out how to add some further debug/testing when this
fails in a way it will help us debug.

Cheers,

Richard
Mikko Rapeli Oct. 6, 2023, 7:51 a.m. UTC | #7
Hi,

On Thu, Oct 05, 2023 at 04:47:16PM +0100, Richard Purdie wrote:
> On Thu, 2023-10-05 at 18:27 +0300, Mikko Rapeli wrote:
> > Hi,
> > 
> > On Wed, Oct 04, 2023 at 09:35:20PM +0100, Richard Purdie wrote:
> > > On Wed, 2023-10-04 at 22:34 +0300, Mikko Rapeli wrote:
> > > > Hi,
> > > > 
> > > > On Wed, Oct 04, 2023 at 02:29:40PM +0100, Richard Purdie wrote:
> > > > > On Wed, 2023-10-04 at 16:07 +0300, Mikko Rapeli wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > Acked-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > > > > > 
> > > > > > for the full series. I hope our discussion over #yocto irc channel
> > > > > > helped resolve these and the test failures seen with master-next.
> > > > > 
> > > > > Yes, thanks! It was useful to have someone to talk the issues through
> > > > > with and much appreciated.
> > > > > 
> > > > > Sadly I don't think we've found all the issues yet.
> > > > > 
> > > > > On the build N-1 we have a selftest failure on debian12-ty-1:
> > > > > 
> > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5803/steps/14/logs/stdio
> > > > 
> > > > Can I get from this test run stdio to the more detailed task, qemurunner or qemu boot logs somehow?
> > > > Or do you have to dig into the system manually to capture them, like in the example below?
> > > > 
> > > > I would like to check into the detailed logs from both passing and failing test runs.
> > > > Of course kernel dmesg would need to be there so maybe after your patch to the runqemu
> > > > side at least.
> > > > 
> > > > If there is a regression in kernel to the serial device detection and numbering,
> > > > then we need some way to see e.g. 8250 uart/tty0/ttyAMA0 related dmesg messages.
> > > 
> > > I can't capture everything in time in each case but I did grab:
> > > 
> > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/e/
> > > 
> > > which is from
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/76/builds/7863/steps/21/logs/stdio
> > > 
> > > which has this threads patches applied.
> > > 
> > > I've put two sets of logs there from the same builds, one success, one
> > > is missing the getty on one log.
> > > 
> > > 
> > > I also have the logs from:
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/86/builds/5848/steps/14/logs/stdio
> > > 
> > > which I've saved here:
> > > 
> > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/
> > > 
> > > noting that:�
> > > 
> > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/f/qemu_boot_log.20231004134427
> > > 
> > > is 3 bytes in size.
> > 
> > All this is very confusing indeed. I sent another version of the detect login prompt without
> > utf-8 conversion, with some luck that could help. At least on some of the logs the login prompt
> > is there but with lots of garbage around it. Serial port core got refactored in 6.5 and diff to
> > 6.4 is quite large so there could be bugs there, and those could be triggering new
> > errors visible to userspace too.
> 
> With the changes I've made to the logging and the raw log files we have
> without any utf-8 conversion, I'm now 99% certain this isn't any utf
> conversion issue since the login prompts are not in the raw binary
> files and those are written with no conversion. It is much more likely
> the getty isn't running or isn't communicating somehow.
> 
> I'm trying to work out how to add some further debug/testing when this
> fails in a way it will help us debug.

Alright, I was confused by logs from working login detection. I presume
qemu is configured with two serial ports to avoid mixing kernel messages with
the login prompt. It would be easy to grep for the login prompt also from the first
serial output though it would not explain the root cause for these issues
but could allow testing to move forward to the ssh connection. In all of the logs
I saw, the first serial port had a login prompt.

The serial ports are not used in testing, AFAIK. Oh now there are the debug
commands in master-next. Nice, I was hoping for this kind of data in logs when
things go horribly wrong.

Cheers,

-Mikko
diff mbox series

Patch

diff --git a/meta/lib/oeqa/utils/qemurunner.py b/meta/lib/oeqa/utils/qemurunner.py
index ea867a2934f..9aa2e7b3bea 100644
--- a/meta/lib/oeqa/utils/qemurunner.py
+++ b/meta/lib/oeqa/utils/qemurunner.py
@@ -503,9 +503,8 @@  class QemuRunner:
                                   (self.boottime, time.strftime("%D %H:%M:%S")))
             tail = lambda l: "\n".join(l.splitlines()[-25:])
             bootlog = self.decode_qemulog(bootlog)
-            # in case bootlog is empty, use tail qemu log store at self.msg
-            lines = tail(bootlog if bootlog else self.msg)
-            self.logger.warning("Last 25 lines of text (%d):\n%s" % (len(bootlog), lines))
+            self.logger.warning("Last 25 lines of login console (%d):\n%s" % (len(bootlog), tail(bootlog)))
+            self.logger.warning("Last 25 lines of all logging (%d):\n%s" % (len(self.msg), tail(self.msg)))
             self.logger.warning("Check full boot log: %s" % self.logfile)
             self.stop()
             return False