mbox series

[v3,0/6] scripts/resulttool/regression: add metadata filtering

Message ID 20230224164555.67634-1-alexis.lothore@bootlin.com
Headers show
Series scripts/resulttool/regression: add metadata filtering | expand

Message

Alexis Lothoré Feb. 24, 2023, 4:45 p.m. UTC
From: Alexis Lothoré <alexis.lothore@bootlin.com>

Hello,
this new series is the follow-up of [1] to make regression reports more
meaningful, by reducing noise and false positives.

Change since v2:
- add filtering on MACHINE field from test results configuration: the MACHINE
  should always match
- add "metadata guessing" mechanism based on Richard proposal ([2]). Up to the
  point where this series will be merged, tests results stored in git are not
  enriched with OESELFTEST_METADATA. To allow proper test comparison even with
  those tests, try to guess what oeselftest command line has been used to run
  the corresponding tests, and generate OESELFTEST_METADATA accordingly
- add new tool to ease test results usage: yocto_testresults_query. For now the
  tool only manages regression report and is a thin layer between send-qa-email
  (in yocto-autobuilder-helper) and resulttool. Its main role is to translate
  regression reports arguments (which are tags or branches) to fixed revisions
  and to call resulttool accordingly. Most of its code is a transfer from
  send-qa-email (another series for the autobuilder will follow this one to make
  send-qa-email use this new helper, but this current series works
  independently)
  Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" will
  replay the regression report generated when the 4.2_M2 has been generated.

Change since v1:
- properly configure "From" field in series

With those improvements, the regression report is significantly reduced and some
useful data start to emerge from the removed noise:
- with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB
- with the OESELFTEST_METADATA enrichment + metadata guessing for older tests,
  the report goes from 627MB to 1.5MB

After manual inspection on some entries, the remaining oeselftest regression
raised in the report seems valid. There are still some issues to tackle:
- it seems that now one major remaining source of noise is on the "runtime"
  tests (comparison to tests not run on "target" results)
- when a ptest managed by oe-selftest fails, I guess the remaining tests are not
  run, so when 1 failure is logged, we have many "PASSED->None" transitions in
  regression report, we should probably silence it.
- some transitions appear as regression while those are in fact improvements
  (e.g: "UNRESOLVED->PASSED")

[1] https://lore.kernel.org/openembedded-core/20230214165309.63527-1-alexis.lothore@bootlin.com/
[2] https://lore.kernel.org/openembedded-core/124b9c9667b038b8502f6457ba7d894fc4ef3c58.camel@linuxfoundation.org/

Alexis Lothoré (6):
  scripts/oe-selftest: append metadata to tests results
  scripts/resulttool/regression: remove unused import
  scripts/resulttool/regression: add metadata filtering for oeselftest
  oeqa/selftest/resulttool: add test for metadata filtering on
    regression
  scripts: add new helper for regression report generation
  oeqa/selftest: add test for yocto_testresults_query.py

 .../oeqa/selftest/cases/resulttooltests.py    | 137 +++++++++++++++
 .../cases/yoctotestresultsquerytests.py       |  39 +++++
 meta/lib/oeqa/selftest/context.py             |  15 +-
 scripts/lib/resulttool/regression.py          | 163 +++++++++++++++++-
 scripts/yocto_testresults_query.py            | 106 ++++++++++++
 5 files changed, 458 insertions(+), 2 deletions(-)
 create mode 100644 meta/lib/oeqa/selftest/cases/yoctotestresultsquerytests.py
 create mode 100755 scripts/yocto_testresults_query.py

Comments

Richard Purdie Feb. 24, 2023, 6:06 p.m. UTC | #1
Hi Alexis,

Firstly, this looks very much improved, thanks. It is great to start to
see some meaningful data from this.

On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via
lists.openembedded.org wrote:
> From: Alexis Lothoré <alexis.lothore@bootlin.com>
> 
> Hello,
> this new series is the follow-up of [1] to make regression reports more
> meaningful, by reducing noise and false positives.
> 
> Change since v2:
> - add filtering on MACHINE field from test results configuration: the MACHINE
>   should always match
> - add "metadata guessing" mechanism based on Richard proposal ([2]). Up to the
>   point where this series will be merged, tests results stored in git are not
>   enriched with OESELFTEST_METADATA. To allow proper test comparison even with
>   those tests, try to guess what oeselftest command line has been used to run
>   the corresponding tests, and generate OESELFTEST_METADATA accordingly
> - add new tool to ease test results usage: yocto_testresults_query. For now the
>   tool only manages regression report and is a thin layer between send-qa-email
>   (in yocto-autobuilder-helper) and resulttool. Its main role is to translate
>   regression reports arguments (which are tags or branches) to fixed revisions
>   and to call resulttool accordingly. Most of its code is a transfer from
>   send-qa-email (another series for the autobuilder will follow this one to make
>   send-qa-email use this new helper, but this current series works
>   independently)
>   Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" will
>   replay the regression report generated when the 4.2_M2 has been generated.
> 
> Change since v1:
> - properly configure "From" field in series
> 
> With those improvements, the regression report is significantly reduced and some
> useful data start to emerge from the removed noise:
> - with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB
> - with the OESELFTEST_METADATA enrichment + metadata guessing for older tests,
>   the report goes from 627MB to 1.5MB

That is just a bit more readable!

> 
> After manual inspection on some entries, the remaining oeselftest regression
> raised in the report seems valid. There are still some issues to tackle:
> - it seems that now one major remaining source of noise is on the "runtime"
>   tests (comparison to tests not run on "target" results)
> - when a ptest managed by oe-selftest fails, I guess the remaining tests are not
>   run, so when 1 failure is logged, we have many "PASSED->None" transitions in
>   regression report, we should probably silence it.
> - some transitions appear as regression while those are in fact improvements
>   (e.g: "UNRESOLVED->PASSED")

I had quick play. Firstly, if I try "yocto_testresults_query.py
regression-report 4.2_M1 4.2_M2" in an openembedded-core repository
instead of poky, it breaks. That isn't surprising but we should either
make it work or show a sensible error.

I also took a look the report and wondered why the matching isn't quite
right and why we have these "regressions". If we could remove that
noise, I think we'd get down to the real issues. I ended up doing:

resulttool report --commit 4d19594b8bdacde6d809d3f2a25cff7c5a42295e  . > /tmp/repa
resulttool report --commit 5e249ec855517765f4b99e8039cb888ffa09c211  . > /tmp/repb
meld /tmp/rep*

which was interesting as gave lots of warnings like:

"Warning duplicate ptest result 'acl.test/cp.test' for qemuarm64"

so it looks like we had a couple of different test runs for qemuarm64
ptests which is confusing your new code. I suspect this happened due to
some autobuilder glitch during the release build which restarted some
of the build pieces. Not sure how to handle that yet, I'll give it some
further thought but I wanted to share what I think is the source of
some of the issues. Basically we need to get the regression report
looking more like that meld output!

Cheers,

Richard
Richard Purdie Feb. 25, 2023, 9:15 a.m. UTC | #2
On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via
lists.openembedded.org wrote:
> Hi Alexis,
> 
> Firstly, this looks very much improved, thanks. It is great to start to
> see some meaningful data from this.
> 
> On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via
> lists.openembedded.org wrote:
> > From: Alexis Lothoré <alexis.lothore@bootlin.com>
> > 
> > Hello,
> > this new series is the follow-up of [1] to make regression reports more
> > meaningful, by reducing noise and false positives.
> > 
> > Change since v2:
> > - add filtering on MACHINE field from test results configuration: the MACHINE
> >   should always match
> > - add "metadata guessing" mechanism based on Richard proposal ([2]). Up to the
> >   point where this series will be merged, tests results stored in git are not
> >   enriched with OESELFTEST_METADATA. To allow proper test comparison even with
> >   those tests, try to guess what oeselftest command line has been used to run
> >   the corresponding tests, and generate OESELFTEST_METADATA accordingly
> > - add new tool to ease test results usage: yocto_testresults_query. For now the
> >   tool only manages regression report and is a thin layer between send-qa-email
> >   (in yocto-autobuilder-helper) and resulttool. Its main role is to translate
> >   regression reports arguments (which are tags or branches) to fixed revisions
> >   and to call resulttool accordingly. Most of its code is a transfer from
> >   send-qa-email (another series for the autobuilder will follow this one to make
> >   send-qa-email use this new helper, but this current series works
> >   independently)
> >   Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" will
> >   replay the regression report generated when the 4.2_M2 has been generated.
> > 
> > Change since v1:
> > - properly configure "From" field in series
> > 
> > With those improvements, the regression report is significantly reduced and some
> > useful data start to emerge from the removed noise:
> > - with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB
> > - with the OESELFTEST_METADATA enrichment + metadata guessing for older tests,
> >   the report goes from 627MB to 1.5MB
> 
> That is just a bit more readable!
> 
> > 
> > After manual inspection on some entries, the remaining oeselftest regression
> > raised in the report seems valid. There are still some issues to tackle:
> > - it seems that now one major remaining source of noise is on the "runtime"
> >   tests (comparison to tests not run on "target" results)
> > - when a ptest managed by oe-selftest fails, I guess the remaining tests are not
> >   run, so when 1 failure is logged, we have many "PASSED->None" transitions in
> >   regression report, we should probably silence it.
> > - some transitions appear as regression while those are in fact improvements
> >   (e.g: "UNRESOLVED->PASSED")
> 
> I had quick play. Firstly, if I try "yocto_testresults_query.py
> regression-report 4.2_M1 4.2_M2" in an openembedded-core repository
> instead of poky, it breaks. That isn't surprising but we should either
> make it work or show a sensible error.
> 
> I also took a look the report and wondered why the matching isn't quite
> right and why we have these "regressions". If we could remove that
> noise, I think we'd get down to the real issues. I ended up doing:
> 
> resulttool report --commit 4d19594b8bdacde6d809d3f2a25cff7c5a42295e  . > /tmp/repa
> resulttool report --commit 5e249ec855517765f4b99e8039cb888ffa09c211  . > /tmp/repb
> meld /tmp/rep*
> 
> which was interesting as gave lots of warnings like:
> 
> "Warning duplicate ptest result 'acl.test/cp.test' for qemuarm64"
> 
> so it looks like we had a couple of different test runs for qemuarm64
> ptests which is confusing your new code. I suspect this happened due to
> some autobuilder glitch during the release build which restarted some
> of the build pieces. Not sure how to handle that yet, I'll give it some
> further thought but I wanted to share what I think is the source of
> some of the issues. Basically we need to get the regression report
> looking more like that meld output!

I was wrong about the duplication, that isn't the issue, or at least I
found some other more pressing ones. For the ltp issue, I found an easy
fix:

diff --git a/scripts/lib/resulttool/regression.py b/scripts/lib/resulttool/regression.py
index 1b0c8335a39..9d7c35942a6 100644
--- a/scripts/lib/resulttool/regression.py
+++ b/scripts/lib/resulttool/regression.py
@@ -146,6 +146,7 @@ def can_be_compared(logger, base, target):
     run with different tests sets or parameters. Return true if tests can be
     compared
     """
+    ret = True
     base_configuration = base['configuration']
     target_configuration = target['configuration']
 
@@ -165,7 +166,10 @@ def can_be_compared(logger, base, target):
             logger.debug(f"Enriching {target_configuration['STARTTIME']} with {guess}")
             target_configuration['OESELFTEST_METADATA'] = guess
 
-    return metadata_matches(base_configuration, target_configuration) \
+    if base_configuration.get('TEST_TYPE') == 'runtime' and any(result.startswith("ltpresult") for result in base['result']):
+        ret = target_configuration.get('TEST_TYPE') == 'runtime' and any(result.startswith("ltpresult") for result in target['result'])
+
+    return ret and metadata_matches(base_configuration, target_configuration) \
         and machine_matches(base_configuration, target_configuration)
 
 
i.e. only compare ltp to ltp. The issue is we don't use a special image
name for the ltp test runs, we just extend a standard one so it was
comparing ltp to non-ltp.

We should also perhaps consider a clause in there which only compares
runs with ptests with other runs with ptests? Our test matrix won't
trigger that but other usage might in future and it is a safe check?

A lot of the rest of the noise is poor test naming for ptests, e.g.:

ptestresult.lttng-tools.ust/buffers-pid/test_buffers_pid_10_-_Create_session_buffers-pid_in_-o_/tmp/tmp.XXXXXXXXXXrs_pid_trace_path.XTnDY5

which has a random string at the end. I'm wondering if we should pre-
filter ptest result names and truncate a known list of them at the "-"
(lttng-tools, babeltrace, babeltrace2). Curl could also be truncated at
the ",":

ptestresult.curl.test_0010__10_out_of_1506,_remaining:_06:44,_took_1.075s,_duration:_00:02_

We can adjust the ptest generation code to do this at source (we should
perhaps file a bug for that for the four above?) but that won't fix the
older results so we'll probably need some filtering in the code too.

There is something more going on with the ptest results too, I don't
understand why quilt/python3 changed but I suspect we just have to go
through the issues step by step now.

I did look into the:

ptestresult.glibc-user.debug/tst-fortify-c-default-1

'regression' and it is because the test was renamed in the new glibc. I
was therefore thinking a summary of added/removed would be useful in
but only in these cases. Something along the line of if only tests
added, just summarise X new added and call it a match. If tests removed
and added, list and show a count summary (X removed, Y added) and call
it a regression.

I think I might be tempted to merge this series and then we can change
the code to improve from here as this is clearly a vast improvement on
where we were! Improvements can be incremental on top of these changes.

Cheers,

Richard
Richard Purdie Feb. 25, 2023, 12:32 p.m. UTC | #3
On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via
lists.openembedded.org wrote:
> On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via
> lists.openembedded.org wrote:
> > Hi Alexis,
> > 
> > Firstly, this looks very much improved, thanks. It is great to start to
> > see some meaningful data from this.
> > 
> > On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via
> > lists.openembedded.org wrote:
> > > From: Alexis Lothoré <alexis.lothore@bootlin.com>
> > > 
> > > Hello,
> > > this new series is the follow-up of [1] to make regression reports more
> > > meaningful, by reducing noise and false positives.
> > > 
> > > Change since v2:
> > > - add filtering on MACHINE field from test results configuration: the MACHINE
> > >   should always match
> > > - add "metadata guessing" mechanism based on Richard proposal ([2]). Up to the
> > >   point where this series will be merged, tests results stored in git are not
> > >   enriched with OESELFTEST_METADATA. To allow proper test comparison even with
> > >   those tests, try to guess what oeselftest command line has been used to run
> > >   the corresponding tests, and generate OESELFTEST_METADATA accordingly
> > > - add new tool to ease test results usage: yocto_testresults_query. For now the
> > >   tool only manages regression report and is a thin layer between send-qa-email
> > >   (in yocto-autobuilder-helper) and resulttool. Its main role is to translate
> > >   regression reports arguments (which are tags or branches) to fixed revisions
> > >   and to call resulttool accordingly. Most of its code is a transfer from
> > >   send-qa-email (another series for the autobuilder will follow this one to make
> > >   send-qa-email use this new helper, but this current series works
> > >   independently)
> > >   Example: "yocto_testresults_query.py regression-report 4.2_M1 4.2_M2" will
> > >   replay the regression report generated when the 4.2_M2 has been generated.
> > > 
> > > Change since v1:
> > > - properly configure "From" field in series
> > > 
> > > With those improvements, the regression report is significantly reduced and some
> > > useful data start to emerge from the removed noise:
> > > - with the MACHINE filtering, the 4.2_M2 report goes from 5.5GB to 627MB
> > > - with the OESELFTEST_METADATA enrichment + metadata guessing for older tests,
> > >   the report goes from 627MB to 1.5MB
> > 
> > That is just a bit more readable!
> > 
> > > 
> > > After manual inspection on some entries, the remaining oeselftest regression
> > > raised in the report seems valid. There are still some issues to tackle:
> > > - it seems that now one major remaining source of noise is on the "runtime"
> > >   tests (comparison to tests not run on "target" results)
> > > - when a ptest managed by oe-selftest fails, I guess the remaining tests are not
> > >   run, so when 1 failure is logged, we have many "PASSED->None" transitions in
> > >   regression report, we should probably silence it.
> > > - some transitions appear as regression while those are in fact improvements
> > >   (e.g: "UNRESOLVED->PASSED")
> > 
> > I had quick play. Firstly, if I try "yocto_testresults_query.py
> > regression-report 4.2_M1 4.2_M2" in an openembedded-core repository
> > instead of poky, it breaks. That isn't surprising but we should either
> > make it work or show a sensible error.
> > 
> > I also took a look the report and wondered why the matching isn't quite
> > right and why we have these "regressions". If we could remove that
> > noise, I think we'd get down to the real issues. I ended up doing:
> > 
> > resulttool report --commit 4d19594b8bdacde6d809d3f2a25cff7c5a42295e  . > /tmp/repa
> > resulttool report --commit 5e249ec855517765f4b99e8039cb888ffa09c211  . > /tmp/repb
> > meld /tmp/rep*
> > 
> > which was interesting as gave lots of warnings like:
> > 
> > "Warning duplicate ptest result 'acl.test/cp.test' for qemuarm64"
> > 
> > so it looks like we had a couple of different test runs for qemuarm64
> > ptests which is confusing your new code. I suspect this happened due to
> > some autobuilder glitch during the release build which restarted some
> > of the build pieces. Not sure how to handle that yet, I'll give it some
> > further thought but I wanted to share what I think is the source of
> > some of the issues. Basically we need to get the regression report
> > looking more like that meld output!
> 
> I was wrong about the duplication, that isn't the issue, or at least I
> found some other more pressing ones. For the ltp issue, I found an easy
> fix:
> 
> diff --git a/scripts/lib/resulttool/regression.py b/scripts/lib/resulttool/regression.py
> index 1b0c8335a39..9d7c35942a6 100644
> --- a/scripts/lib/resulttool/regression.py
> +++ b/scripts/lib/resulttool/regression.py
> @@ -146,6 +146,7 @@ def can_be_compared(logger, base, target):
>      run with different tests sets or parameters. Return true if tests can be
>      compared
>      """
> +    ret = True
>      base_configuration = base['configuration']
>      target_configuration = target['configuration']
>  
> @@ -165,7 +166,10 @@ def can_be_compared(logger, base, target):
>              logger.debug(f"Enriching {target_configuration['STARTTIME']} with {guess}")
>              target_configuration['OESELFTEST_METADATA'] = guess
>  
> -    return metadata_matches(base_configuration, target_configuration) \
> +    if base_configuration.get('TEST_TYPE') == 'runtime' and any(result.startswith("ltpresult") for result in base['result']):
> +        ret = target_configuration.get('TEST_TYPE') == 'runtime' and any(result.startswith("ltpresult") for result in target['result'])
> +
> +    return ret and metadata_matches(base_configuration, target_configuration) \
>          and machine_matches(base_configuration, target_configuration)
>  
>  
> i.e. only compare ltp to ltp. The issue is we don't use a special image
> name for the ltp test runs, we just extend a standard one so it was
> comparing ltp to non-ltp.
> 
> We should also perhaps consider a clause in there which only compares
> runs with ptests with other runs with ptests? Our test matrix won't
> trigger that but other usage might in future and it is a safe check?
> 
> A lot of the rest of the noise is poor test naming for ptests, e.g.:
> 
> ptestresult.lttng-tools.ust/buffers-pid/test_buffers_pid_10_-_Create_session_buffers-pid_in_-o_/tmp/tmp.XXXXXXXXXXrs_pid_trace_path.XTnDY5
> 
> which has a random string at the end. I'm wondering if we should pre-
> filter ptest result names and truncate a known list of them at the "-"
> (lttng-tools, babeltrace, babeltrace2). Curl could also be truncated at
> the ",":
> 
> ptestresult.curl.test_0010__10_out_of_1506,_remaining:_06:44,_took_1.075s,_duration:_00:02_
> 
> We can adjust the ptest generation code to do this at source (we should
> perhaps file a bug for that for the four above?) but that won't fix the
> older results so we'll probably need some filtering in the code too.
> 
> There is something more going on with the ptest results too, I don't
> understand why quilt/python3 changed but I suspect we just have to go
> through the issues step by step now.
> 
> I did look into the:
> 
> ptestresult.glibc-user.debug/tst-fortify-c-default-1
> 
> 'regression' and it is because the test was renamed in the new glibc. I
> was therefore thinking a summary of added/removed would be useful in
> but only in these cases. Something along the line of if only tests
> added, just summarise X new added and call it a match. If tests removed
> and added, list and show a count summary (X removed, Y added) and call
> it a regression.
> 
> I think I might be tempted to merge this series and then we can change
> the code to improve from here as this is clearly a vast improvement on
> where we were! Improvements can be incremental on top of these changes.

This goes a long way to shrinking the report even further. Looks like
the curl test reporting needs some work as the IDs look like they
change but this at least makes the issue clearer and the real deltas
are becoming much easier to see outside the noise.

diff --git a/scripts/lib/resulttool/regression.py b/scripts/lib/resulttool/regression.py
index 1b0c8335a39..0d8948f012f 100644
--- a/scripts/lib/resulttool/regression.py
+++ b/scripts/lib/resulttool/regression.py
@@ -243,6 +247,21 @@ def regression_common(args, logger, base_results, target_results):
 
     return 0
 
+def fixup_ptest_names(results, logger):
+    for r in results:
+        for i in results[r]:
+            tests = list(results[r][i]['result'].keys())
+            for test in tests:
+                new = None
+                if test.startswith(("ptestresult.lttng-tools.", "ptestresult.babeltrace.", "ptestresult.babeltrace2")) and "_-_" in test:
+                    new = test.split("_-_")[0]
+                elif test.startswith(("ptestresult.curl.")) and "__" in test:
+                    new = test.split("__")[0]
+                if new:
+                    results[r][i]['result'][new] = results[r][i]['result'][test]
+                    del results[r][i]['result'][test]
+
+
 def regression_git(args, logger):
     base_results = {}
     target_results = {}
@@ -304,6 +323,9 @@ def regression_git(args, logger):
     base_results = resultutils.git_get_result(repo, revs[index1][2])
     target_results = resultutils.git_get_result(repo, revs[index2][2])
 
+    fixup_ptest_names(base_results, logger)
+    fixup_ptest_names(target_results, logger)
+
     regression_common(args, logger, base_results, target_results)
 
     return 0
Richard Purdie Feb. 25, 2023, 12:44 p.m. UTC | #4
I'll try and stop poking at this but it is all rather interesting and I
think we have spotted our first nasty regression. The quilt ptests did
really stop running properly and reporting test results!

Looking at a recent master report:

https://autobuilder.yocto.io/pub/non-release/20230224-14/testresults/testresult-report.txt

you can see the quilt ptest count is still zero as it was in M2 but not
in M1.

I'm thinking Ross might have been responsible with:

https://git.yoctoproject.org/poky/commit/?id=61bb4d8e75dfaaf980c32fbe992d34f794b7c537

!

The reason there are python3 and python3-cryptography changes are that
there are 35,000 python3 ptests and about 250 changes so that isn't
really unexpected for what was probably a version change. I'd guess
python3-cryptography made changes to their test suite. Overall the
counts of pass rates increased and no failures so those overall counts
may be good to have in the report too.

We might want to log the version of the recipe in the data somewhere so
we can show if this was a python3 version change.

Cheers,

Richard
Alexis Lothoré Feb. 25, 2023, 3:59 p.m. UTC | #5
Hello Richard,
as usual, thanks for the prompt feedback !

On 2/25/23 13:32, Richard Purdie wrote:
> On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via
> lists.openembedded.org wrote:
>> On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via
>> lists.openembedded.org wrote:
>>> Hi Alexis,
>>>
>>> Firstly, this looks very much improved, thanks. It is great to start to
>>> see some meaningful data from this.
>>>
>>> On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via
>>> lists.openembedded.org wrote:
>>>> After manual inspection on some entries, the remaining oeselftest regression
>>>> raised in the report seems valid. There are still some issues to tackle:
>>>> - it seems that now one major remaining source of noise is on the "runtime"
>>>>   tests (comparison to tests not run on "target" results)
>>>> - when a ptest managed by oe-selftest fails, I guess the remaining tests are not
>>>>   run, so when 1 failure is logged, we have many "PASSED->None" transitions in
>>>>   regression report, we should probably silence it.
>>>> - some transitions appear as regression while those are in fact improvements
>>>>   (e.g: "UNRESOLVED->PASSED")
>>>
>>> I had quick play. Firstly, if I try "yocto_testresults_query.py
>>> regression-report 4.2_M1 4.2_M2" in an openembedded-core repository
>>> instead of poky, it breaks. That isn't surprising but we should either
>>> make it work or show a sensible error.

Oh right, I am working in a Poky build configuration, so I have assumed that this
would be the unique use case.
Since the test results commits are tightly coupled to revisions in poky (so not
oecore), I plan to merely log an error about not found revision (and suggesting
the user to check that the repository is poky and not oecore).
But please let me know if I miss a major use case here and that a smarter
fallback plan (shallow-clone poky if we are running in oecore ?) is needed


>> I think I might be tempted to merge this series and then we can change
>> the code to improve from here as this is clearly a vast improvement on
>> where we were! Improvements can be incremental on top of these changes.

I am in favor of this :) If it is OK for you, I will just re-submit a series with
the fix for the proper error logging when running the tool from oecore and not poky.

Next we could introduce all the suggestions you have suggested, but I feel that
with the quick increase of "hotfixes" count to support issues with older test
results, and for the sake of maintainability of resulttool and its submodules,
those specific hotfixes need to be properly isolated (and documented), like in a
"regression_quirks.py" or something like that. What do you think ?

Alexis
Richard Purdie Feb. 26, 2023, 12:15 p.m. UTC | #6
On Sat, 2023-02-25 at 16:59 +0100, Alexis Lothoré wrote:
> Hello Richard,
> as usual, thanks for the prompt feedback !
> 
> On 2/25/23 13:32, Richard Purdie wrote:
> > On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via
> > lists.openembedded.org wrote:
> > > On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via
> > > lists.openembedded.org wrote:
> > > > Hi Alexis,
> > > > 
> > > > Firstly, this looks very much improved, thanks. It is great to start to
> > > > see some meaningful data from this.
> > > > 
> > > > On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via
> > > > lists.openembedded.org wrote:
> > > > > After manual inspection on some entries, the remaining oeselftest regression
> > > > > raised in the report seems valid. There are still some issues to tackle:
> > > > > - it seems that now one major remaining source of noise is on the "runtime"
> > > > >   tests (comparison to tests not run on "target" results)
> > > > > - when a ptest managed by oe-selftest fails, I guess the remaining tests are not
> > > > >   run, so when 1 failure is logged, we have many "PASSED->None" transitions in
> > > > >   regression report, we should probably silence it.
> > > > > - some transitions appear as regression while those are in fact improvements
> > > > >   (e.g: "UNRESOLVED->PASSED")
> > > > 
> > > > I had quick play. Firstly, if I try "yocto_testresults_query.py
> > > > regression-report 4.2_M1 4.2_M2" in an openembedded-core repository
> > > > instead of poky, it breaks. That isn't surprising but we should either
> > > > make it work or show a sensible error.
> 
> Oh right, I am working in a Poky build configuration, so I have assumed that this
> would be the unique use case.
> Since the test results commits are tightly coupled to revisions in poky (so not
> oecore), I plan to merely log an error about not found revision (and suggesting
> the user to check that the repository is poky and not oecore).
> But please let me know if I miss a major use case here and that a smarter
> fallback plan (shallow-clone poky if we are running in oecore ?) is needed

I'm happy to for it just to give an human readable error, someone can
add this functionality if they need/want it.

> > > I think I might be tempted to merge this series and then we can change
> > > the code to improve from here as this is clearly a vast improvement on
> > > where we were! Improvements can be incremental on top of these changes.
> 
> I am in favor of this :) If it is OK for you, I will just re-submit a series with
> the fix for the proper error logging when running the tool from oecore and not poky.
> 
> Next we could introduce all the suggestions you have suggested, but I feel that
> with the quick increase of "hotfixes" count to support issues with older test
> results, and for the sake of maintainability of resulttool and its submodules,
> those specific hotfixes need to be properly isolated (and documented), like in a
> "regression_quirks.py" or something like that. What do you think ?

I'm hoping we don't have many of these quirks. We have a huge history
at this point so it would be sad if the tool can't work with it. From
what I've seen so far, we can manage with the code in the regression
module itself. I've tried to add some comments.

I wondered what to do with this series since I needed to get M3 built.
Since this series was available and mostly usable, it would be better
to have a nicer report this time, it is a good test of the code.

In the end I've merged most of it, along with my two tweaks to handle
LTP and the bigger ptest results issue. I couldn't take one set of the
selftests since they simply don't work. This will give us a useful
realworld test of the M3 report.

I'm working on the assumption you'll send a follow up series with the
tests, the oe-core check and some of the other issues I've mentioned in
other emails?

Cheers,

Richard
Alexis Lothoré Feb. 26, 2023, 3:42 p.m. UTC | #7
Hello Richard,
On 2/26/23 13:15, Richard Purdie wrote:
> On Sat, 2023-02-25 at 16:59 +0100, Alexis Lothoré wrote:
>> Hello Richard,
>> as usual, thanks for the prompt feedback !
>>
>> On 2/25/23 13:32, Richard Purdie wrote:
>>> On Sat, 2023-02-25 at 09:15 +0000, Richard Purdie via
>>> lists.openembedded.org wrote:
>>>> On Fri, 2023-02-24 at 18:06 +0000, Richard Purdie via
>>>> lists.openembedded.org wrote:
>>>>> Hi Alexis,
>>>>>
>>>>> Firstly, this looks very much improved, thanks. It is great to start to
>>>>> see some meaningful data from this.
>>>>>
>>>>> On Fri, 2023-02-24 at 17:45 +0100, Alexis Lothoré via
>>>>> lists.openembedded.org wrote:
>>>>>> After manual inspection on some entries, the remaining oeselftest regression
>>>>>> raised in the report seems valid. There are still some issues to tackle:
>>>>>> - it seems that now one major remaining source of noise is on the "runtime"
>>>>>>   tests (comparison to tests not run on "target" results)
>>>>>> - when a ptest managed by oe-selftest fails, I guess the remaining tests are not
>>>>>>   run, so when 1 failure is logged, we have many "PASSED->None" transitions in
>>>>>>   regression report, we should probably silence it.
>>>>>> - some transitions appear as regression while those are in fact improvements
>>>>>>   (e.g: "UNRESOLVED->PASSED")
>>>>>
>>>>> I had quick play. Firstly, if I try "yocto_testresults_query.py
>>>>> regression-report 4.2_M1 4.2_M2" in an openembedded-core repository
>>>>> instead of poky, it breaks. That isn't surprising but we should either
>>>>> make it work or show a sensible error.
>>
>> Oh right, I am working in a Poky build configuration, so I have assumed that this
>> would be the unique use case.
>> Since the test results commits are tightly coupled to revisions in poky (so not
>> oecore), I plan to merely log an error about not found revision (and suggesting
>> the user to check that the repository is poky and not oecore).
>> But please let me know if I miss a major use case here and that a smarter
>> fallback plan (shallow-clone poky if we are running in oecore ?) is needed
> 
> I'm happy to for it just to give an human readable error, someone can
> add this functionality if they need/want it.

ACK

>>>> I think I might be tempted to merge this series and then we can change
>>>> the code to improve from here as this is clearly a vast improvement on
>>>> where we were! Improvements can be incremental on top of these changes.
>>
>> I am in favor of this :) If it is OK for you, I will just re-submit a series with
>> the fix for the proper error logging when running the tool from oecore and not poky.
>>
>> Next we could introduce all the suggestions you have suggested, but I feel that
>> with the quick increase of "hotfixes" count to support issues with older test
>> results, and for the sake of maintainability of resulttool and its submodules,
>> those specific hotfixes need to be properly isolated (and documented), like in a
>> "regression_quirks.py" or something like that. What do you think ?
> 
> I'm hoping we don't have many of these quirks. We have a huge history
> at this point so it would be sad if the tool can't work with it. From
> what I've seen so far, we can manage with the code in the regression
> module itself. I've tried to add some comments.
> 
> I wondered what to do with this series since I needed to get M3 built.
> Since this series was available and mostly usable, it would be better
> to have a nicer report this time, it is a good test of the code.
> 
> In the end I've merged most of it, along with my two tweaks to handle
> LTP and the bigger ptest results issue. I couldn't take one set of the
> selftests since they simply don't work. This will give us a useful
> realworld test of the M3 report.

Ok, great. For the selftest failing, my bad, adding proper logging was one of
those "one last change before sending", and obviously I did forget to re-run the
tests before sending.
> 
> I'm working on the assumption you'll send a follow up series with the
> tests, the oe-core check and some of the other issues I've mentioned in
> other emails?

Absolutely. Besides the tests, oecore check and the improvements mentioned in
this mail thread, the next thing I was keeping in mind is fixing the report
generation against "master-next" branches you have mentioned a few weeks ago.

Regards,
Ross Burton Feb. 27, 2023, 1:14 p.m. UTC | #8
On 25 Feb 2023, at 12:44, Richard Purdie via lists.openembedded.org <richard.purdie=linuxfoundation.org@lists.openembedded.org> wrote:
> 
> I'll try and stop poking at this but it is all rather interesting and I
> think we have spotted our first nasty regression. The quilt ptests did
> really stop running properly and reporting test results!
> 
> Looking at a recent master report:
> 
> https://autobuilder.yocto.io/pub/non-release/20230224-14/testresults/testresult-report.txt
> 
> you can see the quilt ptest count is still zero as it was in M2 but not
> in M1.
> 
> I'm thinking Ross might have been responsible with:
> 
> https://git.yoctoproject.org/poky/commit/?id=61bb4d8e75dfaaf980c32fbe992d34f794b7c537
> 
> !

Grumble.  Yes, I was.

Fixes sent.

Cheers,
Ross
Richard Purdie Feb. 27, 2023, 1:41 p.m. UTC | #9
On Sun, 2023-02-26 at 16:42 +0100, Alexis Lothoré wrote:
> > I'm hoping we don't have many of these quirks. We have a huge history
> > at this point so it would be sad if the tool can't work with it. From
> > what I've seen so far, we can manage with the code in the regression
> > module itself. I've tried to add some comments.
> > 
> > I wondered what to do with this series since I needed to get M3 built.
> > Since this series was available and mostly usable, it would be better
> > to have a nicer report this time, it is a good test of the code.
> > 
> > In the end I've merged most of it, along with my two tweaks to handle
> > LTP and the bigger ptest results issue. I couldn't take one set of the
> > selftests since they simply don't work. This will give us a useful
> > realworld test of the M3 report.
> 
> Ok, great. For the selftest failing, my bad, adding proper logging was one of
> those "one last change before sending", and obviously I did forget to re-run the
> tests before sending.
> > 
> > I'm working on the assumption you'll send a follow up series with the
> > tests, the oe-core check and some of the other issues I've mentioned in
> > other emails?
> 
> Absolutely. Besides the tests, oecore check and the improvements mentioned in
> this mail thread, the next thing I was keeping in mind is fixing the report
> generation against "master-next" branches you have mentioned a few weeks ago.


FWIW, here is the result from M3 rc1:

https://autobuilder.yocto.io/pub/releases/yocto-4.2_M3.rc1/testresults/testresult-regressions-report.txt

It looks like a couple of runs were mismatched so the ltp issue isn't
quite resolved with my tweak. All in all it is definitely a lot more
readable than it was though so progress :)

Looks like dbus may need the same fix as curl.

Cheers,

Richard