diff mbox series

[RFC,3/3] testimage: implement test artifacts retriever for failing tests

Message ID 20230602095037.97981-4-alexis.lothore@bootlin.com
State New
Headers show
Series add failed test artifacts retriever | expand

Commit Message

Alexis Lothoré June 2, 2023, 9:50 a.m. UTC
Add a basic artifacts retrievers in testimage class which:
- triggers when at least one runtime test fails
- reads a list of files to retrieve from ARTIFACTS_LIST_PATH
- retrieve those files over scp thanks to existing ssh class
- store those files in an "artifacts" directory in "tmp/log/oeqa/<image>"

Bring partial solution to [YOCTO #14901]

Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
---
 meta/classes-recipe/testimage.bbclass | 51 +++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

Comments

Alexander Kanavin June 2, 2023, 11:46 a.m. UTC | #1
On Fri, 2 Jun 2023 at 11:50, Alexis Lothoré via lists.openembedded.org
<alexis.lothore=bootlin.com@lists.openembedded.org> wrote:
> +# When any test fails, if path to an artifacts configuration (listing
> +# files/directories to retrieve on target) has been provided, all listed
> +# elements will be downloaded from target before shutting it down
> +
> +ARTIFACTS_LIST_PATH ??= ""

I think this should be simplified further by simply specifying the
default list directly instead:

TESTIMAGE_FAILED_QA_ARTIFACTS ?= "${libdir}/*/ptest/ /var/log/ /etc/version"

Then, if desired, the list can be overridden globally, or per image as
needed  -  from the image recipes and/or regular bitbake configuration
mechanisms used by CI (local.conf etc). If the feature needs to be
switched off altogether, that can be done by setting an empty list.

Alex
Mikko Rapeli June 2, 2023, 11:52 a.m. UTC | #2
Hi,

On Fri, Jun 02, 2023 at 01:46:03PM +0200, Alexander Kanavin wrote:
> On Fri, 2 Jun 2023 at 11:50, Alexis Lothor� via lists.openembedded.org
> <alexis.lothore=bootlin.com@lists.openembedded.org> wrote:
> > +# When any test fails, if path to an artifacts configuration (listing
> > +# files/directories to retrieve on target) has been provided, all listed
> > +# elements will be downloaded from target before shutting it down
> > +
> > +ARTIFACTS_LIST_PATH ??= ""
> 
> I think this should be simplified further by simply specifying the
> default list directly instead:
> 
> TESTIMAGE_FAILED_QA_ARTIFACTS ?= "${libdir}/*/ptest/ /var/log/ /etc/version"

Try fetching /etc/os-release too, please.

I like this patch, thanks!

Cheers,

-Mikko
Alexis Lothoré June 2, 2023, 12:15 p.m. UTC | #3
Hello Alexander, Mikko,

On 6/2/23 13:52, Mikko Rapeli wrote:
> Hi,
> 
> On Fri, Jun 02, 2023 at 01:46:03PM +0200, Alexander Kanavin wrote:
>> On Fri, 2 Jun 2023 at 11:50, Alexis Lothoré via lists.openembedded.org
>> <alexis.lothore=bootlin.com@lists.openembedded.org> wrote:
>>> +# When any test fails, if path to an artifacts configuration (listing
>>> +# files/directories to retrieve on target) has been provided, all listed
>>> +# elements will be downloaded from target before shutting it down
>>> +
>>> +ARTIFACTS_LIST_PATH ??= ""
>>
>> I think this should be simplified further by simply specifying the
>> default list directly instead:
>>
>> TESTIMAGE_FAILED_QA_ARTIFACTS ?= "${libdir}/*/ptest/ /var/log/ /etc/version"

I wondered about this possibility while adding this variable, and at this time I
thought it would encode to much "CI logic" in testimage class. But since you
raise the suggestion, I can update to pass directly the file list, which could
indeed be overridden by CI tests runners. We can start with those three paths in
the default value.

Another point raised by your example is the lack of wildcards management in
artifacts path, which would clearly ease the listing, especially for ptests.
I'll see if I can come up with something better for this
> 
> Try fetching /etc/os-release too, please.

ACK, will add it to default list
> 
> I like this patch, thanks!
> 
> Cheers,
> 
> -Mikko
> 
> 
> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#182307): https://lists.openembedded.org/g/openembedded-core/message/182307
> Mute This Topic: https://lists.openembedded.org/mt/99283034/7394887
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [alexis.lothore@bootlin.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Alexander Kanavin June 2, 2023, 12:21 p.m. UTC | #4
On Fri, 2 Jun 2023 at 14:14, Alexis Lothoré <alexis.lothore@bootlin.com> wrote:
> I wondered about this possibility while adding this variable, and at this time I
> thought it would encode to much "CI logic" in testimage class. But since you
> raise the suggestion, I can update to pass directly the file list, which could
> indeed be overridden by CI tests runners. We can start with those three paths in
> the default value.

But this isn't CI logic. It's bitbake's QA logic, which can and should
be available in local builds. Running testimage locally is a supported
and regularly used operation, so it should work exactly same as in CI.

Alex
Mikko Rapeli June 2, 2023, 1:07 p.m. UTC | #5
Hi,

These changes are an improvement, but based on my experience in product test automation,
instead of collecting logs after testing is completed, it is better to capture logs
from the target device while tests are being executed. Serial console, systemd journal
etc logs can be captured as streams from the device under test (DUT), and time stamps added
from the test controller so that aligning events like test command execution and log messages
is possible. It is also a good idea to capture core dumps via this kind of stream(s).
Problems with HW and BSP SW may not be visible after testing completes (device has
rebooted, reset) or capturing data is not possible at all (due to system, kernel, userspace hang,
oom or too much load).

This capturing of logs could be implemented by adding some configurable variables to execute
commands on the test controller and inside oeqa environment at certain stages of testing, for example
after serial console login prompt has been detected. Command to execute could be a simple
"ssh -c 'journalctl -b -a'" to capture boot logs and "ssh 'journalctl -a -f'" and log the output data with
additional time stamps to bitbake task output or a separate file.

Just thinking out loud here, these changes are an improvement over current situation already.
Thanks for sending these patches!

Cheers,

-Mikko
Alexis Lothoré June 7, 2023, 7:36 a.m. UTC | #6
Hi Mikko, sorry for late reply, and thanks for the additional feedback

On 6/2/23 15:07, Mikko Rapeli wrote:
> Hi,
> 
> These changes are an improvement, but based on my experience in product test automation,
> instead of collecting logs after testing is completed, it is better to capture logs
> from the target device while tests are being executed. Serial console, systemd journal
> etc logs can be captured as streams from the device under test (DUT), and time stamps added
> from the test controller so that aligning events like test command execution and log messages
> is possible. It is also a good idea to capture core dumps via this kind of stream(s).
> Problems with HW and BSP SW may not be visible after testing completes (device has
> rebooted, reset) or capturing data is not possible at all (due to system, kernel, userspace hang,
> oom or too much load).
> 
> This capturing of logs could be implemented by adding some configurable variables to execute
> commands on the test controller and inside oeqa environment at certain stages of testing, for example
> after serial console login prompt has been detected. Command to execute could be a simple
> "ssh -c 'journalctl -b -a'" to capture boot logs and "ssh 'journalctl -a -f'" and log the output data with
> additional time stamps to bitbake task output or a separate file.
> 
> Just thinking out loud here, these changes are an improvement over current situation already.
> Thanks for sending these patches!

While I tend to agree with what you are suggesting, I feel like there are some
new constraints, because of targets variety: is the system a real target or a
qemu image ? Does it run systemd/journald ? Are coredumps enabled by default ?
Of course those constraints may be quite easy to circumvent, moreover we can
think of a "best effort" solution which gathers "what it can" from running
target. Also, there may be some corner cases that would need some specific
handling: some tests can be quite long (hours), what about broken pipes during
those ? This would definitely need some re-connection strategies for example.

Since current proposed solution is not very invasive/has not much requirements
except a valid ssh access, I plan to update the series and let some real testing
show if it is relevant. If not (or not enough), we may give a try to a "runtime"
version like the one you are suggesting ?

Thanks,
Alexis
> 
> Cheers,
> 
> -Mikko
> 
> 
> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#182316): https://lists.openembedded.org/g/openembedded-core/message/182316
> Mute This Topic: https://lists.openembedded.org/mt/99283034/7394887
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [alexis.lothore@bootlin.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Mikko Rapeli June 7, 2023, 7:46 a.m. UTC | #7
Hi,

On Wed, Jun 07, 2023 at 09:36:07AM +0200, Alexis Lothor� wrote:
> Hi Mikko, sorry for late reply, and thanks for the additional feedback
> 
> On 6/2/23 15:07, Mikko Rapeli wrote:
> > Hi,
> > 
> > These changes are an improvement, but based on my experience in product test automation,
> > instead of collecting logs after testing is completed, it is better to capture logs
> > from the target device while tests are being executed. Serial console, systemd journal
> > etc logs can be captured as streams from the device under test (DUT), and time stamps added
> > from the test controller so that aligning events like test command execution and log messages
> > is possible. It is also a good idea to capture core dumps via this kind of stream(s).
> > Problems with HW and BSP SW may not be visible after testing completes (device has
> > rebooted, reset) or capturing data is not possible at all (due to system, kernel, userspace hang,
> > oom or too much load).
> > 
> > This capturing of logs could be implemented by adding some configurable variables to execute
> > commands on the test controller and inside oeqa environment at certain stages of testing, for example
> > after serial console login prompt has been detected. Command to execute could be a simple
> > "ssh -c 'journalctl -b -a'" to capture boot logs and "ssh 'journalctl -a -f'" and log the output data with
> > additional time stamps to bitbake task output or a separate file.
> > 
> > Just thinking out loud here, these changes are an improvement over current situation already.
> > Thanks for sending these patches!
> 
> While I tend to agree with what you are suggesting, I feel like there are some
> new constraints, because of targets variety: is the system a real target or a
> qemu image ? Does it run systemd/journald ? Are coredumps enabled by default ?
> Of course those constraints may be quite easy to circumvent, moreover we can
> think of a "best effort" solution which gathers "what it can" from running
> target. Also, there may be some corner cases that would need some specific
> handling: some tests can be quite long (hours), what about broken pipes during
> those ? This would definitely need some re-connection strategies for example.

Agreed. Currently our qemu reboot tests are failing for due some kind
of reconnection issue, for example. I'll try to find time to fix this...

> Since current proposed solution is not very invasive/has not much requirements
> except a valid ssh access, I plan to update the series and let some real testing
> show if it is relevant. If not (or not enough), we may give a try to a "runtime"
> version like the one you are suggesting ?

Yes, the series is a clear improvement over the current state.

Cheers,

-Mikko
diff mbox series

Patch

diff --git a/meta/classes-recipe/testimage.bbclass b/meta/classes-recipe/testimage.bbclass
index 6b10c1db09f9..3fd3dd5b0264 100644
--- a/meta/classes-recipe/testimage.bbclass
+++ b/meta/classes-recipe/testimage.bbclass
@@ -18,6 +18,12 @@  inherit image-artifact-names
 
 TESTIMAGE_AUTO ??= "0"
 
+# When any test fails, if path to an artifacts configuration (listing
+# files/directories to retrieve on target) has been provided, all listed
+# elements will be downloaded from target before shutting it down
+
+ARTIFACTS_LIST_PATH ??= ""
+
 # You can set (or append to) TEST_SUITES in local.conf to select the tests
 # which you want to run for your target.
 # The test names are the module names in meta/lib/oeqa/runtime/cases.
@@ -192,6 +198,45 @@  def get_testimage_boot_patterns(d):
                 boot_patterns[flag] = flagval.encode().decode('unicode-escape')
     return boot_patterns
 
+def load_artifacts_list(artifacts_conf_path):
+    import json
+
+    if not artifacts_conf_path:
+        return None
+
+    if not os.path.isfile(artifacts_conf_path):
+        return None
+
+    try:
+        with open(artifacts_conf_path) as f:
+            artifacts_list = json.load(f)
+    except json.decoder.JSONDecodeError as e:
+        bb.warn(f"Invalid tests artifact list format: {e.lineno}:{e.colno} : {e.msg}")
+        return None
+
+    if 'artifacts' in artifacts_list:
+        return artifacts_list['artifacts']
+
+    return None
+
+def retrieve_test_artifacts(target, artifacts_list, target_dir):
+    import shutil
+
+    local_artifacts_dir = os.path.join(target_dir, "artifacts")
+    if os.path.isdir(local_artifacts_dir):
+        shutil.rmtree(local_artifacts_dir)
+
+    os.makedirs(local_artifacts_dir)
+    for artifact_path in artifacts_list:
+        if not os.path.isabs(artifact_path):
+            bb.warn(f"{artifact_path} is not an absolute path")
+            continue
+        try:
+            dest_dir = os.path.join(local_artifacts_dir, os.path.dirname(artifact_path[1:]))
+            os.makedirs(dest_dir, exist_ok=True)
+            target.copyFrom(artifact_path, dest_dir)
+        except:
+            bb.warn(f"Can not retrieve {artifact_path} from test target")
 
 def testimage_main(d):
     import os
@@ -402,6 +447,12 @@  def testimage_main(d):
                         get_testimage_result_id(configuration),
                         dump_streams=d.getVar('TESTREPORT_FULLLOGS'))
         results.logSummary(pn)
+        if not results.wasSuccessful():
+            artifacts_list = load_artifacts_list(d.getVar("ARTIFACTS_LIST_PATH"))
+            if not artifacts_list:
+                bb.warn("Could not load artifacts list, skip artifacts retrieval")
+            else:
+                retrieve_test_artifacts(tc.target, artifacts_list, get_testimage_json_result_dir(d))
 
     tc.target.stop()