[v2] create-spdx: Get SPDX-License-Identifier from source

Message ID 20220207192915.70095-1-saul.wold@windriver.com
State Accepted, archived
Commit 51e5f328635eb022143178c3169bae719509697a
Headers show
Series [v2] create-spdx: Get SPDX-License-Identifier from source | expand

Commit Message

Saul Wold Feb. 7, 2022, 7:29 p.m. UTC
This patch will read the begining of source files and try to find
the SPDX-License-Identifier to populate the licenseInfoInFiles
field for each source file. This does not populate licenseConcluded
at this time, nor rolls it up to package level.

We read as binary file since some source code seem to have some
binary characters, the license is then converted to ascii strings.

Signed-off-by: Saul Wold <saul.wold@windriver.com>
---
v2: Updated commit message, and fixed REGEX based on Peter's suggetion

 meta/classes/create-spdx.bbclass | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

Comments

Scott Murray Feb. 7, 2022, 8:33 p.m. UTC | #1
On Mon, 7 Feb 2022, Saul Wold wrote:

> This patch will read the begining of source files and try to find
> the SPDX-License-Identifier to populate the licenseInfoInFiles
> field for each source file. This does not populate licenseConcluded
> at this time, nor rolls it up to package level.
>
> We read as binary file since some source code seem to have some
> binary characters, the license is then converted to ascii strings.
>
> Signed-off-by: Saul Wold <saul.wold@windriver.com>
> ---
> v2: Updated commit message, and fixed REGEX based on Peter's suggetion
>
>  meta/classes/create-spdx.bbclass | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
>
> diff --git a/meta/classes/create-spdx.bbclass b/meta/classes/create-spdx.bbclass
> index 8b4203fdb5..588489cc2b 100644
> --- a/meta/classes/create-spdx.bbclass
> +++ b/meta/classes/create-spdx.bbclass
> @@ -37,6 +37,24 @@ SPDX_SUPPLIER[doc] = "The SPDX PackageSupplier field for SPDX packages created f
>
>  do_image_complete[depends] = "virtual/kernel:do_create_spdx"
>
> +def extract_licenses(filename):
> +    import re
> +    import oe.spdx
> +
> +    lic_regex = re.compile(b'SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
> +
> +    try:
> +        with open(filename, 'rb') as f:
> +            size = min(15000, os.stat(filename).st_size)
> +            txt = f.read(size)
> +            licenses = re.findall(lic_regex, txt)
> +            if licenses:
> +                ascii_licenses = [lic.decode('ascii') for lic in licenses]
> +                return ascii_licenses
> +    except Exception as e:
> +        bb.warn(f"Exception reading {filename}: {e}")
> +    return None
> +
>  def get_doc_namespace(d, doc):
>      import uuid
>      namespace_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, d.getVar("SPDX_UUID_NAMESPACE"))
> @@ -232,6 +250,11 @@ def add_package_files(d, doc, spdx_pkg, topdir, get_spdxid, get_types, *, archiv
>                          checksumValue=bb.utils.sha256_file(filepath),
>                      ))
>
> +                if "SOURCE" in spdx_file.fileTypes:
> +                    extracted_lics = extract_licenses(filepath)
> +                    if extracted_lics:
> +                        spdx_file.licenseInfoInFiles = extracted_lics
> +
>                  doc.files.append(spdx_file)
>                  doc.add_relationship(spdx_pkg, "CONTAINS", spdx_file)
>                  spdx_pkg.hasFiles.append(spdx_file.SPDXID)

IMO this seems like perhaps either going too far, or not far enough.  If
we go to the trouble to scan source files for explicit SPDX license
declarations, but do not go as far as pattern detection like the
meta-spdxscanner layer does with its use Scancode Toolkit
(https://github.com/nexB/scancode-toolkit), then it seems there's
more potential for giving users a false impression as to the completeness
of the resulting report/SBOM.  Perhaps that can be handled by making it
very clear that further scanning and auditing is still required in the
hopefully forthcoming create-spdx.bbclass documentation, but I can
imagine having to explain this to customers.

Scott
Joshua Watt Feb. 7, 2022, 8:35 p.m. UTC | #2
On 2/7/22 14:33, Scott Murray wrote:
> On Mon, 7 Feb 2022, Saul Wold wrote:
>
>> This patch will read the begining of source files and try to find
>> the SPDX-License-Identifier to populate the licenseInfoInFiles
>> field for each source file. This does not populate licenseConcluded
>> at this time, nor rolls it up to package level.
>>
>> We read as binary file since some source code seem to have some
>> binary characters, the license is then converted to ascii strings.
>>
>> Signed-off-by: Saul Wold <saul.wold@windriver.com>
>> ---
>> v2: Updated commit message, and fixed REGEX based on Peter's suggetion
>>
>>   meta/classes/create-spdx.bbclass | 23 +++++++++++++++++++++++
>>   1 file changed, 23 insertions(+)
>>
>> diff --git a/meta/classes/create-spdx.bbclass b/meta/classes/create-spdx.bbclass
>> index 8b4203fdb5..588489cc2b 100644
>> --- a/meta/classes/create-spdx.bbclass
>> +++ b/meta/classes/create-spdx.bbclass
>> @@ -37,6 +37,24 @@ SPDX_SUPPLIER[doc] = "The SPDX PackageSupplier field for SPDX packages created f
>>
>>   do_image_complete[depends] = "virtual/kernel:do_create_spdx"
>>
>> +def extract_licenses(filename):
>> +    import re
>> +    import oe.spdx
>> +
>> +    lic_regex = re.compile(b'SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
>> +
>> +    try:
>> +        with open(filename, 'rb') as f:
>> +            size = min(15000, os.stat(filename).st_size)
>> +            txt = f.read(size)
>> +            licenses = re.findall(lic_regex, txt)
>> +            if licenses:
>> +                ascii_licenses = [lic.decode('ascii') for lic in licenses]
>> +                return ascii_licenses
>> +    except Exception as e:
>> +        bb.warn(f"Exception reading {filename}: {e}")
>> +    return None
>> +
>>   def get_doc_namespace(d, doc):
>>       import uuid
>>       namespace_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, d.getVar("SPDX_UUID_NAMESPACE"))
>> @@ -232,6 +250,11 @@ def add_package_files(d, doc, spdx_pkg, topdir, get_spdxid, get_types, *, archiv
>>                           checksumValue=bb.utils.sha256_file(filepath),
>>                       ))
>>
>> +                if "SOURCE" in spdx_file.fileTypes:
>> +                    extracted_lics = extract_licenses(filepath)
>> +                    if extracted_lics:
>> +                        spdx_file.licenseInfoInFiles = extracted_lics
>> +
>>                   doc.files.append(spdx_file)
>>                   doc.add_relationship(spdx_pkg, "CONTAINS", spdx_file)
>>                   spdx_pkg.hasFiles.append(spdx_file.SPDXID)
> IMO this seems like perhaps either going too far, or not far enough.  If
> we go to the trouble to scan source files for explicit SPDX license
> declarations, but do not go as far as pattern detection like the
> meta-spdxscanner layer does with its use Scancode Toolkit
> (https://github.com/nexB/scancode-toolkit), then it seems there's
> more potential for giving users a false impression as to the completeness
> of the resulting report/SBOM.  Perhaps that can be handled by making it
> very clear that further scanning and auditing is still required in the
> hopefully forthcoming create-spdx.bbclass documentation, but I can
> imagine having to explain this to customers.

Can you given an overview of what meta-spdxscanner does? I'm not quite 
clear what extra processing would be required here.

>
> Scott
>
>
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#161466): https://lists.openembedded.org/g/openembedded-core/message/161466
> Mute This Topic: https://lists.openembedded.org/mt/88980079/3616693
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [JPEWhacker@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Scott Murray Feb. 7, 2022, 8:59 p.m. UTC | #3
On Mon, 7 Feb 2022, Joshua Watt wrote:

>
> On 2/7/22 14:33, Scott Murray wrote:
> > On Mon, 7 Feb 2022, Saul Wold wrote:
> >
> >> This patch will read the begining of source files and try to find
> >> the SPDX-License-Identifier to populate the licenseInfoInFiles
> >> field for each source file. This does not populate licenseConcluded
> >> at this time, nor rolls it up to package level.
> >>
> >> We read as binary file since some source code seem to have some
> >> binary characters, the license is then converted to ascii strings.
> >>
> >> Signed-off-by: Saul Wold <saul.wold@windriver.com>
> >> ---
> >> v2: Updated commit message, and fixed REGEX based on Peter's suggetion
> >>
> >>   meta/classes/create-spdx.bbclass | 23 +++++++++++++++++++++++
> >>   1 file changed, 23 insertions(+)
> >>
> >> diff --git a/meta/classes/create-spdx.bbclass
> >> b/meta/classes/create-spdx.bbclass
> >> index 8b4203fdb5..588489cc2b 100644
> >> --- a/meta/classes/create-spdx.bbclass
> >> +++ b/meta/classes/create-spdx.bbclass
> >> @@ -37,6 +37,24 @@ SPDX_SUPPLIER[doc] = "The SPDX PackageSupplier field for
> >> SPDX packages created f
> >>
> >>   do_image_complete[depends] = "virtual/kernel:do_create_spdx"
> >>
> >> +def extract_licenses(filename):
> >> +    import re
> >> +    import oe.spdx
> >> +
> >> +    lic_regex = re.compile(b'SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[
> >> |\n|\r\n]*?')
> >> +
> >> +    try:
> >> +        with open(filename, 'rb') as f:
> >> +            size = min(15000, os.stat(filename).st_size)
> >> +            txt = f.read(size)
> >> +            licenses = re.findall(lic_regex, txt)
> >> +            if licenses:
> >> +                ascii_licenses = [lic.decode('ascii') for lic in licenses]
> >> +                return ascii_licenses
> >> +    except Exception as e:
> >> +        bb.warn(f"Exception reading {filename}: {e}")
> >> +    return None
> >> +
> >>   def get_doc_namespace(d, doc):
> >>       import uuid
> >>       namespace_uuid = uuid.uuid5(uuid.NAMESPACE_DNS,
> >>       d.getVar("SPDX_UUID_NAMESPACE"))
> >> @@ -232,6 +250,11 @@ def add_package_files(d, doc, spdx_pkg, topdir,
> >> get_spdxid, get_types, *, archiv
> >>                           checksumValue=bb.utils.sha256_file(filepath),
> >>                       ))
> >>
> >> +                if "SOURCE" in spdx_file.fileTypes:
> >> +                    extracted_lics = extract_licenses(filepath)
> >> +                    if extracted_lics:
> >> +                        spdx_file.licenseInfoInFiles = extracted_lics
> >> +
> >>                   doc.files.append(spdx_file)
> >>                   doc.add_relationship(spdx_pkg, "CONTAINS", spdx_file)
> >>                   spdx_pkg.hasFiles.append(spdx_file.SPDXID)
> > IMO this seems like perhaps either going too far, or not far enough.  If
> > we go to the trouble to scan source files for explicit SPDX license
> > declarations, but do not go as far as pattern detection like the
> > meta-spdxscanner layer does with its use Scancode Toolkit
> > (https://github.com/nexB/scancode-toolkit), then it seems there's
> > more potential for giving users a false impression as to the completeness
> > of the resulting report/SBOM.  Perhaps that can be handled by making it
> > very clear that further scanning and auditing is still required in the
> > hopefully forthcoming create-spdx.bbclass documentation, but I can
> > imagine having to explain this to customers.
>
> Can you given an overview of what meta-spdxscanner does? I'm not quite clear
> what extra processing would be required here.

Jan-Simon can talk to it better, as he's done some dev work on the layer
and done tests with it against AGL (and the subsequent Fossology instance
experimentation), but AFAIK for the actual scanning scancode-toolkit
does pattern matching based license detection, so in theory it'll catch
excerpts of or slightly modified versions of the licenses in its
database, as opposed to just searching for SPDX-License-Identifier
declarations.  If everyone else is happy with the latter, I'm willing to
believe I'm offbase in my concerns, but either way I do think the
limitations are going to need to be documented so users (and their
lawyers) are aware of them.

Scott
Robert Berger Feb. 8, 2022, 12:50 p.m. UTC | #4
Hi,

On 07/02/2022 22:59, Scott Murray wrote:
> Jan-Simon can talk to it better, as he's done some dev work on the layer
> and done tests with it against AGL (and the subsequent Fossology instance
> experimentation), but AFAIK for the actual scanning scancode-toolkit
> does pattern matching based license detection, so in theory it'll catch
> excerpts of or slightly modified versions of the licenses in its
> database, as opposed to just searching for SPDX-License-Identifier
> declarations.

If I understand it correctly scancode-toolkit[1] claims to do this:

"ScanCode provides the most accurate license detection engine and does a 
full comparison (also known as diff or red line comparison) between a 
database of license texts and your code instead of relying only on 
approximate regex patterns or probabilistic search, edit distance or 
machine learning."

[1] https://github.com/nexB/scancode-toolkit

meta-spdxscanner can also create tarballs for each recipe and uploads 
them to a Fossology instance. Then Fossology takes care about license 
scanning and so on - and manual intervention is required.

A major issue here is, that the combined work (linking) is not taken 
into account at all, but it's all only per recipe.

Regards,

Robert
Jan-Simon Moeller Feb. 8, 2022, 1:19 p.m. UTC | #5
Hi all

> > Can you given an overview of what meta-spdxscanner does? I'm not quite
> > clear what extra processing would be required here.
>
> Jan-Simon can talk to it better, as he's done some dev work on the layer
> and done tests with it against AGL (and the subsequent Fossology instance
> experimentation), but AFAIK for the actual scanning scancode-toolkit
> does pattern matching based license detection, so in theory it'll catch
> excerpts of or slightly modified versions of the licenses in its
> database, as opposed to just searching for SPDX-License-Identifier
> declarations.  If everyone else is happy with the latter, I'm willing to
> believe I'm offbase in my concerns, but either way I do think the
> limitations are going to need to be documented so users (and their
> lawyers) are aware of them.

TLDR: meta-spdxscanner integrates with scanning tools. Either with fossology
or scancode-tk. An upload to blackduck is also possible meanwhile.

Let's focus on fossology and scancode-tk.

a) fossology

Here we essentially integrate in the task chain and archive the sources after
patching to upload them to a fossology instance. All the scanning/processing
happens then on the server and after some time (a lot ! ;) ) we get a SPDX
report back that we store alongside the package. This is a result of a scan,
so it might catch licenses of files deep in the source tree that may not be
declared in the recipe and so on.

Also, fossology offers then a webinterface for manual inspection and review.
So this is a thorough but quite manual process. More for release work than
daily or occasional stuff.


b) scancode-tk
scancode on the contrary will run on your host during the build and gather the
data.  It will write the spdx file out as well.


I think for us the interesting part would be to compare e.g. the scancode-tk
scan from b) with what we have declared in the recipe.


Best,
JS
Mikko Rapeli Feb. 8, 2022, 1:35 p.m. UTC | #6
Hi,

On Tue, Feb 08, 2022 at 02:19:51PM +0100, Jan-Simon Möller wrote:
> Hi all
> 
> > > Can you given an overview of what meta-spdxscanner does? I'm not quite
> > > clear what extra processing would be required here.
> >
> > Jan-Simon can talk to it better, as he's done some dev work on the layer
> > and done tests with it against AGL (and the subsequent Fossology instance
> > experimentation), but AFAIK for the actual scanning scancode-toolkit
> > does pattern matching based license detection, so in theory it'll catch
> > excerpts of or slightly modified versions of the licenses in its
> > database, as opposed to just searching for SPDX-License-Identifier
> > declarations.  If everyone else is happy with the latter, I'm willing to
> > believe I'm offbase in my concerns, but either way I do think the
> > limitations are going to need to be documented so users (and their
> > lawyers) are aware of them.
> 
> TLDR: meta-spdxscanner integrates with scanning tools. Either with fossology
> or scancode-tk. An upload to blackduck is also possible meanwhile.
> 
> Let's focus on fossology and scancode-tk.
> 
> a) fossology
> 
> Here we essentially integrate in the task chain and archive the sources after
> patching to upload them to a fossology instance. All the scanning/processing
> happens then on the server and after some time (a lot ! ;) ) we get a SPDX
> report back that we store alongside the package. This is a result of a scan,
> so it might catch licenses of files deep in the source tree that may not be
> declared in the recipe and so on.
> 
> Also, fossology offers then a webinterface for manual inspection and review.
> So this is a thorough but quite manual process. More for release work than
> daily or occasional stuff.
> 
> 
> b) scancode-tk
> scancode on the contrary will run on your host during the build and gather the
> data.  It will write the spdx file out as well.
> 
> 
> I think for us the interesting part would be to compare e.g. the scancode-tk
> scan from b) with what we have declared in the recipe.

I guess reports from both will be a superset of used licenses (and possibly copyright
statements too) since the list of source files which are actually compiled is not
known to these services.

Currently the source recipes which have multiple licenses including problematic ones,
are not cleaned up for license compliance scan. E.g. GPLv3 licensed source code are not
deleted at do_patch() time. Thus reports need to be manually adjusted.

Cheers,

-Mikko
Jan-Simon Moeller Feb. 8, 2022, 1:56 p.m. UTC | #7
Hi Mikko,

> I guess reports from both will be a superset of used licenses (and possibly
> copyright statements too) since the list of source files which are actually
> compiled is not known to these services.
Yes, the input is the 'bare' patched source archives. Yes, you're right. We do
only know the 'input' side of things aka all licenses of the source tree.

We do *not* know the 'output' side of things, yet. Aka what of these do end-up
in the image.

But IMHO let's raise the bar step by step.

> Currently the source recipes which have multiple licenses including
> problematic ones, are not cleaned up for license compliance scan. E.g.
> GPLv3 licensed source code are not deleted at do_patch() time. Thus reports
> need to be manually adjusted.
Well, thats a different topic and should be discussed alongside meta-gplv2.

Best,
JS
Joshua Watt Feb. 8, 2022, 2:16 p.m. UTC | #8
On Tue, Feb 8, 2022 at 7:19 AM Jan-Simon Moeller <dl9pf@gmx.de> wrote:
>
> Hi all
>
> > > Can you given an overview of what meta-spdxscanner does? I'm not quite
> > > clear what extra processing would be required here.
> >
> > Jan-Simon can talk to it better, as he's done some dev work on the layer
> > and done tests with it against AGL (and the subsequent Fossology instance
> > experimentation), but AFAIK for the actual scanning scancode-toolkit
> > does pattern matching based license detection, so in theory it'll catch
> > excerpts of or slightly modified versions of the licenses in its
> > database, as opposed to just searching for SPDX-License-Identifier
> > declarations.  If everyone else is happy with the latter, I'm willing to
> > believe I'm offbase in my concerns, but either way I do think the
> > limitations are going to need to be documented so users (and their
> > lawyers) are aware of them.
>
> TLDR: meta-spdxscanner integrates with scanning tools. Either with fossology
> or scancode-tk. An upload to blackduck is also possible meanwhile.
>
> Let's focus on fossology and scancode-tk.
>
> a) fossology
>
> Here we essentially integrate in the task chain and archive the sources after
> patching to upload them to a fossology instance. All the scanning/processing
> happens then on the server and after some time (a lot ! ;) ) we get a SPDX
> report back that we store alongside the package. This is a result of a scan,
> so it might catch licenses of files deep in the source tree that may not be
> declared in the recipe and so on.
>
> Also, fossology offers then a webinterface for manual inspection and review.
> So this is a thorough but quite manual process. More for release work than
> daily or occasional stuff.
>
>
> b) scancode-tk
> scancode on the contrary will run on your host during the build and gather the
> data.  It will write the spdx file out as well.
>
>
> I think for us the interesting part would be to compare e.g. the scancode-tk
> scan from b) with what we have declared in the recipe.

Excellent. Thanks for the synopsis.

Our perspective on all of the SPDX work we've done hasn't been to
necessarily replace or replicate the behavior of these advanced
licensing tools. Instead, we want to supplement them because there is
a lot of information about recipes and packages that is pretty easy
for us to know since we track it as metadata and are the ones actually
*doing* the build. I don't know if we can claim to have a "complete"
view of the SBoM, but I also don't think any single tool can provide a
complete view. The amount of detail you want is going to be driven by
you're requirements and some tools are simply better at providing
different views than others. Our goal has always been to provide what
we can (and especially what is hard for others to discover), and allow
users to supplement with other tools of their choice.

The license scanning may not fall under the "hard to discover" for
others (depending on how integrated they are to scan the build
source), but it's certainly something we *can* provide and can be
quite useful. I would expect that different scanning tools will
generate different results anyway, so as long as we indicate that our
scanning uses a simplistic method, I don't think including it is
necessarily wrong or a bad idea. If you want a more detailed view,
than some other tool should be used to supplement the SBoM.

I'm hopeful that someday our core SPDX work can make it a little
easier on some of these tools so that they can "build" on our existing
framework instead of replicating parts of it.

>
>
> Best,
> JS
>
>
>
>
>
>

Patch

diff --git a/meta/classes/create-spdx.bbclass b/meta/classes/create-spdx.bbclass
index 8b4203fdb5..588489cc2b 100644
--- a/meta/classes/create-spdx.bbclass
+++ b/meta/classes/create-spdx.bbclass
@@ -37,6 +37,24 @@  SPDX_SUPPLIER[doc] = "The SPDX PackageSupplier field for SPDX packages created f
 
 do_image_complete[depends] = "virtual/kernel:do_create_spdx"
 
+def extract_licenses(filename):
+    import re
+    import oe.spdx
+
+    lic_regex = re.compile(b'SPDX-License-Identifier:\s+([-A-Za-z\d. ]+)[ |\n|\r\n]*?')
+
+    try:
+        with open(filename, 'rb') as f:
+            size = min(15000, os.stat(filename).st_size)
+            txt = f.read(size)
+            licenses = re.findall(lic_regex, txt)
+            if licenses:
+                ascii_licenses = [lic.decode('ascii') for lic in licenses]
+                return ascii_licenses
+    except Exception as e:
+        bb.warn(f"Exception reading {filename}: {e}")
+    return None
+
 def get_doc_namespace(d, doc):
     import uuid
     namespace_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, d.getVar("SPDX_UUID_NAMESPACE"))
@@ -232,6 +250,11 @@  def add_package_files(d, doc, spdx_pkg, topdir, get_spdxid, get_types, *, archiv
                         checksumValue=bb.utils.sha256_file(filepath),
                     ))
 
+                if "SOURCE" in spdx_file.fileTypes:
+                    extracted_lics = extract_licenses(filepath)
+                    if extracted_lics:
+                        spdx_file.licenseInfoInFiles = extracted_lics
+
                 doc.files.append(spdx_file)
                 doc.add_relationship(spdx_pkg, "CONTAINS", spdx_file)
                 spdx_pkg.hasFiles.append(spdx_file.SPDXID)