diff mbox series

[bitbake-devel,v2] fetch2: add Google Cloud Platform (GCP) fetcher

Message ID 20230801124715.752243-1-eekmecic@snap.com
State New
Headers show
Series [bitbake-devel,v2] fetch2: add Google Cloud Platform (GCP) fetcher | expand

Commit Message

Emil Ekmečić Aug. 1, 2023, 12:47 p.m. UTC
From: Emil Ekmečić <eekmecic@snap.com>

This fetcher allows BitBake to fetch from a Google Cloud Storage
bucket. The fetcher expects a gs:// URI of the following form:

SSTATE_MIRRORS = "file://.* gs://<bucket name>/PATH"

The fetcher uses the Google Cloud Storage Python Client, and
expects it to be installed, configured, and authenticated prior
to use.

Here is how this fetcher conforms to the fetcher expectations
described at this link:
https://git.yoctoproject.org/poky/tree/bitbake/lib/bb/fetch2/README

a) Yes, network fetching only happens in the fetcher

b) The fetcher has nothing to do with the unpack phase so there is no
network access there

c) This change doesn't affect the behavior of DL_DIR. The GCP fetcher
only downloads to the DL_DIR in the same way that other fetchers,
namely the S3 and Azure fetchers do.

d) The fetcher is identical to the S3 and Azure fetchers in this
context

e) Yes, the fetcher output is deterministic because it is downloading
tarballs from a bucket and not modifying them in any way.

f) I set up a local proxy using tinyproxy and set the http_proxy
variable to test whether the Python API respected the proxy. It
appears that it did as I could see traffic passing through the
proxy. I also did some searching online and found posts indicating
that the Google Cloud Python APIs supported the classic Linux proxy
variables, namely:
  - https://github.com/googleapis/google-api-python-client/issues/1260

g) Access is minimal, only checking if the file exists and downloading
it if it does.

h) Not applicable, BitBake already knows which version it wants and
the version infomation is encoded in the filename. The fetcher has no
concept of versions.

i) Not applicable

j) Not applicable

k) No tests were added as part of this change. I didn't see any tests
for the S3 or Azure changes either, is that OK?

l) I'm not 100% familiar but I don't believe this fetcher is using any
tools during parse time. Please correct me if I'm wrong.

Signed-off-by: Emil Ekmečić <eekmecic@snap.com>
---
 lib/bb/fetch2/__init__.py |   4 +-
 lib/bb/fetch2/gcp.py      | 104 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+), 1 deletion(-)
 create mode 100644 lib/bb/fetch2/gcp.py

Comments

Alexander Kanavin Aug. 1, 2023, 1:18 p.m. UTC | #1
On Tue, 1 Aug 2023 at 14:47, eekmecic via lists.openembedded.org
<eekmecic=snap.com@lists.openembedded.org> wrote:
> e) Yes, the fetcher output is deterministic because it is downloading
> tarballs from a bucket and not modifying them in any way.

Let's say there's a supply chain attack on the bucket side, and the
tarball has been replaced with a maliciously modified one. Will this
fetcher be able to detect and reject such a download? For http tarball
checksums are provided in the recipe. For git, the commit revision
doubles as an integrity checksum (we trust that the local git client
performs the check correctly). How is this handled here?

Alex
Emil Ekmečić Aug. 1, 2023, 2:29 p.m. UTC | #2
On Tue, Aug 1, 2023 at 3:18 PM Alexander Kanavin <alex.kanavin@gmail.com> wrote:
>
> On Tue, 1 Aug 2023 at 14:47, eekmecic via lists.openembedded.org
> <eekmecic=snap.com@lists.openembedded.org> wrote:
> > e) Yes, the fetcher output is deterministic because it is downloading
> > tarballs from a bucket and not modifying them in any way.
>
> Let's say there's a supply chain attack on the bucket side, and the
> tarball has been replaced with a maliciously modified one. Will this
> fetcher be able to detect and reject such a download? For http tarball
> checksums are provided in the recipe. For git, the commit revision
> doubles as an integrity checksum (we trust that the local git client
> performs the check correctly). How is this handled here?
>
> Alex

This fetcher, in its current state, will not be able to detect a supply
chain attack of the type you've described. It does no verification of
any kind.

We are using this fetcher to support fetching sstate artifacts from a GCS
bucket. From my understanding, these artifacts have signatures that are
based on their inputs so BitBake would be able to verify that the
expected signature matches the signature of the file downloaded by the
fetcher.

I am unsure of how (if?) BitBake verifies the integrity of source
tarballs downloaded from mirrors. We are not using it to fetch source,
only prebuilt artifacts. I took a look at the other two fetchers
that implement support for S3 and Azure and I don't see any differences
in how they would handle a supply chain attack of the type you
described. Please correct me if I'm wrong. Likewise please let me know
if there's something that needs to be implemented in the fetcher code
to prevent such an attack.

Emil
Alexander Kanavin Aug. 1, 2023, 5:30 p.m. UTC | #3
On Tue, 1 Aug 2023 at 16:30, Emil Ekmečić <eekmecic@snap.com> wrote:

> I am unsure of how (if?) BitBake verifies the integrity of source
> tarballs downloaded from mirrors. We are not using it to fetch source,
> only prebuilt artifacts. I took a look at the other two fetchers
> that implement support for S3 and Azure and I don't see any differences
> in how they would handle a supply chain attack of the type you
> described. Please correct me if I'm wrong. Likewise please let me know
> if there's something that needs to be implemented in the fetcher code
> to prevent such an attack.

Right, I thought you had actual recipes using the fetcher.

I think this bit:

    def recommends_checksum(self, urldata):
        return True

will require checksumming when the fetcher is used in a recipe. I
don't actually know for sure what happens when the fetcher is used
from sstate code and where is the code that prevents tampering with
sstate objects.

Given that the fetcher is the same 'class' as s3/azure - seems
primarily intended for sstate mirrors, I'm not sure if it should have
tests in bitbake/lib/bb/tests/fetch.py. Maybe RP can clarify.

Otherwise, I'm fine with this.

Alex
Emil Ekmečić Aug. 2, 2023, 11:48 a.m. UTC | #4
On Tue, Aug 1, 2023 at 7:30 PM Alexander Kanavin <alex.kanavin@gmail.com> wrote:
>
> On Tue, 1 Aug 2023 at 16:30, Emil Ekmečić <eekmecic@snap.com> wrote:
>
> > I am unsure of how (if?) BitBake verifies the integrity of source
> > tarballs downloaded from mirrors. We are not using it to fetch source,
> > only prebuilt artifacts. I took a look at the other two fetchers
> > that implement support for S3 and Azure and I don't see any differences
> > in how they would handle a supply chain attack of the type you
> > described. Please correct me if I'm wrong. Likewise please let me know
> > if there's something that needs to be implemented in the fetcher code
> > to prevent such an attack.
>
> Right, I thought you had actual recipes using the fetcher.
>
> I think this bit:
>
>     def recommends_checksum(self, urldata):
>         return True
>
> will require checksumming when the fetcher is used in a recipe. I
> don't actually know for sure what happens when the fetcher is used
> from sstate code and where is the code that prevents tampering with
> sstate objects.
>
> Given that the fetcher is the same 'class' as s3/azure - seems
> primarily intended for sstate mirrors, I'm not sure if it should have
> tests in bitbake/lib/bb/tests/fetch.py. Maybe RP can clarify.
>
> Otherwise, I'm fine with this.
>
> Alex

Yes, that bit will require checksumming. My colleague Étienne Cordonnier
verified this by creating a recipe that uses this fetcher without
specifying a checksum, which returned the following error:

WARNING: uutils-coreutils-0.0.16-r0 do_fetch: Missing checksum for
'/workdir/downloads/sstate:wayland::1.20.0:r0::10:0004d8aa6640ecb43319a1c
2e68111d0522036c86fe0f331c1c830c0e0a30c31_populate_lic.tar.zst', consider
adding at least one to the recipe:
SRC_URI[sha256sum] = "16c025cd29bc9eb535676160e0edaf90832b9b07204fe0f5e
180384b66c904ab"

After adding the checksum to the recipe as indicated by the warning, the
warning message went away. Lastly, if he intentionally inputted an
incorrect checksum BitBake threw an error:

ERROR: uutils-coreutils-0.0.16-r0 do_fetch: Fetcher failure for URL:
'gs://< our internal bucket name >/00/04/sstate:wayland::1.20.0:r0::1
0:0004d8aa6640ecb43319a1c2e68111d0522036c86fe0f331c1c830c0e0a30c3
1_populate_lic.tar.zst'. Checksum mismatch!

I will create another revision of the patch that includes documentation
for the fetcher in the user manual. I will also create a patch in
openembedded-core with some small changes needed for this feature
to work. I'll ensure that I mention all the patches so that they can
be integrated simultaneously with minimal breakage should they be
accepted.

Thank you Alex for taking a look.

Emil
diff mbox series

Patch

diff --git a/lib/bb/fetch2/__init__.py b/lib/bb/fetch2/__init__.py
index 8afe012e..0a3d7a58 100644
--- a/lib/bb/fetch2/__init__.py
+++ b/lib/bb/fetch2/__init__.py
@@ -1290,7 +1290,7 @@  class FetchData(object):
 
             if checksum_name in self.parm:
                 checksum_expected = self.parm[checksum_name]
-            elif self.type not in ["http", "https", "ftp", "ftps", "sftp", "s3", "az", "crate"]:
+            elif self.type not in ["http", "https", "ftp", "ftps", "sftp", "s3", "az", "crate", "gs"]:
                 checksum_expected = None
             else:
                 checksum_expected = d.getVarFlag("SRC_URI", checksum_name)
@@ -1973,6 +1973,7 @@  from . import npm
 from . import npmsw
 from . import az
 from . import crate
+from . import gcp
 
 methods.append(local.Local())
 methods.append(wget.Wget())
@@ -1994,3 +1995,4 @@  methods.append(npm.Npm())
 methods.append(npmsw.NpmShrinkWrap())
 methods.append(az.Az())
 methods.append(crate.Crate())
+methods.append(gcp.GCP())
diff --git a/lib/bb/fetch2/gcp.py b/lib/bb/fetch2/gcp.py
new file mode 100644
index 00000000..66efbbdd
--- /dev/null
+++ b/lib/bb/fetch2/gcp.py
@@ -0,0 +1,104 @@ 
+"""
+BitBake 'Fetch' implementation for Google Cloup Platform Storage.
+
+Class for fetching files from Google Cloud Storage using the
+Google Cloud Storage Python Client. The GCS Python Client must
+be correctly installed, configured and authenticated prior to use.
+Additionally, gsutil must also be installed.
+
+"""
+
+# Copyright (C) 2023, Snap Inc.
+#
+# Based in part on bb.fetch2.s3:
+#    Copyright (C) 2017 Andre McCurdy
+#
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Based on functions from the base bb module, Copyright 2003 Holger Schurig
+
+import os
+import bb
+import urllib.parse, urllib.error
+from bb.fetch2 import FetchMethod
+from bb.fetch2 import FetchError
+from bb.fetch2 import logger
+from google.cloud import storage
+
+class GCP(FetchMethod):
+    """
+    Class to fetch urls via GCP's Python API.
+    """
+    def __init__(self):
+        self.gcp_client = None
+
+    def init(self, d):
+        """
+        Initialize GCP client.
+        """
+        self.get_gcp_client()
+
+    def supports(self, ud, d):
+        """
+        Check to see if a given url can be fetched with GCP.
+        """
+        return ud.type in ['gs']
+
+    def recommends_checksum(self, urldata):
+        return True
+
+    def urldata_init(self, ud, d):
+        if 'downloadfilename' in ud.parm:
+            ud.basename = ud.parm['downloadfilename']
+        else:
+            ud.basename = os.path.basename(ud.path)
+
+        ud.localfile = d.expand(urllib.parse.unquote(ud.basename))
+
+    def get_gcp_client(self):
+        self.gcp_client = storage.Client(project=None)
+
+    def download(self, ud, d):
+        """
+        Fetch urls using the GCP API.
+        Assumes localpath was called first.
+        """
+        logger.debug2(f"Trying to download gs://{ud.host}{ud.path} to {ud.localpath}")
+        if self.gcp_client is None:
+            self.get_gcp_client()
+
+        bb.fetch2.check_network_access(d, "gsutil stat", ud.url)
+
+        # Path sometimes has leading slash, so strip it
+        path = ud.path.lstrip("/")
+        blob = self.gcp_client.bucket(ud.host).blob(path)
+        blob.download_to_filename(ud.localpath)
+
+        # Additional sanity checks copied from the wget class (although there
+        # are no known issues which mean these are required, treat the GCP API
+        # tool with a little healthy suspicion).
+        if not os.path.exists(ud.localpath):
+            raise FetchError(f"The GCP API returned success for gs://{ud.host}{ud.path} but {ud.localpath} doesn't exist?!")
+
+        if os.path.getsize(ud.localpath) == 0:
+            os.remove(ud.localpath)
+            raise FetchError(f"The downloaded file for gs://{ud.host}{ud.path} resulted in a zero size file?! Deleting and failing since this isn't right.")
+
+        return True
+
+    def checkstatus(self, fetch, ud, d):
+        """
+        Check the status of a URL.
+        """
+        logger.debug2(f"Checking status of gs://{ud.host}{ud.path}")
+        if self.gcp_client is None:
+            self.get_gcp_client()
+
+        bb.fetch2.check_network_access(d, "gsutil stat", ud.url)
+
+        # Path sometimes has leading slash, so strip it
+        path = ud.path.lstrip("/")
+        if self.gcp_client.bucket(ud.host).blob(path).exists() == False:
+            raise FetchError(f"The GCP API reported that gs://{ud.host}{ud.path} does not exist")
+        else:
+            return True