siggen: fix very inefficient string concatenation

Message ID	20230201141900.1478768-1-ecordonnier@snap.com
State	New
Headers	show Return-Path: <ecordonnier@snap.com> ip: 209.85.215.180, mailfrom: ecordonnier@snapchat.com) From: ecordonnier@snap.com To: bitbake-devel@lists.openembedded.org Cc: Etienne Cordonnier <ecordonnier@snap.com>, JJ Robertson <jrobertson@snap.com> Subject: [PATCH] siggen: fix very inefficient string concatenation Date: Wed, 1 Feb 2023 15:19:00 +0100 Message-Id: <20230201141900.1478768-1-ecordonnier@snap.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	siggen: fix very inefficient string concatenation \| expand siggen: fix very inefficient string concatenation

Message ID

20230201141900.1478768-1-ecordonnier@snap.com

State

New

Headers

From: ecordonnier@snap.com
To: bitbake-devel@lists.openembedded.org
Cc: Etienne Cordonnier <ecordonnier@snap.com>,
	JJ Robertson <jrobertson@snap.com>
Subject: [PATCH] siggen: fix very inefficient string concatenation
Date: Wed,  1 Feb 2023 15:19:00 +0100
Message-Id: <20230201141900.1478768-1-ecordonnier@snap.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Series

siggen: fix very inefficient string concatenation | expand

Commit Message

Etienne Cordonnier Feb. 1, 2023, 2:19 p.m. UTC

From: Etienne Cordonnier <ecordonnier@snap.com>

As discussed in https://stackoverflow.com/a/4435752/1710392 , CPython
has an optimization for statements in the form "a = a + b" or "a += b".
It seems that this line does not get optimized, because it has a form a = a + b + c:
data = data + "./" + f.split("/./")[1]

For that reason, it does a copy of data for each iteration, potentially copying megabytes
of data for each iteration.

Changing this line causes SignatureGeneratorBasic::get_taskhash to take 0.06 seconds
instead of 45 seconds on my test setup where SRC_URI points to a big directory.

Note that PEP8 recommends explicitely not to use this optimization which is specific to CPython:
"do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b"

However, the PEP8 recommended form using "join()" also does not avoid the copy and takes 45 seconds in my test setup:
data = ''.join((data, "./", f.split("/./")[1]))

I have changed the other lines to also use += for consistency only, however those were in the form a = a + b
and were optimized already.

Co-authored-by: JJ Robertson <jrobertson@snap.com>
Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
---
 lib/bb/siggen.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
index 0e79404f..26e0243b 100644
--- a/lib/bb/siggen.py
+++ b/lib/bb/siggen.py
@@ -349,19 +349,19 @@  class SignatureGeneratorBasic(SignatureGenerator):
 
         data = self.basehash[tid]
         for dep in self.runtaskdeps[tid]:
-            data = data + self.get_unihash(dep)
+            data += self.get_unihash(dep)
 
         for (f, cs) in self.file_checksum_values[tid]:
             if cs:
                 if "/./" in f:
-                    data = data + "./" + f.split("/./")[1]
-                data = data + cs
+                    data += "./" + f.split("/./")[1]
+                data += cs
 
         if tid in self.taints:
             if self.taints[tid].startswith("nostamp:"):
-                data = data + self.taints[tid][8:]
+                data += self.taints[tid][8:]
             else:
-                data = data + self.taints[tid]
+                data += self.taints[tid]
 
         h = hashlib.sha256(data.encode("utf-8")).hexdigest()
         self.taskhash[tid] = h

siggen: fix very inefficient string concatenation

Commit Message

Patch