[bitbake-devel,1/2] utils/md5_file: don't iterate line-by-line

Submitted by Ross Burton on Aug. 13, 2018, 6:02 p.m. | Patch ID: 153706

Details

Message ID 20180813180226.18299-1-ross.burton@intel.com
State New
Headers show

Commit Message

Ross Burton Aug. 13, 2018, 6:02 p.m.
Opening a file in binary mode and iterating it seems like the simple solution
but will still break on newlines, which for binary files isn't really useful as
the size of the chunks could be huge or tiny.

Instead, let's be a bit more clever: we'll be MD5ing lots of files, but we don't
want to fill up memory: use mmap() to open the file and read the file in 8k
blocks.

Signed-off-by: Ross Burton <ross.burton@intel.com>
---
 bitbake/lib/bb/utils.py | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Patch hide | download patch | download mbox

diff --git a/bitbake/lib/bb/utils.py b/bitbake/lib/bb/utils.py
index 9903183213b..b20cdabcf01 100644
--- a/bitbake/lib/bb/utils.py
+++ b/bitbake/lib/bb/utils.py
@@ -524,12 +524,17 @@  def md5_file(filename):
     """
     Return the hex string representation of the MD5 checksum of filename.
     """
-    import hashlib
-    m = hashlib.md5()
+    import hashlib, mmap
 
     with open(filename, "rb") as f:
-        for line in f:
-            m.update(line)
+        m = hashlib.md5()
+        try:
+            with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
+                for chunk in iter(lambda: mm.read(8192), b''):
+                    m.update(chunk)
+        except ValueError:
+            # You can't mmap() an empty file so silence this exception
+            pass
     return m.hexdigest()
 
 def sha256_file(filename):

Comments

Armin Kuster Aug. 13, 2018, 6:20 p.m.
On 08/13/2018 11:02 AM, Ross Burton wrote:
> Opening a file in binary mode and iterating it seems like the simple solution
> but will still break on newlines, which for binary files isn't really useful as
> the size of the chunks could be huge or tiny.
>
> Instead, let's be a bit more clever: we'll be MD5ing lots of files, but we don't
> want to fill up memory: use mmap() to open the file and read the file in 8k
> blocks.
>
> Signed-off-by: Ross Burton <ross.burton@intel.com>
> ---
>  bitbake/lib/bb/utils.py | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/bitbake/lib/bb/utils.py b/bitbake/lib/bb/utils.py
> index 9903183213b..b20cdabcf01 100644
> --- a/bitbake/lib/bb/utils.py
> +++ b/bitbake/lib/bb/utils.py
> @@ -524,12 +524,17 @@ def md5_file(filename):
>      """
>      Return the hex string representation of the MD5 checksum of filename.
>      """
> -    import hashlib
> -    m = hashlib.md5()
> +    import hashlib, mmap
>  
>      with open(filename, "rb") as f:
> -        for line in f:
> -            m.update(line)
> +        m = hashlib.md5()
> +        try:
> +            with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
> +                for chunk in iter(lambda: mm.read(8192), b''):
> +                    m.update(chunk)
> +        except ValueError:
> +            # You can't mmap() an empty file so silence this exception
> +            pass
>      return m.hexdigest()
>  
>  def sha256_file(filename):
Fixes the problem I was having. thanks

Acked-by: Armin Kuster <akuster808@gmail.com>
Rasmus Villemoes Aug. 14, 2018, 8:33 a.m.
On 2018-08-13 20:02, Ross Burton wrote:
> Opening a file in binary mode and iterating it seems like the simple solution
> but will still break on newlines, which for binary files isn't really useful as
> the size of the chunks could be huge or tiny.

And what exactly was wrong with

http://lists.openembedded.org/pipermail/bitbake-devel/2018-July/009407.html
http://lists.openembedded.org/pipermail/bitbake-devel/2018-July/009409.html

? I have lots of patches and ideas for performance improvements in
bitbake, pseudo and other components, but I'm not getting any response
at all, which is rather frustrating.

Rasmus
Ross Burton Aug. 14, 2018, 1:12 p.m.
Nothing whatsoever, I don't read bitbake-devel often so didn't spot
those.  The mmap approach is slightly faster though in my benchmarks.

If you have other patches which are not getting merged or commented
on, please ping the thread to remind the maintainers.

Ross

On 14 August 2018 at 09:33, Rasmus Villemoes <rasmus.villemoes@prevas.dk> wrote:
> On 2018-08-13 20:02, Ross Burton wrote:
>> Opening a file in binary mode and iterating it seems like the simple solution
>> but will still break on newlines, which for binary files isn't really useful as
>> the size of the chunks could be huge or tiny.
>
> And what exactly was wrong with
>
> http://lists.openembedded.org/pipermail/bitbake-devel/2018-July/009407.html
> http://lists.openembedded.org/pipermail/bitbake-devel/2018-July/009409.html
>
> ? I have lots of patches and ideas for performance improvements in
> bitbake, pseudo and other components, but I'm not getting any response
> at all, which is rather frustrating.
>
> Rasmus
Richard Purdie Aug. 14, 2018, 3:47 p.m.
On Tue, 2018-08-14 at 10:33 +0200, Rasmus Villemoes wrote:
> On 2018-08-13 20:02, Ross Burton wrote:
> > Opening a file in binary mode and iterating it seems like the
> > simple solution
> > but will still break on newlines, which for binary files isn't
> > really useful as
> > the size of the chunks could be huge or tiny.
> 
> And what exactly was wrong with
> 
> http://lists.openembedded.org/pipermail/bitbake-devel/2018-July/00940
> 7.html
> http://lists.openembedded.org/pipermail/bitbake-devel/2018-July/00940
> 9.html
> 
> ? I have lots of patches and ideas for performance improvements in
> bitbake, pseudo and other components, but I'm not getting any
> response
> at all, which is rather frustrating.

Sorry, that is my fault, I hadn't gotten around to reviewing/testing
them.

Ross' patch using mmap should be slightly more efficient so I'll likely
go with that but I'd take a patch to sort the other hash functions
(e.g. sha256) out similarly and now I'm much more familiar with the
code and expecting the change, review should be much simpler/faster...

Sorry once again...

Cheers,

Richard