[bitbake-devel,RFC,WIP,v1] lib/bb/checksum.py: Speed-up checksum gen when directory is git

Submitted by Aníbal Limón on Oct. 8, 2019, 6:45 p.m. | Patch ID: 165622

Details

Message ID 20191008184512.20130-1-anibal.limon@linaro.org
State New
Headers show

Commit Message

Aníbal Limón Oct. 8, 2019, 6:45 p.m.
In some cases people/organizations are using a SRC_URI with
file:///PATH_TO_DIR that contains a git repository for different
reasons, it is useful when want to do an internal build without
clone sources from outside.

This could consume a lot of CPU time because the current taskhash
generation mechanism didn't identify that the folder is a VCS
(git, svn, cvs) and makes the cksum for every file including the
.git repository in this case.

There are different ways to improve the situation,

* Add protocol=gitscm in file:// SRC_URI but the taskhash is
  calculated before the fetcher identifies the protocol, will require
  some changes in bitbake codebase.
* This patch: When directory is a git repository (contains .git)
  use HEAD rev + git diff to calculate checksum instead of do it
  in every file, that is hackish because make some assumptions about
  .git directory contents.
* Variant of this patch: Make a list of VCS directories (.git, .svn,
  .cvs) and take out for cksum calculations, same as before making
  assumptions about the . folders content.

Signed-off-by: Aníbal Limón <anibal.limon@linaro.org>
---
 lib/bb/checksum.py | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Patch hide | download patch | download mbox

diff --git a/lib/bb/checksum.py b/lib/bb/checksum.py
index 5bc8a8fc..ee125cb5 100644
--- a/lib/bb/checksum.py
+++ b/lib/bb/checksum.py
@@ -86,6 +86,19 @@  class FileChecksumCache(MultiProcessCache):
             return checksum
 
         def checksum_dir(pth):
+            git_dir = os.path.join(pth, '.git')
+            if os.path.exists(git_dir):
+                import subprocess, hashlib
+                m = hashlib.md5()
+                head = subprocess.check_output("cd %s && git rev-parse HEAD" % pth, shell=True)
+                diff = subprocess.check_output("cd %s && git diff" % pth, shell=True)
+                m.update(head)
+                if diff:
+                    m.update(diff)
+
+                return [(pth, m.hexdigest())]
+
+
             # Handle directories recursively
             if pth == "/":
                 bb.fatal("Refusing to checksum /")

Comments

Mark Hatle Oct. 8, 2019, 8:53 p.m.
On 10/8/19 1:45 PM, Aníbal Limón wrote:
> In some cases people/organizations are using a SRC_URI with
> file:///PATH_TO_DIR that contains a git repository for different
> reasons, it is useful when want to do an internal build without
> clone sources from outside.
> 
> This could consume a lot of CPU time because the current taskhash
> generation mechanism didn't identify that the folder is a VCS
> (git, svn, cvs) and makes the cksum for every file including the
> .git repository in this case.
> 
> There are different ways to improve the situation,
> 
> * Add protocol=gitscm in file:// SRC_URI but the taskhash is
>   calculated before the fetcher identifies the protocol, will require
>   some changes in bitbake codebase.

When I have done this before, I've -always- defined it as a git repository by
SRCURI:

git://<local file path>;protocol=file

Then it does exactly what this patch appears to do and uses the git logic to
handle everything automatically.  (The file protocol is already implemented.)

Wouldn't this be better then specifying file:// and then attempting to infer
what file refers to?

--MArk

> * This patch: When directory is a git repository (contains .git)
>   use HEAD rev + git diff to calculate checksum instead of do it
>   in every file, that is hackish because make some assumptions about
>   .git directory contents.
> * Variant of this patch: Make a list of VCS directories (.git, .svn,
>   .cvs) and take out for cksum calculations, same as before making
>   assumptions about the . folders content.
> 
> Signed-off-by: Aníbal Limón <anibal.limon@linaro.org>
> ---
>  lib/bb/checksum.py | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/lib/bb/checksum.py b/lib/bb/checksum.py
> index 5bc8a8fc..ee125cb5 100644
> --- a/lib/bb/checksum.py
> +++ b/lib/bb/checksum.py
> @@ -86,6 +86,19 @@ class FileChecksumCache(MultiProcessCache):
>              return checksum
>  
>          def checksum_dir(pth):
> +            git_dir = os.path.join(pth, '.git')
> +            if os.path.exists(git_dir):
> +                import subprocess, hashlib
> +                m = hashlib.md5()
> +                head = subprocess.check_output("cd %s && git rev-parse HEAD" % pth, shell=True)
> +                diff = subprocess.check_output("cd %s && git diff" % pth, shell=True)
> +                m.update(head)
> +                if diff:
> +                    m.update(diff)
> +
> +                return [(pth, m.hexdigest())]
> +
> +
>              # Handle directories recursively
>              if pth == "/":
>                  bb.fatal("Refusing to checksum /")
>
Nicolas Dechesne Oct. 8, 2019, 9:07 p.m.
Le mar. 8 oct. 2019 à 22:54, Mark Hatle <mark.hatle@kernel.crashing.org> a
écrit :

>
>
> On 10/8/19 1:45 PM, Aníbal Limón wrote:
> > In some cases people/organizations are using a SRC_URI with
> > file:///PATH_TO_DIR that contains a git repository for different
> > reasons, it is useful when want to do an internal build without
> > clone sources from outside.
> >
> > This could consume a lot of CPU time because the current taskhash
> > generation mechanism didn't identify that the folder is a VCS
> > (git, svn, cvs) and makes the cksum for every file including the
> > .git repository in this case.
> >
> > There are different ways to improve the situation,
> >
> > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
> >   calculated before the fetcher identifies the protocol, will require
> >   some changes in bitbake codebase.
>
> When I have done this before, I've -always- defined it as a git repository
> by
> SRCURI:
>
> git://<local file path>;protocol=file
>
> Then it does exactly what this patch appears to do and uses the git logic
> to
> handle everything automatically.  (The file protocol is already
> implemented.)
>
> Wouldn't this be better then specifying file:// and then attempting to
> infer
> what file refers to?


Anibal and I are working on the same use case / problem. Folks who rely on
this setup also use it as a workspace, eg. They make local changes in the
sources workspace. So when doing fetch again the local changes are used. So
this is not strictly equivalent to what you propose, right?

Another major flaw with the current file:// with a git folder is that doing
a git fetch in the source tree will invalidate the fetch task even if the
actual sources haven’t changed!



>
> --MArk
>
> > * This patch: When directory is a git repository (contains .git)
> >   use HEAD rev + git diff to calculate checksum instead of do it
> >   in every file, that is hackish because make some assumptions about
> >   .git directory contents.
> > * Variant of this patch: Make a list of VCS directories (.git, .svn,
> >   .cvs) and take out for cksum calculations, same as before making
> >   assumptions about the . folders content.
> >
> > Signed-off-by: Aníbal Limón <anibal.limon@linaro.org>
> > ---
> >  lib/bb/checksum.py | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/lib/bb/checksum.py b/lib/bb/checksum.py
> > index 5bc8a8fc..ee125cb5 100644
> > --- a/lib/bb/checksum.py
> > +++ b/lib/bb/checksum.py
> > @@ -86,6 +86,19 @@ class FileChecksumCache(MultiProcessCache):
> >              return checksum
> >
> >          def checksum_dir(pth):
> > +            git_dir = os.path.join(pth, '.git')
> > +            if os.path.exists(git_dir):
> > +                import subprocess, hashlib
> > +                m = hashlib.md5()
> > +                head = subprocess.check_output("cd %s && git rev-parse
> HEAD" % pth, shell=True)
> > +                diff = subprocess.check_output("cd %s && git diff" %
> pth, shell=True)
> > +                m.update(head)
> > +                if diff:
> > +                    m.update(diff)
> > +
> > +                return [(pth, m.hexdigest())]
> > +
> > +
> >              # Handle directories recursively
> >              if pth == "/":
> >                  bb.fatal("Refusing to checksum /")
> >
> --
> _______________________________________________
> bitbake-devel mailing list
> bitbake-devel@lists.openembedded.org
> http://lists.openembedded.org/mailman/listinfo/bitbake-devel
>
Mark Hatle Oct. 8, 2019, 10:31 p.m.
On 10/8/19 4:07 PM, Nicolas Dechesne wrote:
> 
> 
> Le mar. 8 oct. 2019 à 22:54, Mark Hatle <mark.hatle@kernel.crashing.org
> <mailto:mark.hatle@kernel.crashing.org>> a écrit :
> 
> 
> 
>     On 10/8/19 1:45 PM, Aníbal Limón wrote:
>     > In some cases people/organizations are using a SRC_URI with
>     > file:///PATH_TO_DIR that contains a git repository for different
>     > reasons, it is useful when want to do an internal build without
>     > clone sources from outside.
>     >
>     > This could consume a lot of CPU time because the current taskhash
>     > generation mechanism didn't identify that the folder is a VCS
>     > (git, svn, cvs) and makes the cksum for every file including the
>     > .git repository in this case.
>     >
>     > There are different ways to improve the situation,
>     >
>     > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
>     >   calculated before the fetcher identifies the protocol, will require
>     >   some changes in bitbake codebase.
> 
>     When I have done this before, I've -always- defined it as a git repository by
>     SRCURI:
> 
>     git://<local file path>;protocol=file
> 
>     Then it does exactly what this patch appears to do and uses the git logic to
>     handle everything automatically.  (The file protocol is already implemented.)
> 
>     Wouldn't this be better then specifying file:// and then attempting to infer
>     what file refers to?
> 
> 
> Anibal and I are working on the same use case / problem. Folks who rely on this
> setup also use it as a workspace, eg. They make local changes in the sources
> workspace. So when doing fetch again the local changes are used. So this is not
> strictly equivalent to what you propose, right?

Sometimes I do my work on a local clone of a repository.  Then I redirect the
recipe to my local git repository using SRC_URI, and SRCREV.

--Mark

> Another major flaw with the current file:// with a git folder is that doing a
> git fetch in the source tree will invalidate the fetch task even if the actual
> sources haven’t changed!
> 
> 
> 
> 
>     --MArk
> 
>     > * This patch: When directory is a git repository (contains .git)
>     >   use HEAD rev + git diff to calculate checksum instead of do it
>     >   in every file, that is hackish because make some assumptions about
>     >   .git directory contents.
>     > * Variant of this patch: Make a list of VCS directories (.git, .svn,
>     >   .cvs) and take out for cksum calculations, same as before making
>     >   assumptions about the . folders content.
>     >
>     > Signed-off-by: Aníbal Limón <anibal.limon@linaro.org
>     <mailto:anibal.limon@linaro.org>>
>     > ---
>     >  lib/bb/checksum.py | 13 +++++++++++++
>     >  1 file changed, 13 insertions(+)
>     >
>     > diff --git a/lib/bb/checksum.py b/lib/bb/checksum.py
>     > index 5bc8a8fc..ee125cb5 100644
>     > --- a/lib/bb/checksum.py
>     > +++ b/lib/bb/checksum.py
>     > @@ -86,6 +86,19 @@ class FileChecksumCache(MultiProcessCache):
>     >              return checksum
>     > 
>     >          def checksum_dir(pth):
>     > +            git_dir = os.path.join(pth, '.git')
>     > +            if os.path.exists(git_dir):
>     > +                import subprocess, hashlib
>     > +                m = hashlib.md5()
>     > +                head = subprocess.check_output("cd %s && git rev-parse
>     HEAD" % pth, shell=True)
>     > +                diff = subprocess.check_output("cd %s && git diff" % pth,
>     shell=True)
>     > +                m.update(head)
>     > +                if diff:
>     > +                    m.update(diff)
>     > +
>     > +                return [(pth, m.hexdigest())]
>     > +
>     > +
>     >              # Handle directories recursively
>     >              if pth == "/":
>     >                  bb.fatal("Refusing to checksum /")
>     >
>     -- 
>     _______________________________________________
>     bitbake-devel mailing list
>     bitbake-devel@lists.openembedded.org
>     <mailto:bitbake-devel@lists.openembedded.org>
>     http://lists.openembedded.org/mailman/listinfo/bitbake-devel
>
Richard Purdie Oct. 8, 2019, 11:15 p.m.
On Tue, 2019-10-08 at 13:45 -0500, Aníbal Limón wrote:
> In some cases people/organizations are using a SRC_URI with
> file:///PATH_TO_DIR that contains a git repository for different
> reasons, it is useful when want to do an internal build without
> clone sources from outside.
> 
> This could consume a lot of CPU time because the current taskhash
> generation mechanism didn't identify that the folder is a VCS
> (git, svn, cvs) and makes the cksum for every file including the
> .git repository in this case.
> 
> There are different ways to improve the situation,
> 
> * Add protocol=gitscm in file:// SRC_URI but the taskhash is
>   calculated before the fetcher identifies the protocol, will require
>   some changes in bitbake codebase.
> * This patch: When directory is a git repository (contains .git)
>   use HEAD rev + git diff to calculate checksum instead of do it
>   in every file, that is hackish because make some assumptions about
>   .git directory contents.
> * Variant of this patch: Make a list of VCS directories (.git, .svn,
>   .cvs) and take out for cksum calculations, same as before making
>   assumptions about the . folders content.

This is an interesting one.

File checksums are added to the hashes "late" so that we don't have to
reparse entire recipes when files change. We do need a mechanism to
know when we need to reparse the checksum. I think this means you can
skip the checksum calculation for each file but you do still end up
having to stat all files in the tree separately for bitbake's tracking
and for git. We also have to notice when new files are added.

As such I'm not convinced this patch will work correctly (e.g. would it
notice if I copy in a new file to the directory untracked by git). 

A first step may be to add some further tests to bitbake-selftest to
better cover this area...

Cheers,

Richard
Nicolas Dechesne Oct. 9, 2019, 9:02 a.m.
On Wed, Oct 9, 2019 at 1:15 AM <richard.purdie@linuxfoundation.org> wrote:
>
> On Tue, 2019-10-08 at 13:45 -0500, Aníbal Limón wrote:
> > In some cases people/organizations are using a SRC_URI with
> > file:///PATH_TO_DIR that contains a git repository for different
> > reasons, it is useful when want to do an internal build without
> > clone sources from outside.
> >
> > This could consume a lot of CPU time because the current taskhash
> > generation mechanism didn't identify that the folder is a VCS
> > (git, svn, cvs) and makes the cksum for every file including the
> > .git repository in this case.
> >
> > There are different ways to improve the situation,
> >
> > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
> >   calculated before the fetcher identifies the protocol, will require
> >   some changes in bitbake codebase.
> > * This patch: When directory is a git repository (contains .git)
> >   use HEAD rev + git diff to calculate checksum instead of do it
> >   in every file, that is hackish because make some assumptions about
> >   .git directory contents.
> > * Variant of this patch: Make a list of VCS directories (.git, .svn,
> >   .cvs) and take out for cksum calculations, same as before making
> >   assumptions about the . folders content.
>
> This is an interesting one.

Are you referring to the last bullet here? I suspect it's the second one.

Also to give a bit more background to everyone, as it might not be
obvious. I've seen the same pattern used several times , especially in
large/corporate deployment of OE/YP. the whole build workspace is
built as:

|- sources
|---- kernel
|---- component_A
|---- component_B
|- layers
|---- poky
|---- meta-mycompany
|-------- recipes for kernel, component_A, ...

The whole workspace is managed with a repo manifest, and the recipes
are written to use source code from the 'sources' local folder.

I am not trying to argue whether this is a good practice or not ;-)
but from the perspectives of the folks I've talked to , there are a
couple of critical advantages of doing something like that:
* it looks like Android development workflow ;-)
* it relates to the company license/legal process and review. e.g. all
the software that gets out of the company is managed by a single repo
manifest xml file
* it solves "nicely" the problem of being able to iteratively develop
using bitbake natively. e.g. "bibtake myimage" always work, and uses
local changes from 'sources'

So overall, i am being convinced that this is a valid use case for OE
end users. I don't think we can use the git:// fetcher as we need the
snapshot of the current 'sources' (with local changes), and using the
file:// fetcher has important performance impacts:
* checksum for 'each' file (which can be large, especially for kernel)
* un-expected rebuild when running repo sync, if any new git objects
are put in .git (even when no changes are made to the local worktree
of the git project).

>
> File checksums are added to the hashes "late" so that we don't have to
> reparse entire recipes when files change. We do need a mechanism to
> know when we need to reparse the checksum. I think this means you can
> skip the checksum calculation for each file but you do still end up
> having to stat all files in the tree separately for bitbake's tracking
> and for git. We also have to notice when new files are added.
>
> As such I'm not convinced this patch will work correctly (e.g. would it
> notice if I copy in a new file to the directory untracked by git).

At least I confirm that with the file:// fetcher everything works
fine, when modifying files. I don't think I have tried adding new
files. But I will try that.

Are you trying to say that to fix this properly we might need another
Fetcher , something in between file:// and git://, e.g. localgit://?
Would that make this problem easier to solve?

>
> A first step may be to add some further tests to bitbake-selftest to
> better cover this area...
>
> Cheers,
>
> Richard
>
>
>
>
>
Richard Purdie Oct. 9, 2019, 11:53 a.m.
On Wed, 2019-10-09 at 11:02 +0200, Nicolas Dechesne wrote:
> On Wed, Oct 9, 2019 at 1:15 AM <richard.purdie@linuxfoundation.org>
> wrote:
> > On Tue, 2019-10-08 at 13:45 -0500, Aníbal Limón wrote:
> > > In some cases people/organizations are using a SRC_URI with
> > > file:///PATH_TO_DIR that contains a git repository for different
> > > reasons, it is useful when want to do an internal build without
> > > clone sources from outside.
> > > 
> > > This could consume a lot of CPU time because the current taskhash
> > > generation mechanism didn't identify that the folder is a VCS
> > > (git, svn, cvs) and makes the cksum for every file including the
> > > .git repository in this case.
> > > 
> > > There are different ways to improve the situation,
> > > 
> > > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
> > >   calculated before the fetcher identifies the protocol, will
> > > require
> > >   some changes in bitbake codebase.
> > > * This patch: When directory is a git repository (contains .git)
> > >   use HEAD rev + git diff to calculate checksum instead of do it
> > >   in every file, that is hackish because make some assumptions
> > > about
> > >   .git directory contents.
> > > * Variant of this patch: Make a list of VCS directories (.git,
> > > .svn,
> > >   .cvs) and take out for cksum calculations, same as before
> > > making
> > >   assumptions about the . folders content.
> > 
> > This is an interesting one.
> 
> Are you referring to the last bullet here? I suspect it's the second
> one.

Sorry I wasn't clear, I was meaning in general.

> 
> Also to give a bit more background to everyone, as it might not be
> obvious. I've seen the same pattern used several times , especially
> in
> large/corporate deployment of OE/YP. the whole build workspace is
> built as:
> 
> > - sources
> > ---- kernel
> > ---- component_A
> > ---- component_B
> > - layers
> > ---- poky
> > ---- meta-mycompany
> > -------- recipes for kernel, component_A, ...
> 
> The whole workspace is managed with a repo manifest, and the recipes
> are written to use source code from the 'sources' local folder.
> 
> I am not trying to argue whether this is a good practice or not ;-)
> but from the perspectives of the folks I've talked to , there are a
> couple of critical advantages of doing something like that:
> * it looks like Android development workflow ;-)
> * it relates to the company license/legal process and review. e.g.
> all
> the software that gets out of the company is managed by a single repo
> manifest xml file
> * it solves "nicely" the problem of being able to iteratively develop
> using bitbake natively. e.g. "bibtake myimage" always work, and uses
> local changes from 'sources'
> 
> So overall, i am being convinced that this is a valid use case for OE
> end users. I don't think we can use the git:// fetcher as we need the
> snapshot of the current 'sources' (with local changes), and using the
> file:// fetcher has important performance impacts:
> * checksum for 'each' file (which can be large, especially for
> kernel)
> * un-expected rebuild when running repo sync, if any new git objects
> are put in .git (even when no changes are made to the local worktree
> of the git project).
> 
> > File checksums are added to the hashes "late" so that we don't have
> > to
> > reparse entire recipes when files change. We do need a mechanism to
> > know when we need to reparse the checksum. I think this means you
> > can
> > skip the checksum calculation for each file but you do still end up
> > having to stat all files in the tree separately for bitbake's
> > tracking
> > and for git. We also have to notice when new files are added.
> > 
> > As such I'm not convinced this patch will work correctly (e.g.
> > would it
> > notice if I copy in a new file to the directory untracked by git).
> 
> At least I confirm that with the file:// fetcher everything works
> fine, when modifying files. I don't think I have tried adding new
> files. But I will try that.

I'd like to check that bitbake's hashes are changing correctly in the
different modification cases.

I did also wondering about this kind of trick, borrowed from stack
overflow:

if [ ! -e  .git/allfilesindex ]; then
    cp .git/index .git/allfilesindex
fi
GIT_INDEX_FILE=.git/allfilesindex git add -u; git write-tree

to get a hash which represents the state of the tree. Using git add -A
might track untracked files too.

> Are you trying to say that to fix this properly we might need another
> Fetcher , something in between file:// and git://, e.g. localgit://?
> Would that make this problem easier to solve?

I'm not sure about that. I'm mainly worried that we have reports that
file:// doesn't work correctly today before we add this kind of
complexity on top. Hence my comments about needing better tests in this
area, with and without git involved.

I'm a little bit too focused on the release to be able to think clearly
about this right now which doesn't help.

Cheers,

Richard

> > A first step may be to add some further tests to bitbake-selftest
> > to
> > better cover this area...
> > 
> > Cheers,
> > 
> > Richard
> > 
> > 
> > 
> > 
> >
Mark Hatle Oct. 10, 2019, 11:53 p.m.
On 10/9/19 4:02 AM, Nicolas Dechesne wrote:
> On Wed, Oct 9, 2019 at 1:15 AM <richard.purdie@linuxfoundation.org> wrote:
>>
>> On Tue, 2019-10-08 at 13:45 -0500, Aníbal Limón wrote:
>>> In some cases people/organizations are using a SRC_URI with
>>> file:///PATH_TO_DIR that contains a git repository for different
>>> reasons, it is useful when want to do an internal build without
>>> clone sources from outside.
>>>
>>> This could consume a lot of CPU time because the current taskhash
>>> generation mechanism didn't identify that the folder is a VCS
>>> (git, svn, cvs) and makes the cksum for every file including the
>>> .git repository in this case.
>>>
>>> There are different ways to improve the situation,
>>>
>>> * Add protocol=gitscm in file:// SRC_URI but the taskhash is
>>>   calculated before the fetcher identifies the protocol, will require
>>>   some changes in bitbake codebase.
>>> * This patch: When directory is a git repository (contains .git)
>>>   use HEAD rev + git diff to calculate checksum instead of do it
>>>   in every file, that is hackish because make some assumptions about
>>>   .git directory contents.
>>> * Variant of this patch: Make a list of VCS directories (.git, .svn,
>>>   .cvs) and take out for cksum calculations, same as before making
>>>   assumptions about the . folders content.
>>
>> This is an interesting one.
> 
> Are you referring to the last bullet here? I suspect it's the second one.
> 
> Also to give a bit more background to everyone, as it might not be
> obvious. I've seen the same pattern used several times , especially in
> large/corporate deployment of OE/YP. the whole build workspace is
> built as:
> 
> |- sources
> |---- kernel
> |---- component_A
> |---- component_B
> |- layers
> |---- poky
> |---- meta-mycompany
> |-------- recipes for kernel, component_A, ...
> 
> The whole workspace is managed with a repo manifest, and the recipes
> are written to use source code from the 'sources' local folder.

This exact situation is why (when I was at WR) we patched git-repo to allow for
bare repository checkouts.. There is no reason for source/* to be checked out,
but it does need a local clone for performance.

Once you have a local clone then you can use the mirroring to point to it and
everything works properly using git://....;protocol=file

> I am not trying to argue whether this is a good practice or not ;-)
> but from the perspectives of the folks I've talked to , there are a
> couple of critical advantages of doing something like that:

The problem with google's restrictive patch submission requirements is I was
never able to push the changes back to google to enable these bare
repositories.. but the patches have been published and are regularly updated at:

https://github.com/WindRiver-OpenSourceLabs/git-repo

Look in the master-next / master-wr-next branches for the rebased versions.

> * it looks like Android development workflow ;-)
> * it relates to the company license/legal process and review. e.g. all
> the software that gets out of the company is managed by a single repo
> manifest xml file
> * it solves "nicely" the problem of being able to iteratively develop
> using bitbake natively. e.g. "bibtake myimage" always work, and uses
> local changes from 'sources'

Yes, all of that can be accomplished with a bare checkout...  but even if you
don't do a bare checkout, you can still do this with bitbake the way it is.

Instead of pointing to sources/kernel you would point to 'sources/kernel/.git'..

PREMIRRORS_append := "\
     git://.*/.* file://${LAYERDIR}/downloads/ \n \
     git://.*/.* git://${LAYERDIR}/../../git/BASENAME;protocol=file \n \
     git://.*/.* git://${LAYERDIR}/../../git/MIRRORNAME;protocol=file \n \
"

For the bare version the above works...  (The file one is for a 'download
tarball'...)

Otherwise, '/.git' after the 'NAME' to do the checked out version.  (I've not
tried this recently but it used to work...  but dramatically increases the
length of time required for a download.)

> So overall, i am being convinced that this is a valid use case for OE
> end users. I don't think we can use the git:// fetcher as we need the
> snapshot of the current 'sources' (with local changes), and using the

I think we need to be precise on what you mean by local changes.  Local as in
not yet committed?  Or local as in on the local disk.  The later you can
definitely use the mirroring.. the former you need to use other approaches to
satisfy (externalsrc... but then you run into the any file changes and it will
rebuild.   If anything, I'd say maybe externalsrc needs to be enhanced?)

> file:// fetcher has important performance impacts:
> * checksum for 'each' file (which can be large, especially for kernel)
> * un-expected rebuild when running repo sync, if any new git objects
> are put in .git (even when no changes are made to the local worktree
> of the git project).

This is why using git://... is needed to tell the system to use the git
formatting and ignore just new files.

--Mark

>>
>> File checksums are added to the hashes "late" so that we don't have to
>> reparse entire recipes when files change. We do need a mechanism to
>> know when we need to reparse the checksum. I think this means you can
>> skip the checksum calculation for each file but you do still end up
>> having to stat all files in the tree separately for bitbake's tracking
>> and for git. We also have to notice when new files are added.
>>
>> As such I'm not convinced this patch will work correctly (e.g. would it
>> notice if I copy in a new file to the directory untracked by git).
> 
> At least I confirm that with the file:// fetcher everything works
> fine, when modifying files. I don't think I have tried adding new
> files. But I will try that.
> 
> Are you trying to say that to fix this properly we might need another
> Fetcher , something in between file:// and git://, e.g. localgit://?
> Would that make this problem easier to solve?
> 
>>
>> A first step may be to add some further tests to bitbake-selftest to
>> better cover this area...
>>
>> Cheers,
>>
>> Richard
>>
>>
>>
>>
>>
Ola X Nilsson Oct. 11, 2019, 12:36 p.m.
On Wed, Oct 09 2019, richard.purdie@linuxfoundation.org wrote:

> On Wed, 2019-10-09 at 11:02 +0200, Nicolas Dechesne wrote:
>> On Wed, Oct 9, 2019 at 1:15 AM <richard.purdie@linuxfoundation.org>
>> wrote:
>> > On Tue, 2019-10-08 at 13:45 -0500, Aníbal Limón wrote:
>> > > In some cases people/organizations are using a SRC_URI with
>> > > file:///PATH_TO_DIR that contains a git repository for different
>> > > reasons, it is useful when want to do an internal build without
>> > > clone sources from outside.
>> > > 
>> > > This could consume a lot of CPU time because the current taskhash
>> > > generation mechanism didn't identify that the folder is a VCS
>> > > (git, svn, cvs) and makes the cksum for every file including the
>> > > .git repository in this case.
>> > > 
>> > > There are different ways to improve the situation,
>> > > 
>> > > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
>> > >   calculated before the fetcher identifies the protocol, will
>> > > require
>> > >   some changes in bitbake codebase.
>> > > * This patch: When directory is a git repository (contains .git)
>> > >   use HEAD rev + git diff to calculate checksum instead of do it
>> > >   in every file, that is hackish because make some assumptions
>> > > about
>> > >   .git directory contents.
>> > > * Variant of this patch: Make a list of VCS directories (.git,
>> > > .svn,
>> > >   .cvs) and take out for cksum calculations, same as before
>> > > making
>> > >   assumptions about the . folders content.
>> > 
>> > This is an interesting one.
>> 
>> Are you referring to the last bullet here? I suspect it's the second
>> one.
>
> Sorry I wasn't clear, I was meaning in general.
>
>> 
>> Also to give a bit more background to everyone, as it might not be
>> obvious. I've seen the same pattern used several times , especially
>> in
>> large/corporate deployment of OE/YP. the whole build workspace is
>> built as:
>> 
>> > - sources
>> > ---- kernel
>> > ---- component_A
>> > ---- component_B
>> > - layers
>> > ---- poky
>> > ---- meta-mycompany
>> > -------- recipes for kernel, component_A, ...
>> 
>> The whole workspace is managed with a repo manifest, and the recipes
>> are written to use source code from the 'sources' local folder.
>> 
>> I am not trying to argue whether this is a good practice or not ;-)
>> but from the perspectives of the folks I've talked to , there are a
>> couple of critical advantages of doing something like that:
>> * it looks like Android development workflow ;-)
>> * it relates to the company license/legal process and review. e.g.
>> all
>> the software that gets out of the company is managed by a single repo
>> manifest xml file
>> * it solves "nicely" the problem of being able to iteratively develop
>> using bitbake natively. e.g. "bibtake myimage" always work, and uses
>> local changes from 'sources'
>> 
>> So overall, i am being convinced that this is a valid use case for OE
>> end users. I don't think we can use the git:// fetcher as we need the
>> snapshot of the current 'sources' (with local changes), and using the
>> file:// fetcher has important performance impacts:
>> * checksum for 'each' file (which can be large, especially for
>> kernel)
>> * un-expected rebuild when running repo sync, if any new git objects
>> are put in .git (even when no changes are made to the local worktree
>> of the git project).
>> 
>> > File checksums are added to the hashes "late" so that we don't have
>> > to
>> > reparse entire recipes when files change. We do need a mechanism to
>> > know when we need to reparse the checksum. I think this means you
>> > can
>> > skip the checksum calculation for each file but you do still end up
>> > having to stat all files in the tree separately for bitbake's
>> > tracking
>> > and for git. We also have to notice when new files are added.
>> > 
>> > As such I'm not convinced this patch will work correctly (e.g.
>> > would it
>> > notice if I copy in a new file to the directory untracked by git).
>> 
>> At least I confirm that with the file:// fetcher everything works
>> fine, when modifying files. I don't think I have tried adding new
>> files. But I will try that.
>
> I'd like to check that bitbake's hashes are changing correctly in the
> different modification cases.
>
> I did also wondering about this kind of trick, borrowed from stack
> overflow:
>
> if [ ! -e  .git/allfilesindex ]; then
>     cp .git/index .git/allfilesindex
> fi
> GIT_INDEX_FILE=.git/allfilesindex git add -u; git write-tree
>
> to get a hash which represents the state of the tree. Using git add -A
> might track untracked files too.

This is what externalsrc.bbclass does to detect changes.

/Ola

>> Are you trying to say that to fix this properly we might need another
>> Fetcher , something in between file:// and git://, e.g. localgit://?
>> Would that make this problem easier to solve?
>
> I'm not sure about that. I'm mainly worried that we have reports that
> file:// doesn't work correctly today before we add this kind of
> complexity on top. Hence my comments about needing better tests in this
> area, with and without git involved.
>
> I'm a little bit too focused on the release to be able to think clearly
> about this right now which doesn't help.
>
> Cheers,
>
> Richard
>
>> > A first step may be to add some further tests to bitbake-selftest
>> > to
>> > better cover this area...
>> > 
>> > Cheers,
>> > 
>> > Richard
>> > 
>> > 
>> > 
>> > 
>> >
Nicolas Dechesne Nov. 6, 2019, 4:50 a.m.
hi,

On Tue, Oct 8, 2019 at 8:42 PM Aníbal Limón <anibal.limon@linaro.org> wrote:
>
> In some cases people/organizations are using a SRC_URI with
> file:///PATH_TO_DIR that contains a git repository for different
> reasons, it is useful when want to do an internal build without
> clone sources from outside.
>
> This could consume a lot of CPU time because the current taskhash
> generation mechanism didn't identify that the folder is a VCS
> (git, svn, cvs) and makes the cksum for every file including the
> .git repository in this case.
>
> There are different ways to improve the situation,
>
> * Add protocol=gitscm in file:// SRC_URI but the taskhash is
>   calculated before the fetcher identifies the protocol, will require
>   some changes in bitbake codebase.
> * This patch: When directory is a git repository (contains .git)
>   use HEAD rev + git diff to calculate checksum instead of do it
>   in every file, that is hackish because make some assumptions about
>   .git directory contents.
> * Variant of this patch: Make a list of VCS directories (.git, .svn,
>   .cvs) and take out for cksum calculations, same as before making
>   assumptions about the . folders content.

I've discussed with Khem and Richard today (@ELCE) about this patch.
We kind of agreed that the current approach is not really good since
the local fetcher isn't supposed to 'deal' with scm commands. however
we agreed that the last variant proposed above might be a much better
approach. The idea would be to exclude VCS directories from the
checksum computation. It could potentially be extended in a slightly
more generic way, like using a variable to specify a list of
directories to exclude, which could be unset by default, and set in oe
core as
BB_LOCAL_DIRS_EXCLUDE = ".git .cvs .svn"

Anibal: do you think you can give it a try?


>
> Signed-off-by: Aníbal Limón <anibal.limon@linaro.org>
> ---
>  lib/bb/checksum.py | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/lib/bb/checksum.py b/lib/bb/checksum.py
> index 5bc8a8fc..ee125cb5 100644
> --- a/lib/bb/checksum.py
> +++ b/lib/bb/checksum.py
> @@ -86,6 +86,19 @@ class FileChecksumCache(MultiProcessCache):
>              return checksum
>
>          def checksum_dir(pth):
> +            git_dir = os.path.join(pth, '.git')
> +            if os.path.exists(git_dir):
> +                import subprocess, hashlib
> +                m = hashlib.md5()
> +                head = subprocess.check_output("cd %s && git rev-parse HEAD" % pth, shell=True)
> +                diff = subprocess.check_output("cd %s && git diff" % pth, shell=True)
> +                m.update(head)
> +                if diff:
> +                    m.update(diff)
> +
> +                return [(pth, m.hexdigest())]
> +
> +
>              # Handle directories recursively
>              if pth == "/":
>                  bb.fatal("Refusing to checksum /")
> --
> 2.23.0
>
Aníbal Limón Nov. 13, 2019, 7:44 p.m.
On Tue, 5 Nov 2019 at 22:50, Nicolas Dechesne <nicolas.dechesne@linaro.org>
wrote:

> hi,
>
> On Tue, Oct 8, 2019 at 8:42 PM Aníbal Limón <anibal.limon@linaro.org>
> wrote:
> >
> > In some cases people/organizations are using a SRC_URI with
> > file:///PATH_TO_DIR that contains a git repository for different
> > reasons, it is useful when want to do an internal build without
> > clone sources from outside.
> >
> > This could consume a lot of CPU time because the current taskhash
> > generation mechanism didn't identify that the folder is a VCS
> > (git, svn, cvs) and makes the cksum for every file including the
> > .git repository in this case.
> >
> > There are different ways to improve the situation,
> >
> > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
> >   calculated before the fetcher identifies the protocol, will require
> >   some changes in bitbake codebase.
> > * This patch: When directory is a git repository (contains .git)
> >   use HEAD rev + git diff to calculate checksum instead of do it
> >   in every file, that is hackish because make some assumptions about
> >   .git directory contents.
> > * Variant of this patch: Make a list of VCS directories (.git, .svn,
> >   .cvs) and take out for cksum calculations, same as before making
> >   assumptions about the . folders content.
>
> I've discussed with Khem and Richard today (@ELCE) about this patch.
> We kind of agreed that the current approach is not really good since
> the local fetcher isn't supposed to 'deal' with scm commands. however
> we agreed that the last variant proposed above might be a much better
> approach. The idea would be to exclude VCS directories from the
> checksum computation. It could potentially be extended in a slightly
> more generic way, like using a variable to specify a list of
> directories to exclude, which could be unset by default, and set in oe
> core as
> BB_LOCAL_DIRS_EXCLUDE = ".git .cvs .svn"
>
> Anibal: do you think you can give it a try?
>

It sounds good, this new variable will be only used for cksum exclude?, if
yes may be to change
for more specific variable  BB_LOCAL_DIRS_TASKHASH_EXCLUDE.

Regards,
Anibal


>
>
> >
> > Signed-off-by: Aníbal Limón <anibal.limon@linaro.org>
> > ---
> >  lib/bb/checksum.py | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/lib/bb/checksum.py b/lib/bb/checksum.py
> > index 5bc8a8fc..ee125cb5 100644
> > --- a/lib/bb/checksum.py
> > +++ b/lib/bb/checksum.py
> > @@ -86,6 +86,19 @@ class FileChecksumCache(MultiProcessCache):
> >              return checksum
> >
> >          def checksum_dir(pth):
> > +            git_dir = os.path.join(pth, '.git')
> > +            if os.path.exists(git_dir):
> > +                import subprocess, hashlib
> > +                m = hashlib.md5()
> > +                head = subprocess.check_output("cd %s && git rev-parse
> HEAD" % pth, shell=True)
> > +                diff = subprocess.check_output("cd %s && git diff" %
> pth, shell=True)
> > +                m.update(head)
> > +                if diff:
> > +                    m.update(diff)
> > +
> > +                return [(pth, m.hexdigest())]
> > +
> > +
> >              # Handle directories recursively
> >              if pth == "/":
> >                  bb.fatal("Refusing to checksum /")
> > --
> > 2.23.0
> >
>