Update from obsidian - thinkpad

Affected files:
.obsidian/plugins/obsidian-omnivore/data.json
Read Later/2024-02-18 - Git Tips 1- Oldies but Goodies.md
Read Later/2024-02-18 - Git Tips 2- New Stuff in Git.md
Read Later/2024-02-18 - Git Tips 3- Really Large Repositories.md
Read Later/2024-03-26 - How Do Open Source Software Lifecycles Work-.md
Read Later/2024-04-18 - Unix sockets, the basics in Rust - Emmanuel Bosquet.md
Read Later/2024-05-10 - Bullet Journal in 5 Minutes a Day (for busy people).md
This commit is contained in:
Alexander Navarro 2024-05-12 12:31:31 -04:00
parent a6c324506a
commit 993f1f4795
7 changed files with 1166 additions and 11 deletions

View file

@ -0,0 +1,300 @@
---
id: 0f04cdaa-c871-4ba1-a882-7aecf65a2505
title: |
Git Tips 3: Really Large Repositories
status: ARCHIVED
tags:
- read-later
- dotfiles
- git
date_added: 2024-02-18 11:02:27
url_omnivore: |
https://omnivore.app/me/git-tips-3-really-large-repositories-18dbc867c96
url_original: |
https://blog.gitbutler.com/git-tips-3-really-large-repositories/
---
# Git Tips 3: Really Large Repositories
## Highlights
Git has had shallow clones. You could almost always run something like `git clone --depth=1` to get only the last commit and the objects it needs and then `git fetch --unshallow` to get the rest of the history later if needed. But it did break lots of things. You can't `blame`, you can't `log`, etc.
[source](https://omnivore.app/me/git-tips-3-really-large-repositories-18dbc867c96#d319f349-d2ea-4aed-9558-d67e4b708d74)
---
If you want to do a blobless clone, you just pass `--filter=blob:none`
[source](https://omnivore.app/me/git-tips-3-really-large-repositories-18dbc867c96#d296df77-d1f6-4a1b-99c6-9716824f2ee2)
---
If each of those subdirectories was huge, this could be annoying to manage. Instead, we can use sparse checkouts to filter the checkouts to just specified directories.
To do this, we run `git sparse-checkout set [dir1] [dir2] ...`
[source](https://omnivore.app/me/git-tips-3-really-large-repositories-18dbc867c96#d92785cc-9dde-46f8-9764-af2195722289)
---
[Scalar](https://git-scm.com/docs/scalar?ref=blog.gitbutler.com) is mostly just used to clone with the correct defaults and config settings (blobless clone, no checkout by default, setting up maintenance properly, etc).
[source](https://omnivore.app/me/git-tips-3-really-large-repositories-18dbc867c96#96d0ab7b-1bf9-4b10-a08a-3d13c01bf56c)
---
## Original
![Git Tips 3: Really Large Repositories](https://proxy-prod.omnivore-image-cache.app/0x0,siZlz2SPUjb175XwxdyoxP42iDGrPNR_i-XDguj36PyY/https://blog.gitbutler.com/content/images/size/w500/2024/02/CleanShot-2024-02-08-at-15.55.00@2x.png)
Did you know that Git can handle enormous monorepos like Windows? Let's take a look at how.
In our third and final section of our [Git Tips series](https://blog.gitbutler.com/git-tips-and-tricks/), we're going to talk about how well Git now handles **very large** repositories and monorepos.
Do you want to use vanilla Git today to manage a 300GB repo of 3.5M files receiving a push every 20 seconds from 4000 developers without problems? **Read on!**
Here is our blog agenda. Our blogenda.
* [Prefetching](#prefetching)
* [Commit Graph](#commit-graph)
* [Filesystem Monitor](#filesystem-monitor)
* [Partial Cloning](#partial-cloning)
* [Sparse Checkouts](#sparse-checkouts)
* [Scalar](#scalar)
## First, Let's Thank Windows
Before we get started though, the first thing we have to do is thank Microsoft for nearly all of this.
In 2017, Microsoft successfully moved the Windows codebase to Git. Brian Harry wrote a really great blog post about it called [The Largest Git Repo on the Planet](https://devblogs.microsoft.com/bharry/the-largest-git-repo-on-the-planet/?ref=blog.gitbutler.com) that you should read if you're interested, but the size and scope of this repository is astounding.
* **3.5M** files
* for reference, the Linux kernel is about 80k files, or 2% of that
* **300GB** repository (vs \~4.5G Linux kernel)
* **4,000** active developers
* **8,421** pushes per day (on average)
* **4,352** active topic branches
In order to get that to work in any possible way, Microsoft had a lot of work to do. With vanilla Git at that time, a lot of commands (ie, `git status`) would take hours if they ever finished at all. They needed to make every command close to as fast as Source Depot was.
The first solution to this problem was a new project called [VFS for Git](https://github.com/microsoft/VFSForGit?ref=blog.gitbutler.com) which was a virtual filesystem layer that did virtual checkouts and then requested files from a central server on demand.
Eventually they moved more and more of the solutions they developed to the [Scalar](https://github.com/microsoft/scalar?ref=blog.gitbutler.com) project, which got rid of the virtualization layer, and would instead request file contents on checkout rather than on demand.
Then they moved everything, piece by piece, into the [Microsoft Git](https://github.com/microsoft/git?ref=blog.gitbutler.com) fork and then finally every part of _that_ was moved into [core Git](https://git-scm.com/docs/scalar?ref=blog.gitbutler.com).
So, as promised, if you want to use Git out of the box today to manage a 300GB repo of 3.5M files receiving a push every 20 seconds from 4000 developers, you can _100%_ do so.
Let's get into everything they added for us.
## Prefetching
So, in the last blog post I talked about how running `git maintenance` on a repository will run prefetching and maintain your commit graph every hour. Let's cover the first of those. What is "prefetching"?
One of the things that the Windows devs found annoying was that fetches would often be slow because there was _so much_ activity going on all the time. Whenever they would fetch, they have to get _all_ the data since whatever the last time they manually fetched.
So in the cronjob, they added something called "prefetching", which will essentially run a fetch command every hour automatically for you.
However, it does not update your remote references like it normally would, instead it populates a special `refs/prefetch` area of your references.
![](https://proxy-prod.omnivore-image-cache.app/2000x1124,s5PGJtd3NbLb1LIeeq_YWX0VJY09p-71Zf0LwHdEMMsg/https://blog.gitbutler.com/content/images/2024/02/image-3.png)
These references aren't shown normally, even if you run something like `git log --all`, they are quite hidden from you. However, they are used in the remote server negotiation, which means that if you have this turned on, whenever you go to fetch, your computer is never more than 1 hour of pushes behind, data-wise.
Basically it makes manual fetches fast.
![](https://proxy-prod.omnivore-image-cache.app/1596x888,sb4Lbt2QVISxOxom_DRM5UVNMt1lEn4ZkkAm8Cl062eg/https://blog.gitbutler.com/content/images/2024/02/CleanShot-2024-02-08-at-14.41.05@2x.png)
Joke stolen from Martin Woodward :)
## Commit Graph
The other thing that `git maintenance` does every hour is update your commit graph data. What does that mean exactly?
Well, essentially, it makes walking your commit history faster.
Instead of opening up one commit object at a time to see what it's parent is, then opening up that object to see what it's parent is, etc, the commit graph is basically an index of all that information that can be quickly read in one go. This makes things like `git log --graph` or `git branch --contains` much, much faster. For most repositories this probably isn't a horrible problem, but when you start getting into millions of commits, it makes a huge difference.
Here's a benchmark of some log related subcommands run on the Linux kernel codebase with and without the commit graph data (from the [GitHub blog](https://github.blog/2022-08-30-gits-database-internals-ii-commit-history-queries/?ref=blog.gitbutler.com#the-commit-graph))
| **Command** | **Withoutcommit-graph** | **Withcommit-graph** |
| ------------------------- | ----------------------- | -------------------- |
| git rev-list v5.19 | 6.94s | 0.98s |
| git rev-list v5.0..v5.19 | 2.51s | 0.30s |
| git merge-base v5.0 v5.19 | 2.59s | 0.24s |
Here is a quick test that I did on the same Linux repo running `git log --graph --oneline` before and after writing a commit graph file.
![](https://proxy-prod.omnivore-image-cache.app/2000x1097,sTzCPFXduXYtUdrHYMfcLWmrV9-98RbuoCIsu4c0788Q/https://blog.gitbutler.com/content/images/2024/02/CleanShot-2024-02-08-at-14.52.44@2x.png)
Again, if you have your `maintenance` cron jobs running, this is just magically done for you constantly, you don't really have to do anything explicit.
## Filesystem Monitor
One of the things that VFS for Git needed was a filesystem monitor so that it could detect when virtual file contents were being requested and fetch them from a central server if needed.
This monitor was eventually utilized to speed up the `git status` command by updating the index based on filesystem modification events rather than running `stat` on every file, every time when you run it.
While the former became unnecessary when the virtualization layer was abandoned, the latter was integrated into Git core. If you want much, much faster `git status` runs for very large working directories, the new Git filesystem monitor is a lifesaver.
Basically you just set these config settings:
```routeros
$ git config core.fsmonitor true
```
What this will do is add a setting that the `git status` command will see when it runs, indicating that it should be using the [fsmonitor-daemon](https://git-scm.com/docs/git-fsmonitor--daemon?ref=blog.gitbutler.com). If this daemon is not running, it will start it.
![](https://proxy-prod.omnivore-image-cache.app/2000x1183,sygera0cWIyNzCE8oYQ1maIUOqQuX7xqwA9VIR7-NzmI/https://blog.gitbutler.com/content/images/2024/02/CleanShot-2024-02-08-at-14.59.36@2x.png)
fsmonitor on Chromium makes status 20x faster
So, the first time you run `status` after setting the config setting, it won't be much faster. But every time after that it will be massively faster.
Again, there's nothing really to explicitly do after setting the value, things just get faster.
## Partial Cloning
Another big issue with large repositories is clone size. As you probably know, by default Git will fetch everything. You don't even need to have a 300GB repository for this to be a problem. [Linux](https://github.com/torvalds/linux?ref=blog.gitbutler.com) is over 4GB, [Chromium](https://github.com/chromium/chromium?ref=blog.gitbutler.com) is over 20GB. A full Chromium clone can easily take an hour, even over a pretty fast connection.
Now, for a long time ==Git has had shallow clones. You could almost always run something like== `==git== ==clone== ==--depth=========1==` ==to get only the last commit and the objects it needs and then== `==git== ==fetch== ==--unshallow==` ==to get the rest of the history later if needed. But it did break lots of things. You can't== `==blame==`==, you can't== `==log==`==, etc.==
However, now Git has both blobless and treeless clones. So you do get the whole history (all of the commits), but you don't locally have the actual content. Let's ignore the treeless clones for now because it's not generally recommended, but the blobless clone is.
![](https://proxy-prod.omnivore-image-cache.app/2000x841,s5O63kDCQr5bDr0tglWOF_SKPPIWXMui5ZnxgxZCTCTU/https://blog.gitbutler.com/content/images/2024/02/CleanShot-2024-02-08-at-15.12.29@2x.png)
A full clone of the Linux repository is 4.6G, or (for me) a 20 min process
==If you want to do a blobless clone, you just pass== `==--filter=====blob:none==`
This will change the process a little. It will download all the commits and trees, which in the case of cloning the Linux kernel reduces 4.6G to 1.4G, and it will then do a _second_ object fetch for just the blobs that it needs in order to populate the checkout.
![](https://proxy-prod.omnivore-image-cache.app/2000x949,shA_6-2W1enm5l40LjCQ2iZ15sTdVVB7NVv3Qjmiv-O0/https://blog.gitbutler.com/content/images/2024/02/CleanShot-2024-02-08-at-15.14.06@2x.png)
So you can see that instead of 20 minutes for the clone, it took me 4.5 minutes. You can also see that it did two fetches, one for the 1.4GBs of commit and tree data, and a second for the 243MB of files it needs for my local checkout.
Now, there are downsides to this too. If you want to run a command that needs data that is not there, Git will have to go back to the server and request those objects. Luckily, it does this on-demand as it needs the objects, but it can make something like `blame` do a bunch of roundtrips.
![](https://proxy-prod.omnivore-image-cache.app/2000x1130,sQnjRfPMnR0N-snkIfA9TVQUCVl8twPR_ReMNUGlvA_o/https://blog.gitbutler.com/content/images/2024/02/CleanShot-2024-02-08-at-15.17.14@2x.png)
round and round we go
In the case of Linux, my tests showed that a normal file blame might take 4 seconds now takes 45 seconds. But that's only the first time you need to do it.
## Sparse Checkouts
The last big thing to look at is not only useful for large repositories, but specifically for monorepos. That is, repositories that contain multiple projects in subdirectories.
For example, at GitButler, our web services are in a monorepo, with each service we run on AWS in a subdirectory.
```reasonml
tree -L 1
.
├── Gemfile
├── Gemfile.lock
├── README.md
├── auth-proxy
├── butler
├── chain
├── check.rb
├── copilot
└── git
6 directories, 4 files
```
==If each of those subdirectories was huge, this could be annoying to manage. Instead, we can use sparse checkouts to filter the checkouts to just specified directories.==
==To do this, we run== `==git== ==sparse-checkout== ==set [dir1] [dir2] ...==`
```smali
git sparse-checkout set butler copilot
tree -L 1
.
├── Gemfile
├── Gemfile.lock
├── README.md
├── butler
├── check.rb
└── copilot
```
So we still have the top level files, but only the two subdirectories that we specified. This is called "cone mode" and tends to be pretty fast. It also makes `status` and related commands faster because there are fewer files to care about. You can also however, set patterns rather than subdirectories, but it's more complicated.
Here's a local test I did with the Chromium repository:
```sql
time git status
On branch main
Your branch is up to date with 'origin/main'.
nothing to commit, working tree clean
real 0m5.931s
git sparse-checkout set build base
time git status
On branch main
Your branch is up to date with 'origin/main'.
You are in a sparse checkout with 2% of tracked files present.
nothing to commit, working tree clean
real 0m0.386s
```
This is without the `fsmonitor` stuff. You can see that `status` went from 6 seconds to run down to 0.3 seconds because there just aren't as many files.
If you're using large monorepos, this means you can do a blobless clone to have a much smaller local database (you can also run `clone --no-checkout` to skip the initial checkout), then do a `sparse-checkout` to filter to the directories you need and everything is massively faster.
## Scalar
Finally, Git now (since Oct 2022, Git 2.38) ships with an _alternative_ command line invocation that wraps some of this stuff.
This command is called `scalar`. Just go ahead and type it:
```fsharp
scalar
usage: scalar [-C <directory>] [-c <key>=<value>] <command> [<options>]
Commands:
clone
list
register
unregister
run
reconfigure
delete
help
version
diagnose
```
==[Scalar](https://git-scm.com/docs/scalar?ref=blog.gitbutler.com)== ==is mostly just used to clone with the correct defaults and config settings (blobless clone, no checkout by default, setting up maintenance properly, etc).==
If you are managing large repositories, cloning with this negates the need to run `git maintenance start` and send the `--no-checkout` command and remember `--filter=tree:0` and whatnot.
Now you're ready to scale! ...-ar.
## Some More Reading
If you want to read about all of this in great detail, GitHub has done an amazing job covering lots of this too:
* [The Commit Graph](https://github.blog/2022-08-30-gits-database-internals-ii-commit-history-queries/?ref=blog.gitbutler.com#the-commit-graph)
* [Sparse checkout](https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/?ref=blog.gitbutler.com)
* [Filesystem Monitor](https://github.blog/2022-06-29-improve-git-monorepo-performance-with-a-file-system-monitor/?ref=blog.gitbutler.com)
* [Sparse index](https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/?ref=blog.gitbutler.com)
* [Partial and shallow clone](https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/?ref=blog.gitbutler.com)
* [The Story of Scalar](https://github.blog/2022-10-13-the-story-of-scalar/?ref=blog.gitbutler.com)
There is also a ton more fascinating information on this from [Derrick Stolee](https://stolee.dev/?ref=blog.gitbutler.com), who did a lot of work on these projects.
Ok, that's it for our Git Tips series! Hope you enjoyed it and let us know in Discord if you have any questions or comments, or would like to see us do any other topics in Git land.
Thanks!
### Subscribe to new posts.