Date:

Share:

How are Git’s objects linked together?

Related Articles

Table of Contents

introduction

Most Git users are familiar with how each commit refers to one or more parents, except for the initial commit that has no parents.

As a result, commitment networks are created that we call branches. It also results in a structure known as a Directed Acyclic Graph or DAG, which is a graph structure that represents a series of nodes that relate to other nodes through directed edges, without circular structures or Cycles be possible

However, most Git users don’t realize that Git’s other core objects—trees and blobs—are interconnected and committed together using similar types of references.

In this article, we’ll discuss how Git commits connect to Git trees, and further how these trees connect to blobs that form a connected network of objects. This is what allows Git to do all the amazing tasks we developers use every day.

Git commits connect to root trees

As a casual Git user, you may not have known that every commit you make refers to a root tree. We have a whole article on trees if you need a refresher.

A root tree is just the top-level tree object that commit points to. Each commit points to one and only one root tree, indicating the collection of objects of which the commit is a “snapshot”. The root tree is created directly from the changes in the files in the staging area at the time of the commit.

Like all Git objects, each root tree is stored in the Git object database and identified by a unique SHA-1 hash value of its contents. This SHA-1 value is also known as Object ID. Each commit refers to its root tree by recording the object ID of the tree along with the contents of the commit.

A commit object has the format shown below, which includes the object ID of the root tree, along with the rest of the commit data:

commit <size-of-commit-data-in-bytes>''
<root-tree-SHA1-hash>
<parent-1-commit-id>
<parent-2-commit-id>
...
<parent-N-commit-id>
author ID email date
committer ID email date

user comment

By storing the object ID of the root tree as part of the commit, Git can figure out what file content is associated with that commit, as we’ll see later.

Git root trees connect to blobs and subtrees

Since root trees in Git are just regular trees, they refer to the set of blobs and subtrees contained in a commit. Now would be a good time to brush up on stains, if you need one.

Here is the format of a tree as stored by Git:

tree <size-of-tree-in-bytes>
<file-1-permission> <file-1-name><file-1-blob-object-id>
<file-2-permission> <file-2-name><file-2-blob-object-id>
...
<file-n-permission> <file-n-path><file-n-blob-hash>

As you can see, a tree is just a list of file permissions, file names, and their corresponding blob object IDs. The tree helps Legit understand which patches are included in each commit.

Git subtree nesting

Note that trees can also refer to other trees (known as subtrees), which in turn refer to other patches and even deeper under trees. You can think of it as a sort of hierarchy or family tree where all the files and folders in a commit are connected together.

Git stores a brand new blob for each version of each file

Many Git users think that Git stores a series of differences or changes to files, which are applied or reduced incrementally to restore each version of a file. However, this is not true at all. Although legacy version control systems like SCCS and RCS did things this way, it turns out to be very slow to apply all these differences when the history gets very large.

For this reason, Git stores an entirely new block for each version of each file you commit. This makes recovering the contents of the file as simple and fast as accessing the contents of the corresponding blob.

This may sound like a waste of storage, especially if some files only have very small changes. But, Git uses several layers of compression algorithms to compensate for this. The zlib library is used to compress each individual Git object. Furthermore, individual objects are compressed into pack files. This allows similar objects (such as blobs originating from files with only minor changes) to be efficiently stored and transferred across networks.

Git trees point to existing patches for unchanged files

However, for files that sit in your working directory unchanged when you commit, Git can simply reuse the existing blob already sitting in the object database. This is done by referencing the blob object ID that exists in the root tree when a commit is made. All changed files will have a new block created and included in the tree, but unchanged files will simply use their existing block.

Visualizing Git objects as a hierarchy under the commit chain

Putting it all together, I want you to try to imagine what Git’s root tree, subtrees, and blobs look like.

  1. Start by drawing a simple chain of 3 bound (i.e. a small branch). Each commitment is a circle with an arrow pointing back to its parent.
  2. Then imagine a root tree as a triangle sitting under each bound circle. An arrow points from each commit to its root tree.
  3. Finally, see a group of 2-3 spots as boxes sitting under each root tree, with arrows pointing from the tree to each block.

I find that mentally visualizing this “marionette” structure helps me get a feel for how Git’s objects fit together under the hood.

Summary

In this article, we discussed how Git’s core objects – blobs, trees, and commits – all communicate together into an interconnected network.

We saw how each commit refers to a root tree, and each root tree refers to a set of blobs and possibly subtrees. A new blob is created for each and every modified version of a file, and trees will make use of existing blobs whenever possible for storage efficiency.

All of this is best represented as a network of objects flowing under a chain of commits, with arrows representing the references between the various Git objects.

next steps

If you’re interested in learning more about how Git works under the hood, check out our Baby Git guide for developers, which dives into the Git code in an accessible way. We wrote it for developers who are curious to learn how Git works at the code level. To that end, we’ve documented the first version of Git’s code and discussed it in detail.

We hope you enjoyed this post! Feel free to email me at jacob@initialcommit.io with any questions or comments.

References

  1. Git SCM Packfiles – https://git-scm.com/book/en/v2/Git-Internals-Packfiles

Source

Popular Articles