git restful: a thought experiment to help understand how git works, part 1
This is not an implementation of any kind but rather a thought experiment, which I found quite intuitive for ordinary developers to understand how git works and what happens behind different commands. For sanity’s sake, many details will be reduced to a simpler level or completely ignored.
Plus, an imaginary API for git sounds fun.
Assume we don’t have a git implementation, and we’re building a service to mimic one, our service consists of
- a server with data models and APIs to mutate the git repository
- a client to accept git commands and take actions
server
storage
Git uses a file-based data store. However, the underlying database is irrelevant to our imaginary service. All we need is a magical key-value store (an SQLite database, or multiple JSON files) with essential functions:
- to isolate values into different models
- to save a value as a specific model while generating a unique id (or use a provided one)
- to fetch a value by model and id
the core models
Git itself is not that complex. It relies on only a trinity of three core models to operate. Additionally, there’re references acting like annotated shortcuts.
blob
- a blob is simply a segment of binary content, without metadata like filename and timestamp
- it’s a snapshot of a specific version of a file. After all, it’s a version control service, and this is our tiniest unit to control
- we’re using plain string to represent the content despite its binary nature
id: id
content: string
tree
- a tree contains its path plus a list of (pointers of) other blobs or trees, just like a directory containing the path of itself plus files and sub-directories within
- we could recursively build a tree from only one root node TBD
id: id
path: string
children: [{
child_type: tree | blob
child_id: id
}]
commit
- a commit is like a little yellow sticker with a piece of message attached to the trunk of a tree
- therefore, it consists of a message string and a pointer to a tree
- additionally, a commit may refer to one or more parent commits, collectively forming a network of different chains
- a normal commit has one parent, and a merge commit has two. You may use
git merge
to create one with more than two parents, a.k.a an octopus merge
- a normal commit has one parent, and a merge commit has two. You may use
id: id
tree_id: id
parent_ids: id[]
message: string
reference
- A reference is a named commit. That’s it
- In reality, git doesn’t treat references the same way as the other three core models, yet it doesn’t matter to our fantasy service
- A branch is an ordinary reference, pointing to the head of a chain of commits. The id of a branch reference always formatted as
heads/{branch}
- a tag, as you probably have guessed, is a reference as well, with an id like
tags/{tag}
- HEAD is a particular local-only reference, marking your current position
id: id
commit_id: id
to recap
- a blob is a snapshot of one file
- a tree is a snapshot of one directory
- a commit is a snapshot of the repository, with a piece of message
- a reference points to a commit that you often care about
a sample scenario
We will define some preliminary data to represent a repository with exactly one file and one commit.
files
path | content |
---|---|
/constitution.md | We the People of… |
blobs
id | content |
---|---|
blob1 | We the People of… |
trees
id | path | children |
---|---|---|
tree1 | /constitution.md | [] |
commits
id | tree_id | message | parent_ids |
---|---|---|---|
commit1 | tree1 | first commit | [] |
references
id | commit_id |
---|---|
heads/master | commit1 |
HEAD | commit1 |
client
Following are the pseudocode snippets to handle different git commands.
branch & checkout
The sweet command git checkout
is a sweet shortcut for one git branch
plus one git checkout
. git branch
creates the new branch, and git checkout
updates the HEAD
references.
# git branch fix/amendment-1
current_head = GET /refs/HEAD
branch_name = 'fix/amendment-1'
branch_ref = POST /refs { id: 'heads/{branch_name}', commit_id: current_head.commit_id }
# git checkout fix/amendment-1
PATCH /refs/HEAD { commit_id: branch_ref.id }
a lucid notation
The pseudocode above looks tedious and for better readability, let’s try a painless notation:
- GET /models/id → Model.find
- POST /models → Model.create
- PATCH /models/id → Model.update
- …
add & commit
We’re ignoring concepts like local, remote, and staging, so adding and committing must be combined as an atomic action.
To add and commit files, we need to create new blobs for files, then build a new snapshot from modifying a clone of the current tree.
# git add amendment-1.md && git commit 'fix: first amendment'
current_ref = Ref.find('heads/fix/amendment-1')
current_commit = Commit.find(current_ref.commit_id)
current_tree = Tree.find(current_commit.tree_id)
new_blob = Blob.create(content: 'Congress shall make…')
new_tree = Tree.create(path: 'amendment-1.md', children: [])
new_commit = Commit.create(
tree_id: new_tree.id,
message: 'fix: implemented first amendment',
prev_commit_ids: [current_commit]
)
Ref.find('heads/fix/amendment-1').update(commit_id: new_commit.id)
Ref.find('HEAD').update(commit_id: new_commit.id)
merge
Merging is troublesome, the variations of positions and amount of branches demand different actions, plus it may eventually fail due to conflicts.
We’re mainly talking about merging branches right now. The mechanism of merging the actual content of files (blobs) is beyond the scope of this part of the post.
Merge is an operation perform against branches, but branches are only references pointing to commits, so technically, we’re merging head commits of different chains.
The most straightforward case would be where the incoming commit is on the same chain, ahead of the base commit. In this case, we may simply move the pointer, a.k.a fast-forwarding, without creating a new commit.
# git merge fix/amendment-1
+--->---+
| |
---b---x---i
head = Ref.find('HEAD')
base_commit = Commit.find(head.commit_id)
incoming_commit = Ref.find('heads/fix/amendment-1')
base_commit.update(commit_id: incoming_commit.id)
This is often too good to be true. The branches (two or even more) to merge rarely rest on the same sub-chain, and we have to find one base commit, usually one of their common ancestors.
Finding that base commit involves different strategies, let’s ease the headache and say we’re merging only two branches and use the nearest common ancestor as the base.
# git merge fix/amendment-1
(fix/amendment-1)
---c1---c2---x---|
---c4---x--------+
(head) (new commit)
head_ref = Ref.find('HEAD')
incoming_ref = Ref.find('heads/fix/amendment-1')
head_commit = Commit.find(head_ref).commit_id)
incoming_commit = Commit.find(incoming_ref.commit_id)
new_commit = Commit.create(
tree_id: Tree.create(…).id,
message: 'merge fix/amendment-1',
prev_commit_ids: [head_commit.id, incoming_commit.id]
)
head_ref.update(commit_id: new_commit.id)
incoming_commit.update(commit_id: new_commit.id)
next steps
advanced merging and rebasing
We haven’t touched the following topics:
- conflicts detection, merge strategy, octopus merge, etc.
- rebase command and the options like amend and squash
branches as resources
A resource-oriented API service could be pretty flexible. One resource doesn’t strictly bond with a corresponding model in the data store. In a friendly git service, for example, branches and tags could be defined as resources, providing functions like
- GET /branches?prefix=’fix/’
- POST /branches?base=base
conflicts and mergeability
Git hosting services tend to invent concepts like pull/merge request
to handle mergeability, which is inevitably necessary for collaboration between committers. But for git itself, we could propose a more transparent suite of API like
- GET /conflicts?base=head&incoming=some-commit
operation as a service
We could also design routes for operations like
- POST /operations/rebases?branch=some-branch
And now the status 409 Conflict
sounds perfectly appropriate.
Some would say this is not entirely representational
since there’s no resource created. I’d rather think of this as a shortcut for creating an ephemeral resource that has side effects.
Comments