Categories
Programming

The Secret to a Green CI: Efficient Pre-commit Hooks with checkout-index

tl;dr

Linting in the CI

Most people are familiar with the concept of “CI” as in “CI/CD”.

For the ones who are not, it usually refers to a set of automatic tests that are run when a developer sends their code for review. These tests typically include “linters”, which will check that the code follows certain stylistic rules, and detect common mistakes.

It can be pretty annoying to wait for the CI to complete, only to discover that it failed because that I forgot a “;” in TypeScript, added an extra “;” in Python, did not format my Rust code properly, etc.

Of course, I can always run the proper linting command before sending my code to make sure everything is okay. And what were the options to pass to cppcheck again? Oh right, I am supposed to use clang-tidy in this project. What’s the proper syntax to call mypy? Oh, and I do not want to run eslint on all the source files, it takes too long, which files did I change since the last time?

I know, I’ll just write a script. I will have to find out which files were changed since the last time it was call to avoid verifying everything every time. I could use the git history for that, but what was the last commit when the script was run? I would have to store that somewhere…

Things would be simpler if I just ran it on every commit. But then I am definitely going to forget. If only I could run that script for every commit. Oh, and I should ignore changes that I have not committed, since I am not going to send these, and it could create false positives and false negatives. I could use git stash, but I should be extra cautious about not messing things up.

The pre-commit hook

Fortunately, this is a common predicament, and a solution for this exists: pre-commit hooks.

In git, a hook is a script (or executable in general) that is run automatically at certain times. For instance, when running git push, or when making a commit.

When creating a repository, git automatically populates the hooks with some examples. Take a look at .git/hooks/ in your repository.

The hook when making a commit is .git/hooks/pre-commit. So if you make a script and put it there, it will be run when you do git commit. Try it by writing the following code in .git/hooks/pre-commit. Do not forget to make it executable with chmod +x .git/hooks/pre-commit.

#!/usr/bin/env bash
echo "Hello from the pre-commit hook!"
exit 1

If you git add some files and run git commit, you should see:

$ git commit
Hello from the pre-commit hook!

git did not open up your editor to let you write a commit message. In fact, not commit was actually made. The reason is that we made the script exit with the error status “1”. For success, that would be “0”. The status of the script is the same as that of the last executed command. Here, the last executed command is exit 1, which unsurprisingly exits with status “1”.

So, for your TypeScript project, you can just use something like the script below, and no more errors!

#!/usr/bin/env bash
npx eslint src/
npx tsc

Well, not quite.

Making it Work Right

First, we are writing a shell script. And, although they can be convenient, they come with their lot of surprises. In the previous pre-commit script, a linting error found by eslint won’t stop the commit. We can see why in this example:

$ cat false-true
#!/usr/bin/env bash
false
true
$ ./false-true
$ echo $?
0

The false command exits with “1”. But it does not stop the execution of the script! So true is then run, which exits with “0”. This behavior is usually unwanted. Thankfully, we can disable it by adding set -e:

$ cat false-true
#!/usr/bin/env bash
set -e
false
true
$ ./false-true
$ echo $?
1

That’s better!

We can also set the -E option, to ensure that the ERR trap is inherited in functions. What that means is that, if you do some clean-up in your bash script (usually done with traps), and happen to write a function where a failure occurs, the clean-up is done properly. Setting the -u option will give you a proper error when you make a typo in a variable name, instead of just behaving like the variable contains the empty string.

Note: We can also set the “-o pipefail” option, but it comes with its own surprises. Basically, without “-o pipefail”, if you have “command1 | command2”, the script will only fail if command2 fails. With “-o pipefail”, the script will also fail if command1 fails. This can be what you want. But sometimes, command1 failing just means that there is nothing to check. In short, use this option on a case-by-case basis.

Alright, we are not linting the files in the current directory before making a commit. That’s good. But that’s not what we want. What if we only commit some of the changes?

To run the linting command on the exact state corresponding to the commit we are about to do, we could just stash the changes, but that’s dangerous. What if the script fails? What if the script succeeds? What if there is nothing to stash? What if we make a mistake somewhere and randomly corrupt the working directory?

Okay, there is a simple rule to avoid all that: don’t touch the working directory. Don’t stash, don’t move files, don’t do anything in the current directory.

Instead, just copy the state corresponding to the commit in a temporary directory far away from the working directory.

But my repository has so many files! It’s going to take so long to copy the full thing!

We’re not cloning the repository. Just doing a checkout-index. And git is fast.

The proper way to create a working directory is with mktemp, which will avoid any collision. Also, we are not going to leave random files lying around, so we ensure that the temporary directory is removed once the script finishes (succeeds, fails, crashes, is killed, etc.). In Bash, the proper way to do this is with traps. So we do that immediately. Finally, we copy the state of the current commit to that temporary directory with checkout-index:

TEMPDIR=$(mktemp -d)
trap "rm -rf $TEMPDIR" EXIT SIGHUP SIGINT SIGQUIT SIGTERM
git checkout-index --prefix=$TEMPDIR/ -af

Once this is done, we can just cd to that temporary directory and run our linting commands. And we’re done.

Well, almost. You did not expect it to be that easy, do you?

Making it Work Fast

See, modern build tools such as npm and cargo do this clever thing where they put dependencies in a local subdirectory (node_modules/ for npm, target/ for cargo).

It’s definitely an improvement about Python’s virtual environments. However, it does mean that, when we run our linting command from the temporary directory, we start from scratch1. Every time we try to make a commit.

Thankfully, we can just tell npm and cargo to use the usual subdirectory2. But where is this subdirectory again? We could use pwd, but maybe the user is in another subdirectory of the project. Maybe we could try visiting the parent’s until we find node_modules/ or target/?

There is much simpler. The thing we are looking for is usually easy to locate from the repository’s root. Often, it is just directly at the repository’s root. So, if the repository is at ~/src/my-project, we should look in ~/src/my-project/node-modyles/ and ~/src/my-project/target/. Conveniently, we can just ask git for the repository’s root:

GIT_ROOT=$(git rev-parse --show-toplevel)

Then, we’ll just set the relevant environment variables:

export NODE_PATH="${GIT_ROOT}/node_modules"
export CARGO_TARGET_DIR="${GIT_ROOT}/target"

Now things are starting to look pretty good!

But we can do better.

If you only change 1 file out of 10,000, you might want to just check that one file and skip the rest. We can do just that.

First, let’s get the list of the files we should check:

git diff --cached --name-only --diff-filter=AM

Here, the “AM” value passed to diff-filter is two flags, “A” for “added files” and “M” for “modified files”. We can filter further with grep to keep only the files ending in “.py” for example.

Then, we feed this list as arguments for our linting executable. This is done easily with xargs, which allows will take each line from the input and give it as an argument to the command. For instance:

(echo a; echo b; echo c) | xargs ls
# same as
ls a b c

Since the list might be empty, remember to use --no-run-if-empty. Without it, you will run the linting command without argument, which might fail, or worse3, lint all the files. So we could have:

git diff --cached --name-only --diff-filter=AM |
    grep '\.[hc]$' |
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast c; }  # lint from temp directory

Note: The paths listed by git diff are relative to the root of the repository, so you’ll want to run the linting command from that root, even if all the files are in a subdirectory. This could also mean that you may have to explicitly point the linting command to the configuration file. For instance, if your TypeScript files are in a front/ directory along with the Biome configuration file, you will need to do:

git diff --cached --name-only --diff-filter=AM | grep -E '\.tsx?$' |
    (cd $TEMPDIR; xargs --no-run-if-empty npx @biomejs/biome check --config-path="front/")

The Script

Finally, with all the pieces coming together, we have a pre-commit hook which is both reliable and fast. Don’t forget to add the pre-commit script itself to your repository, and enable it with ln -s ../../pre-commit .git/hooks/ (you cannot directly track files in .git/).

You might not need TypeScript and Rust and Python and C and C++. But you can just pick the relevant parts.

#!/usr/bin/env bash
# Usage: copy this file to .git/hooks/

# Exit at first error
set -Eeu

# To handle partially committed files, we must copy the staged changes to a
# separate location
# See also https://stackoverflow.com/a/36793330
TEMPDIR=$(mktemp -d)
trap "rm -rf $TEMPDIR" EXIT SIGHUP SIGINT SIGQUIT SIGTERM
git checkout-index --prefix=$TEMPDIR/ -af

# keep using the same node_modules/ and target/ directories, not a new one in
# the temporary directory this avoids re-parsing everything from scratch every
# time we run the script
GIT_ROOT=$(git rev-parse --show-toplevel)

# TypeScript
export NODE_PATH="${GIT_ROOT}/node_modules"
git diff --cached --name-only --diff-filter=AM | grep -E '\.tsx?$' |
    (cd $TEMPDIR; xargs --no-run-if-empty npx @biomejs/biome check)

# Rust
export CARGO_TARGET_DIR="${GIT_ROOT}/target"
(cd $TEMPDIR; cargo fmt --check)
(cd $TEMPDIR; cargo clippy --all -- --deny warnings)

# Python
(cd $TEMPDIR/back; ruff check .)

# C
git diff --cached --name-only --diff-filter=AM |  # list modified files
    grep '\.[hc]$' |  # filter C files
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast c; }  # lint from temp directory

# C++
git diff --cached --name-only --diff-filter=AM |  # list modified files
    grep '\.[hc]pp$' |  # filter C++ files
    { cd $TEMPDIR; xargs --no-run-if-empty ./lint-fast cxx; }  # lint from temp directory

For C and C++, I use the following lint-fast script:

#!/usr/bin/env bash
# Exit at first error
set -Eeuo pipefail

MODE="${1}"
shift

# trailing space check
if grep -Hn '\s$' "$@"; then
    echo "Error: trailing space"
    exit 1
fi

# check for NULL instead of nullptr in C++ files
if test "${MODE}" = "cxx"; then
    grep -Hn --color NULL "$@" && exit 1
fi


GCCFLAGS=(
    # do not actually compile
    -fsyntax-only
    # fail script when issue is found
    -Werror
    # usual paranoid options
    -Wall
    -Wconversion
    -Wextra
    -Wpedantic
    -Wshadow
    -Wvla
)

# C++ specific options
if test "${MODE}" = "cxx"; then
    GCCFLAGS+=(
        -std=c++17
        -Wold-style-cast
    )
fi

gcc "${GCCFLAGS[@]}" "$@"

#note: one flag might come from pre-commit hook
CPPCHECK_FLAGS=(
    --suppress=noExplicitConstructor
    # no point when we only check a subset of the files
    --suppress=unusedFunction
    # disable lookup for system includes
    --suppress=missingIncludeSystem
    # sometimes does stupid suggestions
    --suppress=useStlAlgorithm
    # otherwise, it complain when not enough files are included
    --suppress=unusedStructMember
    # otherwise, it will complain it has not found a suppressed warning
    --suppress=unmatchedSuppression
    # enable everything else
    --enable=all
    # fail script when issue is found
    --error-exitcode=1
    # only print something when there is an issue
    --quiet
    # ignore inlined JS code
    -DEM_ASM=
)

cppcheck "${CPPCHECK_FLAGS[@]}" "$@"

# C++ specific options
CLANG_FLAGS=(
    --checks=-clang-analyzer-security.FloatLoopCounter,-clang-analyzer-valist.Uninitialized,-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,-clang-analyzer-optin.performance.Padding
    --warnings-as-errors=*
)
if test "${MODE}" = "cxx"; then
    CLANG_FLAGS+=(
        -extra-arg=-std=c++17
    )
fi
clang-tidy "${CLANG_FLAGS[@]}" "$@" --
  1. Of course, if we’re linting with a call to npx, it will probably just fail since the command is not installed in the local project. ↩︎
  2. Yes, I know, I just said we shouldn’t touch the working directory. But we’re not, npm and cargo are! … Okay, let’s just say it’s a special case and we are being very careful here. ↩︎
  3. This is worse because this could stay under the radar for a long time and waste time. An error can usually be diagnosticed quickly and fixed once and for all. ↩︎

Leave a Reply

Your email address will not be published. Required fields are marked *