Friday, August 07, 2009

git: shrinking Subversion import

At $WORK, we've been attempting for years—but fairly infrequently—to do distributed development with centralized Subversion. We finally had enough and decided to move to git.

Part of that move involved importing a couple of projects with 6+ years of history. Early revisions carried lots of binary test data, so git svn clone produced repositories weighing in at 3.5 and 4.5 gigabytes.

Another less than satisfactory result was the umpteen bazillion git branches corresponding to git tags. Some of the git branches formed families with names of the form name-x.y.z@1234, where name-x.y.z is the name of a Subversion release tag and 1234 was a Subversion revision that modified the tag. A happy design choice made the branch name-x.y.z (with no @nnn) the head revision of that Subversion tag, so we easily picked off some targets:

$ git branch -r -D `git branch -r | grep @`
Cribbing from svn2git, converting the git branches to git tags was a series of commands of the form
$ git checkout 1.2.3
$ git tag -a -m "Tagging release 1.2.3" v1.2.3
$ git branch -r -D 1.2.3
Then to make the Subversion trunk the git master branch:
$ git branch -D master
$ git checkout trunk
$ git checkout -f -b master
$ git branch -r -D trunk
Here's a good point to checkpoint your work in case you hose something later.

Using Antony Stubbs's script to find the biggest objects in a repo's packs, we determined that much of the bulk came from huge HDF5-format test baselines along with a few others. So we cut them out:

$ git filter-branch -d /dev/shm/scratch --index-filter \
  "git rm --cached -f --ignore-unmatch '*.h5'; \
   git rm --cached -f --ignore-unmatch '*.sig'; \
   git rm --cached -f --ignore-unmatch '*.2dsc'" \
  --tag-name-filter cat -- --all
The use of --index-filter makes the long process (remember, it has to hit all possible revisions) quicker because it operates directly on the index rather than checking out every snapshot, munging the filesystem, and shoving the new snapshot back in. Also, /dev/shm is a tmpfs mount for better throughput, and the directory named with -d shouldn't exist.

The git filter-branch manpage has a checklist for shrinking a repository that recommends running filter-branch and then cloning to leave behind the cruft.

Cloning with a filesystem path makes hardlinks, so use a URL:

$ git clone file:///home/gbacon/src/dBTools.git
Even after doing this, some big unnamed blobs had survived the clone. Thanks to #git on freenode for the suggestion to excise the reflog:
$ git reflog expire --verbose --expire=0 --all
$ git gc --prune=0
Note that these options will require a fairly recent git.

After all these steps, the git repositories were went from gigabytes to 75 and 100 megabytes, much nicer!

4 comments:

Antony Stubbs said...

Glad you found it useful! :) How did you come across it?

Greg said...

I forget the exact search terms, something like git largest objects, that led to a post of yours on the git mailing list.

Unknown said...

Greg, you might want to mention that we keep the old SVN repos around so that we can pull those binaries out. Otherwise we've just dumped our entire test history and can't revert to a checkpoint and get working tests.

Anonymous said...

Did you figure out why you have @nnn branches? I am seeing this with a git-svn clone I'm working with but am not 100% sure why they're there.

Some are due to some past merges where people deleted trunk and copied a branch over top instead, and one is due to re-organization on the server however I haven't been able to make a toy example that replicates either of these cases.