Remove a Large File from Commit History in Git – 从Git的提交历史中删除一个大文件

最后修改: 2021年 12月 26日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we’ll learn how to remove large files from the commit history of a git repository using various tools.

在本教程中,我们将学习如何使用各种工具从 git 仓库的提交历史中删除大文件。

2. Using git filter-branch

2.使用git filter-branch

This is the most commonly used method, and it helps us rewrite the history of committed branches.

这是最常用的方法,它可以帮助我们重写已提交分支的历史。

For example, suppose we mistakenly drop a blob file inside a project folder, and after deleting it, we still notice the file in our git history:

例如,假设我们误将一个blob文件丢在项目文件夹内,删除后,我们仍然在git历史中注意到该文件。

$ git log --graph --full-history --all --pretty=format:"%h%x09%d%x20%s"
* 9e87646        (HEAD -> master) blob file removed
* 2583677        blob file
* 34ea256        my first commit

We can remove the blob file from our git history by rewriting the tree and its content with this command:

我们可以通过这个命令重写树和它的内容,将blob文件从我们的git历史中删除。

$ git filter-branch --tree-filter 'rm -f blob.txt' HEAD

Here, the rm option removes the file from the tree. Additionally, the -f  option prevents the command from failing if the file is absent from other committed directories in our project. Without the -f  option, the command may fail when we have more than one directory in our project.

在这里,rm 选项将文件从树中删除。此外,-f 选项防止如果该文件不在我们项目的其他提交目录中,该命令将失败。如果没有-f选项,当我们的项目中有一个以上的目录时,该命令可能会失败。

Here is our git log after we ran the command:

这是我们运行命令后的git日志:

* 8f39d86        (HEAD -> master) blob file removed
* e99a81d        blob file
| * 9e87646      (refs/original/refs/heads/master) blob file removed
| * 2583677      blob file
|/  
* 34ea256        my first commit

We can replace the HEAD with the SHA1 key of the commit history to minimize the rewrite.

我们可以用提交历史的SHA1密钥替换HEAD,以尽量减少重写。

Our git log still contains the reference to the deleted file. We can delete the reference by updating our repo:

我们的git日志仍然包含对已删除文件的引用。我们可以通过更新我们的 repo 来删除这个引用:

$ git update-ref -d refs/original/refs/heads/master

The -d option deletes the named ref after verifying it still contains old values.

-d 选项在验证了它仍然包含旧值之后,删除了命名的参考文献。

We need to record that our reference changed in the repository:

我们需要记录我们的引用在版本库中的变化。

$ git reflog expire --expire=now --all

The expire subcommand prunes older reference log entries.

expire子命令删去较早的参考日志条目。

Finally, we need to clean up and optimize our repo:

最后,我们需要清理和优化我们的 repo。

$ git gc --prune=now

The –prune=now option prunes loose objects regardless of their age.

-prune=now选项对松散的对象进行修剪,无论其年龄如何。

After running the command, here is our git log:

运行命令后,这里是我们的git日志:

* 6f49d86        (HEAD -> master) my first commit

We can see that the refs have been removed.

我们可以看到,参考文献已经被删除。

Alternatively, we can run:

另外,我们可以运行。

$ git filter-branch --index filter 'git rm --cached --ignore-unmatched blob.txt' HEAD

This works exactly like tree-filter, but it is faster because it only rewrites the index, i.e., the working directory. The subcommand –ignore-unmatched prevents the command from failing if the file is missing from other committed directories in our project.

这与tree-filter的工作原理完全一样,但它更快,因为它只重写索引,即工作目录。子命令-ignore-unmatched 如果文件在我们项目的其他提交目录中丢失,则防止该命令失败。

We should note that this approach with two different commands can be slow when deleting a large file.

我们应该注意,在删除大文件时,这种使用两个不同命令的方法会很慢。

3. Using git-filter-repo

3.使用git-filter-repo

An alternate approach is to use the git-filter-repo command. It is a third-party add-on, simpler to use, and faster than other approaches. Moreover, it is the solution recommended in the git official documentation.

另一种方法是使用git-filter-repo命令。它是一个第三方插件,使用起来更简单,而且比其他方法更快。此外,它也是git官方文档中推荐的解决方案。

3.1. Installation

3.1.安装

It requires python3 >= 3.5 and git >= 2.22.0 at a minimum; some features require git 2.24.0 or higher.

它至少需要python3 >=3.5和git >=2.22.0;某些功能需要git 2.24.0或更高

We are going to install git-filter-repo on our Linux machine. For the Windows installation guide, we can refer to the documentation. 

我们将在Linux机器上安装git-filter-repo。关于Windows的安装指南,我们可以参考文档

Firstly, we are going to install python-pip and git-filter-repo with the following commands:

首先,我们要用以下命令安装python-pipgit-filter-repo

$ sudo apt install python3-pip
$ pip install --user git-filter-repo

Alternatively, we can use the below commands to install git-filter-repo:

另外,我们可以使用下面的命令来安装git-filter-repo

# Add to bashrc.
export PATH="${HOME}/bin:${PATH}"

mkdir -p ~/bin
wget -O ~/bin/git-filter-repo https://raw.githubusercontent.com/newren/git-filter-repo/7b3e714b94a6e5b9f478cb981c7f560ef3f36506/git-filter-repo
chmod +x ~/bin/git-filter-repo

3.2. Removing the File

3.2.删除文件

Let’s run the command to check our git log:

让我们运行这个命令来检查我们的git日志。

$ git log --graph --full-history --all --pretty=format:"%h%x09%d%x20%s"
* ee36517        (HEAD -> master) blob.txt removed
* a480073        project folder

The next thing is for us to analyze our repo:

下一件事是我们要分析我们的 repo。

$ git filter-repo --analyze
Processed 5 blob sizes
Processed 2 commits
Writing reports to .git/filter-repo/analysis...done.

This generates a directory of reports of the state of our repo. The report can be found at .git/filter-repo/analysis. This information may help determine what to filter in a subsequent run. It can also help us determine if our previous filtering command actually did what we wanted it to do.

这将生成一个关于我们 repo 状态的报告目录。该报告可以在.git/filter-repo/analysis中找到。T这些信息可能有助于确定在随后的运行中要过滤什么。它还可以帮助我们确定我们之前的过滤命令是否真的做了我们想要做的事情。

Then, let’s run this command with option –path-match, which help specify the file to include in the filtered history:

然后,让我们用选项-path-match,来运行这个命令,这有助于指定要包括在过滤的历史中的文件。

$ git filter-repo --force --invert-paths --path-match blob.txt

Here is our new git log:

这是我们新的git日志。

* 8940776        (HEAD -> master) project folder

After execution, it will change the commit hashes of the modified commit. 

执行后,它将改变修改后的提交的哈希值。

4. Using BRG Repo-Cleaner

4.使用BRG Repo-Cleaner

Another great option is BRG Repo-Cleaner, which is a third-party add-on written in Java.

另一个不错的选择是BRG Repo-Cleaner,它是一个用Java编写的第三方插件。

It is faster than the git filter-branch approach. Additionally, it is good for removing large files, passwords, credentials, and other private data

它比git filter-branch方法更快。此外,它还有利于删除大型文件、密码、凭证和其他私人数据

Let assume we want to remove blob files greater than 200MB. This add-on makes it easy to do this:

假设我们想删除大于200MB的blob文件。这个附加组件可以让我们轻松做到这一点:

$ java -jar bfg.jar --strip-blob-bigger-than 200M my-repo.git

Then, let’s run this command to clean the dead data:

然后,让我们运行这个命令来清理死的数据。

$ git gc --prune=now --aggressive

5. Using git-rebase

5.使用git-rebase

We need the SHA1 key from the git log to use this approach:

我们需要git日志中的SHA1密钥来使用这种方法:

$ git log --graph --full-history --all --pretty=format:"%h%x09%d%x20%s"
* 535f7ea        (HEAD -> master) blob file removed
* 8bffdfa        blob file
* 5bac30b        index.html

Our aim is to remove the blob file from our commit history. So we’ll use the SHA1 key from the history of the entry preceding the one we want to remove.

我们的目的是将blob文件从我们的提交历史中删除。因此,我们将使用我们想要删除的那个条目之前的历史记录中的SHA1密钥。

With this command, we enter into an interactive rebase:

通过这条命令,我们进入了一个交互式的rebase:

$ git rebase -i 5bac30b

This opens our nano editor showing:

这将打开我们的nano编辑器,显示:

pick 535f7ea blob file removed
pick 8bffdfa blob file 

# Rebase 5bac30b..535f7ea onto 535f7ea (2 command)
#
# Commands:
# p, pick <commit> = use commit
# r, reword <commit> = use commit, but edit the commit message
# e, edit <commit> = use commit, but stop for amending
# s, squash <commit> = use commit, but meld into previous commit
# f, fixup <commit> = like "squash", but discard this commit's log message
# x, exec <command> = run command (the rest of the line) using shell
# b, break = stop here (continue rebase later with 'git rebase --continue')
# d, drop <commit> = remove commit
# l, label <label> = label current HEAD with a name
# t, reset <label> = reset HEAD to a label
# m, merge [-C <commit> | -c <commit>] <label> [# <oneline>]
# .  create a merge commit using the original merge commit's
# .  message (or the oneline, if no original merge commit was
# .  specified). Use -c <commit> to reword the commit message.

Now, we’ll modify this by deleting the text “pick 535f7ea blob file removed“. This helps us alter the commit history and remove the history we deleted earlier.

现在,我们要修改一下,删除”pick 535f7ea blob file removed“这个文本。这有助于我们改变提交历史,并删除我们之前删除的历史。

We then save the file and quit the editor, which drops us at the terminal with the following message:

然后我们保存文件并退出编辑器,这时我们在终端机上看到了如下信息:

interactive rebase in progress; onto 535f7ea
Last command done (1 command done):
pick 535f7ea blob file removed
No commands remaining.
You are currently rebasing branch 'master' on '535f7ea'.
(all conflicts fixed: run "git rebase --continue")

Finally, let’s continue the rebase operation:

最后,让我们继续进行重定位操作。

$ git rebase --continue
Successfully rebased and updated refs/heads/master.

We can then verify our commit history:

然后我们可以验证我们的提交历史。

$ git log --graph --full-history --all --pretty=format:"%h%x09%d%x20%s"
* 5bac30b        (HEAD -> master) index.html

We should note that this approach is not as fast as git-filter-repo.

我们应该注意,这种方法没有git-filter-repo那么快。

6. Conclusion

6.结论

In this article, we learned different approaches to remove large files from the commit history of a git repository. We also saw that according to git documentation, git filter-repo is recommended because it is fast and has fewer cons compared to other approaches.

在这篇文章中,我们学习了从 git 仓库的提交历史中删除大文件的不同方法。我们还看到,根据git文档,git filter-repo被推荐,因为与其他方法相比,它速度快,缺点少。