一次nixos启动速度调试记录

免责声明:本文是一篇个人排查记录(日记), 而非详尽的教程. 文中包含一些“愚蠢”的操作和试错过程, 希望能为遇到类似问题的你提供一些思路.

启动卡顿, 20秒空白的煎熬

最近, 我的 NixOS 笔记本电脑启动速度突然变得异常缓慢. 每次开机, 屏幕都会长时间(大约 20 多秒)没有任何输出, 我怀疑是启动过程中出现了死锁之类的bug. 虽然当时时间紧张无暇顾及, 但这个问题一直困扰着我, 直到最近才下定决心一探究竟.

初步检查:寻找线索

首先, 我通过 dmesg 命令查看了内核日志, 希望能找到一些异常:

1
2
[    1.543706] usb 3-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 27.779755] systemd[1]: Inserted module 'autofs4'

从日志中可以看出, 在 usb 设备初始化之后, 到 autofs4 模块插入之间, 存在一个显著的时间间隔, 约为 26 秒.

接着, 我查看了 journalctl. 这里没有像 dmesg 那样明显的间隔, 但这可能是由于系统时钟在启动早期被调整, 或者 journald 在捕获早期内核消息时存在聚合或延迟.

1
2
Jun 11 22:15:19 miLaptop kernel: usb 3-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0
Jun 11 22:15:19 miLaptop systemd[1]: Inserted module 'autofs4'

为了获取更详细的启动信息, 我尝试提高了内核日志级别并切换了内核版本:

1
2
boot.kernelParams = [ "debug" "initcall_debug" "loglevel=8" "noquiet" ];
boot.kernelPackages = pkgs.linuxPackages_6_14;

但是依旧没有找到任何有用的信息.

随后, 我使用 systemd-analyze 进行了分析. 虽然我当时没有保留出问题的记录, 但可以确定的是, 内核阶段(kernel stage)的耗时竟然长达 30 秒!作为对比, 下面是一个正常的 systemd-analyze 输出:

1
2
Startup finished in 4.914s (firmware) + 3.721s (loader) + 1.610s (kernel) + 8.354s (userspace) = 18.600s
graphical.target reached after 3.835s in userspace.

这表明问题不在用户空间(userspace), 而是发生在更底层的内核启动阶段.

深入探究:Stage 2 的 Perl 脚本

我继续深入探索, 由于 NixOS 的特性, 我决定 fork 一份 nixpkgs(因为 flake nixpkgs 打 patch 似乎只能作用于 derivation 上), 并在 NixOS 启动的多个阶段打印日志.

支线任务: 维护一份下游的nixpkgs并定时自动同步上游

我预估探索这个问题可能需要很长的时间(怀疑是kernel的bug),为了方便探索,我使用github actions实现了自动同步

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
on:
schedule:
- cron: "0 2 * * *"

workflow_dispatch:

jobs:
fetch_upstream:
name: sync upstream of nixpkgs_unstable
runs-on: ubuntu-22.04
steps:
- name: main
run: |
git config --global user.name 'Seeker472-sync-bot'
git config --global user.email 'gmx472@qq.com'
git clone --depth 20 https://${{ secrets.ACCESS_TOKEN }}@github.com/Seeker0472/nixpkgs.git -b unstable-test tmp
cd tmp
git remote add upstream https://${{ secrets.ACCESS_TOKEN }}@github.com/NixOS/nixpkgs.git
git fetch --depth 20 upstream nixos-unstable
git merge upstream/nixos-unstable
git push origin unstable-test

secrets.ACCESS_TOKEN 是 github 的 Personal access tokens (classic)

找到关键问题

但是, 我很快就发现问题出在 stage2 阶段, 一个 perl 脚本负责初始化 /etc 目录.

我在 nixos/modules/system/etc/etc.nix 中加入了日志输出:

1
2
3
4
5
6
7
8
9
10
11
diff --git a/nixos/modules/system/etc/etc.nix b/nixos/modules/system/etc/etc.nix
index 15387f261..f1e43521a 100644
--- a/nixos/modules/system/etc/etc.nix
+++ b/nixos/modules/system/etc/etc.nix
@@ -362,6 +362,7 @@ in
# Set up the statically computed bits of /etc.
echo "setting up /etc..."
${pkgs.perl.withPackages (p: [ p.FileSlurp ])}/bin/perl ${./setup-etc.pl} ${etc}/etc
+ echo "perl exec finished"
'';

重新启动后, 日志果然印证了我的猜测:

1
2
3
4
5
6
[    1.489603] stage-2-init: setting up /etc...
[ 1.495332] usb 3-10: new full-speed USB device number 4 using xhci_hcd
[ 1.620655] usb 3-10: New USB device found, idVendor=8087, idProduct=0026, bcdDevice= 0.02
[ 1.620661] usb 3-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 1.622555] probe of 3-10 returned 0 after 1478 usecs
[ 28.326322] stage-2-init: perl exec finished

setting up /etc...perl exec finished 之间, 竟然有长达 27 秒的延迟!这明确表明 setup-etc.pl 脚本的执行耗时过长.

定位元凶:200万次的 cleanup 调用

为了找出具体是脚本中的哪个部分耗时, 我尝试给 setup-etc.pl 脚本的每个函数都添加了调用计数器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
diff --git a/nixos/modules/system/etc/setup-etc.pl b/nixos/modules/system/etc/setup-etc.pl
index ea0a38308..2defc6653 100644
--- a/nixos/modules/system/etc/setup-etc.pl
+++ b/nixos/modules/system/etc/setup-etc.pl
@@ -8,7 +8,10 @@ use File::Slurp;
my $etc = $ARGV[0] or die;
my $static = "/etc/static";

+my %function_call_counts;
+
sub atomicSymlink {
+ $function_call_counts{atomicSymlink}++;
my ($source, $target) = @_;
my $tmp = "$target.tmp";
unlink $tmp;
@@ -21,7 +24,6 @@ sub atomicSymlink {
}
}

-
# Atomically update /etc/static to point at the etc files of the
# current configuration.
atomicSymlink $etc, $static or die;
@@ -30,6 +32,8 @@ atomicSymlink $etc, $static or die;
# means either argument is a symlink to a file in /etc/static or a
# directory with all children being static.
sub isStatic {
+ $function_call_counts{isStatic}++;
+
my $path = shift;

if (-l $path) {
@@ -59,6 +63,7 @@ sub isStatic {
# in the current one. For efficiency, don't look under /etc/nixos
# (where all the NixOS sources live).
sub cleanup {
+ $function_call_counts{cleanup}++;
if ($File::Find::name eq "/etc/nixos") {
$File::Find::prune = 1;
return;
@@ -66,6 +71,7 @@ sub cleanup {
if (-l $_) {
my $target = readlink $_;
if (substr($target, 0, length $static) eq $static) {
+ print "cleanup-remove:FILE:---$File::Find::name\n";
my $x = "/etc/static/" . substr($File::Find::name, length "/etc/");
unless (-l $x) {
print STDERR "removing obsolete symlink ‘$File::Find::name’...\n";
@@ -73,8 +79,10 @@ sub cleanup {
}
}
}
+
}

+
find(\&cleanup, "/etc");


@@ -82,14 +90,15 @@ find(\&cleanup, "/etc");
my @oldCopied = read_file("/etc/.clean", chomp => 1, err_mode => 'quiet');
open CLEAN, ">>/etc/.clean";

-
# For every file in the etc tree, create a corresponding symlink in
# /etc to /etc/static. The indirection through /etc/static is to make
# switching to a new configuration somewhat more atomic.
my %created;
my @copied;

+
sub link {
+ $function_call_counts{link}++;
my $fn = substr $File::Find::name, length($etc) + 1 or next;

# nixos-enter sets up /etc/resolv.conf as a bind mount, so skip it.
@@ -136,15 +145,18 @@ sub link {

find(\&link, $etc);

-
# Delete files that were copied in a previous version but not in the
# current.
foreach my $fn (@oldCopied) {
+ $function_call_counts{foreach}++;
+
if (!defined $created{$fn}) {
+ $function_call_counts{foreach_y}++;
$fn = "/etc/$fn";
print STDERR "removing obsolete file ‘$fn’...\n";
unlink "$fn";
}
+
}


@@ -157,3 +169,9 @@ write_file("/etc/.clean", map { "$_\n" } sort @copied);
# so we need to check and re-create it during activation.
open TAG, ">>/etc/NIXOS";
close TAG;
+
+print "\n--- Function Call Summary ---\n";
+foreach my $func_name (sort keys %function_call_counts) {
+ print sprintf "%-20s: %d times\n", $func_name, $function_call_counts{$func_name};
+}
+print "---------------------------\n";

这次的日志输出令人震惊:

1
2
3
4
5
6
7
8
[   27.649163] stage-2-init: --- Function Call Summary ---
[ 27.649176] stage-2-init: atomicSymlink : 153 times
[ 27.649191] stage-2-init: cleanup : 2011360 times
[ 27.649203] stage-2-init: foreach : 3 times
[ 27.649216] stage-2-init: isStatic : 16 times
[ 27.649230] stage-2-init: link : 203 times
[ 27.649242] stage-2-init: ---------------------------
[ 27.655408] stage-2-init: perl exec finished

cleanup 函数竟然被调用了 2,011,360 次! 这个函数在启动时会遍历整个 /etc 目录, 检查并移除指向 /etc/static 的旧的软链接.

经过排查, 我发现罪魁祸首是 /etc/docker 目录. 这个目录有大量文件, 导致 cleanup 函数在遍历时耗费了巨量时间.

我曾考虑直接在 Perl 脚本中跳过 /etc/docker 目录, 但这并不是一个优雅的解决方案:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
diff --git a/nixos/modules/system/etc/setup-etc.pl b/nixos/modules/system/etc/setup-etc.pl
index 2defc6653..743c6dd25 100644
--- a/nixos/modules/system/etc/setup-etc.pl
+++ b/nixos/modules/system/etc/setup-etc.pl
@@ -58,6 +58,11 @@ sub isStatic {
return 0;
}

+my $static_docker_path = "/etc/static/docker";
+
+if (-e $static_docker_path) {
+ print STDERR "Entering /etc/docker to check for obsolete symlinks (as $static_docker_path is not a symlink).\n";
+}
# Remove dangling symlinks that point to /etc/static. These are
# configuration files that existed in a previous configuration but not
# in the current one. For efficiency, don't look under /etc/nixos
@@ -68,6 +73,12 @@ sub cleanup {
$File::Find::prune = 1;
return;
}
+ if ($File::Find::name eq "/etc/docker") {
+ unless (-e $static_docker_path) {
+ $File::Find::prune = 1;
+ return;
+ }
+ }
if (-l $_) {
my $target = readlink $_;
if (substr($target, 0, length $static) eq $static) {

真正的罪魁祸首:早期的错误配置

在进一步思考之后, 我意识到这并非 Docker 的默认行为,/etc目录应该是存放系统级的配置文件的,不应该有一堆垃圾,docker的默认目录也应该是/var/lib/docker,于是我开始检查我自己的配置

问题出在我早期的一个“垃圾配置”, 其中设置了 Docker 的 data-root

1
2
3
4
5
6
virtualisation.docker = {
enable = true;
enableOnBoot = false;
rootless.setSocketVariable = true;
daemon.settings = { data-root = "/etc/docker"; }; # <--- 就是这里!
};

将 Docker 的 data-root 设置为 /etc/docker 是导致 /etc 目录异常膨胀的根本原因. Docker 默认将数据存储在 /var/lib/docker, 而我将其指向了 /etc, 导致 /etc 成为了一个巨大的数据存储区, 使得 setup-etc.pl 脚本在每次启动时都需要遍历和处理海量文件, 从而引发了长达数十秒的启动延迟.

解决方案很简单:移除或修正这个错误的 data-root 配置, 让 Docker 使用其默认的数据路径. 经过修正后, 系统启动速度恢复了正常.

展望:试试 “stateless” 模式?

NixOS 有一个 impermanence 模块,使得系统每次重启时除了少数指定目录外,其他所有数据都会被清除.

之后有空试试这个


一次nixos启动速度调试记录
https://20040702.xyz/2025/06/13/nixos-boot-slow/
作者
Seeker
发布于
2025年6月13日
许可协议