一次nixos启动速度调试记录

免责声明：本文是一篇个人排查记录（日记）, 而非详尽的教程. 文中包含一些“愚蠢”的操作和试错过程, 希望能为遇到类似问题的你提供一些思路.

启动卡顿, 20秒空白的煎熬

最近, 我的 NixOS 笔记本电脑启动速度突然变得异常缓慢. 每次开机, 屏幕都会长时间（大约 20 多秒）没有任何输出, 我怀疑是启动过程中出现了死锁之类的bug. 虽然当时时间紧张无暇顾及, 但这个问题一直困扰着我, 直到最近才下定决心一探究竟.

初步检查：寻找线索

首先, 我通过 dmesg 命令查看了内核日志, 希望能找到一些异常：

1 2	`[ 1.543706] usb 3-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [ 27.779755] systemd[1]: Inserted module 'autofs4'`

从日志中可以看出, 在 usb 设备初始化之后, 到 autofs4 模块插入之间, 存在一个显著的时间间隔, 约为 26 秒.

接着, 我查看了 journalctl. 这里没有像 dmesg 那样明显的间隔, 但这可能是由于系统时钟在启动早期被调整, 或者 journald 在捕获早期内核消息时存在聚合或延迟.

1 2	`Jun 11 22:15:19 miLaptop kernel: usb 3-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0 Jun 11 22:15:19 miLaptop systemd[1]: Inserted module 'autofs4'`

为了获取更详细的启动信息, 我尝试提高了内核日志级别并切换了内核版本：

1 2	`boot.kernelParams = [ "debug" "initcall_debug" "loglevel=8" "noquiet" ]; boot.kernelPackages = pkgs.linuxPackages_6_14;`

但是依旧没有找到任何有用的信息.

随后, 我使用 systemd-analyze 进行了分析. 虽然我当时没有保留出问题的记录, 但可以确定的是, 内核阶段（kernel stage）的耗时竟然长达 30 秒！作为对比, 下面是一个正常的 systemd-analyze 输出：

1 2	`Startup finished in 4.914s (firmware) + 3.721s (loader) + 1.610s (kernel) + 8.354s (userspace) = 18.600s graphical.target reached after 3.835s in userspace.`

这表明问题不在用户空间（userspace）, 而是发生在更底层的内核启动阶段.

深入探究：Stage 2 的 Perl 脚本

我继续深入探索, 由于 NixOS 的特性, 我决定 fork 一份 nixpkgs（因为 flake 对 nixpkgs 打 patch 似乎只能作用于 derivation 上）, 并在 NixOS 启动的多个阶段打印日志.

支线任务: 维护一份下游的nixpkgs并定时自动同步上游

我预估探索这个问题可能需要很长的时间(怀疑是kernel的bug),为了方便探索,我使用github actions实现了自动同步

on:
  schedule:
  - cron: "0 2 * * *"
  
  workflow_dispatch:

jobs:
  fetch_upstream:
    name: sync upstream of nixpkgs_unstable
    runs-on: ubuntu-22.04
    steps:
      - name: main
        run: |
          git config --global user.name 'Seeker472-sync-bot'
          git config --global user.email 'gmx472@qq.com'
          git clone --depth 20 https://${{ secrets.ACCESS_TOKEN }}@github.com/Seeker0472/nixpkgs.git -b unstable-test tmp
          cd tmp
          git remote add upstream https://${{ secrets.ACCESS_TOKEN }}@github.com/NixOS/nixpkgs.git
          git fetch --depth 20 upstream nixos-unstable
          git merge upstream/nixos-unstable
          git push origin unstable-test

secrets.ACCESS_TOKEN 是 github 的 Personal access tokens (classic)

找到关键问题

但是, 我很快就发现问题出在 stage2 阶段, 一个 perl 脚本负责初始化 /etc 目录.

我在 nixos/modules/system/etc/etc.nix 中加入了日志输出：

diff --git a/nixos/modules/system/etc/etc.nix b/nixos/modules/system/etc/etc.nix
index 15387f261..f1e43521a 100644
--- a/nixos/modules/system/etc/etc.nix
+++ b/nixos/modules/system/etc/etc.nix
@@ -362,6 +362,7 @@ in
           # Set up the statically computed bits of /etc.
           echo "setting up /etc..."
           ${pkgs.perl.withPackages (p: [ p.FileSlurp ])}/bin/perl ${./setup-etc.pl} ${etc}/etc
+          echo "perl exec finished"
         '';

重新启动后, 日志果然印证了我的猜测：

[    1.489603] stage-2-init: setting up /etc...
[    1.495332] usb 3-10: new full-speed USB device number 4 using xhci_hcd
[    1.620655] usb 3-10: New USB device found, idVendor=8087, idProduct=0026, bcdDevice= 0.02
[    1.620661] usb 3-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    1.622555] probe of 3-10 returned 0 after 1478 usecs
[   28.326322] stage-2-init: perl exec finished

从 setting up /etc... 到 perl exec finished 之间, 竟然有长达 27 秒的延迟！这明确表明 setup-etc.pl 脚本的执行耗时过长.

定位元凶：200万次的 `cleanup` 调用

为了找出具体是脚本中的哪个部分耗时, 我尝试给 setup-etc.pl 脚本的每个函数都添加了调用计数器：

diff --git a/nixos/modules/system/etc/setup-etc.pl b/nixos/modules/system/etc/setup-etc.pl
index ea0a38308..2defc6653 100644
--- a/nixos/modules/system/etc/setup-etc.pl
+++ b/nixos/modules/system/etc/setup-etc.pl
@@ -8,7 +8,10 @@ use File::Slurp;
 my $etc = $ARGV[0] or die;
 my $static = "/etc/static";
 
+my %function_call_counts;
+
 sub atomicSymlink {
+    $function_call_counts{atomicSymlink}++;
     my ($source, $target) = @_;
     my $tmp = "$target.tmp";
     unlink $tmp;
@@ -21,7 +24,6 @@ sub atomicSymlink {
     }
 }
 
-
 # Atomically update /etc/static to point at the etc files of the
 # current configuration.
 atomicSymlink $etc, $static or die;
@@ -30,6 +32,8 @@ atomicSymlink $etc, $static or die;
 # means either argument is a symlink to a file in /etc/static or a
 # directory with all children being static.
 sub isStatic {
+    $function_call_counts{isStatic}++;
+
     my $path = shift;
 
     if (-l $path) {
@@ -59,6 +63,7 @@ sub isStatic {
 # in the current one.  For efficiency, don't look under /etc/nixos
 # (where all the NixOS sources live).
 sub cleanup {
+    $function_call_counts{cleanup}++;
     if ($File::Find::name eq "/etc/nixos") {
         $File::Find::prune = 1;
         return;
@@ -66,6 +71,7 @@ sub cleanup {
     if (-l $_) {
         my $target = readlink $_;
         if (substr($target, 0, length $static) eq $static) {
+            print "cleanup-remove:FILE:---$File::Find::name\n";
             my $x = "/etc/static/" . substr($File::Find::name, length "/etc/");
             unless (-l $x) {
                 print STDERR "removing obsolete symlink ‘$File::Find::name’...\n";
@@ -73,8 +79,10 @@ sub cleanup {
             }
         }
     }
+
 }
 
+
 find(\&cleanup, "/etc");
 
 
@@ -82,14 +90,15 @@ find(\&cleanup, "/etc");
 my @oldCopied = read_file("/etc/.clean", chomp => 1, err_mode => 'quiet');
 open CLEAN, ">>/etc/.clean";
 
-
 # For every file in the etc tree, create a corresponding symlink in
 # /etc to /etc/static.  The indirection through /etc/static is to make
 # switching to a new configuration somewhat more atomic.
 my %created;
 my @copied;
 
+
 sub link {
+    $function_call_counts{link}++;
     my $fn = substr $File::Find::name, length($etc) + 1 or next;
 
     # nixos-enter sets up /etc/resolv.conf as a bind mount, so skip it.
@@ -136,15 +145,18 @@ sub link {
 
 find(\&link, $etc);
 
-
 # Delete files that were copied in a previous version but not in the
 # current.
 foreach my $fn (@oldCopied) {
+    $function_call_counts{foreach}++;
+
     if (!defined $created{$fn}) {
+        $function_call_counts{foreach_y}++;
         $fn = "/etc/$fn";
         print STDERR "removing obsolete file ‘$fn’...\n";
         unlink "$fn";
     }
+
 }
 
 
@@ -157,3 +169,9 @@ write_file("/etc/.clean", map { "$_\n" } sort @copied);
 # so we need to check and re-create it during activation.
 open TAG, ">>/etc/NIXOS";
 close TAG;
+
+print "\n--- Function Call Summary ---\n";
+foreach my $func_name (sort keys %function_call_counts) {
+    print sprintf "%-20s: %d times\n", $func_name, $function_call_counts{$func_name};
+}
+print "---------------------------\n";

这次的日志输出令人震惊：

[   27.649163] stage-2-init: --- Function Call Summary ---
[   27.649176] stage-2-init: atomicSymlink           : 153 times
[   27.649191] stage-2-init: cleanup                 : 2011360 times
[   27.649203] stage-2-init: foreach                 : 3 times
[   27.649216] stage-2-init: isStatic                : 16 times
[   27.649230] stage-2-init: link                    : 203 times
[   27.649242] stage-2-init: ---------------------------
[   27.655408] stage-2-init: perl exec finished

cleanup 函数竟然被调用了 2,011,360 次！ 这个函数在启动时会遍历整个 /etc 目录, 检查并移除指向 /etc/static 的旧的软链接.

经过排查, 我发现罪魁祸首是 /etc/docker 目录. 这个目录有大量文件, 导致 cleanup 函数在遍历时耗费了巨量时间.

我曾考虑直接在 Perl 脚本中跳过 /etc/docker 目录, 但这并不是一个优雅的解决方案：

diff --git a/nixos/modules/system/etc/setup-etc.pl b/nixos/modules/system/etc/setup-etc.pl
index 2defc6653..743c6dd25 100644
--- a/nixos/modules/system/etc/setup-etc.pl
+++ b/nixos/modules/system/etc/setup-etc.pl
@@ -58,6 +58,11 @@ sub isStatic {
     return 0;
 }
 
+my $static_docker_path = "/etc/static/docker";
+
+if (-e $static_docker_path) {
+    print STDERR "Entering /etc/docker to check for obsolete symlinks (as $static_docker_path is not a symlink).\n";
+}
 # Remove dangling symlinks that point to /etc/static.  These are
 # configuration files that existed in a previous configuration but not
 # in the current one.  For efficiency, don't look under /etc/nixos
@@ -68,6 +73,12 @@ sub cleanup {
         $File::Find::prune = 1;
         return;
     }
+    if ($File::Find::name eq "/etc/docker") {
+        unless (-e $static_docker_path) {
+            $File::Find::prune = 1;
+            return;
+        }
+    }
     if (-l $_) {
         my $target = readlink $_;
         if (substr($target, 0, length $static) eq $static) {

真正的罪魁祸首：早期的错误配置

在进一步思考之后, 我意识到这并非 Docker 的默认行为,/etc目录应该是存放系统级的配置文件的,不应该有一堆垃圾,docker的默认目录也应该是/var/lib/docker,于是我开始检查我自己的配置

问题出在我早期的一个“垃圾配置”, 其中设置了 Docker 的 data-root：

virtualisation.docker = {
  enable = true;
  enableOnBoot = false;
  rootless.setSocketVariable = true;
  daemon.settings = { data-root = "/etc/docker"; }; # <--- 就是这里！
};

将 Docker 的 data-root 设置为 /etc/docker 是导致 /etc 目录异常膨胀的根本原因. Docker 默认将数据存储在 /var/lib/docker, 而我将其指向了 /etc, 导致 /etc 成为了一个巨大的数据存储区, 使得 setup-etc.pl 脚本在每次启动时都需要遍历和处理海量文件, 从而引发了长达数十秒的启动延迟.

解决方案很简单：移除或修正这个错误的 data-root 配置, 让 Docker 使用其默认的数据路径. 经过修正后, 系统启动速度恢复了正常.

展望:试试 “stateless” 模式?

NixOS 有一个 impermanence 模块,使得系统每次重启时除了少数指定目录外，其他所有数据都会被清除.

之后有空试试这个

#linux

一次nixos启动速度调试记录

https://20040702.xyz/2025/06/13/nixos-boot-slow/

作者

Seeker

发布于

2025年6月13日

许可协议

驾驭unix find 输出的n种方式下一篇

一次nixos启动速度调试记录

启动卡顿, 20秒空白的煎熬

初步检查：寻找线索

深入探究：Stage 2 的 Perl 脚本

支线任务: 维护一份下游的nixpkgs并定时自动同步上游

找到关键问题

定位元凶：200万次的 cleanup 调用

真正的罪魁祸首：早期的错误配置

展望:试试 “stateless” 模式?

定位元凶：200万次的 `cleanup` 调用