如何调整docker下linux的ulimit大小设置?

linux 下 locked-in-memory size 的默认大小通常是 64 K,这对于 io_uring 来说是不够用的。

io_uring accounts memory it needs under the rlimit memlocked option, which
can be quite low on some setups (64K). The default is usually enough for
most use cases, but bigger rings or things like registered buffers deplete
it quickly.

比如对于 io_uring-echo-server 来说,它在执行到 io_uring_queue_init_params 调用时就会失败,你将会看到如下错误输出

1
2
3
4
root@aa26ad058377:/github/io_uring-echo-server# ./io_uring_echo_server 8110
io_uring echo server listening for connections on port: 8110
io_uring_init_failed...
: Cannot allocate memory

不仅如此,liburing 项目下的 test 同样会执行失败,你会看到 ring setup failed 字样

1
2
3
4
5
6
7
8
9
10
Running test 232c93d07b74-test                                      232c93d07b74-test: 232c93d07b74-test.c:116: rcv: Assertion `res >= 0' failed.
./runtests.sh: line 67: 25 Aborted timeout -s INT -k $TIMEOUT $TIMEOUT "${test_exec[@]}"
Test 232c93d07b74-test failed with ret 134
Running test 35fa71a030ca-test 5 sec [5]
Running test 500f9fbadef8-test ring setup failed
Test 500f9fbadef8-test failed with ret 1
Running test 7ad0e4b2f83c-test io_uring_queue_init=-12
Test 7ad0e4b2f83c-test failed with ret 1
Running test 8a9973408177-test ring setup failed
Test 8a9973408177-test failed with ret 1

对于这个问题,你并不能通过 ulimit -l 1048576locked-in-memory size 调大(例如 1Mbytes)。这是因为对于 hard limit 来说,一旦设置好,它就不能改成更大的值。

A hard limit can only be decreased. Once it is set it cannot be increased; a soft limit may be in-
creased up to the value of the hard limit. If neither -H nor -S is specified, both the soft and hard
limits are updated when assigning a new limit value, and the soft limit is used when reporting the
current value.

为此,liburing 给出了一个简短的解决方案说明

Going into detail on how to bump the limit on various systems is beyond the scope
of this little blurb, but check /etc/security/limits.conf for user specific
settings, or /etc/systemd/user.conf and /etc/systemd/system.conf for systemd
setups.

但是对于使用 docker 搭建 linux 运行环境的用户来说,这个修改配置文件的方案并不方便,我在一番尝试之后并未成功(当然也可能是因为我对 docker 理解的不够深刻导致)。

实际上 docker 支持在启动容器时设置 ulimit 参数,这在使用上更加方便。

Since setting ulimit settings in a container requires extra privileges not available in the default container, you can set these using the --ulimit flag. --ulimit is specified with a soft and hard limit as such: <type>=<soft limit>[:<hard limit>]

如果我们要调整 locked-in-memory size 大小,只需要在启动容器时加上如下参数即可

1
--ulimit memlock=8192

这行参数含义如下

  • --ulimit 表示要调整 ulimit 设置
  • memlock=8192 表示将 locked-in-memory size 设置为 8192b,即 8kbytes

你可能好奇 memlock 这个选项从哪来的,毕竟 docker run --help 里也没有说明

1
--ulimit ulimit                  Ulimit options (default [])

其实可以在 linux /etc/security/limits.conf 文件的注释中找到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#<item> can be one of the following:
# - core - limits the core file size (KB)
# - data - max data size (KB)
# - fsize - maximum filesize (KB)
# - memlock - max locked-in-memory address space (KB)
# - nofile - max number of open file descriptors
# - rss - max resident set size (KB)
# - stack - max stack size (KB)
# - cpu - max CPU time (MIN)
# - nproc - max number of processes
# - as - address space limit (KB)
# - maxlogins - max number of logins for this user
# - maxsyslogins - max number of logins on the system
# - priority - the priority to run user process with
# - locks - max number of file locks the user can hold
# - sigpending - max number of pending signals
# - msgqueue - max memory used by POSIX message queues (bytes)
# - nice - max nice priority allowed to raise to values: [-20, 19]
# - rtprio - max realtime priority
# - chroot - change root to directory (Debian-specific)

此外,如果你还需要修改 ulimit 的其他选项,那么可以使用 ulimit -a 查看各个选项对应的描述(还可以看到当前设置的大小),然后在上面的文档中找到对应的 item 名。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
root@32b9ffca4a3a:/# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7474
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

上面的结果中有一些是设置为 unlimited ,如果要在 docker 启动时将 locked-in-memory size 也设置为 unlimited,可通过将数值设置为 -1 来实现

1
--ulimit memlock=-1

总结

在启动 docker 容器时,增加 --ulimit memlock=-1 参数将 locked-in-memory size 设置为 unlimited,在容器启动后通过 ulimit -l 查看是否生效。

如下是一个完整的 docker 启动命令

1
docker run -it --ulimit memlock=-1 --mount type=bind,src=/Users/gorden5566/github,dst=/github --name gorden5566 gubuntu:0.0.2

除了设置 ulimit -lunlimited 外,这条命令还做了如下操作

  • --mount type=bind,src=/Users/gorden5566/github,dst=/github 将本地的 /Users/gorden5566/github 目录映射到 docker 容器中的 /github 目录下,方便与主机共享文件(需要先在 docker 配置中将 /Users/gorden5566/github 目录设置为共享资源)。
  • --name gorden5566 指定容器名为 gorden5566 方便后续操作容器
  • gubuntu:0.0.2 指定容器的镜像。这里是我自己打包的一个 ubuntu:latest 镜像

设置memlock的作用

查看 getrlimit 的帮助文档,它的作用是设置可以锁定到内存中的最大容量,它的值会向下取整为一个最接近系统页大小的倍数,这会影响 mlock、mlockall 和 mmap 的 MAP_LOCKED 操作。

This is the maximum number of bytes of memory that may be locked into RAM. This limit is in
effect rounded down to the nearest multiple of the system page size. This limit affects
mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also af-
fects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared
memory segments (see shmget(2)) that may be locked by the real user ID of the calling process.
The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks es-
tablished by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this
limit in each of these two categories.

将内存锁定后,操作系统会确保这部分数据在 RAM 中,而不会被置换出去,从而避免了缺页中断导致的延迟

mlock(), mlock2(), and mlockall() lock part or all of the calling process’s virtual address space into
RAM, preventing that memory from being paged to the swap area.

Io_uring 中的 sq 和 cq 要保存在 RAM 中,不能被置换出去。

Rocket mq 中也用到了内存锁定,将 commit log 文件通过 mmap 映射到内存后,再通过 mlock 将其锁定,避免被置换出去而导致出现缺页中断。

参考

https://github.com/axboe/liburing

https://docs.docker.com/engine/reference/commandline/run/#set-ulimits-in-container---ulimit

man ulimit from fish-shell

man getrlimit

man mlock