k8s pod 配置shareMemory

Linux shm/tempfs

linux默认支持 挂载tmpfs 时指定大小,如下所示:

1
mount tmpfs -t tmpfs /home/test/ -o size=1M

这个命令只是逻辑占用,并不会占用真实的内存空间,但是在该目录

测试 shm 大小

使用以下命令写入文件到目录/dev/shm ,块大小为1M,数量1024个,一共1G大小

1
dd if=/dev/zero of=/dev/shm/test.random  bs=1M count=1024

创建shm时,会占用内存空间,可以使用free -hm 命令查看,如下所示

1
2
3
4
5
6
7
8
9
10
11
12
root@ajkqpkmajvc6q-0:/dev/shm# free -mh
total used free shared buff/cache available
Mem: 172G 17G 2.8G 4.2G 152G 148G
Swap: 0B 0B 0B
root@ajkqpkmajvc6q-0:/dev/shm# dd if=/dev/zero of=/dev/shm/test1.random bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 2.6405 s, 1.6 GB/s
root@ajkqpkmajvc6q-0:/dev/shm# free -mh
total used free shared buff/cache available
Mem: 172G 17G 820M 8.2G 154G 144G
Swap: 0B 0B 0B

kubernetes Befor 1.20

虽然docker支持配置shm 参数,但是kubernetes并不支持该参数,社区里有基于emptyDir的方式使用,如下所示,定义emptyDir的卷,并挂载到容器的/dev/shm 目录

1
2
3
4
5
6
7
8
9
10
11
....
volumeMounts:
- mountPath: /dev/shm
name: shm
...


volumes:
- emptyDir:
medium: Memory
sizeLimit: 4Gi

虽然这里有sizeLimit参数,但是其实并不生效,默认使用的大小为宿主机内存的一半,k8s代码里面进行挂载时,并未增加size参数,只是挂载tmpfs文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
k8s.io\kubernetes\pkg\volume\emptydir\empty_dir.go
// setupTmpfs creates a tmpfs mount at the specified directory.
func (ed *emptyDir) setupTmpfs(dir string) error {
if ed.mounter == nil {
return fmt.Errorf("memory storage requested, but mounter is nil")
}
if err := ed.setupDir(dir); err != nil {
return err
}
// Make SetUp idempotent.
medium, isMnt, err := ed.mountDetector.GetMountMedium(dir)
if err != nil {
return err
}
// If the directory is a mountpoint with medium memory, there is no
// work to do since we are already in the desired state.
if isMnt && medium == v1.StorageMediumMemory {
return nil
}

klog.V(3).Infof("pod %v: mounting tmpfs for volume %v", ed.pod.UID, ed.volName)
return ed.mounter.Mount("tmpfs", dir, "tmpfs", nil /* options */)
}

但是这时候设置的sizeLimit:4Gi 起到了别的作用,kubelet的eviction manager会监控pod的emptyDir卷使用的空间大小,当使用空间超过该值时,会将该Pod驱逐(kubelet 可以获取到Pod对应容器的emptyDir卷空间使用信息 k8s.io\kubernetes\pkg\kubelet\server\stats)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
k8s.io\kubernetes\pkg\kubelet\eviction\eviction_manager.go

// localStorageEviction checks the EmptyDir volume usage for each pod and determine whether it exceeds the specified limit and needs
// to be evicted. It also checks every container in the pod, if the container overlay usage exceeds the limit, the pod will be evicted too.
func (m *managerImpl) localStorageEviction(summary *statsapi.Summary, pods []*v1.Pod) []*v1.Pod {
statsFunc := cachedStatsFunc(summary.Pods)
evicted := []*v1.Pod{}
for _, pod := range pods {
podStats, ok := statsFunc(pod)
if !ok {
continue
}

if m.emptyDirLimitEviction(podStats, pod) {
evicted = append(evicted, pod)
continue
}

if m.podEphemeralStorageLimitEviction(podStats, pod) {
evicted = append(evicted, pod)
continue
}

if m.containerEphemeralStorageLimitEviction(podStats, pod) {
evicted = append(evicted, pod)
}
}

return evicted
}

所以说,如果你想在低版本的k8s上使用shm,请不要设置sizeLimit

kubernetes1.20版本

1.20 版本合入了一个(PR)[https://github.com/kubernetes/kubernetes/pull/94444/commits],可以在kubelet设置开启一个feature 特性,kubelet在创建容器时,会为Pod挂载shm,此时还是需要为Pod以挂载卷的方式实现shm

  1. 当 Pod 并没有设置memory limit时,此时 shm大小为node的Allocateable Memory大小
  2. 当Pod 设置了Memory Limit 但是在medium的emptyDir未设置sizeLimit时,shm 大小为Pod 的memory Limit
  3. 当Pod的medium emptyDir设置sizeLimit时,shm大小为sizeLimit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

func calculateEmptyDirMemorySize(nodeAllocatableMemory *resource.Quantity, spec *volume.Spec, pod *v1.Pod) *resource.Quantity {
// if feature is disabled, continue the default behavior of linux host default
sizeLimit := &resource.Quantity{}
if !utilfeature.DefaultFeatureGate.Enabled(features.SizeMemoryBackedVolumes) {
return sizeLimit
}

// size limit defaults to node allocatable (pods cant consume more memory than all pods)
sizeLimit = nodeAllocatableMemory
zero := resource.MustParse("0")

// determine pod resource allocation
// we use the same function for pod cgroup assigment to maintain consistent behavior
// NOTE: this could be nil on systems that do not support pod memory containment (i.e. windows)
podResourceConfig := cm.ResourceConfigForPod(pod, false, uint64(100000))
if podResourceConfig != nil && podResourceConfig.Memory != nil {
podMemoryLimit := resource.NewQuantity(*(podResourceConfig.Memory), resource.BinarySI)
// ensure 0 < value < size
if podMemoryLimit.Cmp(zero) > 0 && podMemoryLimit.Cmp(*sizeLimit) < 1 {
sizeLimit = podMemoryLimit
}
}

// volume local size is used if and only if less than what pod could consume
if spec.Volume.EmptyDir.SizeLimit != nil {
volumeSizeLimit := spec.Volume.EmptyDir.SizeLimit
// ensure 0 < value < size
if volumeSizeLimit.Cmp(zero) > 0 && volumeSizeLimit.Cmp(*sizeLimit) < 1 {
sizeLimit = volumeSizeLimit
}
}
return sizeLimit
}