kubernetes apps删除流程

问题描述

最近遇到了好几个Kubernetes集群出现了删除Statefulset时,Pod未被删除的问题,经过定位是开发同事,基于farbric 的k8s api进行删除statefulset的操作
,调用了删除statefulset的接口后,又调用了删除pod的接口,但是都是使用的默认删除方式,非级联删除(Orphan策略),这在某些情况下,可能只是调用了删除statefulset的接口,但是未调用删除Pod的接口,就会
出现Pod未被删除的,此时Pod的metadat内可能仍然存在,但是kube-controller-manager中的garbargecollector会将该Pode的ownerreference移除,但是label中仍然带有controller-revision-hash和statefulset的信息,如下图所示
avatar

对于正常的属于Statefulset的Pod标识如下所示:

avatar

借此机会对Statefulset删除的源码进行了分析,原以为是由Statefulset-controller 进行控制,但是看了一下代码,发现Statefulset-controller只是用来控制副本数的变化,但是对于Statefulset的删除,并不做任何处理,statefulset-controller的如下代码:

1
2
3
4
5
// If the StatefulSet is being deleted, don't do anything other than updating
// status.
if set.DeletionTimestamp != nil {
return &status, nil
}

##Kubernetes资源删除的方式
Kubernetes 在删除资源时,存在级联删除和非级联删除
###控制垃圾收集器删除 Dependent
####级联删除
当删除对象时,可以指定是否该对象的 Dependent 也自动删除掉。自动删除 Dependent 也称为级联删除。Kubernetes 中有两种级联删除的模式:background 模式和 foreground 模式。

1
kubectl delete statefulset  -n de2ca8d1-94b4-4faa-8077-e9374ca9db4e 5bagk2rivkjno --cascade=true

Background 级联删除,在 background 级联删除 模式下,Kubernetes 会立即删除 Owner 对象,然后垃圾收集器会在后台删除这些 Dependent。

1
2
3
   curl -X DELETE 127.0.0.1:8080/apis/extensions/v1beta1/namespaces/default/replicasets/my-repset \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Background"}' \
-H "Content-Type: application/json"

Foreground 级联删除m在 foreground 级联删除 模式下,根对象首先进入 “删除中” 状态。该对象会设置deletionTimestamp 字段对象的 metadata.finalizers 字段包含了值 “foregroundDeletion”,对象仍然可以通过 REST API 可见,一旦被设置为 “删除中” 状态,垃圾收集器会删除对象的所有 Dependent。垃圾收集器删除了所有 “Blocking” 的 Dependent(对象的 ownerReference.blockOwnerDeletion=true)之后,它会删除 Owner 对象。
如果一个对象的ownerReferences 字段被一个 Controller(例如 Deployment 或 ReplicaSet)设置,blockOwnerDeletion 会被自动设置,没必要手动修改这个字段。

1
2
3
curl -X DELETE localhost:8080/apis/extensions/v1beta1/namespaces/default/replicasets/my-repset \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \
-H "Content-Type: application/json"

非级联删除

如果删除对象时,不自动删除它的 Dependent,这些 Dependent 被称作是原对象的 孤儿(Orphan),可以使用以下命令实现

1
kubectl delete statefulset  -n de2ca8d1-94b4-4faa-8077-e9374ca9db4e 5bagk2rivkjno --cascade=false

或者

1
2
3
curl -X DELETE 127.0.0.1:8080/apis/extensions/v1beta1/namespaces/default/replicasets/my-repset \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \
-H "Content-Type: application/json"

Kubernetes 删除apps流程分析

删除流程中几个重要的过程包括 kube-apiserver 提供的rest服务

  1. 通过调用rest api实现etcd数据库中,app对象的状态更新,包括增加deletetimestamp和finalizer自带,触发更新事件

  2. kube-controller-manager收到了apps状态的更新事件,通过更新内置的graph(集群内资源依赖附属关系的图)和garbagecollector进行资源极其附属的删除,这里是只在etcd数据库删除

  3. kubelet收到第1步中的资源删除事件,进行底层资源的删除和回收

###kube-apiserver
kube-apiserver在启动时会基于go-restful将rest服务的handler 进行加载,主要如下所示:
pkg/master/master.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// InstallAPIs will install the APIs for the restStorageProviders if they are enabled.
func (m *Master) InstallAPIs(apiResourceConfigSource serverstorage.APIResourceConfigSource, restOptionsGetter generic.RESTOptionsGetter, restStorageProviders ...RESTStorageProvider) {
apiGroupsInfo := []*genericapiserver.APIGroupInfo{}

for _, restStorageBuilder := range restStorageProviders {
groupName := restStorageBuilder.GroupName()
if !apiResourceConfigSource.AnyVersionForGroupEnabled(groupName) {
klog.V(1).Infof("Skipping disabled API group %q.", groupName)
continue
}
apiGroupInfo, enabled := restStorageBuilder.NewRESTStorage(apiResourceConfigSource, restOptionsGetter)
if !enabled {
klog.Warningf("Problem initializing API group %q, skipping.", groupName)
continue
}
klog.V(1).Infof("Enabling API group %q.", groupName)

if postHookProvider, ok := restStorageBuilder.(genericapiserver.PostStartHookProvider); ok {
name, hook, err := postHookProvider.PostStartHook()
if err != nil {
klog.Fatalf("Error building PostStartHook: %v", err)
}
m.GenericAPIServer.AddPostStartHookOrDie(name, hook)
}

apiGroupsInfo = append(apiGroupsInfo, &apiGroupInfo)
}
#集成API
if err := m.GenericAPIServer.InstallAPIGroups(apiGroupsInfo...); err != nil {
klog.Fatalf("Error in registering group versions: %v", err)
}
}

主要的调用链:

1
2
3
4
5
k8s.io/apiserver/pkg/server/genericapiserver.go:InstallAPIGroups()
k8s.io/apiserver/pkg/server/genericapiserver.go:installAPIResources()
k8s.io/apiserver/pkg/endpoints/groupversion.go:InstallREST()
k8s.io/apiserver/pkg/endpoints/installer.go:Install()
k8s.io/apiserver/pkg/endpoints/installer.go:registerResourceHandlers()

rest注册的核心函数:将rest请求直接映射为etcd存储的操作

1
staging/src/k8s.io/apiserver/pkg/endpoints/installer.go:183  registerResourceHandlers()

删除的rest请求对应的处理请求在这里

1
2
3
4
5
6
7
8
9
10
   .....
actions = appendIf(actions, action{"GET", itemPath, nameParams, namer, false}, isGetter)
if getSubpath {
actions = appendIf(actions, action{"GET", itemPath + "/{path:*}", proxyParams, namer, false}, isGetter)
}
actions = appendIf(actions, action{"PUT", itemPath, nameParams, namer, false}, isUpdater)
actions = appendIf(actions, action{"PATCH", itemPath, nameParams, namer, false}, isPatcher)
#删除的处理函数
actions = appendIf(actions, action{"DELETE", itemPath, nameParams, namer, false}, isGracefulDeleter)
.....

该接口的定义如下所示,即Delete函数可能直接删除数据,也可能异步删除资源

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// GracefulDeleter knows how to pass deletion options to allow delayed deletion of a
// RESTful object.
type GracefulDeleter interface {
// Delete finds a resource in the storage and deletes it.
// If options are provided, the resource will attempt to honor them or return an invalid
// request error.
// Although it can return an arbitrary error value, IsNotFound(err) is true for the
// returned error value err when the specified resource is not found.
// Delete *may* return the object that was deleted, or a status object indicating additional
// information about deletion.
// It also returns a boolean which is set to true if the resource was instantly
// deleted or false if it will be deleted asynchronously.
Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error)
}

这里有个比较重要的Delete函数,用于在Delete之前做一些业务处理,就包括了我们前面重点提到的设置deletetimestamp 和finalizers字段,对于Statefulset特有的增删改查的预处理,代码都归档在了k8s.io\kubernetes\pkg\registry\apps\statefulset目录,但是对于通用的增删改查预处理操作
,代码被归档在了这里 k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/registry/rest/,其中通用的BeforeDelete归档在了staging/src/k8s.io/apiserver/pkg/registry/rest/delete.go:BeforeDelete()
其中核心的Delete函数 在staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go:Delete(),重点关注下面的代码段,设置删除策略:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
....
var preconditions storage.Preconditions
if options.Preconditions != nil {
preconditions.UID = options.Preconditions.UID
preconditions.ResourceVersion = options.Preconditions.ResourceVersion
}
#调用BeforeDelete获取是否需要graceful 进行删除
graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
if err != nil {
return nil, false, err
}
......
// Handle combinations of graceful deletion and finalization by issuing
// the correct updates.
#设置finalizer,为orphan或者foregroundDeletion策略
shouldUpdateFinalizers, _ := deletionFinalizersForGarbageCollection(ctx, e, accessor, options)
// TODO: remove the check, because we support no-op updates now.
if graceful || pendingFinalizers || shouldUpdateFinalizers {
#在etcd数据库更新资源的metadata信息
err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
}

// !deleteImmediately covers all cases where err != nil. We keep both to be future-proof.
if !deleteImmediately || err != nil {
return out, false, err
}
.....

###kube-controller-manager
kube-controller-manager内有一个GarbageCollector用于完成资源的清理删除工作,启动的时候首先会运行一个dependencyGraphBuilder 用于构建集群资源的依赖关系图谱,这个graphbuild 会获取集群的全部资源,并根根据资源的metadata信息构建关系图谱
,并基于事件监听更新 关系图谱

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
pkg/controller/garbagecollector/graph_builder.go:startMonitors()
func (gb *GraphBuilder) startMonitors() {
gb.monitorLock.Lock()
defer gb.monitorLock.Unlock()

if !gb.running {
return
}

// we're waiting until after the informer start that happens once all the controllers are initialized. This ensures
// that they don't get unexpected events on their work queues.
<-gb.informersStarted

#定义需要为哪些资源建立关系图谱,每个都基于informer进行监听,对于garbargecollect 中的graph,只是获取哪些允许进行删除的资源pkg/controller/garbagecollector/garbagecollector.go:GetDeletableResources()
monitors := gb.monitors
started := 0
for _, monitor := range monitors {
if monitor.stopCh == nil {
monitor.stopCh = make(chan struct{})
gb.sharedInformers.Start(gb.stopCh)
go monitor.Run()
started++
}
}
....

graph_build 中会处理收到的资源状态更新事件,将产生事件的对象放到缓存队列内,主要pkg/controller/garbagecollector/graph_builder.go:processGraphChanges()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Dequeueing an event from graphChanges, updating graph, populating dirty_queue.
func (gb *GraphBuilder) processGraphChanges() bool {
item, quit := gb.graphChanges.Get()
....
case (event.eventType == addEvent || event.eventType == updateEvent) && found:
// handle changes in ownerReferences
added, removed, changed := referencesDiffs(existingNode.owners, accessor.GetOwnerReferences())
if len(added) != 0 || len(removed) != 0 || len(changed) != 0 {
// check if the changed dependency graph unblock owners that are
// waiting for the deletion of their dependents.
gb.addUnblockedOwnersToDeleteQueue(removed, changed)
// update the node itself
existingNode.owners = accessor.GetOwnerReferences()
// Add the node to its new owners' dependent lists.
gb.addDependentToOwners(existingNode, added)
// remove the node from the dependent list of node that are no longer in
// the node's owners list.
gb.removeDependentFromOwners(existingNode, removed)
}

if beingDeleted(accessor) {
existingNode.markBeingDeleted()
}
gb.processTransitions(event.oldObj, accessor, existingNode)
case event.eventType == deleteEvent:
if !found {
klog.V(5).Infof("%v doesn't exist in the graph, this shouldn't happen", accessor.GetUID())
return true
}
// removeNode updates the graph
gb.removeNode(existingNode)
existingNode.dependentsLock.RLock()
defer existingNode.dependentsLock.RUnlock()
if len(existingNode.dependents) > 0 {
gb.absentOwnerCache.Add(accessor.GetUID())
}
for dep := range existingNode.dependents {
#将需要删除的资源加入到attemptToDelete队列
gb.attemptToDelete.Add(dep)
}
for _, owner := range existingNode.owners {
ownerNode, found := gb.uidToNode.Read(owner.UID)
if !found || !ownerNode.isDeletingDependents() {
continue
}
// this is to let attempToDeleteItem check if all the owner's
// dependents are deleted, if so, the owner will be deleted.
#将需要删除的资源加入到attemptToDelete队列
gb.attemptToDelete.Add(ownerNode)
}
}
return true
}

pkg/controller/garbagecollector/garbagecollector.go:attemptToDeleteWorker() 处理每个删除对象的事件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
func (gc *GarbageCollector) attemptToDeleteItem(item *node) error {
....
#在这里出进行处理,当发现需要删除Statefulset时,先判断是否需要进行删除该Statefulset的附属资源,先进行附属资源的删除
// attemptToOrphanWorker() into attemptToDeleteItem() as well.
if item.isDeletingDependents() {
return gc.processDeletingDependentsItem(item)
}
....

switch {
case hasOrphanFinalizer(latest):
// if an existing orphan finalizer is already on the object, honor it.
policy = metav1.DeletePropagationOrphan
case hasDeleteDependentsFinalizer(latest):
// if an existing foreground finalizer is already on the object, honor it.
policy = metav1.DeletePropagationForeground
default:
// otherwise, default to background.
policy = metav1.DeletePropagationBackground
}
#这里会将Statefulset的POd 在数据库中直接删除
klog.V(2).Infof("delete object %s with propagation policy %s", item.identity, policy)
return gc.deleteObject(item.identity, &policy)
}
}
}

#完成附属资源的删除
// process item that's waiting for its dependents to be deleted
func (gc *GarbageCollector) processDeletingDependentsItem(item *node) error {
blockingDependents := item.blockingDependents()
if len(blockingDependents) == 0 {
klog.V(2).Infof("remove DeleteDependents finalizer for item %s", item.identity)
#在etcd内移除pod的finakuzed字段,会同时将该资源在数据库删除
return gc.removeFinalizer(item, metav1.FinalizerDeleteDependents)
}
for _, dep := range blockingDependents {
if !dep.isDeletingDependents() {
klog.V(2).Infof("adding %s to attemptToDelete, because its owner %s is deletingDependents", dep.identity, item.identity)
gc.attemptToDelete.Add(dep)
}
}
return nil
}

##kubelet
kubelet主要完成资源的释放,主要的删除Pod的处理逻辑,在kubelet的主函数syncpod中,当发现是删除pod的事件时,立即处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func (kl *Kubelet) syncPod(o syncPodOptions) error {
// pull out the required options
pod := o.pod
mirrorPod := o.mirrorPod
podStatus := o.podStatus
updateType := o.updateType

// if we want to kill a pod, do it now!
if updateType == kubetypes.SyncPodKill {
killPodOptions := o.killPodOptions
if killPodOptions == nil || killPodOptions.PodStatusFunc == nil {
return fmt.Errorf("kill pod options are required if update type is kill")
}
apiPodStatus := killPodOptions.PodStatusFunc(pod, podStatus)
kl.statusManager.SetPodStatus(pod, apiPodStatus)
// we kill the pod with the specified grace period since this is a termination
if err := kl.killPod(pod, nil, podStatus, killPodOptions.PodTerminationGracePeriodSecondsOverride); err != nil {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, "error killing pod: %v", err)
// there was an error killing the pod, so we return that error directly
utilruntime.HandleError(err)
return err
}
return nil
}