k8s优雅重启

理论上处于terminating状态的pod，k8s 就会把它从service中移除了，只用配置一个优雅停机时长就行了。kubectl get endpoints 验证

因此，优雅重新的核心问题，是怎么让空闲长连接关闭，再等待处理中的请求执行完。
一些底层 HTTP 服务器（如 uvicorn），在收到SIGTERM 信号后会优雅地关闭进程，这包括清理所有的活动连接（包括空闲的 HTTP Keep-Alive 长连接），可以通过以下方法验证：

telnet <ip> <port># 输入以下内容按两次Enter
GET /health HTTP/1.1
Host: <ip>
Connection: keep-alive

你将看到正常的HTTP响应，且连接没有被关闭：

date: Fri, 24 Jan 2025 02:05:43 GMT
server: uvicorn
content-length: 4
content-type: application/json"ok"

这个时候你去让这个pod处于terminating状态，你会发现这个连接被关闭了：Connection closed by foreign host.

简介

使用kubernetes启动容器时，一般都会配置一些探针来保证pod的健康，并通过terminationGracePeriodSeconds控制pod 在接收到终止信号后等待完成清理的最大时间。

apiVersion: apps/v1
kind: Deployment
metadata:name: my-applabels:app: my-app
spec:replicas: 3selector:matchLabels:app: my-apptemplate:metadata:labels:app: my-appspec:terminationGracePeriodSeconds: 60containers:- name: my-app-containerimage: my-app:latestports:- containerPort: 8080readinessProbe:httpGet:path: /healthport: 8080initialDelaySeconds: 5periodSeconds: 10timeoutSeconds: 2successThreshold: 1failureThreshold: 3livenessProbe:tcpSocket:port: 8080initialDelaySeconds: 10periodSeconds: 10timeoutSeconds: 2successThreshold: 1failureThreshold: 10

通过就绪探针和存活探针，使得容器启动就绪后才会有流量转发进来，容器故障后也能自动重启。
但对于请求成功率要求较为严格的应用，这种方式存在一个较为严重问题：
pod滚动发布的过程中，虽然terminationGracePeriodSeconds让容器在一定时间后再退出，给了执行中的请求一些处理时间。但是terminating的过程中还是不断会有新请求进来，最终还是会有些请求受影响。

优雅重启原理

优雅重启最核心的问题就是pod在销毁过程中，不要再转发新请求进来。pod切换到terminating状态时，会发送一个SIG_TERM信号，应用端需要捕获到这个信号，将就绪探针的健康检查接口返回400+的状态码（503表示未准备好），这样失败failureThreshold次后，k8s就不会再转发新请求进来，在给一定时间让在途请求处理完成。

简介中给的yaml示例，pod在收到SIG_TERM信号后，将健康检查接口标记为不可用，就绪探针每10秒检查一次，连续3次失败就不再转发流量到该pod(30-40秒)，terminationGracePeriodSeconds配置的是60秒，执行的请求此刻则还剩20-30秒时间处理。如果你觉得时间不够，可以考虑加大terminationGracePeriodSeconds的值。

优雅重启示例

python

python可以使用signal这个内置库来监听信号。

stop_event = threading.Event()def _handler_termination_signal(signum, frame, app: FastAPI) -> None:match signum:case signal.SIGINT:logging.info("Received SIGINT signal, mark service to unhealthy.")case signal.SIGTERM:logging.info("Received SIGTERM signal, mark service to unhealthy.")case _:logging.warning(f"Received unexpected signal: {signum}")returnsignal.signal(signal.SIGTERM, partial(_handler_termination_signal, app=app))
signal.signal(signal.SIGINT, partial(_handler_termination_signal, app=app))  # ctrl + c 停止@app.get("/health")
async def health_check(request: Request):if stop_event.is_set():return PlainTextResponse("stopped", status_code=503)return "ok"

gunicorn

gunicorn会管理自己的主进程和worker进程，代码中使用signal无法捕获SIG_TERM信号，需要按照它的语法规范去捕获。

新建gunicorn_config.py文件

import logging
import signal# 处理 SIGTERM 信号的函数
def handle_sigterm(signum, frame):from main import stop_eventlogging.info("Worker received SIGTERM, setting health to unhealthy...")stop_event.set()# Worker 初始化时设置信号处理器
def post_worker_init(worker):signal.signal(signal.SIGTERM, handle_sigterm)logging.info("Signal handler for SIGTERM set in worker")

gunicorn启动时设置config类

gunicorn -c gunicorn_config.py main:app

main.py的健康检查接口使用stop_event

import threading
from flask import Responsestop_event = threading.Event()@app.route("/health")
def health():if stop_event.is_set():return Response(json.dumps({"pid": os.getpid(), "status": "unhealthy"}),status=503,content_type="application/json",)else:return Response(json.dumps({"pid": os.getpid(), "status": "ok"}),status=200,content_type="application/json",)

简介

优雅重启原理

优雅重启示例

python

gunicorn

相关资讯

热文排行

最新新闻

推荐新闻

热搜词