一:概念
Android WatchDog(看门狗)是Android系统中用于监控系统关键服务的运行状态的机制,其核心目标是检测系统服务是否因死锁、阻塞或异常导致长时间无响应,并在必要时触发系统恢复(如重启)。
二:核心功能
2.1 服务状态监控
定期检查关键系统服务(如ActivityManager、WindowManager等)是否正常响应,防止服务阻塞导致系统卡死。
2.2 超时处理
若某个服务未在规定时间内更新“心跳”(monitor),WatchDog判定为超时,触发后续处理流程。
2.3 日志收集与调试
超时发生时,自动收集系统堆栈信息(包括所有线程的调用栈),帮助定位问题根源。
2.4 系统恢复
在严重超时情况下,可能强制重启系统进程(system_server)或整个设备,避免用户长时间面对无响应界面。
三:实现逻辑
3.1 启动
WatchDog像系统服务一样,是由SystemServer启动的,具体启动逻辑如下
frameworks/base/services/java/com/android/server/SystemServer.javaprivate void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {...//Watchdog对象的创建及线程的启动t.traceBegin("StartWatchdog");final Watchdog watchdog = Watchdog.getInstance();watchdog.start();mDumper.addDumpable(watchdog);t.traceEnd();...//Watchdog初始化t.traceBegin("InitWatchdog");watchdog.init(mSystemContext, mActivityManagerService);t.traceEnd();...}
3.2 初始化
- WatchDog运行在"watchdog"线程中
- "watchdog.monitor"后台线程,负责处理通过Handler发送的任务
- HandlerChecker一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行
- 把"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程加入监控中
- 初始化Binder线程的监视器
frameworks/base/services/core/java/com/android/server/Watchdog.javaprivate Watchdog() {//新建一个名为"watchdog"的线程mThread = new Thread(this::run, "watchdog");...//启动一个名为"watchdog.monitor"的后台线程,负责处理通过Handler发送的任务ServiceThread t = new ServiceThread("watchdog.monitor",android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);t.start();//创建一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行mMonitorChecker = new HandlerChecker(new Handler(t.getLooper()), "monitor thread", mLock);//监控"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker));...//初始化Binder线程的监视器addMonitor(new BinderThreadMonitor());...}public void start() {//启动"watchdog"线程mThread.start();}public void init(Context context, ActivityManagerService activity) {mActivity = activity;//注册重启广播context.registerReceiver(new RebootRequestReceiver(),new IntentFilter(Intent.ACTION_REBOOT),android.Manifest.permission.REBOOT, null);...}
3.3 周期性检测
- 检查周期为15s,最大等待时间为60s
- 用不同的状态表示不同的等待时间:WAITING(等待了15s之内)/WAITED_UNTIL_PRE_WATCHDOG(等待了15-60s)/OVERDUE(等待超过60s)/COMPLETED(已完成)
- 等待时间在15s-30s,会收集堆栈信息但不立即重启;如果超过60s,会收集堆栈信息并立刻重启
- 检测手段:调用被监控线程的monitor函数,根据返回时间来判断超时时间
private void run() {boolean waitedHalf = false;while (true) {List<HandlerChecker> blockedCheckers = Collections.emptyList();...boolean doWaitedPreDump = false;//watchdog超时时间(60s)final long watchdogTimeoutMillis = mWatchdogTimeoutMillis;//watchdog检查间隔(15s)final long checkIntervalMillis = watchdogTimeoutMillis / PRE_WATCHDOG_TIMEOUT_RATIO;...synchronized (mLock) {long sfHangTime;long timeout = checkIntervalMillis;...for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i);//把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueuehc.checker().scheduleCheckLocked(hc.customTimeoutMillis().orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER));}...long start = SystemClock.uptimeMillis();while (timeout > 0) {...try {//等待一个检查周期(15s)mLock.wait(timeout);// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting} catch (InterruptedException e) {Log.wtf(TAG, e);}...timeout = checkIntervalMillis - (SystemClock.uptimeMillis() - start);}...if (sfHangTime > TIME_SF_WAIT * 2) {...} else {//针对每一个HandlerChecker的等待时间,返回不用的状态(WAITING-等待了15s之内/WAITED_UNTIL_PRE_WATCHDOG-等待了15-60s/OVERDUE-等待超过60s/COMPLETED-已完成)final int waitState = evaluateCheckerCompletionLocked();if (waitState == COMPLETED) {//monitor的thread已按时返回心跳,重置waitedHalf,继续下一次循环...waitedHalf = false;continue;} else if (waitState == WAITING) {//monitor的thread没有在15s内返回心跳,继续等待,不做处理,继续下一次循环continue;} else if (waitState == WAITED_UNTIL_PRE_WATCHDOG) {//monitor的thread没有在15s-60s内返回心跳if (!waitedHalf) {//monitor的thread没有在15s-30s内返回心跳Slog.i(TAG, "WAITED_UNTIL_PRE_WATCHDOG");waitedHalf = true;//获取阻塞的线程blockedCheckers = getCheckersWithStateLocked(WAITED_UNTIL_PRE_WATCHDOG);subject = describeCheckersLocked(blockedCheckers);pids = new ArrayList<>(mInterestingJavaPids);doWaitedPreDump = true;} else {//monitor的thread没有在30s-60s内返回心跳continue;}} else {//monitor的thread没有在60s内返回心跳// something is overdue!blockedCheckers = getCheckersWithStateLocked(OVERDUE);subject = describeCheckersLocked(blockedCheckers);allowRestart = mAllowRestart;pids = new ArrayList<>(mInterestingJavaPids);}}} // END synchronized (mLock)//打印堆栈到日志中logWatchog(doWaitedPreDump, subject, pids);if (doWaitedPreDump) {//monitor的thread没有在15s-30s内返回心跳,继续下一次循环continue;}...if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);//诊断被阻塞的检查器并记录相关信息WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");...exceptionHang.WDTMatterJava(330);if (mSfHang) {...} else {//把WatchDog进程杀掉(WatchDog自杀)Process.killProcess(Process.myPid());}//终止当前运行的JVMSystem.exit(10);}waitedHalf = false;}}public static class HandlerChecker implements Runnable {public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {mWaitMaxMillis = handlerCheckerTimeoutMillis;if (mCompleted) {//把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueuemMonitors.addAll(mMonitorQueue);mMonitorQueue.clear();}...}public void run() {final int size = mMonitors.size();for (int i = 0 ; i < size ; i++) {synchronized (mLock) {mCurrentMonitor = mMonitors.get(i);}//检查监控线程状态,如果被监控线程卡住,这里也会卡住mCurrentMonitor.monitor();}synchronized (mLock) {mCompleted = true;mCurrentMonitor = null;}}}
四:总结
4.1 如何将特定线程加入WatchDog监控
答:可参考AMS
4.1.1 实现Watchdog.Monitor接口
public class ActivityManagerService extends IActivityManager.Stub
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback, ActivityManagerGlobalLock {
public void monitor() {
synchronized (this) { }
}
}
2.1.2 初始化时把自身加入到WatchDog的监控线程中
public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {
...
//AMS把自身加入到WatchDog的监控线程中
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
...
}
4.2 优点
4.2.1 避免误杀
区分轻度/严重超时,防止短暂高负载导致误重启。
4.2.2 性能开销
检测间隔(默认15秒)权衡实时性与资源消耗。
4.2.3 死锁检测
通过多服务心跳协同,发现跨服务死锁问题。