【银河麒麟高级服务器操作系统】在VMware虚拟机情况下出现软锁处理过程

系统环境及配置

系统环境	物理机/虚拟机/云/容器	VMware虚拟机，宿主机型号是YK SR750
网络环境	外网/私有网络/无网络	私有网络
硬件环境	机型	VMware Virtual Platform
	处理器	Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
	内存	64GB
	整机类型/架构	x86
	固件版本	Phoenix Technologies LTD
软件环境	具体操作系统版本	银河麒麟高级服务器操作系统V10 Kylin Linux Advanced Server release V10 (Tercel)
软件环境	内核版本	4.19.90-23.49.v2101.ky10.x86_64

现象描述

银河麒麟高级服务器操作系统V10SP1 0518版本在VMware虚拟机情况下出现软锁。收到反馈，主机发生软锁。

分析过程

分析vmcore

分析vmcore-dmesg

内核检测到软锁死，系统无法继续同步操作，因此触发了内核恐慌。交换进程（swapper/0）在CPU0上，在规定的时间内没有响应，导致系统出现软锁。根据软锁堆栈中的RIP寄存器，RIP: 0010:__do_softirq+0x77/0x2e9，可见当时正在处理软中断。

[6367835.162265] Kernel panic - not syncing: softlockup: hung tasks
[6367835.162266] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G L 4.19.90-23.49.v2101.ky10.x86_64 #1
[6367835.162267] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[6367835.162267] Call Trace:
[6367835.162268] <IRQ>
[6367835.162272] dump_stack+0x66/0x8b
[6367835.162275] panic+0xe4/0x296
[6367835.162278] watchdog_timer_fn+0x25b/0x270
[6367835.162279] ? softlockup_fn+0x40/0x40
[6367835.162281] __hrtimer_run_queues+0x108/0x290
[6367835.162282] hrtimer_interrupt+0xe5/0x240
[6367835.162284] ? sched_clock+0x5/0x10
[6367835.162286] smp_apic_timer_interrupt+0x6a/0x130
[6367835.162287] apic_timer_interrupt+0xf/0x20
[6367835.162289] RIP: 0010:__do_softirq+0x77/0x2e9
[6367835.162290] Code: 05 ba 63 c1 5d 00 01 00 00 c7 44 24 20 0a 00 00 00 44 89 74 24 04 48 c7 c0 00 9f 02 00 65 66 c7 00 00 00 fb 66 0f 1f 44 00 00 <b8> ff ff ff ff 0f bc 44 24 04 83 c0 01 89 44 24 10 0f 84 9c 00 00
[6367835.162291] RSP: 0018:ffff89ae9d003f88 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[6367835.162292] RAX: 0000000000029f00 RBX: 0000000000000000 RCX: 0000000000000020
[6367835.162292] RDX: 00000000000000af RSI: ffff89ae90cdd6d0 RDI: 0000000000000000
[6367835.162292] RBP: 0000000000000000 R08: ffff89ae9d0286a0 R09: 0000000000000000
[6367835.162293] R10: ffff89ae9d003f48 R11: 0000000000000000 R12: 0000000000000000
[6367835.162293] R13: 0000000000000000 R14: 0000000000000010 R15: 0000000000000000
[6367835.162294] ? apic_timer_interrupt+0xa/0x20
[6367835.162296] ? __do_softirq+0x4b/0x2e9
[6367835.162297] ? sched_clock+0x5/0x10
[6367835.162298] irq_exit+0xfe/0x110
[6367835.162299] call_function_single_interrupt+0xf/0x20
[6367835.162299] </IRQ>
[6367835.162300] RIP: 0010:native_safe_halt+0xe/0x10

分析vmcore

分析vmcore，，系统的负载平均值极高（76.15, 50.69, 39.92），表明系统在过去的1分钟、5分钟和15分钟内承受了非常高的负载。这种高负载可能导致CPU资源被大量占用，影响关键任务和中断处理的响应时间。

secondary_startup_64 用于初始化和启动辅助处理器
native_safe_halt 表明在进入空闲状态前，系统检测到某种异常情况
  call_function_single_interrupt 用于执行跨CPU通信或同步操作
   irq_exit 结束中断处理，恢复中断前的CPU状态
    __softirqentry_text_start 软中断入口点，用于处理软中断任务，如网络数据包处理、定时器任务等

查看CPU0上的硬件中断情况，在该CPU并没有硬中断，所以没有出现中断风暴。

分析vmcore

从日志中可以看到，这是一次软锁的问题，kswapd0在CPU#8 上，内核态运行时消耗了过多的 CPU 时间，未能及时释放CPU导致其他任务无法运行。这种现象由内核 bug 或资源竞争引起。kswapd0内核线程负责将页面从内存交换到磁盘，通常与内存不足或内存管理的性能问题相关。

查看RIP寄存器的值为 smp_call_function_many+0x224/0x250，smp_call_function_many()函数作用是在多个CPU上并行运行一个指定的函数。可见CPUX发起了IPI，需要其它CPU响应IPI中断，但是其它CPU处于阻塞状态屏蔽了中断，导致一直无法响应，最后触发软锁。

从vmcore-dmesg中查不到无法响应IPI中断的CPU是哪个，vmcore是vmcore-incomplete，无法进一步分析。

[8824094.110781] watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [kswapd0:118]
...
[8824094.115075] CPU: 8 PID: 118 Comm: kswapd0 Kdump: loaded Not tainted 4.19.90-23.49.v2101.ky10.x86_64 #1
[8824094.115743] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[8824094.116477] RIP: 0010:smp_call_function_many+0x224/0x250
[8824094.116847] Code: c7 e8 a0 eb 74 00 3b 05 2e 5a 51 01 0f 83 5f fe ff ff 48 63 c8 48 8b 13 48 03 14 cd 40 a7 d5 83 8b 4a 18 83 e1 01 74 0a f3 90 <8b> 4a 18 83 e1 01 75 f6 eb c7 48 c7 c2 e0 9d 06 84 48 89 ee 89 c7
[8824094.117954] RSP: 0018:ffffba1746be3af0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[8824094.118324] RAX: 0000000000000009 RBX: ffff98d45d22ba00 RCX: 0000000000000001
[8824094.118713] RDX: ffff98d45d2713c0 RSI: 0000000000000000 RDI: ffff98c5810f2d00
[8824094.119092] RBP: 0000000000000080 R08: 000000000002f060 R09: ffffffff82a5040a
[8824094.119482] R10: ffffe4f0e828d600 R11: 0000000000000000 R12: 0000000000000001
[8824094.119482] R10: ffffe4f0e828d600 R11: 0000000000000000 R12: 0000000000000001
[8824094.119922] R13: 000000000002b9c0 R14: ffffffff82a73980 R15: ffffba1746be3b30
[8824094.120311] FS: 0000000000000000(0000) GS:ffff98d45d200000(0000) knlGS:0000000000000000
[8824094.120739] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[8824094.121121] CR2: 0000000005d0b6a4 CR3: 000000075f40a005 CR4: 00000000007606e0
[8824094.121534] PKRU: 55555554
[8824094.121921] Call Trace:
[8824094.122308] arch_tlbbatch_flush+0x6f/0xd0
[8824094.122715] try_to_unmap_flush+0x26/0x40
[8824094.123090] shrink_page_list+0x43c/0xd60
[8824094.123465] shrink_inactive_list+0x2c1/0x760
[8824094.123837] shrink_node_memcg+0x365/0x770
[8824094.124203] ? shrink_slab+0x54/0x2b0
[8824094.124575] ? shrink_slab+0x54/0x2b0
[8824094.124927] ? shrink_node+0xcf/0x410
[8824094.125263] shrink_node+0xcf/0x410
[8824094.125601] kswapd+0x2b1/0x6e0
[8824094.125925] ? mem_cgroup_shrink_node+0x170/0x170
[8824094.126237] kthread+0x113/0x130
[8824094.126554] ? kthread_create_worker_on_cpu+0x70/0x70
[8824094.126852] ret_from_fork+0x1f/0x40
[8824094.127141] Sending NMI from CPU 8 to CPUs 0-7,9-15:
[8824104.064527] Kernel panic - not syncing: softlockup: hung tasks