Linux进程调度系列1 - 用户态进程被调度过程
对于linux的进程调度(这里不讲调度算法,只是讲实际linux源码中如何进行进程切换的)。
首先应该清楚两个点:
- 进程主动申请调度器调度(主动让出CPU)。
- 一个是时间片用完,或者被抢占(被动让出CPU)。
本文将仅详细说明进程在用户态时间片用完被调度的情况
其他的情况将在接下来的文章中继续探讨!
1. 进程在用户态时间片用完被调度
- 进程(假设为进程A)在用户态运行时,此时发生一个定时器中断(该定时器中断专门用来做分时操作的)。
- 进入定时器中断时,CPU会保存进程用户态的
PC
到EL1_lr
,pstate
保存到SPSR
寄存器中以及关中断,并切换到该进程的内核栈中(从用户态进入内核态必定是从该用户内核栈的最开始处)。 - 然后在中断处理函数中将进程用户态的环境保存到内核态中(以栈帧结构进行保存)。
- 定时器处理函数内部:
- 该cpu的
preempt count
字段中的中断计数加1(用来表示现在中断嵌套的个数以及大于0表示处于中断上下文)。 - 对现在current指向的进程控制块的时间片进行减减操作(其他处理细节就不赘述)并设置
need_resched
字段表示需要调度。 preempt count
字段中的中断计数减1,进入ret_to_user
函数中。- 然后判断
preempt count
字段中的中断计数以及抢占计数是否为0(都为0表示进入进程上下文并且可以进行进程调度,否则不能进行进程调度,进入finish_ret_to_user
,会进行进程运行环境的恢复,从该进程的内核栈的栈帧中恢复,并将内核栈SP指向栈底,然后eret返回用户态继续执行)。 - 这里将从可以进行调度(即中断计数和抢占计数都为0)进行继续分析:
- 可以进行调度,进入
schedule()
进行调度器。
- 该cpu的
- 调度器内部会通过调度算法进行判断是否需要切换、以及需要切换的话选出要上CPU的进程(这里假设需要进行切换)。
- 根据调度算法选出的需要上CPU的进程(假设为进程B),并拿到该进程控制块,然后进入
cpu_switch_to
进行进程上下文切换。 - 这时候调度器的工作是把当前进程(进程A)的
x19~x29
、sp(内核栈)
、lr
寄存器的值保存到该进程对应的task_struct
中,同样把需要上CPU的进程(进程B)的task_struct
中的x19~x29
、sp(内核栈)
、lr
寄存器恢复到对应的寄存器中。这样完成了内核栈的切换,此时是回到了之前进程的被切换下内核栈的**cpu_switch_to
函数调用点继续运行**。(所有的进程调度都是在这个函数的点进行调度的,同样重新被调度也是恢复到这个点继续执行,并且内核栈sp与lr都恢复了,这样就恢复了之前的调度点时的运行环境)。由此得出不同进程的调度点与恢复点的不同之处就是x19~x29
、sp(内核栈)
、lr
寄存器,然而这些寄存器的值右可以被保存和恢复,所以就可以完美地进行进程调度(只是还没有回到进程的用户态,只是在内核态继续运行)。 - 这个时候后进程A下了CPU,进程B上CPU执行(对应图中时刻2后)。
- 在某个时刻(对应图中时刻3)调度器又进行调度并且还是进入
cpu_switch_to
函数进行上下文切换,还是同上述步骤7(对应图中时刻3、4)的处理。此时进程A又上了CPU运行(只是现在还处于进程内核态)。 - 然后进程A会继续执行下去(对应图中时刻4以后),后面的代码和不被调度时是执行的一样的结果(因为该进程的
SP
没有改变、lr
返回地址没变(函数调用栈没有变),即所有都没有变),所以按照道理是要继续运行之前定时器中断处理的最后一段代码,从进程的内核栈中的栈帧恢复进程的用户态运行上下文到寄存器中。 - 最后执行
eret
指令返回到进程用户态继续执行。
- 调度的流程图乳如下:
2. 为什么这样切换没有改变进程状态?
- 因为进程的内核栈
SP
值没有变、lr
寄存器的值没有变、x19~x29
寄存器值也没有变 ------> 保留了进程的调用过程 - 特别是所有的切换都是从
cpu_switch_to
这个函数的同一个点发生 ------> 即内核态时的PC
也没有变,被调度出CPU时和上CPU时的pc没有变。 - 有上面两点得出进程的运行环境(状态)并没有发生变化。
- 当然会有疑问了
x0~x18
没有保存,是不是说明运行环境改变了?- 从物理层面说,确实
x0~x18
寄存器中的值可能不一样了。 - 但是从进程运行状态看
cpu_switch_to
函数中并没有用x0~x18
寄存器,所以这些就算改变了也没有一点影响。
- 从物理层面说,确实
- 如果调用
cpu_switch_to
函数的函数用到了x0~x18
呢?- 这里就得看arm64的函数调用规定标准了,调用者需要保存
x0~x18
寄存器到栈中,被调用者需要保存x19~x29
寄存器(如果被调用者要用这些寄存器的话)。 - 所以调用
cpu_switch_to
函数的函数已经把x0~x18
寄存器保存到栈中了,然后进程切换栈的内容并没有变,而且SP的值也有保存和恢复,所以运行环境也没有变化。
- 这里就得看arm64的函数调用规定标准了,调用者需要保存
3. 结尾
这里探讨了用户态时间片调度的情况,后续将会对其他情况继续深入探讨!
4. 附重要调度相关内核源码
其中最重要的地方在cpu_switch_to
,这里就是时机进行进程切换的地方
el0_irq
用户态进入中断的中断处理
SYM_CODE_START_LOCAL_NOALIGN(el0_irq)kernel_entry 0
el0_irq_naked:el0_interrupt_handler handle_arch_irqb ret_to_user
SYM_CODE_END(el0_irq)
ret_to_user
返回用户态的函数
- 在这里面判断是否中断计数以及抢占计数
- 如果中断计数为0则表明现在以及退出了中断上下文,进入了进行上下文,如果不为0则还是在中断上下文,并不能进行进程调度
/*
#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \_TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT | \_TIF_NOTIFY_SIGNAL)
*/SYM_CODE_START_LOCAL(ret_to_user)disable_daifgic_prio_kentry_setup tmp=x3
#ifdef CONFIG_TRACE_IRQFLAGSbl trace_hardirqs_off
#endifldr x19, [tsk, #TSK_TI_FLAGS]// 测试TSK_TI_FLAGS中_TIF_WORK_MASK对应的位and x2, x19, #_TIF_WORK_MASK// 不为0则跳转到work_pendingcbnz x2, work_pending
finish_ret_to_user:user_enter_irqoff/* Ignore asynchronous tag check faults in the uaccess routines */clear_mte_async_tcfenable_step_tsk x19, x2
#ifdef CONFIG_GCC_PLUGIN_STACKLEAKbl stackleak_erase
#endifkernel_exit 0/** Ok, we need to do extra processing, enter the slow path.*/work_pending:mov x0, sp // 'regs'mov x1, x19bl do_notify_resumeldr x19, [tsk, #TSK_TI_FLAGS] // re-check for single-stepb finish_ret_to_user
SYM_CODE_END(ret_to_user)
do_notify_resume
调用schedule()
的函数
asmlinkage void do_notify_resume(struct pt_regs *regs,unsigned long thread_flags)
{do {/* Check valid user FS if needed */addr_limit_user_check();if (thread_flags & _TIF_NEED_RESCHED) {/* Unmask Debug and SError for the next task */local_daif_restore(DAIF_PROCCTX_NOIRQ);schedule(); // -> __schedule()} else {local_daif_restore(DAIF_PROCCTX);if (thread_flags & _TIF_UPROBE)uprobe_notify_resume(regs);if (thread_flags & _TIF_MTE_ASYNC_FAULT) {clear_thread_flag(TIF_MTE_ASYNC_FAULT);send_sig_fault(SIGSEGV, SEGV_MTEAERR,(void __user *)NULL, current);}if (thread_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))do_signal(regs);if (thread_flags & _TIF_NOTIFY_RESUME) {tracehook_notify_resume(regs);rseq_handle_notify_resume(NULL, regs);}if (thread_flags & _TIF_FOREIGN_FPSTATE)fpsimd_restore_current_state();}local_daif_mask();thread_flags = READ_ONCE(current_thread_info()->flags);} while (thread_flags & _TIF_WORK_MASK);
}
__schedule
/** __schedule() is the main scheduler function.** The main means of driving the scheduler and thus entering this function are:** 1. Explicit blocking: mutex, semaphore, waitqueue, etc.** 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return* paths. For example, see arch/x86/entry_64.S.** To drive preemption between tasks, the scheduler sets the flag in timer* interrupt handler scheduler_tick().** 3. Wakeups don't really cause entry into schedule(). They add a* task to the run-queue and that's it.** Now, if the new task added to the run-queue preempts the current* task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets* called on the nearest possible occasion:** - If the kernel is preemptible (CONFIG_PREEMPTION=y):** - in syscall or exception context, at the next outmost* preempt_enable(). (this might be as soon as the wake_up()'s* spin_unlock()!)** - in IRQ context, return from interrupt-handler to* preemptible context** - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)* then at the next:** - cond_resched() call* - explicit schedule() call* - return from syscall or exception to user-space* - return from interrupt-handler to user-space** WARNING: must be called with preemption disabled!*/
static void __sched notrace __schedule(bool preempt)
{struct task_struct *prev, *next;unsigned long *switch_count;unsigned long prev_state;struct rq_flags rf;struct rq *rq;int cpu;cpu = smp_processor_id();rq = cpu_rq(cpu);prev = rq->curr;schedule_debug(prev, preempt);if (sched_feat(HRTICK))hrtick_clear(rq);local_irq_disable();rcu_note_context_switch(preempt);/** Make sure that signal_pending_state()->signal_pending() below* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)* done by the caller to avoid the race with signal_wake_up():** __set_current_state(@state) signal_wake_up()* schedule() set_tsk_thread_flag(p, TIF_SIGPENDING)* wake_up_state(p, state)* LOCK rq->lock LOCK p->pi_state* smp_mb__after_spinlock() smp_mb__after_spinlock()* if (signal_pending_state()) if (p->state & @state)** Also, the membarrier system call requires a full memory barrier* after coming from user-space, before storing to rq->curr.*/rq_lock(rq, &rf);smp_mb__after_spinlock();/* Promote REQ to ACT */rq->clock_update_flags <<= 1;update_rq_clock(rq);switch_count = &prev->nivcsw;/** We must load prev->state once (task_struct::state is volatile), such* that:** - we form a control dependency vs deactivate_task() below.* - ptrace_{,un}freeze_traced() can change ->state underneath us.*/prev_state = prev->state;if (!preempt && prev_state) {if (signal_pending_state(prev_state, prev)) {prev->state = TASK_RUNNING;} else {prev->sched_contributes_to_load =(prev_state & TASK_UNINTERRUPTIBLE) &&!(prev_state & TASK_NOLOAD) &&!(prev->flags & PF_FROZEN);if (prev->sched_contributes_to_load)rq->nr_uninterruptible++;/** __schedule() ttwu()* prev_state = prev->state; if (p->on_rq && ...)* if (prev_state) goto out;* p->on_rq = 0; smp_acquire__after_ctrl_dep();* p->state = TASK_WAKING** Where __schedule() and ttwu() have matching control dependencies.** After this, schedule() must not care about p->state any more.*/deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);if (prev->in_iowait) {atomic_inc(&rq->nr_iowait);delayacct_blkio_start();}}switch_count = &prev->nvcsw;}next = pick_next_task(rq, prev, &rf);clear_tsk_need_resched(prev);clear_preempt_need_resched();if (likely(prev != next)) {rq->nr_switches++;/** RCU users of rcu_dereference(rq->curr) may not see* changes to task_struct made by pick_next_task().*/RCU_INIT_POINTER(rq->curr, next);/** The membarrier system call requires each architecture* to have a full memory barrier after updating* rq->curr, before returning to user-space.** Here are the schemes providing that barrier on the* various architectures:* - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.* switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.* - finish_lock_switch() for weakly-ordered* architectures where spin_unlock is a full barrier,* - switch_to() for arm64 (weakly-ordered, spin_unlock* is a RELEASE barrier),*/++*switch_count;psi_sched_switch(prev, next, !task_on_rq_queued(prev));trace_sched_switch(preempt, prev, next);/* Also unlocks the rq: */rq = context_switch(rq, prev, next, &rf);} else {rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);rq_unlock_irq(rq, &rf);}balance_callback(rq);
}
context_switch
调用__switch_to
/** context_switch - switch to the new MM and the new thread's register state.*/
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,struct task_struct *next, struct rq_flags *rf)
{prepare_task_switch(rq, prev, next);/** For paravirt, this is coupled with an exit in switch_to to* combine the page table reload and the switch backend into* one hypercall.*/arch_start_context_switch(prev);/** kernel -> kernel lazy + transfer active* user -> kernel lazy + mmgrab() active** kernel -> user switch + mmdrop() active* user -> user switch*/if (!next->mm) { // to kernelenter_lazy_tlb(prev->active_mm, next);next->active_mm = prev->active_mm;if (prev->mm) // from usermmgrab(prev->active_mm);elseprev->active_mm = NULL;} else { // to usermembarrier_switch_mm(rq, prev->active_mm, next->mm);/** sys_membarrier() requires an smp_mb() between setting* rq->curr / membarrier_switch_mm() and returning to userspace.** The below provides this either through switch_mm(), or in* case 'prev->active_mm == next->mm' through* finish_task_switch()'s mmdrop().*/switch_mm_irqs_off(prev->active_mm, next->mm, next);if (!prev->mm) { // from kernel/* will mmdrop() in finish_task_switch(). */rq->prev_mm = prev->active_mm;prev->active_mm = NULL;}}rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);prepare_lock_switch(rq, next, rf);/* Here we just switch the register state and the stack. */switch_to(prev, next, prev); // -> __switch_to()barrier();return finish_task_switch(prev);
}
__switch_to
进程上下文切换(包括内存表、运行环境等)
/** Thread switching.*/
__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,struct task_struct *next)
{struct task_struct *last;fpsimd_thread_switch(next);tls_thread_switch(next);hw_breakpoint_thread_switch(next);contextidr_thread_switch(next);entry_task_switch(next);uao_thread_switch(next);ssbs_thread_switch(next);erratum_1418040_thread_switch(next);/** Complete any pending TLB or cache maintenance on this CPU in case* the thread migrates to a different CPU.* This full barrier is also required by the membarrier system* call.*/dsb(ish);/** MTE thread switching must happen after the DSB above to ensure that* any asynchronous tag check faults have been logged in the TFSR*_EL1* registers.*/mte_thread_switch(next);/* the actual thread switch */last = cpu_switch_to(prev, next);return last;
}
cpu_switch_to
实际进行进程切换的函数
/** Register switch for AArch64. The callee-saved registers need to be saved* and restored. On entry:* x0 = previous task_struct (must be preserved across the switch)* x1 = next task_struct* Previous and next are guaranteed not to be the same.**/
SYM_FUNC_START(cpu_switch_to)mov x10, #THREAD_CPU_CONTEXTadd x8, x0, x10mov x9, sp // 在el1下,sp指的是sp_el1stp x19, x20, [x8], #16 // store callee-saved registersstp x21, x22, [x8], #16stp x23, x24, [x8], #16stp x25, x26, [x8], #16stp x27, x28, [x8], #16stp x29, x9, [x8], #16str lr, [x8]add x8, x1, x10ldp x19, x20, [x8], #16 // restore callee-saved registersldp x21, x22, [x8], #16ldp x23, x24, [x8], #16ldp x25, x26, [x8], #16ldp x27, x28, [x8], #16ldp x29, x9, [x8], #16ldr lr, [x8]mov sp, x9msr sp_el0, x1 // 切换current指针为新上cpu的进程ptrauth_keys_install_kernel x1, x8, x9, x10scs_save x0, x8scs_load_currentret
SYM_FUNC_END(cpu_switch_to)
kernel_exit
退出异常处理
.macro kernel_exit, el.if \el != 0disable_daif/* Restore the task's original addr_limit. */ldr x20, [sp, #S_ORIG_ADDR_LIMIT]str x20, [tsk, #TSK_TI_ADDR_LIMIT]/* No need to restore UAO, it will be restored from SPSR_EL1 */.endif/* Restore pmr */
alternative_if ARM64_HAS_IRQ_PRIO_MASKINGldr x20, [sp, #S_PMR_SAVE]msr_s SYS_ICC_PMR_EL1, x20mrs_s x21, SYS_ICC_CTLR_EL1tbz x21, #6, .L__skip_pmr_sync\@ // Check for ICC_CTLR_EL1.PMHEdsb sy // Ensure priority change is seen by redistributor
.L__skip_pmr_sync\@:
alternative_else_nop_endifldp x21, x22, [sp, #S_PC] // load ELR, SPSR#ifdef CONFIG_ARM64_SW_TTBR0_PAN
alternative_if_not ARM64_HAS_PANbl __swpan_exit_el\el
alternative_else_nop_endif
#endif.if \el == 0ldr x23, [sp, #S_SP] // load return stack pointermsr sp_el0, x23tst x22, #PSR_MODE32_BIT // native task?b.eq 3f#ifdef CONFIG_ARM64_ERRATUM_845719
alternative_if ARM64_WORKAROUND_845719
#ifdef CONFIG_PID_IN_CONTEXTIDRmrs x29, contextidr_el1msr contextidr_el1, x29
#elsemsr contextidr_el1, xzr
#endif
alternative_else_nop_endif
#endif
3:scs_save tsk, x0/* No kernel C function calls after this as user keys are set. */ptrauth_keys_install_user tsk, x0, x1, x2apply_ssbd 0, x0, x1.endifmsr elr_el1, x21 // set up the return datamsr spsr_el1, x22ldp x0, x1, [sp, #16 * 0]ldp x2, x3, [sp, #16 * 1]ldp x4, x5, [sp, #16 * 2]ldp x6, x7, [sp, #16 * 3]ldp x8, x9, [sp, #16 * 4]ldp x10, x11, [sp, #16 * 5]ldp x12, x13, [sp, #16 * 6]ldp x14, x15, [sp, #16 * 7]ldp x16, x17, [sp, #16 * 8]ldp x18, x19, [sp, #16 * 9]ldp x20, x21, [sp, #16 * 10]ldp x22, x23, [sp, #16 * 11]ldp x24, x25, [sp, #16 * 12]ldp x26, x27, [sp, #16 * 13]ldp x28, x29, [sp, #16 * 14].if \el == 0
alternative_if_not ARM64_UNMAP_KERNEL_AT_EL0ldr lr, [sp, #S_LR]add sp, sp, #S_FRAME_SIZE // restore speret
alternative_else_nop_endif
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0bne 4fmsr far_el1, x29tramp_alias x30, tramp_exit_native, x29br x30
4:tramp_alias x30, tramp_exit_compat, x29br x30
#endif.elseldr lr, [sp, #S_LR]add sp, sp, #S_FRAME_SIZE // restore sp/* Ensure any device/NC reads complete */alternative_insn nop, "dmb sy", ARM64_WORKAROUND_1508412eret.endifsb.endm