Linux进程调度系列之-1.用户态进程时间完被调度的过程

Linux进程调度系列1 - 用户态进程被调度过程

对于linux的进程调度（这里不讲调度算法，只是讲实际linux源码中如何进行进程切换的）。

首先应该清楚两个点：

进程主动申请调度器调度（主动让出CPU）。
一个是时间片用完，或者被抢占（被动让出CPU）。

本文将仅详细说明进程在用户态时间片用完被调度的情况

其他的情况将在接下来的文章中继续探讨！

1. 进程在用户态时间片用完被调度

进程（假设为进程A）在用户态运行时，此时发生一个定时器中断（该定时器中断专门用来做分时操作的）。
进入定时器中断时，CPU会保存进程用户态的PC到EL1_lr，pstate保存到SPSR寄存器中以及关中断，并切换到该进程的内核栈中（从用户态进入内核态必定是从该用户内核栈的最开始处）。
然后在中断处理函数中将进程用户态的环境保存到内核态中（以栈帧结构进行保存）。
定时器处理函数内部：
1. 该cpu的preempt count字段中的中断计数加1（用来表示现在中断嵌套的个数以及大于0表示处于中断上下文）。
2. 对现在current指向的进程控制块的时间片进行减减操作（其他处理细节就不赘述）并设置need_resched字段表示需要调度。
3. preempt count字段中的中断计数减1，进入ret_to_user函数中。
4. 然后判断preempt count字段中的中断计数以及抢占计数是否为0（都为0表示进入进程上下文并且可以进行进程调度，否则不能进行进程调度，进入finish_ret_to_user,会进行进程运行环境的恢复，从该进程的内核栈的栈帧中恢复，并将内核栈SP指向栈底，然后eret返回用户态继续执行）。
5. 这里将从可以进行调度（即中断计数和抢占计数都为0）进行继续分析：
6. 可以进行调度，进入schedule()进行调度器。
调度器内部会通过调度算法进行判断是否需要切换、以及需要切换的话选出要上CPU的进程（这里假设需要进行切换）。
根据调度算法选出的需要上CPU的进程（假设为进程B），并拿到该进程控制块，然后进入cpu_switch_to进行进程上下文切换。
这时候调度器的工作是把当前进程（进程A）的x19~x29、sp(内核栈)、lr寄存器的值保存到该进程对应的task_struct中，同样把需要上CPU的进程（进程B）的task_struct中的x19~x29、sp(内核栈)、lr寄存器恢复到对应的寄存器中。这样完成了内核栈的切换，此时是回到了之前进程的被切换下内核栈的**cpu_switch_to函数调用点继续运行**。（所有的进程调度都是在这个函数的点进行调度的，同样重新被调度也是恢复到这个点继续执行，并且内核栈sp与lr都恢复了，这样就恢复了之前的调度点时的运行环境）。由此得出不同进程的调度点与恢复点的不同之处就是x19~x29、sp(内核栈)、lr寄存器，然而这些寄存器的值右可以被保存和恢复，所以就可以完美地进行进程调度（只是还没有回到进程的用户态，只是在内核态继续运行）。
这个时候后进程A下了CPU，进程B上CPU执行（对应图中时刻2后）。
在某个时刻（对应图中时刻3）调度器又进行调度并且还是进入cpu_switch_to函数进行上下文切换，还是同上述步骤7（对应图中时刻3、4）的处理。此时进程A又上了CPU运行（只是现在还处于进程内核态）。
然后进程A会继续执行下去（对应图中时刻4以后），后面的代码和不被调度时是执行的一样的结果（因为该进程的SP没有改变、lr返回地址没变（函数调用栈没有变），即所有都没有变），所以按照道理是要继续运行之前定时器中断处理的最后一段代码，从进程的内核栈中的栈帧恢复进程的用户态运行上下文到寄存器中。
最后执行eret指令返回到进程用户态继续执行。

调度的流程图乳如下：

2. 为什么这样切换没有改变进程状态？

因为进程的内核栈SP值没有变、lr寄存器的值没有变、x19~x29寄存器值也没有变 ------> 保留了进程的调用过程
特别是所有的切换都是从cpu_switch_to这个函数的同一个点发生 ------> 即内核态时的PC也没有变，被调度出CPU时和上CPU时的pc没有变。
有上面两点得出进程的运行环境（状态）并没有发生变化。

当然会有疑问了x0～x18没有保存，是不是说明运行环境改变了？
- 从物理层面说，确实x0～x18寄存器中的值可能不一样了。
- 但是从进程运行状态看cpu_switch_to函数中并没有用x0～x18寄存器，所以这些就算改变了也没有一点影响。
如果调用cpu_switch_to函数的函数用到了x0～x18呢？
- 这里就得看arm64的函数调用规定标准了，调用者需要保存x0～x18寄存器到栈中，被调用者需要保存x19~x29寄存器（如果被调用者要用这些寄存器的话）。
- 所以调用cpu_switch_to函数的函数已经把x0～x18寄存器保存到栈中了，然后进程切换栈的内容并没有变，而且SP的值也有保存和恢复，所以运行环境也没有变化。

3. 结尾

这里探讨了用户态时间片调度的情况，后续将会对其他情况继续深入探讨！

4. 附重要调度相关内核源码

其中最重要的地方在cpu_switch_to，这里就是时机进行进程切换的地方

`el0_irq` 用户态进入中断的中断处理

SYM_CODE_START_LOCAL_NOALIGN(el0_irq)kernel_entry 0
el0_irq_naked:el0_interrupt_handler handle_arch_irqb	ret_to_user
SYM_CODE_END(el0_irq)

`ret_to_user` 返回用户态的函数

在这里面判断是否中断计数以及抢占计数
如果中断计数为0则表明现在以及退出了中断上下文，进入了进行上下文，如果不为0则还是在中断上下文，并不能进行进程调度

/*
#define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \_TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT | \_TIF_NOTIFY_SIGNAL)
*/SYM_CODE_START_LOCAL(ret_to_user)disable_daifgic_prio_kentry_setup tmp=x3
#ifdef CONFIG_TRACE_IRQFLAGSbl	trace_hardirqs_off
#endifldr	x19, [tsk, #TSK_TI_FLAGS]// 测试TSK_TI_FLAGS中_TIF_WORK_MASK对应的位and	x2, x19, #_TIF_WORK_MASK// 不为0则跳转到work_pendingcbnz	x2, work_pending
finish_ret_to_user:user_enter_irqoff/* Ignore asynchronous tag check faults in the uaccess routines */clear_mte_async_tcfenable_step_tsk x19, x2
#ifdef CONFIG_GCC_PLUGIN_STACKLEAKbl	stackleak_erase
#endifkernel_exit 0/** Ok, we need to do extra processing, enter the slow path.*/work_pending:mov	x0, sp				// 'regs'mov	x1, x19bl	do_notify_resumeldr	x19, [tsk, #TSK_TI_FLAGS]	// re-check for single-stepb	finish_ret_to_user
SYM_CODE_END(ret_to_user)

`do_notify_resume` 调用`schedule()`的函数

asmlinkage void do_notify_resume(struct pt_regs *regs,unsigned long thread_flags)
{do {/* Check valid user FS if needed */addr_limit_user_check();if (thread_flags & _TIF_NEED_RESCHED) {/* Unmask Debug and SError for the next task */local_daif_restore(DAIF_PROCCTX_NOIRQ);schedule(); // -> __schedule()} else {local_daif_restore(DAIF_PROCCTX);if (thread_flags & _TIF_UPROBE)uprobe_notify_resume(regs);if (thread_flags & _TIF_MTE_ASYNC_FAULT) {clear_thread_flag(TIF_MTE_ASYNC_FAULT);send_sig_fault(SIGSEGV, SEGV_MTEAERR,(void __user *)NULL, current);}if (thread_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))do_signal(regs);if (thread_flags & _TIF_NOTIFY_RESUME) {tracehook_notify_resume(regs);rseq_handle_notify_resume(NULL, regs);}if (thread_flags & _TIF_FOREIGN_FPSTATE)fpsimd_restore_current_state();}local_daif_mask();thread_flags = READ_ONCE(current_thread_info()->flags);} while (thread_flags & _TIF_WORK_MASK);
}

__schedule

/** __schedule() is the main scheduler function.** The main means of driving the scheduler and thus entering this function are:**   1. Explicit blocking: mutex, semaphore, waitqueue, etc.**   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return*      paths. For example, see arch/x86/entry_64.S.**      To drive preemption between tasks, the scheduler sets the flag in timer*      interrupt handler scheduler_tick().**   3. Wakeups don't really cause entry into schedule(). They add a*      task to the run-queue and that's it.**      Now, if the new task added to the run-queue preempts the current*      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets*      called on the nearest possible occasion:**       - If the kernel is preemptible (CONFIG_PREEMPTION=y):**         - in syscall or exception context, at the next outmost*           preempt_enable(). (this might be as soon as the wake_up()'s*           spin_unlock()!)**         - in IRQ context, return from interrupt-handler to*           preemptible context**       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)*         then at the next:**          - cond_resched() call*          - explicit schedule() call*          - return from syscall or exception to user-space*          - return from interrupt-handler to user-space** WARNING: must be called with preemption disabled!*/
static void __sched notrace __schedule(bool preempt)
{struct task_struct *prev, *next;unsigned long *switch_count;unsigned long prev_state;struct rq_flags rf;struct rq *rq;int cpu;cpu = smp_processor_id();rq = cpu_rq(cpu);prev = rq->curr;schedule_debug(prev, preempt);if (sched_feat(HRTICK))hrtick_clear(rq);local_irq_disable();rcu_note_context_switch(preempt);/** Make sure that signal_pending_state()->signal_pending() below* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)* done by the caller to avoid the race with signal_wake_up():** __set_current_state(@state)		signal_wake_up()* schedule()				  set_tsk_thread_flag(p, TIF_SIGPENDING)*					  wake_up_state(p, state)*   LOCK rq->lock			    LOCK p->pi_state*   smp_mb__after_spinlock()		    smp_mb__after_spinlock()*     if (signal_pending_state())	    if (p->state & @state)** Also, the membarrier system call requires a full memory barrier* after coming from user-space, before storing to rq->curr.*/rq_lock(rq, &rf);smp_mb__after_spinlock();/* Promote REQ to ACT */rq->clock_update_flags <<= 1;update_rq_clock(rq);switch_count = &prev->nivcsw;/** We must load prev->state once (task_struct::state is volatile), such* that:**  - we form a control dependency vs deactivate_task() below.*  - ptrace_{,un}freeze_traced() can change ->state underneath us.*/prev_state = prev->state;if (!preempt && prev_state) {if (signal_pending_state(prev_state, prev)) {prev->state = TASK_RUNNING;} else {prev->sched_contributes_to_load =(prev_state & TASK_UNINTERRUPTIBLE) &&!(prev_state & TASK_NOLOAD) &&!(prev->flags & PF_FROZEN);if (prev->sched_contributes_to_load)rq->nr_uninterruptible++;/** __schedule()			ttwu()*   prev_state = prev->state;    if (p->on_rq && ...)*   if (prev_state)		    goto out;*     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();*				  p->state = TASK_WAKING** Where __schedule() and ttwu() have matching control dependencies.** After this, schedule() must not care about p->state any more.*/deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);if (prev->in_iowait) {atomic_inc(&rq->nr_iowait);delayacct_blkio_start();}}switch_count = &prev->nvcsw;}next = pick_next_task(rq, prev, &rf);clear_tsk_need_resched(prev);clear_preempt_need_resched();if (likely(prev != next)) {rq->nr_switches++;/** RCU users of rcu_dereference(rq->curr) may not see* changes to task_struct made by pick_next_task().*/RCU_INIT_POINTER(rq->curr, next);/** The membarrier system call requires each architecture* to have a full memory barrier after updating* rq->curr, before returning to user-space.** Here are the schemes providing that barrier on the* various architectures:* - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.*   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.* - finish_lock_switch() for weakly-ordered*   architectures where spin_unlock is a full barrier,* - switch_to() for arm64 (weakly-ordered, spin_unlock*   is a RELEASE barrier),*/++*switch_count;psi_sched_switch(prev, next, !task_on_rq_queued(prev));trace_sched_switch(preempt, prev, next);/* Also unlocks the rq: */rq = context_switch(rq, prev, next, &rf);} else {rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);rq_unlock_irq(rq, &rf);}balance_callback(rq);
}

`context_switch` 调用`__switch_to`

/** context_switch - switch to the new MM and the new thread's register state.*/
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,struct task_struct *next, struct rq_flags *rf)
{prepare_task_switch(rq, prev, next);/** For paravirt, this is coupled with an exit in switch_to to* combine the page table reload and the switch backend into* one hypercall.*/arch_start_context_switch(prev);/** kernel -> kernel   lazy + transfer active*   user -> kernel   lazy + mmgrab() active** kernel ->   user   switch + mmdrop() active*   user ->   user   switch*/if (!next->mm) {                                // to kernelenter_lazy_tlb(prev->active_mm, next);next->active_mm = prev->active_mm;if (prev->mm)                           // from usermmgrab(prev->active_mm);elseprev->active_mm = NULL;} else {                                        // to usermembarrier_switch_mm(rq, prev->active_mm, next->mm);/** sys_membarrier() requires an smp_mb() between setting* rq->curr / membarrier_switch_mm() and returning to userspace.** The below provides this either through switch_mm(), or in* case 'prev->active_mm == next->mm' through* finish_task_switch()'s mmdrop().*/switch_mm_irqs_off(prev->active_mm, next->mm, next);if (!prev->mm) {                        // from kernel/* will mmdrop() in finish_task_switch(). */rq->prev_mm = prev->active_mm;prev->active_mm = NULL;}}rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);prepare_lock_switch(rq, next, rf);/* Here we just switch the register state and the stack. */switch_to(prev, next, prev);   // -> __switch_to()barrier();return finish_task_switch(prev);
}

`__switch_to` 进程上下文切换（包括内存表、运行环境等）

/** Thread switching.*/
__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,struct task_struct *next)
{struct task_struct *last;fpsimd_thread_switch(next);tls_thread_switch(next);hw_breakpoint_thread_switch(next);contextidr_thread_switch(next);entry_task_switch(next);uao_thread_switch(next);ssbs_thread_switch(next);erratum_1418040_thread_switch(next);/** Complete any pending TLB or cache maintenance on this CPU in case* the thread migrates to a different CPU.* This full barrier is also required by the membarrier system* call.*/dsb(ish);/** MTE thread switching must happen after the DSB above to ensure that* any asynchronous tag check faults have been logged in the TFSR*_EL1* registers.*/mte_thread_switch(next);/* the actual thread switch */last = cpu_switch_to(prev, next);return last;
}

`cpu_switch_to` 实际进行进程切换的函数

/** Register switch for AArch64. The callee-saved registers need to be saved* and restored. On entry:*   x0 = previous task_struct (must be preserved across the switch)*   x1 = next task_struct* Previous and next are guaranteed not to be the same.**/
SYM_FUNC_START(cpu_switch_to)mov	x10, #THREAD_CPU_CONTEXTadd	x8, x0, x10mov	x9, sp	// 在el1下，sp指的是sp_el1stp	x19, x20, [x8], #16		// store callee-saved registersstp	x21, x22, [x8], #16stp	x23, x24, [x8], #16stp	x25, x26, [x8], #16stp	x27, x28, [x8], #16stp	x29, x9, [x8], #16str	lr, [x8]add	x8, x1, x10ldp	x19, x20, [x8], #16		// restore callee-saved registersldp	x21, x22, [x8], #16ldp	x23, x24, [x8], #16ldp	x25, x26, [x8], #16ldp	x27, x28, [x8], #16ldp	x29, x9, [x8], #16ldr	lr, [x8]mov	sp, x9msr	sp_el0, x1	// 切换current指针为新上cpu的进程ptrauth_keys_install_kernel x1, x8, x9, x10scs_save x0, x8scs_load_currentret
SYM_FUNC_END(cpu_switch_to)

`kernel_exit`退出异常处理

.macro	kernel_exit, el.if	\el != 0disable_daif/* Restore the task's original addr_limit. */ldr	x20, [sp, #S_ORIG_ADDR_LIMIT]str	x20, [tsk, #TSK_TI_ADDR_LIMIT]/* No need to restore UAO, it will be restored from SPSR_EL1 */.endif/* Restore pmr */
alternative_if ARM64_HAS_IRQ_PRIO_MASKINGldr	x20, [sp, #S_PMR_SAVE]msr_s	SYS_ICC_PMR_EL1, x20mrs_s	x21, SYS_ICC_CTLR_EL1tbz	x21, #6, .L__skip_pmr_sync\@	// Check for ICC_CTLR_EL1.PMHEdsb	sy				// Ensure priority change is seen by redistributor
.L__skip_pmr_sync\@:
alternative_else_nop_endifldp	x21, x22, [sp, #S_PC]		// load ELR, SPSR#ifdef CONFIG_ARM64_SW_TTBR0_PAN
alternative_if_not ARM64_HAS_PANbl	__swpan_exit_el\el
alternative_else_nop_endif
#endif.if	\el == 0ldr	x23, [sp, #S_SP]		// load return stack pointermsr	sp_el0, x23tst	x22, #PSR_MODE32_BIT		// native task?b.eq	3f#ifdef CONFIG_ARM64_ERRATUM_845719
alternative_if ARM64_WORKAROUND_845719
#ifdef CONFIG_PID_IN_CONTEXTIDRmrs	x29, contextidr_el1msr	contextidr_el1, x29
#elsemsr contextidr_el1, xzr
#endif
alternative_else_nop_endif
#endif
3:scs_save tsk, x0/* No kernel C function calls after this as user keys are set. */ptrauth_keys_install_user tsk, x0, x1, x2apply_ssbd 0, x0, x1.endifmsr	elr_el1, x21			// set up the return datamsr	spsr_el1, x22ldp	x0, x1, [sp, #16 * 0]ldp	x2, x3, [sp, #16 * 1]ldp	x4, x5, [sp, #16 * 2]ldp	x6, x7, [sp, #16 * 3]ldp	x8, x9, [sp, #16 * 4]ldp	x10, x11, [sp, #16 * 5]ldp	x12, x13, [sp, #16 * 6]ldp	x14, x15, [sp, #16 * 7]ldp	x16, x17, [sp, #16 * 8]ldp	x18, x19, [sp, #16 * 9]ldp	x20, x21, [sp, #16 * 10]ldp	x22, x23, [sp, #16 * 11]ldp	x24, x25, [sp, #16 * 12]ldp	x26, x27, [sp, #16 * 13]ldp	x28, x29, [sp, #16 * 14].if	\el == 0
alternative_if_not ARM64_UNMAP_KERNEL_AT_EL0ldr	lr, [sp, #S_LR]add	sp, sp, #S_FRAME_SIZE		// restore speret
alternative_else_nop_endif
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0bne	4fmsr	far_el1, x29tramp_alias	x30, tramp_exit_native, x29br	x30
4:tramp_alias	x30, tramp_exit_compat, x29br	x30
#endif.elseldr	lr, [sp, #S_LR]add	sp, sp, #S_FRAME_SIZE		// restore sp/* Ensure any device/NC reads complete */alternative_insn nop, "dmb sy", ARM64_WORKAROUND_1508412eret.endifsb.endm

Linux进程调度系列之-1.用户态进程时间完被调度的过程

Linux进程调度系列1 - 用户态进程被调度过程

本文将仅详细说明进程在用户态时间片用完被调度的情况

1. 进程在用户态时间片用完被调度

2. 为什么这样切换没有改变进程状态？

3. 结尾

4. 附重要调度相关内核源码

`el0_irq` 用户态进入中断的中断处理

`ret_to_user` 返回用户态的函数

`do_notify_resume` 调用`schedule()`的函数

__schedule

`context_switch` 调用`__switch_to`

`__switch_to` 进程上下文切换（包括内存表、运行环境等）

`cpu_switch_to` 实际进行进程切换的函数

`kernel_exit`退出异常处理

相关资讯

热文排行

最新新闻

推荐新闻

热搜词

Linux进程调度系列之-1.用户态进程时间完被调度的过程

Linux进程调度系列1 - 用户态进程被调度过程

本文将仅详细说明进程在用户态时间片用完被调度的情况

1. 进程在用户态时间片用完被调度

2. 为什么这样切换没有改变进程状态？

3. 结尾

4. 附重要调度相关内核源码

el0_irq 用户态进入中断的中断处理

ret_to_user 返回用户态的函数

do_notify_resume 调用schedule()的函数

__schedule

context_switch 调用__switch_to

__switch_to 进程上下文切换（包括内存表、运行环境等）

cpu_switch_to 实际进行进程切换的函数

kernel_exit退出异常处理

相关资讯

热文排行

最新新闻

推荐新闻

热搜词

`el0_irq` 用户态进入中断的中断处理

`ret_to_user` 返回用户态的函数

`do_notify_resume` 调用`schedule()`的函数

`context_switch` 调用`__switch_to`

`__switch_to` 进程上下文切换（包括内存表、运行环境等）

`cpu_switch_to` 实际进行进程切换的函数

`kernel_exit`退出异常处理