Decompressed vmlinux: Linux Kernel Initialization
from Page Table Configuration Perspective
Adrian Huang | June, 2021
* Based on kernel 5.11 (x86_64) – QEMU
* SMP (4 CPUs) and 8GB memory
* Kernel parameter: nokaslr
* Legacy BIOS
Agenda
• Recap – CPU booting flow and page table before entering decompressed vmlinux
• 64-bit Virtual Address
• Decompressed vmlinux: Important functions
• Entry point: startup_64()
• x86_64_start_kernel() -> start_kernel() -> setup_arch()
• Apart from focusing on page table configuration, the following are covered as well:
• Fixed-mapped addresses
• Early ioremap: based on fixed-mapped addresses
• Physical memory models
• Especially for sparse memory
• vsyscall - virtual system call (Built on top of fixed-mapped addresses)
• percpu variable
• PTI (Page Table Isolation)
• kernel thread fork & context switch: struct pt_regs and struct inactive_task_frame in kernel
stack
• How to boot secondary CPUs? Where is the entry address?
Recap – CPU booting flow before entering decompressed vmlinux
setup.bin
(arch/x86/boot/setup.bin)
Compressed vmlinux
(Protected-mode kernel)
Note
ELF: arch/x86/boot/compressed/vmlinux
Binary: arch/x86/boot/vmlinux.bin
CRC
bzImage
Long Mode:
Recap - Compressed vmlinux: Page table before entering decompressed
vmlinux
Sign-extend
Page Map
Level-4 Offset
Page Directory
Pointer Offset
Page Directory
Offset
Physical Page Offset
0
30 21
39 20
38 29
47
48
63
PML4E #0
PDPTE #3
Data
Page Map
Level-4 Table
Page Directory
Pointer Table
Page Directory
Table
40
9 9 9
Linear Address
CR3
PDPTE #2
PDPTE #1
PDPTE #0
PDE #1535
PDE #1024
.
.
PDE #2047
PDE #1536
.
.
PDE #511
PDE #0
.
.
PDE #1023
PDE #512
.
.
2MBbyte
Physical
Page
40
40
31
21
[Paging] Identity mapping for 0-4GB memory space
64-bit Virtual Address
Kernel Space
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
page_offset_base
0
16MB
64-bit Virtual Address
Kernel Virtual Address
Physical Memory
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap (32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map – 1TB
(store page frame descriptor)
…
vmemmap_base
64TB
*page
…
*page
…
*page
…
Page Frame
Descriptor
vmemmap_base
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map = 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
Decompressed vmlinux – entry point: startup_64
1. The entry point is still at 0x1000000 (16MB) – not from kernel virtual addresses
2. The kernel virtual addresses will be executed after the corresponding page tables are all set
Decompressed vmlinux – entry point: startup_64
Decompressed vmlinux – entry point: startup_64
Decompressed vmlinux – entry point: startup_64
Change to the kernel virtual address by issuing ‘jmp’ instruction
1
2
3
Decompressed vmlinux – entry point: startup_64
1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily
2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
Decompressed vmlinux – entry point: startup_64
1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily
2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
When to switch to CPU’s own gdt_page (percpu)?
Decompressed vmlinux – entry point: startup_64
Decompressed vmlinux – x86_64_start_kernel()
Page Table Configuration in startup_64 Page Table Configuration in x86_64_start_kernel
init_top_pgt
Decompressed vmlinux – x86_64_start_kernel()
Decompressed vmlinux – x86_64_start_kernel()
Decompressed vmlinux – early_idt_handler_common
Return frame for
iretq
pt_regs
r15-r12
bx
r11-r8
bp
ax
dx
si
cx
orig_ax
ip
di
cs
sp
ss
flags
orig_ax: syscall#, error code for
CPU exception or IRQ number
of HW interrupt
Callee-saved registers:
Check x86_64 ABI
early_make_pgtable Memory Map
early_make_pgtable
vmlinux – early_make_pgtable
vmlinux – x86_64_start_kernel()
vmlinux – start_kernel()
setup_arch() – Part 1
memblock: boot time memory management
Memblock
• Memory allocation during boot time stage
• Set up in setup_arch()
• Tear down in mem_init(): Release free pages
to buddy allocator
[memblock] Reserve page 0
• Security: Mitigate L1TF (L1 Terminal Fault)
vulnerability
Fixed-mapped Addresses: Compile-time virtual memory allocation
vsyscall #0
…
vsyscall #511
FIX_DBGP_BASE
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
VSYSCALL_ADDR = 0xFFFF_FFFF_FF60_0000
FIX_EARLYCON_MEM_BASE
…
__end_of_permanent_fixed_addresses
FIX_BTMAP_END = 1024
…
FIX_BTMAP_BEGIN = 1535
__end_of_fixed_addresses = 1536
vsyscalls (2MB space)
Permanent fixed addresses
512 temporary boot-time
mappings: used by
early_ioremap()
FIXADDR_START = 0xFFFF_FFFF_FF57_C000
Enumeration: fixed_addresses
0xFFFF_FFFF_FF3F_F000
0xFFFF_FFFF_FF20_0000
Modules
MODULES_VADDR
Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP
4MB: fixed-mapped
address space
2MB: Borrow from
‘Modules’ space
breakdown
Fixed-mapped Addresses: Compile-time virtual memory allocation
vsyscall #0
…
vsyscall #511
FIX_DBGP_BASE
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
VSYSCALL_ADDR = 0xFFFF_FFFF_FF60_0000
FIX_EARLYCON_MEM_BASE
…
__end_of_permanent_fixed_addresses
FIX_BTMAP_END = 1024
…
FIX_BTMAP_BEGIN = 1535
__end_of_fixed_addresses = 1536
vsyscalls (2MB space)
Permanent fixed addresses
512 temporary boot-time
mappings: used by
early_ioremap()
FIXADDR_START = 0xFFFF_FFFF_FF57_C000
Enumeration: fixed_addresses
0xFFFF_FFFF_FF3F_F000
0xFFFF_FFFF_FF20_0000
4MB: fixed-mapped
address space
2MB: Borrow from
‘Modules’ space
Fixed-mapped Addresses: Compile-time virtual memory allocation
Fixed-mapped Addresses: Use Case
Early ioremap: based on fixed-mapped address
PDE #507: 0xFFFF_FFFF_FF60_0000
PDE #506: 0xFFFF_FFFF_FF40_0000
PDE #505: 0xFFFF_FFFF_FF20_0000
#1528
…
FIX_BTMAP_BEGIN = 1535
…
FIX_BTMAP_END = 1024
…
# 1031
slot_virt[0]
slot_virt[7]
slot_virt[0] =
0xFFFF_FFFF_FF20_0000
slot_virt[7] =
0xFFFF_FFFF_FF3C_0000
early_ioremap_setup()
Early ioremap
• Mapping/unmapping of I/O physical
address to virtual address before
ioremap mechanism is ready
• early_ioremap() & early_iounmap()
Fixed-mapped Addresses
setup_arch() – Part 1
setup_arch() – Part 1
[Linux x86 Boot Protocol]
setup_data: 64-bit physical pointer to linked list
of struct setup_data
setup_arch() – Part 2
setup_arch() – Part 2 - cleanup_highmap
setup_arch() – Part 2
setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
Split memory range into sub-ranges
that fulfill 4K, 2M or 1G page.
split_mem_range
setup_arch() – Part 2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
kernel_physical_mapping_init(): Page Table Configuration for Direct Mapping
setup_arch() – Part 3
Initialize the idt table with early pagefault handler.
idt_setup_early_pf
setup_arch() – Part 3 - x86_init.paging.pagetable_init()
x86_init.paging.pagetable_init
native_pagetable_init
paging_init
sparse_init
zone_sizes_init
cfg number of pfn for each zone
free_area_init
Zone Allocator
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_DMA
(Physical address: 0-16MB)
ZONE_DMA32
(Physical address: 16MB-4GB)
ZONE_NORMAL
(Physical address > 4GB)
Buddy system
Per-CPU page
frame cache
Buddy system
Per-CPU page
frame cache
ZONE_MOVABLE ZONE_DEVICE
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
0
16MB
Physical Memory
64TB
4GB
paging_init()
• Initialize sparse memory and zone sizes
Physical Memory Models
• Flat Memory Model (CONFIG_FLATMEM)
• UMA (Uniform Memory Access)
• Discontinuous Memory Model (CONFIG_DISCONTIGMEM)
• NUMA (Non-Uniform Memory Access)
• Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP)
• NUMA
• Default configuration
• Sparse Memory
• NUMA
Sparse Memory Virtual Memmap
(CONFIG_SPARSEMEM_VMEMMAP=y)
sparse_init() – Page Table Configuration for ‘struct page’
sparse_init()
sparse_init() ALIGN_DOWN(0xbffd_efff, 128MB) >> 27 =
0xb800_0000 >> 27 = 23
ALIGN_DOWN(0x1_0000_0000, 128MB) >> 27
= 0x1_0000_0000 >> 27 = 32
ALIGN_DOWN(0x2_403f_ffff, 128MB) >> 27 =
0x2_4000_0000 >> 27 = 72
setup_arch() – Part 3 – map_vsyscall
vsyscall (Virtual System Call) – Issue Statement
• The context switch overhead (user <-> kernel) of some system calls
(gettimeofday, time, getcpu) is greater than execution time of those
functions.
• Quote from Linux Programmer's Manual - VDSO(7)
• Making system calls can be slow. In x86 32-bit systems, you can trigger a
software interrupt (int $0x80) to tell the kernel you wish to make a system
call. However, this instruction is expensive: it goes through the full interrupt-
handling paths in the processor's microcode as well as in the kernel. Newer
processors have faster (but backward incompatible) instructions to initiate
system calls.
• Built on top of the fixed-mapped address
vsyscall – Implementation (Emulate)
[PTE] Bit 63: Execute Disable (XD)
• If IA32_EFER.NXE = 1 and XD
= 1, instruction fetches are
not allowed from this PTE.
This will generate a #PF
exception.
vsyscall - Experiment
vsyscall – Experiment – gdb + backtrace
Terminal #1
Terminal #2
vsyscall – Experiment – gdb + backtrace
Terminal #1
Terminal #2
error_code = 21 (0x15)
vsyscall – Experiment – gdb + backtrace
Terminal #1
Terminal #2
Replacement of vsyscall: vDSO (virtual Dynamic
Shared Object)
• vsyscall limitation
• Security concern: fixed virtual address (0xFFFF_FFFF_FF60_0000)
• vDSO
• Exploit ASLR (Address Space Layout Randomization)
• Can be enabled/disabled via /proc/sys/kernel/randomize_va_space
• [Enable] echo 1 > /proc/sys/kernel/randomize_va_space
• [Disable] echo 0 > /proc/sys/kernel/randomize_va_space
• User space address
• Security enhancement
setup_arch() – Part 3
[Recap] Page Table Configuration after finishing setup_arch()
[Recap] Page Table Configuration after finishing setup_arch()
1
2
3
1
1
2
3
vmlinux – start_kernel() – Part 2
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy in
setup_per_cpu_areas()
percpu section
*(.data..percpu..shared_aligned)
*(.data..percpu)
*(.data..percpu..read_mostly)
*(.data..percpu..page_aligned)
*(.data..percpu..first)
.data..percpu
__per_cpu_load
(kernel virtual address)
__per_cpu_end
__per_cpu_start = 0
percpu section
*(.data..percpu..shared_aligned)
*(.data..percpu)
*(.data..percpu..read_mostly)
*(.data..percpu..page_aligned)
*(.data..percpu..first)
.data..percpu
__per_cpu_load
(kernel virtual address)
__per_cpu_end
__per_cpu_start = 0
percpu variable access option #1: __per_cpu_offset
APIs (include/linux/percpu-defs.h):
* per_cpu_ptr(ptr, cpu): via __per_cpu_offset
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address ‘__per_cpu_load’
in setup_per_cpu_areas()
__per_cpu_offset[0]
__per_cpu_offset[1]
__per_cpu_offset[2]
__per_cpu_offset[3]
percpu variable access option #1: __per_cpu_offset
*(.data..percpu..shared_aligned)
*(.data..percpu)
*(.data..percpu..read_mostly)
*(.data..percpu..page_aligned)
*(.data..percpu..first)
.data..percpu
__per_cpu_load
(kernel virtual address)
__per_cpu_end
__per_cpu_start = 0
[Example]
gdt_page = 0xb000
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address ‘__per_cpu_load’
in setup_per_cpu_areas()
__per_cpu_offset[0]
__per_cpu_offset[1]
__per_cpu_offset[2]
__per_cpu_offset[3]
percpu variable access option #2: gs register (MSR: IA32_GS_BASE)
APIs (include/linux/percpu-defs.h):
* this_cpu_read(pcp)
* this_cpu_write(pcp, val)
* this_cpu_add(pcp, val)
* this_cpu_ptr(ptr) & raw_cpu_ptr(ptr)
1. Use gs register
2. If option #1 is not supported, use this_cpu_off per-cpu variable (read mostly)
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address
‘__per_cpu_load’ in
setup_per_cpu_areas()
CPU #0: IA32_GS_BASE
CPU #1: IA32_GS_BASE
CPU #2: IA32_GS_BASE
CPU #3: IA32_GS_BASE
gs register (MSR: IA32_GS_BASE) vs __per_cpu_offset
DEFINE_PER_CPU(int, x);
int z;
z = this_cpu_read(x);
Convert to a single instruction:
mov %gs:x,%edx
Atomic: No need to disable
preemption and interrupt
this_cpu_inc(x)
Convert to a single instruction:
inc %gs:x
int *y;
int cpu;
cpu = get_cpu();
y = per_cpu_ptr(&x, cpu);
(*y)++;
put_cpu();
Non-atomic: Need to disable preemption
gs register __per_cpu_offset
this_cpu_read()
this_cpu_inc()
this_cpu_inc() implementation via __per_cpu_offset
vmlinux – start_kernel() – Part 2
vmlinux – start_kernel() – Part 2 – trap_init()
CPU Entry Area (percpu)
• Page Table Isolation (PTI)
o Mitigate Meltdown
o Isolate user space and kernel space memory
o When the kernel is entered via syscalls, interrupts or exceptions, the page tables are
switched to the full "kernel“ copy.
▪ Entry/exit functions and IDT (Interrupt Descriptor Table) are needed for userspace page table
Kernel
Space
User
Space
User mode &
Kernel Mode
PTI
Kernel
Space
User
Space
Kernel mode
Kernel Space
User Space
User mode
User Space
percpu TSS
entry
Kernel
Space syscall
[User mode]
User Page Table
User Space
percpu TSS
entry
Kernel
Space
Switch to kernel
page table
[Kernel Mode]
User Page Table
User Space
percpu TSS
entry
Kernel
Space
[Kernel Mode]
Kernel Page Table
…
PTI: Concept PTI: High-level implementation
vmlinux – start_kernel() – Part 2 – setup_cpu_entry_area()
vmlinux – start_kernel() – Part 2 – trap_init()
vmlinux – start_kernel() – Part 2 – mm_init()
mm_init
• Set up different parts of Linux kernel memory managers
vmlinux – start_kernel() – Part 2 - preallocate_vmalloc_pages()
vmlinux – start_kernel() – Part 2
pti_init()
pti_init()
vmlinux – start_kernel() – Part 2
vmlinux – start_kernel() – Part 3
vmlinux – start_kernel() – Part 4
vmlinux – start_kernel() – Part 4
CommitLimit: Total amount of memory currently available to be allocated on the system.
Committed_AS: The amount of memory requested by processes.
Over Commit: Committed_AS > CommitLimit
vmlinux – start_kernel() – Part 4
Idle Process (swapper) = init_task (pid = 0)
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers for
userspace application)
task.stack
THREAD_SIZE = 16KB
kernel stack
usage space
task.stack + THREAD_SIZE
struct inactive_task_frame
task.thread_struct.sp
struct fork_frame
Kernel Stack
Context Switch – Kernel Stack
Context Switch – Kernel Stack
Return frame for
iretq
pt_regs
r15-r12
bx
r11-r8
bp
ax
dx
si
cx
orig_ax
ip
di
cs
sp
ss
flags
orig_ax: syscall#, error code for
CPU exception or IRQ number
of HW interrupt
thread_struct
tls_array
es, ds
fsindex, gsindex
fsbase, gsbase
sp
…
inactive_task_frame
r15-r13
bx (kernel thread function)
bp
ret_addr = ret_from_fork
r12 ( kernel thread argument)
Configured by copy_thread() – kernel thread
callee-saved registers
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers for
userspace application)
task.stack
THREAD_SIZE = 16KB
kernel stack
usage space
task.stack + THREAD_SIZE
struct inactive_task_frame
task.thread_struct.sp
struct fork_frame
Kernel Stack
Context Switch – Kernel Thread
inactive_task_frame
r15-r13
bx (kernel thread function)
bp
ret_addr = ret_from_fork
r12 (kernel thread argument)
Configured by copy_thread() – kernel thread
callee-saved registers
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers
for userspace application)
task.stack
kernel stack
usage space
Kernel Stack
bx (kernel thread function)
r13
r14
r15
r12 (kernel thread argument)
ret_addr = ret_from_fork
bp
task.stack +
THREAD_SIZE
rsp
rip
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers
for userspace application)
task.stack
kernel stack
usage space
Kernel Stack
bx (kernel thread function)
r13
r14
r15
r12 (kernel thread argument)
ret_addr = ret_from_fork
bp
task.stack +
THREAD_SIZE
rsp
rip
inactive_task_frame
r15-r13
bx (kernel thread function)
bp
ret_addr = ret_from_fork
r12 (kernel thread argument)
Configured by copy_thread() – kernel thread
callee-saved registers
Context Switch – Kernel Thread
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save CPU registers
for userspace application)
task.stack
kernel stack
usage space
Kernel Stack
bx (kernel thread function)
r13
r14
r15
r12 (kernel thread argument)
ret_addr = ret_from_fork
bp
task.stack +
THREAD_SIZE
rsp
rip
inactive_task_frame
r15-r13
bx (kernel thread function)
bp
ret_addr = ret_from_fork
r12 (kernel thread argument)
Configured by copy_thread() – kernel thread
callee-saved registers
Context Switch – Kernel Thread
Context Switch – Kernel Thread
jump
[Prev task] Return to the next instruction of calling
switch_to() when the previous task is re-scheduled.
4
task.stack
Kernel Stack
STACK_END_MAGIC = 0x57AC6E9D
struct pt_regs (save/restore CPU
registers for userspace tasks)
kernel stack
usage space
bx (kernel thread function)
r13
r14
r15
r12 (kernel thread argument)
ret_addr = ret_from_fork
bp
task.stack +
THREAD_SIZE
rsp
2
3
rsp `return prev_p`
1
Context Switch – Kernel Thread
jump
4
Context Switch – When to run ‘context switch’?
Explicitly call ‘schedule()’ Call ‘cond_resched()’ to yield CPU resource
Context Switch
Context Switch – init_task is rescheduled
[Prev task] Return to the next instruction of calling
switch_to() when the previous task is re-scheduled.
4
Backtrace when init_task (pid = 0) is rescheduled because kernel_init thread (pid = 1) is scheduled out
jump
4
Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
task_struct
mm = NULL
active_mm
init process (pid = 1)
kthreadd (pid = 2)
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm = NULL
scheduler
pid = 0
pid = 1
Kernel Thread Context Switch – Start Here (Aug 2, 2021)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm = NULL
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
task_struct
mm = NULL
active_mm = NULL
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
pid = 1
pid = 2
Kernel Thread Context Switch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
task_struct
mm = NULL
active_mm = NULL
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
pid = 1
pid = 2
1. Each kernel thread does not have its own ‘mm’.
2. The active_mm of the next task inherits the one of the previous task (use the same page table).
Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
Two breakpoints
breakpoint #1
breakpoint #2
gdb breakpoint configuration
Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm = NULL
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
`sleep` userspace task is
selected to run
Context Switch: Kernel Thread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm = NULL
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
pid = 0
pid = 40
`sleep` userspace task is
selected to run
Context Switch: Kernel Thread <-> User Space Task
task_struct
scheduler
sleep program (pid = 40)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
`sleep` userspace task is
scheduled out
Context Switch: Kernel Thread <-> User Space Task
task_struct
scheduler
sleep program (pid = 40)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
task_struct
ksoftirqd/2 (pid = 20)
mm = NULL
active_mm
cpu = 2
pid = 40
pid = 20
[Kernel Thread ]
Inherit active_mm of
the previous task.
(No need to flush
TLB because cr3 is
not changed)
`sleep` userspace task is
scheduled out
vmlinux – start_kernel() – Part 4
init process = kernel_init() (pid = 1)
[pid = 1 – init process] When are mm & active_mm allocated?
[pid = 1 – init process] When are mm & active_mm allocated?
[pid = 1 – init process] When are mm & active_mm allocated?
clone_pgd_range()
[pid = 1 – init process] When are mm & active_mm allocated?
[pid = 1] Before running run_init_process()
[pid = 1] After finishing run_init_process():
kernel thread -> user process
clone_pgd_range(): mm.pgd verification
[pid = 1] mm_struct
smp_init() - boot secondary CPUs
smp_init() - boot secondary CPUs
smp_init() - boot secondary CPUs
cpuhp/cpu_id kernel thread
• Execute callbacks (teardown, startup and son
on) when CPU hotplug state is changed.
smp_init() - boot secondary CPUs
smp_init() - boot secondary CPUs – Boot Flow
startup_32: setup cr3 @trampoline_pgd
secondary_startup_64: setup cr3 @init_top_pgt
[Secondary CPUs] CR3 Register Configuration
startup_32() - boot secondary CPUs – Page Table Configuration
startup_32: setup cr3 @trampoline_pgd
secondary_startup_64: setup cr3 @init_top_pgt
[Secondary CPUs] CR3 Register Configuration
startup_32() - boot secondary CPUs – Page Table Configuration
startup_32: setup cr3 @trampoline_pgd
secondary_startup_64: setup cr3 @init_top_pgt
[Secondary CPUs] CR3 Register Configuration
secondary_startup_64() - boot secondary CPUs – Page Table
startup_32: setup cr3 @trampoline_pgd
secondary_startup_64: setup cr3 @init_top_pgt
[Secondary CPUs] CR3 Register Configuration
Secondary CPUs – When to configure active_mm for idle_threads?
pstree after finishing start_kernel()
• The Linux/x86 Boot Protocol, Documentation/x86/boot.rst
• Intel® 64 and IA-32 Architectures Software Developer’s Manual
• https://wdv4758h.github.io/notes/blog/linux-kernel-boot.html
• Linux insides, https://0xax.gitbooks.io/linux-insides/content/
• Debugging kernel and modules via gdb,
https://www.kernel.org/doc/Documentation/dev-tools/gdb-kernel-
debugging.rst
Reference

Decompressed vmlinux: linux kernel initialization from page table configuration perspective

  • 1.
    Decompressed vmlinux: LinuxKernel Initialization from Page Table Configuration Perspective Adrian Huang | June, 2021 * Based on kernel 5.11 (x86_64) – QEMU * SMP (4 CPUs) and 8GB memory * Kernel parameter: nokaslr * Legacy BIOS
  • 2.
    Agenda • Recap –CPU booting flow and page table before entering decompressed vmlinux • 64-bit Virtual Address • Decompressed vmlinux: Important functions • Entry point: startup_64() • x86_64_start_kernel() -> start_kernel() -> setup_arch() • Apart from focusing on page table configuration, the following are covered as well: • Fixed-mapped addresses • Early ioremap: based on fixed-mapped addresses • Physical memory models • Especially for sparse memory • vsyscall - virtual system call (Built on top of fixed-mapped addresses) • percpu variable • PTI (Page Table Isolation) • kernel thread fork & context switch: struct pt_regs and struct inactive_task_frame in kernel stack • How to boot secondary CPUs? Where is the entry address?
  • 3.
    Recap – CPUbooting flow before entering decompressed vmlinux setup.bin (arch/x86/boot/setup.bin) Compressed vmlinux (Protected-mode kernel) Note ELF: arch/x86/boot/compressed/vmlinux Binary: arch/x86/boot/vmlinux.bin CRC bzImage Long Mode:
  • 4.
    Recap - Compressedvmlinux: Page table before entering decompressed vmlinux Sign-extend Page Map Level-4 Offset Page Directory Pointer Offset Page Directory Offset Physical Page Offset 0 30 21 39 20 38 29 47 48 63 PML4E #0 PDPTE #3 Data Page Map Level-4 Table Page Directory Pointer Table Page Directory Table 40 9 9 9 Linear Address CR3 PDPTE #2 PDPTE #1 PDPTE #0 PDE #1535 PDE #1024 . . PDE #2047 PDE #1536 . . PDE #511 PDE #0 . . PDE #1023 PDE #512 . . 2MBbyte Physical Page 40 40 31 21 [Paging] Identity mapping for 0-4GB memory space
  • 5.
    64-bit Virtual Address KernelSpace 0x0000_7FFF_FFFF_FFFF 0xFFFF_8000_0000_0000 128TB Page frame direct mapping (64TB) ZONE_DMA ZONE_DMA32 ZONE_NORMAL page_offset_base 0 16MB 64-bit Virtual Address Kernel Virtual Address Physical Memory 0 0xFFFF_FFFF_FFFF_FFFF Guard hole (8TB) LDT remap for PTI (0.5TB) Unused hole (0.5TB) vmalloc/ioremap (32TB) vmalloc_base Unused hole (1TB) Virtual memory map – 1TB (store page frame descriptor) … vmemmap_base 64TB *page … *page … *page … Page Frame Descriptor vmemmap_base page_ofset_base = 0xFFFF_8880_0000_0000 vmalloc_base = 0xFFFF_C900_0000_0000 vmemmap_base = 0xFFFF_EA00_0000_0000 * Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c") Default Configuration Kernel text mapping from physical address 0 Kernel code [.text, .data…] Modules __START_KERNEL_map = 0xFFFF_FFFF_8000_0000 __START_KERNEL = 0xFFFF_FFFF_8100_0000 MODULES_VADDR 0xFFFF_8000_0000_0000 Empty Space User Space 128TB 1GB or 512MB 1GB or 1.5GB Fix-mapped address space (Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000 0xFFFF_FFFF_FFFF_FFFF FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 Reference: Documentation/x86/x86_64/mm.rst
  • 6.
    Decompressed vmlinux –entry point: startup_64 1. The entry point is still at 0x1000000 (16MB) – not from kernel virtual addresses 2. The kernel virtual addresses will be executed after the corresponding page tables are all set
  • 7.
    Decompressed vmlinux –entry point: startup_64
  • 8.
    Decompressed vmlinux –entry point: startup_64
  • 9.
    Decompressed vmlinux –entry point: startup_64 Change to the kernel virtual address by issuing ‘jmp’ instruction 1 2 3
  • 10.
    Decompressed vmlinux –entry point: startup_64 1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily 2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
  • 11.
    Decompressed vmlinux –entry point: startup_64 1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily 2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt() When to switch to CPU’s own gdt_page (percpu)?
  • 12.
    Decompressed vmlinux –entry point: startup_64
  • 13.
    Decompressed vmlinux –x86_64_start_kernel() Page Table Configuration in startup_64 Page Table Configuration in x86_64_start_kernel init_top_pgt
  • 14.
    Decompressed vmlinux –x86_64_start_kernel()
  • 15.
    Decompressed vmlinux –x86_64_start_kernel()
  • 16.
    Decompressed vmlinux –early_idt_handler_common Return frame for iretq pt_regs r15-r12 bx r11-r8 bp ax dx si cx orig_ax ip di cs sp ss flags orig_ax: syscall#, error code for CPU exception or IRQ number of HW interrupt Callee-saved registers: Check x86_64 ABI
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    setup_arch() – Part1 memblock: boot time memory management Memblock • Memory allocation during boot time stage • Set up in setup_arch() • Tear down in mem_init(): Release free pages to buddy allocator [memblock] Reserve page 0 • Security: Mitigate L1TF (L1 Terminal Fault) vulnerability
  • 23.
    Fixed-mapped Addresses: Compile-timevirtual memory allocation vsyscall #0 … vsyscall #511 FIX_DBGP_BASE FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 VSYSCALL_ADDR = 0xFFFF_FFFF_FF60_0000 FIX_EARLYCON_MEM_BASE … __end_of_permanent_fixed_addresses FIX_BTMAP_END = 1024 … FIX_BTMAP_BEGIN = 1535 __end_of_fixed_addresses = 1536 vsyscalls (2MB space) Permanent fixed addresses 512 temporary boot-time mappings: used by early_ioremap() FIXADDR_START = 0xFFFF_FFFF_FF57_C000 Enumeration: fixed_addresses 0xFFFF_FFFF_FF3F_F000 0xFFFF_FFFF_FF20_0000 Modules MODULES_VADDR Fix-mapped address space (Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000 0xFFFF_FFFF_FFFF_FFFF FIXADDR_TOP 4MB: fixed-mapped address space 2MB: Borrow from ‘Modules’ space breakdown
  • 24.
    Fixed-mapped Addresses: Compile-timevirtual memory allocation vsyscall #0 … vsyscall #511 FIX_DBGP_BASE FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 VSYSCALL_ADDR = 0xFFFF_FFFF_FF60_0000 FIX_EARLYCON_MEM_BASE … __end_of_permanent_fixed_addresses FIX_BTMAP_END = 1024 … FIX_BTMAP_BEGIN = 1535 __end_of_fixed_addresses = 1536 vsyscalls (2MB space) Permanent fixed addresses 512 temporary boot-time mappings: used by early_ioremap() FIXADDR_START = 0xFFFF_FFFF_FF57_C000 Enumeration: fixed_addresses 0xFFFF_FFFF_FF3F_F000 0xFFFF_FFFF_FF20_0000 4MB: fixed-mapped address space 2MB: Borrow from ‘Modules’ space
  • 25.
    Fixed-mapped Addresses: Compile-timevirtual memory allocation Fixed-mapped Addresses: Use Case
  • 26.
    Early ioremap: basedon fixed-mapped address PDE #507: 0xFFFF_FFFF_FF60_0000 PDE #506: 0xFFFF_FFFF_FF40_0000 PDE #505: 0xFFFF_FFFF_FF20_0000 #1528 … FIX_BTMAP_BEGIN = 1535 … FIX_BTMAP_END = 1024 … # 1031 slot_virt[0] slot_virt[7] slot_virt[0] = 0xFFFF_FFFF_FF20_0000 slot_virt[7] = 0xFFFF_FFFF_FF3C_0000 early_ioremap_setup() Early ioremap • Mapping/unmapping of I/O physical address to virtual address before ioremap mechanism is ready • early_ioremap() & early_iounmap() Fixed-mapped Addresses
  • 27.
  • 28.
    setup_arch() – Part1 [Linux x86 Boot Protocol] setup_data: 64-bit physical pointer to linked list of struct setup_data
  • 29.
  • 30.
    setup_arch() – Part2 - cleanup_highmap
  • 31.
  • 32.
    setup_arch() – Part2: init_mem_mapping() -- Page Table Configuration for Direct Mapping
  • 33.
    setup_arch() – Part2: init_mem_mapping() -- Page Table Configuration for Direct Mapping
  • 34.
    setup_arch() – Part2: init_mem_mapping() -- Page Table Configuration for Direct Mapping Split memory range into sub-ranges that fulfill 4K, 2M or 1G page. split_mem_range
  • 35.
    setup_arch() – Part2: init_mem_mapping() -- Page Table Configuration for Direct Mapping
  • 36.
    kernel_physical_mapping_init(): Page TableConfiguration for Direct Mapping
  • 37.
    setup_arch() – Part3 Initialize the idt table with early pagefault handler. idt_setup_early_pf
  • 38.
    setup_arch() – Part3 - x86_init.paging.pagetable_init() x86_init.paging.pagetable_init native_pagetable_init paging_init sparse_init zone_sizes_init cfg number of pfn for each zone free_area_init Zone Allocator Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache ZONE_DMA (Physical address: 0-16MB) ZONE_DMA32 (Physical address: 16MB-4GB) ZONE_NORMAL (Physical address > 4GB) Buddy system Per-CPU page frame cache Buddy system Per-CPU page frame cache ZONE_MOVABLE ZONE_DEVICE ZONE_DMA ZONE_DMA32 ZONE_NORMAL 0 16MB Physical Memory 64TB 4GB paging_init() • Initialize sparse memory and zone sizes
  • 39.
    Physical Memory Models •Flat Memory Model (CONFIG_FLATMEM) • UMA (Uniform Memory Access) • Discontinuous Memory Model (CONFIG_DISCONTIGMEM) • NUMA (Non-Uniform Memory Access) • Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP) • NUMA • Default configuration • Sparse Memory • NUMA
  • 40.
    Sparse Memory VirtualMemmap (CONFIG_SPARSEMEM_VMEMMAP=y)
  • 41.
    sparse_init() – PageTable Configuration for ‘struct page’
  • 42.
  • 43.
    sparse_init() ALIGN_DOWN(0xbffd_efff, 128MB)>> 27 = 0xb800_0000 >> 27 = 23 ALIGN_DOWN(0x1_0000_0000, 128MB) >> 27 = 0x1_0000_0000 >> 27 = 32 ALIGN_DOWN(0x2_403f_ffff, 128MB) >> 27 = 0x2_4000_0000 >> 27 = 72
  • 44.
    setup_arch() – Part3 – map_vsyscall
  • 45.
    vsyscall (Virtual SystemCall) – Issue Statement • The context switch overhead (user <-> kernel) of some system calls (gettimeofday, time, getcpu) is greater than execution time of those functions. • Quote from Linux Programmer's Manual - VDSO(7) • Making system calls can be slow. In x86 32-bit systems, you can trigger a software interrupt (int $0x80) to tell the kernel you wish to make a system call. However, this instruction is expensive: it goes through the full interrupt- handling paths in the processor's microcode as well as in the kernel. Newer processors have faster (but backward incompatible) instructions to initiate system calls. • Built on top of the fixed-mapped address
  • 46.
    vsyscall – Implementation(Emulate) [PTE] Bit 63: Execute Disable (XD) • If IA32_EFER.NXE = 1 and XD = 1, instruction fetches are not allowed from this PTE. This will generate a #PF exception.
  • 47.
  • 48.
    vsyscall – Experiment– gdb + backtrace Terminal #1 Terminal #2
  • 49.
    vsyscall – Experiment– gdb + backtrace Terminal #1 Terminal #2 error_code = 21 (0x15)
  • 50.
    vsyscall – Experiment– gdb + backtrace Terminal #1 Terminal #2
  • 51.
    Replacement of vsyscall:vDSO (virtual Dynamic Shared Object) • vsyscall limitation • Security concern: fixed virtual address (0xFFFF_FFFF_FF60_0000) • vDSO • Exploit ASLR (Address Space Layout Randomization) • Can be enabled/disabled via /proc/sys/kernel/randomize_va_space • [Enable] echo 1 > /proc/sys/kernel/randomize_va_space • [Disable] echo 0 > /proc/sys/kernel/randomize_va_space • User space address • Security enhancement
  • 52.
  • 53.
    [Recap] Page TableConfiguration after finishing setup_arch()
  • 54.
    [Recap] Page TableConfiguration after finishing setup_arch() 1 2 3 1 1 2 3
  • 55.
    vmlinux – start_kernel()– Part 2 Original .data..percpu .data..percpu for core 2 .data..percpu for core 3 .data..percpu for core 0 .data..percpu for core 1 Physical Memory memcpy in setup_per_cpu_areas()
  • 56.
  • 57.
  • 58.
    percpu variable accessoption #1: __per_cpu_offset APIs (include/linux/percpu-defs.h): * per_cpu_ptr(ptr, cpu): via __per_cpu_offset Original .data..percpu .data..percpu for core 2 .data..percpu for core 3 .data..percpu for core 0 .data..percpu for core 1 Physical Memory memcpy with source address ‘__per_cpu_load’ in setup_per_cpu_areas() __per_cpu_offset[0] __per_cpu_offset[1] __per_cpu_offset[2] __per_cpu_offset[3]
  • 59.
    percpu variable accessoption #1: __per_cpu_offset *(.data..percpu..shared_aligned) *(.data..percpu) *(.data..percpu..read_mostly) *(.data..percpu..page_aligned) *(.data..percpu..first) .data..percpu __per_cpu_load (kernel virtual address) __per_cpu_end __per_cpu_start = 0 [Example] gdt_page = 0xb000 Original .data..percpu .data..percpu for core 2 .data..percpu for core 3 .data..percpu for core 0 .data..percpu for core 1 Physical Memory memcpy with source address ‘__per_cpu_load’ in setup_per_cpu_areas() __per_cpu_offset[0] __per_cpu_offset[1] __per_cpu_offset[2] __per_cpu_offset[3]
  • 60.
    percpu variable accessoption #2: gs register (MSR: IA32_GS_BASE) APIs (include/linux/percpu-defs.h): * this_cpu_read(pcp) * this_cpu_write(pcp, val) * this_cpu_add(pcp, val) * this_cpu_ptr(ptr) & raw_cpu_ptr(ptr) 1. Use gs register 2. If option #1 is not supported, use this_cpu_off per-cpu variable (read mostly) Original .data..percpu .data..percpu for core 2 .data..percpu for core 3 .data..percpu for core 0 .data..percpu for core 1 Physical Memory memcpy with source address ‘__per_cpu_load’ in setup_per_cpu_areas() CPU #0: IA32_GS_BASE CPU #1: IA32_GS_BASE CPU #2: IA32_GS_BASE CPU #3: IA32_GS_BASE
  • 61.
    gs register (MSR:IA32_GS_BASE) vs __per_cpu_offset DEFINE_PER_CPU(int, x); int z; z = this_cpu_read(x); Convert to a single instruction: mov %gs:x,%edx Atomic: No need to disable preemption and interrupt this_cpu_inc(x) Convert to a single instruction: inc %gs:x int *y; int cpu; cpu = get_cpu(); y = per_cpu_ptr(&x, cpu); (*y)++; put_cpu(); Non-atomic: Need to disable preemption gs register __per_cpu_offset this_cpu_read() this_cpu_inc() this_cpu_inc() implementation via __per_cpu_offset
  • 62.
  • 63.
    vmlinux – start_kernel()– Part 2 – trap_init() CPU Entry Area (percpu) • Page Table Isolation (PTI) o Mitigate Meltdown o Isolate user space and kernel space memory o When the kernel is entered via syscalls, interrupts or exceptions, the page tables are switched to the full "kernel“ copy. ▪ Entry/exit functions and IDT (Interrupt Descriptor Table) are needed for userspace page table Kernel Space User Space User mode & Kernel Mode PTI Kernel Space User Space Kernel mode Kernel Space User Space User mode User Space percpu TSS entry Kernel Space syscall [User mode] User Page Table User Space percpu TSS entry Kernel Space Switch to kernel page table [Kernel Mode] User Page Table User Space percpu TSS entry Kernel Space [Kernel Mode] Kernel Page Table … PTI: Concept PTI: High-level implementation
  • 64.
    vmlinux – start_kernel()– Part 2 – setup_cpu_entry_area()
  • 65.
    vmlinux – start_kernel()– Part 2 – trap_init()
  • 66.
    vmlinux – start_kernel()– Part 2 – mm_init() mm_init • Set up different parts of Linux kernel memory managers
  • 67.
    vmlinux – start_kernel()– Part 2 - preallocate_vmalloc_pages()
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
    vmlinux – start_kernel()– Part 4 CommitLimit: Total amount of memory currently available to be allocated on the system. Committed_AS: The amount of memory requested by processes. Over Commit: Committed_AS > CommitLimit
  • 75.
    vmlinux – start_kernel()– Part 4 Idle Process (swapper) = init_task (pid = 0)
  • 76.
    STACK_END_MAGIC = 0x57AC6E9D structpt_regs (save CPU registers for userspace application) task.stack THREAD_SIZE = 16KB kernel stack usage space task.stack + THREAD_SIZE struct inactive_task_frame task.thread_struct.sp struct fork_frame Kernel Stack Context Switch – Kernel Stack
  • 77.
    Context Switch –Kernel Stack Return frame for iretq pt_regs r15-r12 bx r11-r8 bp ax dx si cx orig_ax ip di cs sp ss flags orig_ax: syscall#, error code for CPU exception or IRQ number of HW interrupt thread_struct tls_array es, ds fsindex, gsindex fsbase, gsbase sp … inactive_task_frame r15-r13 bx (kernel thread function) bp ret_addr = ret_from_fork r12 ( kernel thread argument) Configured by copy_thread() – kernel thread callee-saved registers STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save CPU registers for userspace application) task.stack THREAD_SIZE = 16KB kernel stack usage space task.stack + THREAD_SIZE struct inactive_task_frame task.thread_struct.sp struct fork_frame Kernel Stack
  • 78.
    Context Switch –Kernel Thread inactive_task_frame r15-r13 bx (kernel thread function) bp ret_addr = ret_from_fork r12 (kernel thread argument) Configured by copy_thread() – kernel thread callee-saved registers STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save CPU registers for userspace application) task.stack kernel stack usage space Kernel Stack bx (kernel thread function) r13 r14 r15 r12 (kernel thread argument) ret_addr = ret_from_fork bp task.stack + THREAD_SIZE rsp rip
  • 79.
    STACK_END_MAGIC = 0x57AC6E9D structpt_regs (save CPU registers for userspace application) task.stack kernel stack usage space Kernel Stack bx (kernel thread function) r13 r14 r15 r12 (kernel thread argument) ret_addr = ret_from_fork bp task.stack + THREAD_SIZE rsp rip inactive_task_frame r15-r13 bx (kernel thread function) bp ret_addr = ret_from_fork r12 (kernel thread argument) Configured by copy_thread() – kernel thread callee-saved registers Context Switch – Kernel Thread
  • 80.
    STACK_END_MAGIC = 0x57AC6E9D structpt_regs (save CPU registers for userspace application) task.stack kernel stack usage space Kernel Stack bx (kernel thread function) r13 r14 r15 r12 (kernel thread argument) ret_addr = ret_from_fork bp task.stack + THREAD_SIZE rsp rip inactive_task_frame r15-r13 bx (kernel thread function) bp ret_addr = ret_from_fork r12 (kernel thread argument) Configured by copy_thread() – kernel thread callee-saved registers Context Switch – Kernel Thread
  • 81.
    Context Switch –Kernel Thread jump
  • 82.
    [Prev task] Returnto the next instruction of calling switch_to() when the previous task is re-scheduled. 4 task.stack Kernel Stack STACK_END_MAGIC = 0x57AC6E9D struct pt_regs (save/restore CPU registers for userspace tasks) kernel stack usage space bx (kernel thread function) r13 r14 r15 r12 (kernel thread argument) ret_addr = ret_from_fork bp task.stack + THREAD_SIZE rsp 2 3 rsp `return prev_p` 1 Context Switch – Kernel Thread jump 4
  • 83.
    Context Switch –When to run ‘context switch’? Explicitly call ‘schedule()’ Call ‘cond_resched()’ to yield CPU resource
  • 84.
  • 85.
    Context Switch –init_task is rescheduled [Prev task] Return to the next instruction of calling switch_to() when the previous task is re-scheduled. 4 Backtrace when init_task (pid = 0) is rescheduled because kernel_init thread (pid = 1) is scheduled out jump 4
  • 86.
    Kernel Thread ContextSwitch mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm scheduler init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt init process (pid = 1) kthreadd (pid = 2)
  • 87.
    Kernel Thread ContextSwitch mm_struct mmap (list of VMAs) pgd pgd_t pgd init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt task_struct mm = NULL active_mm init process (pid = 1) kthreadd (pid = 2) task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm = NULL scheduler pid = 0 pid = 1
  • 88.
    Kernel Thread ContextSwitch – Start Here (Aug 2, 2021) mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct mm = NULL active_mm task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm = NULL scheduler init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt init process (pid = 1) kthreadd (pid = 2)
  • 89.
    Kernel Thread ContextSwitch mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm task_struct mm = NULL active_mm = NULL scheduler init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt init process (pid = 1) kthreadd (pid = 2) pid = 1 pid = 2
  • 90.
    Kernel Thread ContextSwitch mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct mm = NULL active_mm = NULL task_struct mm = NULL active_mm task_struct mm = NULL active_mm = NULL scheduler init_task (pid = 0) init_mm swapper_pg_dir = init_top_pgt init process (pid = 1) kthreadd (pid = 2) pid = 1 pid = 2 1. Each kernel thread does not have its own ‘mm’. 2. The active_mm of the next task inherits the one of the previous task (use the same page table).
  • 91.
    Context Switch: KernelThread <-> User Space Task mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct scheduler init_task (pid = 0) sleep program (pid = 40) task_struct mm = NULL active_mm cpu = 2 mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 Two breakpoints breakpoint #1 breakpoint #2 gdb breakpoint configuration
  • 92.
    Context Switch: KernelThread <-> User Space Task mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct scheduler init_task (pid = 0) sleep program (pid = 40) task_struct mm = NULL active_mm = NULL cpu = 2 mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 `sleep` userspace task is selected to run
  • 93.
    Context Switch: KernelThread <-> User Space Task mm_struct mmap (list of VMAs) pgd pgd_t pgd task_struct scheduler init_task (pid = 0) sleep program (pid = 40) task_struct mm = NULL active_mm = NULL cpu = 2 mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 pid = 0 pid = 40 `sleep` userspace task is selected to run
  • 94.
    Context Switch: KernelThread <-> User Space Task task_struct scheduler sleep program (pid = 40) mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 `sleep` userspace task is scheduled out
  • 95.
    Context Switch: KernelThread <-> User Space Task task_struct scheduler sleep program (pid = 40) mm_struct mmap (list of VMAs) pgd pgd_t pgd mm active_mm cpu = 2 task_struct ksoftirqd/2 (pid = 20) mm = NULL active_mm cpu = 2 pid = 40 pid = 20 [Kernel Thread ] Inherit active_mm of the previous task. (No need to flush TLB because cr3 is not changed) `sleep` userspace task is scheduled out
  • 96.
    vmlinux – start_kernel()– Part 4 init process = kernel_init() (pid = 1)
  • 97.
    [pid = 1– init process] When are mm & active_mm allocated?
  • 98.
    [pid = 1– init process] When are mm & active_mm allocated?
  • 99.
    [pid = 1– init process] When are mm & active_mm allocated? clone_pgd_range()
  • 100.
    [pid = 1– init process] When are mm & active_mm allocated? [pid = 1] Before running run_init_process() [pid = 1] After finishing run_init_process(): kernel thread -> user process clone_pgd_range(): mm.pgd verification [pid = 1] mm_struct
  • 101.
    smp_init() - bootsecondary CPUs
  • 102.
    smp_init() - bootsecondary CPUs
  • 103.
    smp_init() - bootsecondary CPUs cpuhp/cpu_id kernel thread • Execute callbacks (teardown, startup and son on) when CPU hotplug state is changed.
  • 104.
    smp_init() - bootsecondary CPUs
  • 105.
    smp_init() - bootsecondary CPUs – Boot Flow startup_32: setup cr3 @trampoline_pgd secondary_startup_64: setup cr3 @init_top_pgt [Secondary CPUs] CR3 Register Configuration
  • 106.
    startup_32() - bootsecondary CPUs – Page Table Configuration startup_32: setup cr3 @trampoline_pgd secondary_startup_64: setup cr3 @init_top_pgt [Secondary CPUs] CR3 Register Configuration
  • 107.
    startup_32() - bootsecondary CPUs – Page Table Configuration startup_32: setup cr3 @trampoline_pgd secondary_startup_64: setup cr3 @init_top_pgt [Secondary CPUs] CR3 Register Configuration
  • 108.
    secondary_startup_64() - bootsecondary CPUs – Page Table startup_32: setup cr3 @trampoline_pgd secondary_startup_64: setup cr3 @init_top_pgt [Secondary CPUs] CR3 Register Configuration
  • 109.
    Secondary CPUs –When to configure active_mm for idle_threads?
  • 110.
    pstree after finishingstart_kernel()
  • 111.
    • The Linux/x86Boot Protocol, Documentation/x86/boot.rst • Intel® 64 and IA-32 Architectures Software Developer’s Manual • https://wdv4758h.github.io/notes/blog/linux-kernel-boot.html • Linux insides, https://0xax.gitbooks.io/linux-insides/content/ • Debugging kernel and modules via gdb, https://www.kernel.org/doc/Documentation/dev-tools/gdb-kernel- debugging.rst Reference