Decompressed vmlinux: linux kernel initialization from page table configuration perspective
The document provides an in-depth exploration of the Linux kernel's initialization process from the perspective of page table configuration, focusing on kernel 5.11 (x86_64) with various memory management features and concepts. Key topics include the booting flow, memory model structures like fixed-mapped addresses, early I/O mapping, and the implementations of virtual system calls. It also discusses the setup of CPU-specific structures and memory management functions during the initialization phase to ensure efficient operation of the kernel.
Decompressed vmlinux: linux kernel initialization from page table configuration perspective
1.
Decompressed vmlinux: LinuxKernel Initialization
from Page Table Configuration Perspective
Adrian Huang | June, 2021
* Based on kernel 5.11 (x86_64) – QEMU
* SMP (4 CPUs) and 8GB memory
* Kernel parameter: nokaslr
* Legacy BIOS
2.
Agenda
• Recap –CPU booting flow and page table before entering decompressed vmlinux
• 64-bit Virtual Address
• Decompressed vmlinux: Important functions
• Entry point: startup_64()
• x86_64_start_kernel() -> start_kernel() -> setup_arch()
• Apart from focusing on page table configuration, the following are covered as well:
• Fixed-mapped addresses
• Early ioremap: based on fixed-mapped addresses
• Physical memory models
• Especially for sparse memory
• vsyscall - virtual system call (Built on top of fixed-mapped addresses)
• percpu variable
• PTI (Page Table Isolation)
• kernel thread fork & context switch: struct pt_regs and struct inactive_task_frame in kernel
stack
• How to boot secondary CPUs? Where is the entry address?
64-bit Virtual Address
KernelSpace
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
page_offset_base
0
16MB
64-bit Virtual Address
Kernel Virtual Address
Physical Memory
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap (32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map – 1TB
(store page frame descriptor)
…
vmemmap_base
64TB
*page
…
*page
…
*page
…
Page Frame
Descriptor
vmemmap_base
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map = 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
6.
Decompressed vmlinux –entry point: startup_64
1. The entry point is still at 0x1000000 (16MB) – not from kernel virtual addresses
2. The kernel virtual addresses will be executed after the corresponding page tables are all set
Decompressed vmlinux –entry point: startup_64
Change to the kernel virtual address by issuing ‘jmp’ instruction
1
2
3
10.
Decompressed vmlinux –entry point: startup_64
1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily
2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
11.
Decompressed vmlinux –entry point: startup_64
1. Use original per_cpu copy of ‘init_per_cpu__gdt_page’ temporarily
2. Switch CPU’s own per_cpu ‘gdt_page’ when calling switch_to_new_gdt()
When to switch to CPU’s own gdt_page (percpu)?
Decompressed vmlinux –early_idt_handler_common
Return frame for
iretq
pt_regs
r15-r12
bx
r11-r8
bp
ax
dx
si
cx
orig_ax
ip
di
cs
sp
ss
flags
orig_ax: syscall#, error code for
CPU exception or IRQ number
of HW interrupt
Callee-saved registers:
Check x86_64 ABI
setup_arch() – Part1
memblock: boot time memory management
Memblock
• Memory allocation during boot time stage
• Set up in setup_arch()
• Tear down in mem_init(): Release free pages
to buddy allocator
[memblock] Reserve page 0
• Security: Mitigate L1TF (L1 Terminal Fault)
vulnerability
setup_arch() – Part2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
33.
setup_arch() – Part2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
34.
setup_arch() – Part2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
Split memory range into sub-ranges
that fulfill 4K, 2M or 1G page.
split_mem_range
35.
setup_arch() – Part2: init_mem_mapping() -- Page Table
Configuration for Direct Mapping
vsyscall (Virtual SystemCall) – Issue Statement
• The context switch overhead (user <-> kernel) of some system calls
(gettimeofday, time, getcpu) is greater than execution time of those
functions.
• Quote from Linux Programmer's Manual - VDSO(7)
• Making system calls can be slow. In x86 32-bit systems, you can trigger a
software interrupt (int $0x80) to tell the kernel you wish to make a system
call. However, this instruction is expensive: it goes through the full interrupt-
handling paths in the processor's microcode as well as in the kernel. Newer
processors have faster (but backward incompatible) instructions to initiate
system calls.
• Built on top of the fixed-mapped address
46.
vsyscall – Implementation(Emulate)
[PTE] Bit 63: Execute Disable (XD)
• If IA32_EFER.NXE = 1 and XD
= 1, instruction fetches are
not allowed from this PTE.
This will generate a #PF
exception.
vmlinux – start_kernel()– Part 2
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy in
setup_per_cpu_areas()
percpu variable accessoption #1: __per_cpu_offset
APIs (include/linux/percpu-defs.h):
* per_cpu_ptr(ptr, cpu): via __per_cpu_offset
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address ‘__per_cpu_load’
in setup_per_cpu_areas()
__per_cpu_offset[0]
__per_cpu_offset[1]
__per_cpu_offset[2]
__per_cpu_offset[3]
59.
percpu variable accessoption #1: __per_cpu_offset
*(.data..percpu..shared_aligned)
*(.data..percpu)
*(.data..percpu..read_mostly)
*(.data..percpu..page_aligned)
*(.data..percpu..first)
.data..percpu
__per_cpu_load
(kernel virtual address)
__per_cpu_end
__per_cpu_start = 0
[Example]
gdt_page = 0xb000
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address ‘__per_cpu_load’
in setup_per_cpu_areas()
__per_cpu_offset[0]
__per_cpu_offset[1]
__per_cpu_offset[2]
__per_cpu_offset[3]
60.
percpu variable accessoption #2: gs register (MSR: IA32_GS_BASE)
APIs (include/linux/percpu-defs.h):
* this_cpu_read(pcp)
* this_cpu_write(pcp, val)
* this_cpu_add(pcp, val)
* this_cpu_ptr(ptr) & raw_cpu_ptr(ptr)
1. Use gs register
2. If option #1 is not supported, use this_cpu_off per-cpu variable (read mostly)
Original .data..percpu
.data..percpu for core 2
.data..percpu for core 3
.data..percpu for core 0
.data..percpu for core 1
Physical Memory
memcpy with source
address
‘__per_cpu_load’ in
setup_per_cpu_areas()
CPU #0: IA32_GS_BASE
CPU #1: IA32_GS_BASE
CPU #2: IA32_GS_BASE
CPU #3: IA32_GS_BASE
61.
gs register (MSR:IA32_GS_BASE) vs __per_cpu_offset
DEFINE_PER_CPU(int, x);
int z;
z = this_cpu_read(x);
Convert to a single instruction:
mov %gs:x,%edx
Atomic: No need to disable
preemption and interrupt
this_cpu_inc(x)
Convert to a single instruction:
inc %gs:x
int *y;
int cpu;
cpu = get_cpu();
y = per_cpu_ptr(&x, cpu);
(*y)++;
put_cpu();
Non-atomic: Need to disable preemption
gs register __per_cpu_offset
this_cpu_read()
this_cpu_inc()
this_cpu_inc() implementation via __per_cpu_offset
vmlinux – start_kernel()– Part 2 – trap_init()
CPU Entry Area (percpu)
• Page Table Isolation (PTI)
o Mitigate Meltdown
o Isolate user space and kernel space memory
o When the kernel is entered via syscalls, interrupts or exceptions, the page tables are
switched to the full "kernel“ copy.
▪ Entry/exit functions and IDT (Interrupt Descriptor Table) are needed for userspace page table
Kernel
Space
User
Space
User mode &
Kernel Mode
PTI
Kernel
Space
User
Space
Kernel mode
Kernel Space
User Space
User mode
User Space
percpu TSS
entry
Kernel
Space syscall
[User mode]
User Page Table
User Space
percpu TSS
entry
Kernel
Space
Switch to kernel
page table
[Kernel Mode]
User Page Table
User Space
percpu TSS
entry
Kernel
Space
[Kernel Mode]
Kernel Page Table
…
PTI: Concept PTI: High-level implementation
vmlinux – start_kernel()– Part 4
CommitLimit: Total amount of memory currently available to be allocated on the system.
Committed_AS: The amount of memory requested by processes.
Over Commit: Committed_AS > CommitLimit
Context Switch –init_task is rescheduled
[Prev task] Return to the next instruction of calling
switch_to() when the previous task is re-scheduled.
4
Backtrace when init_task (pid = 0) is rescheduled because kernel_init thread (pid = 1) is scheduled out
jump
4
86.
Kernel Thread ContextSwitch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
Kernel Thread ContextSwitch
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
mm = NULL
active_mm = NULL
task_struct
mm = NULL
active_mm
task_struct
mm = NULL
active_mm = NULL
scheduler
init_task (pid = 0) init_mm
swapper_pg_dir =
init_top_pgt
init process (pid = 1)
kthreadd (pid = 2)
pid = 1
pid = 2
1. Each kernel thread does not have its own ‘mm’.
2. The active_mm of the next task inherits the one of the previous task (use the same page table).
91.
Context Switch: KernelThread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
Two breakpoints
breakpoint #1
breakpoint #2
gdb breakpoint configuration
92.
Context Switch: KernelThread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm = NULL
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
`sleep` userspace task is
selected to run
93.
Context Switch: KernelThread <-> User Space Task
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
task_struct
scheduler
init_task (pid = 0)
sleep program (pid = 40)
task_struct
mm = NULL
active_mm = NULL
cpu = 2
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
pid = 0
pid = 40
`sleep` userspace task is
selected to run
94.
Context Switch: KernelThread <-> User Space Task
task_struct
scheduler
sleep program (pid = 40)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
`sleep` userspace task is
scheduled out
95.
Context Switch: KernelThread <-> User Space Task
task_struct
scheduler
sleep program (pid = 40)
mm_struct
mmap (list of VMAs)
pgd
pgd_t
pgd
mm
active_mm
cpu = 2
task_struct
ksoftirqd/2 (pid = 20)
mm = NULL
active_mm
cpu = 2
pid = 40
pid = 20
[Kernel Thread ]
Inherit active_mm of
the previous task.
(No need to flush
TLB because cr3 is
not changed)
`sleep` userspace task is
scheduled out
[pid = 1– init process] When are mm & active_mm allocated?
98.
[pid = 1– init process] When are mm & active_mm allocated?
99.
[pid = 1– init process] When are mm & active_mm allocated?
clone_pgd_range()
100.
[pid = 1– init process] When are mm & active_mm allocated?
[pid = 1] Before running run_init_process()
[pid = 1] After finishing run_init_process():
kernel thread -> user process
clone_pgd_range(): mm.pgd verification
[pid = 1] mm_struct