feat: SMP load balancing — per-CPU TSS, AP GDT reload, BSP-only timer work
Three fixes enable kernel thread dispatch to any CPU:
1. Per-CPU TSS (gdt.c, gdt.h): Replace single TSS with tss_array[SMP_MAX_CPUS].
Each AP gets its own TSS via tss_init_ap() so ring 3→0 transitions use
the correct per-task kernel stack on any CPU.
2. AP GDT virtual base reload (smp.c): The AP trampoline loads the GDT with
a physical base for real→protected mode. After paging is active, reload
the GDTR with the virtual base and flush all segment registers. Without
this, ring transitions on APs read GDT entries from the identity-mapped
physical address, causing silent failures for user-mode processes.
3. BSP-only timer work (timer.c): Gate tick increment, vdso update,
vga_flush, hal_uart_poll_rx, and process_wake_check to run only on
CPU 0. APs only call schedule(). Prevents non-atomic tick races,
concurrent VGA/UART access, and duplicate wake processing.
4. Per-CPU SYSENTER stacks (sysenter_init.c): Each AP gets its own
SYSENTER ESP MSR pointing to a dedicated stack.
5. Load balancing (scheduler.c): process_create_kernel dispatches to
the least-loaded CPU via sched_pcpu_least_loaded(). All CPUs update
their own TSS ESP0 during context switch.