aboutsummaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
4 daysMerge tag 'slab-for-6.18-rc7' of ↵Linus Torvalds1-6/+26
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Fix mempool poisoning order>0 pages with CONFIG_HIGHMEM (Vlastimil Babka) * tag 'slab-for-6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/mempool: fix poisoning order>0 pages with HIGHMEM
5 daysMerge tag 'fixes-2025-11-19' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fix from Mike Rapoport: "Fix memblock_estimated_nr_free_pages() for soft-reserved memory The "soft-reserved" memory regions (EFI_MEMORY_SP) are added to the memblock.reserved, but not to the memblock.memory. It causes memblock_estimated_nr_free_pages() to return a value smaller value than expected, or if it underflows, an extremely large value. Calculate the number of estimated free pages using memblock_reserved_kern_size() instead of memblock_reserved_size() to fix the issue" * tag 'fixes-2025-11-19' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: fix memblock_estimated_nr_free_pages() for soft-reserved memory
6 daysmm/huge_memory: Fix initialization of huge zero folioLinus Torvalds1-7/+2
The recent fix to properly initialize the tags of the huge zero folio had an unfortunate not-so-subtle side effect: it caused the actual *contents* of the huge zero folio to not be initialized at all when the hardware didn't support the memory tagging. The reason was the unfortunate semantics of tag_clear_highpage(): on hardware that didn't do the tagging, it would silently just not do anything at all. And since this is done only on arm64 with MTE support, that basically meant most hardware. It wasn't necessarily immediately obvious since the huge zero page isn't necessarily very heavily used - or because it might already be zero because all-zeroes is the most common pattern. But it ends up causing random odd user space failures when you do hit it. The unfortunate semantics have been around for a while, but became a real bug only when we started actively using __GFP_ZEROTAGS in the generic get_huge_zero_folio() function - before that, it had only ever been used in code that checked that the hardware supported it. Fix this by simply changing the semantics of tag_clear_highpage() to return whether it actually successfully did something or not. While at it, also make it initialize multiple pages in one go, since that's actually what the only caller wants it to do and it simplifies the whole logic. Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio") Link: https://lore.kernel.org/all/20251117082023.90176-1-00107082@163.com/ Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Reported-and-tested-by: David Wang <00107082@163.com> Reported-and-tested-by: Carlos Llamas <cmllamas@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 daysMerge tag 'vfs-6.18-rc7.fixes' of ↵Linus Torvalds1-8/+7
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix unitialized variable in statmount_string() - Fix hostfs mounting when passing host root during boot - Fix dynamic lookup to fail on cell lookup failure - Fix missing file type when reading bfs inodes from disk - Enforce checking of sb_min_blocksize() calls and update all callers accordingly - Restore write access before closing files opened by open_exec() in binfmt_misc - Always freeze efivarfs during suspend/hibernate cycles - Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to actually allow mount namespace file descriptor in addition to mount namespace ids - Fix tmpfs remount when noswap is specified - Switch Landlock to iput_not_last() to remove false-positives from might_sleep() annotations in iput() - Remove dead node_to_mnt_ns() code - Ensure that per-queue kobjects are successfully created * tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: landlock: fix splats from iput() after it started calling might_sleep() fs: add iput_not_last() shmem: fix tmpfs reconfiguration (remount) when noswap is set fs/namespace: correctly handle errors returned by grab_requested_mnt_ns power: always freeze efivarfs binfmt_misc: restore write access before closing files opened by open_exec() block: add __must_check attribute to sb_min_blocksize() virtio-fs: fix incorrect check for fsvq->kobj xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super isofs: check the return value of sb_min_blocksize() in isofs_fill_super exfat: check return value of sb_min_blocksize in exfat_read_boot_sector vfat: fix missing sb_min_blocksize() return value checks mnt: Remove dead code which might prevent from building bfs: Reconstruct file type when loading from disk afs: Fix dynamic lookup to fail on cell lookup failure hostfs: Fix only passing host root in boot stage with new mount fs: Fix uninitialized 'offp' in statmount_string()
8 daysMerge tag 'mm-hotfixes-stable-2025-11-16-10-40' of ↵Linus Torvalds3-2/+24
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. 5 are cc:stable, 4 are against mm/ All are singletons - please see the respective changelogs for details" * tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm, swap: fix potential UAF issue for VMA readahead selftests/user_events: fix type cast for write_index packed member in perf_test lib/test_kho: check if KHO is enabled mm/huge_memory: fix folio split check for anon folios in swapcache MAINTAINERS: update David Hildenbrand's email address crash: fix crashkernel resource shrink mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
9 daysmm, swap: fix potential UAF issue for VMA readaheadKairui Song1-0/+13
Since commit 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning"), the common helper for allocating and preparing a folio in the swap cache layer no longer tries to get a swap device reference internally, because all callers of __read_swap_cache_async are already holding a swap entry reference. The repeated swap device pinning isn't needed on the same swap device. Caller of VMA readahead is also holding a reference to the target entry's swap device, but VMA readahead walks the page table, so it might encounter swap entries from other devices, and call __read_swap_cache_async on another device without holding a reference to it. So it is possible to cause a UAF when swapoff of device A raced with swapin on device B, and VMA readahead tries to read swap entries from device A. It's not easy to trigger, but in theory, it could cause real issues. Make VMA readahead try to get the device reference first if the swap device is a different one from the target entry. Link: https://lkml.kernel.org/r/20251111-swap-fix-vma-uaf-v1-1-41c660e58562@tencent.com Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 daysmm/huge_memory: fix folio split check for anon folios in swapcacheZi Yan1-2/+4
Both uniform and non uniform split check missed the check to prevent splitting anon folios in swapcache to non-zero order. Splitting anon folios in swapcache to non-zero order can cause data corruption since swapcache only support PMD order and order-0 entries. This can happen when one use split_huge_pages under debugfs to split anon folios in swapcache. In-tree callers do not perform such an illegal operation. Only debugfs interface could trigger it. I will put adding a test case on my TODO list. Fix the check. Link: https://lkml.kernel.org/r/20251105162910.752266-1-ziy@nvidia.com Fixes: 58729c04cf10 ("mm/huge_memory: add buddy allocator like (non-uniform) folio_split()") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: "David Hildenbrand (Red Hat)" <david@kernel.org> Closes: https://lore.kernel.org/all/dc0ecc2c-4089-484f-917f-920fdca4c898@kernel.org/ Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 daysmm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlbDavid Hildenbrand (Red Hat)1-0/+7
In the past, CONFIG_ARCH_HAS_GIGANTIC_PAGE indicated that we support runtime allocation of gigantic hugetlb folios. In the meantime it evolved into a generic way for the architecture to state that it supports gigantic hugetlb folios. In commit fae7d834c43c ("mm: add __dump_folio()") we started using CONFIG_ARCH_HAS_GIGANTIC_PAGE to decide MAX_FOLIO_ORDER: whether we could have folios larger than what the buddy can handle. In the context of that commit, we started using MAX_FOLIO_ORDER to detect page corruptions when dumping tail pages of folios. Before that commit, we assumed that we cannot have folios larger than the highest buddy order, which was obviously wrong. In commit 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when registering hstate"), we used MAX_FOLIO_ORDER to detect inconsistencies, and in fact, we found some now. Powerpc allows for configs that can allocate gigantic folio during boot (not at runtime), that do not set CONFIG_ARCH_HAS_GIGANTIC_PAGE and can exceed PUD_ORDER. To fix it, let's make powerpc select CONFIG_ARCH_HAS_GIGANTIC_PAGE with hugetlb on powerpc, and increase the maximum folio size with hugetlb to 16 GiB on 64bit (possible on arm64 and powerpc) and 1 GiB on 32 bit (powerpc). Note that on some powerpc configurations, whether we actually have gigantic pages depends on the setting of CONFIG_ARCH_FORCE_MAX_ORDER, but there is nothing really problematic about setting it unconditionally: we just try to keep the value small so we can better detect problems in __dump_folio() and inconsistencies around the expected largest folio in the system. Ideally, we'd have a better way to obtain the maximum hugetlb folio size and detect ourselves whether we really end up with gigantic folios. Let's defer bigger changes and fix the warnings first. While at it, handle gigantic DAX folios more clearly: DAX can only end up creating gigantic folios with HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD. Add a new Kconfig option HAVE_GIGANTIC_FOLIOS to make both cases clearer. In particular, worry about ARCH_HAS_GIGANTIC_PAGE only with HUGETLB_PAGE. Note: with enabling CONFIG_ARCH_HAS_GIGANTIC_PAGE on powerpc, we will now also allow for runtime allocations of folios in some more powerpc configs. I don't think this is a problem, but if it is we could handle it through __HAVE_ARCH_GIGANTIC_PAGE_RUNTIME_SUPPORTED. While __dump_page()/__dump_folio was also problematic (not handling dumping of tail pages of such gigantic folios correctly), it doesn't seem critical enough to mark it as a fix. Link: https://lkml.kernel.org/r/20251114214920.2550676-1-david@kernel.org Fixes: 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when registering hstate") Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Closes: https://lore.kernel.org/r/3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu/ Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com> Closes: https://lore.kernel.org/r/94377f5c-d4f0-4c0f-b0f6-5bf1cd7305b1@linux.ibm.com/ Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysmm/mempool: fix poisoning order>0 pages with HIGHMEMVlastimil Babka1-6/+26
The kernel test has reported: BUG: unable to handle page fault for address: fffba000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 03171067 *pte = 00000000 Oops: Oops: 0002 [#1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G T 6.18.0-rc2-00031-gec7f31b2a2d3 #1 NONE a1d066dfe789f54bc7645c7989957d2bdee593ca Tainted: [T]=RANDSTRUCT Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 EIP: memset (arch/x86/include/asm/string_32.h:168 arch/x86/lib/memcpy_32.c:17) Code: a5 8b 4d f4 83 e1 03 74 02 f3 a4 83 c4 04 5e 5f 5d 2e e9 73 41 01 00 90 90 90 3e 8d 74 26 00 55 89 e5 57 56 89 c6 89 d0 89 f7 <f3> aa 89 f0 5e 5f 5d 2e e9 53 41 01 00 cc cc cc 55 89 e5 53 57 56 EAX: 0000006b EBX: 00000015 ECX: 001fefff EDX: 0000006b ESI: fffb9000 EDI: fffba000 EBP: c611fbf0 ESP: c611fbe8 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010287 CR0: 80050033 CR2: fffba000 CR3: 0316e000 CR4: 00040690 Call Trace: poison_element (mm/mempool.c:83 mm/mempool.c:102) mempool_init_node (mm/mempool.c:142 mm/mempool.c:226) mempool_init_noprof (mm/mempool.c:250 (discriminator 1)) ? mempool_alloc_pages (mm/mempool.c:640) bio_integrity_initfn (block/bio-integrity.c:483 (discriminator 8)) ? mempool_alloc_pages (mm/mempool.c:640) do_one_initcall (init/main.c:1283) Christoph found out this is due to the poisoning code not dealing properly with CONFIG_HIGHMEM because only the first page is mapped but then the whole potentially high-order page is accessed. We could give up on HIGHMEM here, but it's straightforward to fix this with a loop that's mapping, poisoning or checking and unmapping individual pages. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202511111411.9ebfa1ba-lkp@intel.com Analyzed-by: Christoph Hellwig <hch@lst.de> Fixes: bdfedb76f4f5 ("mm, mempool: poison elements backed by slab allocator") Cc: stable@vger.kernel.org Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113-mempool-poison-v1-1-233b3ef984c3@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
11 daysMerge tag 'slab-for-6.18-rc6' of ↵Linus Torvalds1-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Fix memory leak of objects from remote NUMA node when bulk freeing to a cache with sheaves (Harry Yoo) * tag 'slab-for-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: fix memory leak in free_to_pcs_bulk()
11 daysmm/slub: fix memory leak in free_to_pcs_bulk()Harry Yoo1-2/+6
The commit 989b09b73978 ("slab: skip percpu sheaves for remote object freeing") introduced the remote_objects array in free_to_pcs_bulk() to skip sheaves when objects from a remote node are freed. However, the array is flushed only when: 1) the array becomes full (++remote_nr >= PCS_BATCH_MAX), or 2) slab_free_hook() returns false and size becomes zero. When neither of the conditions is met, objects in the array are leaked. This resulted in a memory leak [1], where 82 GiB of memory was allocated for the maple_node cache. Flush the array after successfully freeing objects to sheaves in the do_free: path. In the meantime, move the snippet if (!size) goto flush_remote; outside the while loop for readability. Let's say all objects in the array are from a remote node: then we acquire s->cpu_sheaves->lock and try to free an object even when size is zero. This doesn't appear to be harmful, but isn't really readable. Reported-by: Tytus Rogalewski <admin@simplepod.ai> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220765 [1] Closes: https://lore.kernel.org/linux-mm/20251107094809.12e9d705b7bf4815783eb184@linux-foundation.org Closes: https://lore.kernel.org/all/aRGDTwbt2EIz2CYn@hyeyoo Fixes: 989b09b73978 ("slab: skip percpu sheaves for remote object freeing") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20251111125331.12246-1-harry.yoo@oracle.com Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Tytus Rogalewski <admin@simplepod.ai> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
12 daysshmem: fix tmpfs reconfiguration (remount) when noswap is setMike Yuan1-8/+7
In systemd we're trying to switch the internal credentials setup logic to new mount API [1], and I noticed fsconfig(FSCONFIG_CMD_RECONFIGURE) consistently fails on tmpfs with noswap option. This can be trivially reproduced with the following: ``` int fs_fd = fsopen("tmpfs", 0); fsconfig(fs_fd, FSCONFIG_SET_FLAG, "noswap", NULL, 0); fsconfig(fs_fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); fsmount(fs_fd, 0, 0); fsconfig(fs_fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0); <------ EINVAL ``` After some digging the culprit is shmem_reconfigure() rejecting !(ctx->seen & SHMEM_SEEN_NOSWAP) && sbinfo->noswap, which is bogus as ctx->seen serves as a mask for whether certain options are touched at all. On top of that, noswap option doesn't use fsparam_flag_no, hence it's not really possible to "reenable" swap to begin with. Drop the check and redundant SHMEM_SEEN_NOSWAP flag. [1] https://github.com/systemd/systemd/pull/39637 Fixes: 2c6efe9cf2d7 ("shmem: add support to ignore swap") Signed-off-by: Mike Yuan <me@yhndnzj.com> Link: https://patch.msgid.link/20251108190930.440685-1-me@yhndnzj.com Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
13 daysmemblock: fix memblock_estimated_nr_free_pages() for soft-reserved memoryAkinobu Mita1-1/+2
memblock_estimated_nr_free_pages() returns the difference between the total size of the "memory" memblock type and the "reserved" memblock type. The "soft-reserved" memory regions are added to the "reserved" memblock type, but not to the "memory" memblock type. Therefore, memblock_estimated_nr_free_pages() may return a smaller value than expected, or if it underflows, an extremely large value. /proc/sys/kernel/threads-max is determined by the value of memblock_estimated_nr_free_pages(). This issue was discovered on machines with CXL memory because kernel.threads-max was either smaller than expected or extremely large for the installed DRAM size. This fixes the issue by replacing memblock_reserved_size() with memblock_reserved_kern_size() that tells how much memory was reserved from the actual RAM. Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Link: https://patch.msgid.link/20251111010010.7800-1-akinobu.mita@gmail.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-11-09mm/secretmem: fix use-after-free race in fault handlerLance Yang1-1/+1
When a page fault occurs in a secret memory file created with `memfd_secret(2)`, the kernel will allocate a new folio for it, mark the underlying page as not-present in the direct map, and add it to the file mapping. If two tasks cause a fault in the same page concurrently, both could end up allocating a folio and removing the page from the direct map, but only one would succeed in adding the folio to the file mapping. The task that failed undoes the effects of its attempt by (a) freeing the folio again and (b) putting the page back into the direct map. However, by doing these two operations in this order, the page becomes available to the allocator again before it is placed back in the direct mapping. If another task attempts to allocate the page between (a) and (b), and the kernel tries to access it via the direct map, it would result in a supervisor not-present page fault. Fix the ordering to restore the direct map before the folio is freed. Link: https://lkml.kernel.org/r/20251031120955.92116-1-lance.yang@linux.dev Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Lance Yang <lance.yang@linux.dev> Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com> Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/huge_memory: initialise the tags of the huge zero folioCatalin Marinas1-1/+2
On arm64 with MTE enabled, a page mapped as Normal Tagged (PROT_MTE) in user space will need to have its allocation tags initialised. This is normally done in the arm64 set_pte_at() after checking the memory attributes. Such page is also marked with the PG_mte_tagged flag to avoid subsequent clearing. Since this relies on having a struct page, pte_special() mappings are ignored. Commit d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special") maps the huge zero folio special and the arm64 set_pmd_at() will no longer zero the tags. There is no guarantee that the tags are zero, especially if parts of this huge page have been previously tagged. It's fairly easy to detect this by regularly dropping the caches to force the reallocation of the huge zero folio. Allocate the huge zero folio with the __GFP_ZEROTAGS flag. In addition, do not warn in the arm64 __access_remote_tags() when reading tags from the huge zero page. I bundled the arm64 change in here as well since they are both related to the commit mapping the huge zero folio as special. [catalin.marinas@arm.com: handle arch mte_zero_clear_page_tags() code issuing MTE instructions] Link: https://lkml.kernel.org/r/aQi8dA_QpXM8XqrE@arm.com Link: https://lkml.kernel.org/r/20251031170133.280742-1-catalin.marinas@arm.com Fixes: d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Beleswar Padhi <b-padhi@ti.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Aishwarya TCV <aishwarya.tcv@arm.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/damon/sysfs: change next_update_jiffies to a global variableQuanmin Yan1-3/+7
In DAMON's damon_sysfs_repeat_call_fn(), time_before() is used to compare the current jiffies with next_update_jiffies to determine whether to update the sysfs files at this moment. On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make jiffies wrap bugs appear earlier. However, this causes time_before() in damon_sysfs_repeat_call_fn() to unexpectedly return true during the first 5 minutes after boot on 32-bit systems (see [1] for more explanation, which fixes another jiffies-related issue before). As a result, DAMON does not update sysfs files during that period. There is also an issue unrelated to the system's word size[2]: if the user stops DAMON just after next_update_jiffies is updated and restarts it after 'refresh_ms' or a longer delay, next_update_jiffies will retain an older value, causing time_before() to return false and the update to happen earlier than expected. Fix these issues by making next_update_jiffies a global variable and initializing it each time DAMON is started. Link: https://lkml.kernel.org/r/20251030020746.967174-3-yanquanmin1@huawei.com Link: https://lkml.kernel.org/r/20250822025057.1740854-1-ekffu200098@gmail.com [1] Link: https://lore.kernel.org/all/20251029013038.66625-1-sj@kernel.org/ [2] Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work") Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/damon/stat: change last_refresh_jiffies to a global variableQuanmin Yan1-3/+6
Patch series "mm/damon: fixes for the jiffies-related issues", v2. On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make jiffies wrap bugs appear earlier. However, this may cause the time_before() series of functions to return unexpected values, resulting in DAMON not functioning as intended. Meanwhile, similar issues exist in some specific user operation scenarios. This patchset addresses these issues. The first patch is about the DAMON_STAT module, and the second patch is about the core layer's sysfs. This patch (of 2): In DAMON_STAT's damon_stat_damon_call_fn(), time_before_eq() is used to avoid unnecessarily frequent stat update. On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make jiffies wrap bugs appear earlier. However, this causes time_before_eq() in DAMON_STAT to unexpectedly return true during the first 5 minutes after boot on 32-bit systems (see [1] for more explanation, which fixes another jiffies-related issue before). As a result, DAMON_STAT does not update any monitoring results during that period, which becomes more confusing when DAMON_STAT_ENABLED_DEFAULT is enabled. There is also an issue unrelated to the system's word size[2]: if the user stops DAMON_STAT just after last_refresh_jiffies is updated and restarts it after 5 seconds or a longer delay, last_refresh_jiffies will retain an older value, causing time_before_eq() to return false and the update to happen earlier than expected. Fix these issues by making last_refresh_jiffies a global variable and initializing it each time DAMON_STAT is started. Link: https://lkml.kernel.org/r/20251030020746.967174-2-yanquanmin1@huawei.com Link: https://lkml.kernel.org/r/20250822025057.1740854-1-ekffu200098@gmail.com [1] Link: https://lore.kernel.org/all/20251028143250.50144-1-sj@kernel.org/ [2] Fixes: fabdd1e911da ("mm/damon/stat: calculate and expose estimated memory bandwidth") Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09codetag: debug: handle existing CODETAG_EMPTY in mark_objexts_empty for ↵Hao Ge1-1/+5
slabobj_ext When alloc_slab_obj_exts() fails and then later succeeds in allocating a slab extension vector, it calls handle_failed_objexts_alloc() to mark all objects in the vector as empty. As a result all objects in this slab (slabA) will have their extensions set to CODETAG_EMPTY. Later on if this slabA is used to allocate a slabobj_ext vector for another slab (slabB), we end up with the slabB->obj_exts pointing to a slabobj_ext vector that itself has a non-NULL slabobj_ext equal to CODETAG_EMPTY. When slabB gets freed, free_slab_obj_exts() is called to free slabB->obj_exts vector. free_slab_obj_exts() calls mark_objexts_empty(slabB->obj_exts) which will generate a warning because it expects slabobj_ext vectors to have a NULL obj_ext, not CODETAG_EMPTY. Modify mark_objexts_empty() to skip the warning and setting the obj_ext value if it's already set to CODETAG_EMPTY. To quickly detect this WARN, I modified the code from WARN_ON(slab_exts[offs].ref.ct) to BUG_ON(slab_exts[offs].ref.ct == 1); We then obtained this message: [21630.898561] ------------[ cut here ]------------ [21630.898596] kernel BUG at mm/slub.c:2050! [21630.898611] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [21630.900372] Modules linked in: squashfs isofs vfio_iommu_type1 vhost_vsock vfio vhost_net vmw_vsock_virtio_transport_common vhost tap vhost_iotlb iommufd vsock binfmt_misc nfsv3 nfs_acl nfs lockd grace netfs tls rds dns_resolver tun brd overlay ntfs3 exfat btrfs blake2b_generic xor xor_neon raid6_pq loop sctp ip6_udp_tunnel udp_tunnel nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables rfkill ip_set sunrpc vfat fat joydev sg sch_fq_codel nfnetlink virtio_gpu sr_mod cdrom drm_client_lib virtio_dma_buf drm_shmem_helper drm_kms_helper drm ghash_ce backlight virtio_net virtio_blk virtio_scsi net_failover virtio_console failover virtio_mmio dm_mirror dm_region_hash dm_log dm_multipath dm_mod fuse i2c_dev virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring autofs4 aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject] [21630.909177] CPU: 3 UID: 0 PID: 3787 Comm: kylin-process-m Kdump: loaded Tainted: G        W           6.18.0-rc1+ #74 PREEMPT(voluntary) [21630.910495] Tainted: [W]=WARN [21630.910867] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 [21630.911625] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [21630.912392] pc : __free_slab+0x228/0x250 [21630.912868] lr : __free_slab+0x18c/0x250[21630.913334] sp : ffff8000a02f73e0 [21630.913830] x29: ffff8000a02f73e0 x28: fffffdffc43fc800 x27: ffff0000c0011c40 [21630.914677] x26: ffff0000c000cac0 x25: ffff00010fe5e5f0 x24: ffff000102199b40 [21630.915469] x23: 0000000000000003 x22: 0000000000000003 x21: ffff0000c0011c40 [21630.916259] x20: fffffdffc4086600 x19: fffffdffc43fc800 x18: 0000000000000000 [21630.917048] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [21630.917837] x14: 0000000000000000 x13: 0000000000000000 x12: ffff70001405ee66 [21630.918640] x11: 1ffff0001405ee65 x10: ffff70001405ee65 x9 : ffff800080a295dc [21630.919442] x8 : ffff8000a02f7330 x7 : 0000000000000000 x6 : 0000000000003000 [21630.920232] x5 : 0000000024924925 x4 : 0000000000000001 x3 : 0000000000000007 [21630.921021] x2 : 0000000000001b40 x1 : 000000000000001f x0 : 0000000000000001 [21630.921810] Call trace: [21630.922130]  __free_slab+0x228/0x250 (P) [21630.922669]  free_slab+0x38/0x118 [21630.923079]  free_to_partial_list+0x1d4/0x340 [21630.923591]  __slab_free+0x24c/0x348 [21630.924024]  ___cache_free+0xf0/0x110 [21630.924468]  qlist_free_all+0x78/0x130 [21630.924922]  kasan_quarantine_reduce+0x114/0x148 [21630.925525]  __kasan_slab_alloc+0x7c/0xb0 [21630.926006]  kmem_cache_alloc_noprof+0x164/0x5c8 [21630.926699]  __alloc_object+0x44/0x1f8 [21630.927153]  __create_object+0x34/0xc8 [21630.927604]  kmemleak_alloc+0xb8/0xd8 [21630.928052]  kmem_cache_alloc_noprof+0x368/0x5c8 [21630.928606]  getname_flags.part.0+0xa4/0x610 [21630.929112]  getname_flags+0x80/0xd8 [21630.929557]  vfs_fstatat+0xc8/0xe0 [21630.929975]  __do_sys_newfstatat+0xa0/0x100 [21630.930469]  __arm64_sys_newfstatat+0x90/0xd8 [21630.931046]  invoke_syscall+0xd4/0x258 [21630.931685]  el0_svc_common.constprop.0+0xb4/0x240 [21630.932467]  do_el0_svc+0x48/0x68 [21630.932972]  el0_svc+0x40/0xe0 [21630.933472]  el0t_64_sync_handler+0xa0/0xe8 [21630.934151]  el0t_64_sync+0x1ac/0x1b0 [21630.934923] Code: aa1803e0 97ffef2b a9446bf9 17ffff9c (d4210000) [21630.936461] SMP: stopping secondary CPUs [21630.939550] Starting crashdump kernel... [21630.940108] Bye! Link: https://lkml.kernel.org/r/20251029014317.1533488-1-hao.ge@linux.dev Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations") Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: gehao <gehao@kylinos.cn> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/mremap: honour writable bit in mremap pte batchingDev Jain1-1/+1
Currently mremap folio pte batch ignores the writable bit during figuring out a set of similar ptes mapping the same folio. Suppose that the first pte of the batch is writable while the others are not - set_ptes will end up setting the writable bit on the other ptes, which is a violation of mremap semantics. Therefore, use FPB_RESPECT_WRITE to check the writable bit while determining the pte batch. Link: https://lkml.kernel.org/r/20251028063952.90313-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching") Reported-by: David Hildenbrand <david@redhat.com> Debugged-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Barry Song <baohua@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/mm_init: fix hash table order logging in alloc_large_system_hash()Isaac J. Manjarres1-1/+1
When emitting the order of the allocation for a hash table, alloc_large_system_hash() unconditionally subtracts PAGE_SHIFT from log base 2 of the allocation size. This is not correct if the allocation size is smaller than a page, and yields a negative value for the order as seen below: TCP established hash table entries: 32 (order: -4, 256 bytes, linear) TCP bind hash table entries: 32 (order: -2, 1024 bytes, linear) Use get_order() to compute the order when emitting the hash table information to correctly handle cases where the allocation size is smaller than a page: TCP established hash table entries: 32 (order: 0, 256 bytes, linear) TCP bind hash table entries: 32 (order: 0, 1024 bytes, linear) Link: https://lkml.kernel.org/r/20251028191020.413002-1-isaacmanjarres@google.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/truncate: unmap large folio on split failureKiryl Shutsemau1-6/+29
Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are supposed to generate SIGBUS. This behavior might not be respected on truncation. During truncation, the kernel splits a large folio in order to reclaim memory. As a side effect, it unmaps the folio and destroys PMD mappings of the folio. The folio will be refaulted as PTEs and SIGBUS semantics are preserved. However, if the split fails, PMD mappings are preserved and the user will not receive SIGBUS on any accesses within the PMD. Unmap the folio on split failure. It will lead to refault as PTEs and preserve SIGBUS semantics. Make an exception for shmem/tmpfs that for long time intentionally mapped with PMDs across i_size. Link: https://lkml.kernel.org/r/20251027115636.82382-3-kirill@shutemov.name Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/memory: do not populate page table entries beyond i_sizeKiryl Shutsemau2-9/+39
Patch series "Fix SIGBUS semantics with large folios", v3. Accessing memory within a VMA, but beyond i_size rounded up to the next page size, is supposed to generate SIGBUS. Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749 failed due to missing SIGBUS. This was caused by my recent changes that try to fault in the whole folio where possible: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()") 357b92761d94 ("mm/filemap: map entire large folio faultaround") These changes did not consider i_size when setting up PTEs, leading to xfstest breakage. However, the problem has been present in the kernel for a long time - since huge tmpfs was introduced in 2016. The kernel happily maps PMD-sized folios as PMD without checking i_size. And huge=always tmpfs allocates PMD-size folios on any writes. I considered this corner case when I implemented a large tmpfs, and my conclusion was that no one in their right mind should rely on receiving a SIGBUS signal when accessing beyond i_size. I cannot imagine how it could be useful for the workload. But apparently filesystem folks care a lot about preserving strict SIGBUS semantics. Generic/749 was introduced last year with reference to POSIX, but no real workloads were mentioned. It also acknowledged the tmpfs deviation from the test case. POSIX indeed says[3]: References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. The patchset fixes the regression introduced by recent changes as well as more subtle SIGBUS breakage due to split failure on truncation. This patch (of 2): Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are supposed to generate SIGBUS. Recent changes attempted to fault in full folio where possible. They did not respect i_size, which led to populating PTEs beyond i_size and breaking SIGBUS semantics. Darrick reported generic/749 breakage because of this. However, the problem existed before the recent changes. With huge=always tmpfs, any write to a file leads to PMD-size allocation. Following the fault-in of the folio will install PMD mapping regardless of i_size. Fix filemap_map_pages() and finish_fault() to not install: - PTEs beyond i_size; - PMD mappings across i_size; Make an exception for shmem/tmpfs that for long time intentionally mapped with PMDs across i_size. Link: https://lkml.kernel.org/r/20251027115636.82382-1-kirill@shutemov.name Link: https://lkml.kernel.org/r/20251027115636.82382-2-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Fixes: 6795801366da ("xfs: Support large folios") Reported-by: "Darrick J. Wong" <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/huge_memory: preserve PG_has_hwpoisoned if a folio is split to >0 orderZi Yan1-3/+20
folio split clears PG_has_hwpoisoned, but the flag should be preserved in after-split folios containing pages with PG_hwpoisoned flag if the folio is split to >0 order folios. Scan all pages in a to-be-split folio to determine which after-split folios need the flag. An alternatives is to change PG_has_hwpoisoned to PG_maybe_hwpoisoned to avoid the scan and set it on all after-split folios, but resulting false positive has undesirable negative impact. To remove false positive, caller of folio_test_has_hwpoisoned() and folio_contain_hwpoisoned_page() needs to do the scan. That might be causing a hassle for current and future callers and more costly than doing the scan in the split code. More details are discussed in [1]. This issue can be exposed via: 1. splitting a has_hwpoisoned folio to >0 order from debugfs interface; 2. truncating part of a has_hwpoisoned folio in truncate_inode_partial_folio(). And later accesses to a hwpoisoned page could be possible due to the missing has_hwpoisoned folio flag. This will lead to MCE errors. Link: https://lore.kernel.org/all/CAHbLzkoOZm0PXxE9qwtF4gKR=cpRXrSrJ9V9Pm2DJexs985q4g@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20251023030521.473097-1-ziy@nvidia.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Pankaj Raghav <kernel@pankajraghav.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09ksm: use range-walk function to jump over holes in scan_get_next_rmap_itemPedro Demarchi Gomes1-9/+104
Currently, scan_get_next_rmap_item() walks every page address in a VMA to locate mergeable pages. This becomes highly inefficient when scanning large virtual memory areas that contain mostly unmapped regions, causing ksmd to use large amount of cpu without deduplicating much pages. This patch replaces the per-address lookup with a range walk using walk_page_range(). The range walker allows KSM to skip over entire unmapped holes in a VMA, avoiding unnecessary lookups. This problem was previously discussed in [1]. Consider the following test program which creates a 32 TiB mapping in the virtual address space but only populates a single page: #include <unistd.h> #include <stdio.h> #include <sys/mman.h> /* 32 TiB */ const size_t size = 32ul * 1024 * 1024 * 1024 * 1024; int main() { char *area = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0); if (area == MAP_FAILED) { perror("mmap() failed\n"); return -1; } /* Populate a single page such that we get an anon_vma. */ *area = 0; /* Enable KSM. */ madvise(area, size, MADV_MERGEABLE); pause(); return 0; } $ ./ksm-sparse & $ echo 1 > /sys/kernel/mm/ksm/run Without this patch ksmd uses 100% of the cpu for a long time (more then 1 hour in my test machine) scanning all the 32 TiB virtual address space that contain only one mapped page. This makes ksmd essentially deadlocked not able to deduplicate anything of value. With this patch ksmd walks only the one mapped page and skips the rest of the 32 TiB virtual address space, making the scan fast using little cpu. Link: https://lkml.kernel.org/r/20251023035841.41406-1-pedrodemargomes@gmail.com Link: https://lkml.kernel.org/r/20251022153059.22763-1-pedrodemargomes@gmail.com Link: https://lore.kernel.org/linux-mm/423de7a3-1c62-4e72-8e79-19a6413e420c@redhat.com/ [1] Fixes: 31dbd01f3143 ("ksm: Kernel SamePage Merging") Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: craftfever <craftfever@airmail.cc> Closes: https://lkml.kernel.org/r/020cf8de6e773bb78ba7614ef250129f11a63781@murena.io Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/kmsan: fix kmsan kmalloc hook when no stack depots are allocated yetAleksei Nikiforov3-6/+5
If no stack depot is allocated yet, due to masking out __GFP_RECLAIM flags kmsan called from kmalloc cannot allocate stack depot. kmsan fails to record origin and report issues. This may result in KMSAN failing to report issues. Reusing flags from kmalloc without modifying them should be safe for kmsan. For example, such chain of calls is possible: test_uninit_kmalloc -> kmalloc -> __kmalloc_cache_noprof -> slab_alloc_node -> slab_post_alloc_hook -> kmsan_slab_alloc -> kmsan_internal_poison_memory. Only when it is called in a context without flags present should __GFP_RECLAIM flags be masked. With this change all kmsan tests start working reliably. Eric reported: : Yes, KMSAN seems to be at least partially broken currently. Besides the : fact that the kmsan KUnit test is currently failing (which I reported at : https://lore.kernel.org/r/20250911175145.GA1376@sol), I've confirmed that : the poly1305 KUnit test causes a KMSAN warning with Aleksei's patch : applied but does not cause a warning without it. The warning did get : reached via syzbot somehow : (https://lore.kernel.org/r/751b3d80293a6f599bb07770afcef24f623c7da0.1761026343.git.xiaopei01@kylinos.cn/), : so KMSAN must still work in some cases. But it didn't work for me. Link: https://lkml.kernel.org/r/20250930115600.709776-2-aleksei.nikiforov@linux.ibm.com Link: https://lkml.kernel.org/r/20251022030213.GA35717@sol Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Signed-off-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/shmem: fix THP allocation and fallback loopKairui Song1-3/+6
The order check and fallback loop is updating the index value on every loop. This will cause the index to be wrongly aligned by a larger value while the loop shrinks the order. This may result in inserting and returning a folio of the wrong index and cause data corruption with some userspace workloads [1]. [kasong@tencent.com: introduce a temporary variable to improve code] Link: https://lkml.kernel.org/r/20251023065913.36925-1-ryncsn@gmail.com Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20251022105719.18321-1-ryncsn@gmail.com Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1] Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem") Closes: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09mm/huge_memory: do not change split_huge_page*() target order silentlyZi Yan2-10/+5
Page cache folios from a file system that support large block size (LBS) can have minimal folio order greater than 0, thus a high order folio might not be able to be split down to order-0. Commit e220917fa507 ("mm: split a folio in minimum folio order chunks") bumps the target order of split_huge_page*() to the minimum allowed order when splitting a LBS folio. This causes confusion for some split_huge_page*() callers like memory failure handling code, since they expect after-split folios all have order-0 when split succeeds but in reality get min_order_for_split() order folios and give warnings. Fix it by failing a split if the folio cannot be split to the target order. Rename try_folio_split() to try_folio_split_to_order() to reflect the added new_order parameter. Remove its unused list parameter. [The test poisons LBS folios, which cannot be split to order-0 folios, and also tries to poison all memory. The non split LBS folios take more memory than the test anticipated, leading to OOM. The patch fixed the kernel warning and the test needs some change to avoid OOM.] Link: https://lkml.kernel.org/r/20251017013630.139907-1-ziy@nvidia.com Fixes: e220917fa507 ("mm: split a folio in minimum folio order chunks") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: syzbot+e6367ea2fdab6ed46056@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68d2c943.a70a0220.1b52b.02b3.GAE@google.com/ Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-07Merge tag 'slab-for-6.18-rc5' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Fix for potential infinite loop in kmalloc_nolock() when debugging is enabled for the cache (Vlastimil Babka) * tag 'slab-for-6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: prevent infinite loop in kmalloc_nolock() with debugging
2025-11-06slab: prevent infinite loop in kmalloc_nolock() with debuggingVlastimil Babka1-1/+5
In review of a followup work, Harry noticed a potential infinite loop. Upon closed inspection, it already exists for kmalloc_nolock() on a cache with debugging enabled, since commit af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") When alloc_single_from_new_slab() fails to trylock node list_lock, we keep retrying to get partial slab or allocate a new slab. If we indeed interrupted somebody holding the list_lock, the trylock fill fail deterministically and we end up allocating and defer-freeing slabs indefinitely with no progress. To fix it, fail the allocation if spinning is not allowed. This is acceptable in the restricted context of kmalloc_nolock(), especially with debugging enabled. Reported-by: Harry Yoo <harry.yoo@oracle.com> Closes: https://lore.kernel.org/all/aQLqZjjq1SPD3Fml@hyeyoo/ Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20251103-fix-nolock-loop-v1-1-6e2b3e82b9da@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-24Merge tag 'slab-for-6.18-rc3' of ↵Linus Torvalds1-10/+21
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Two fixes for race conditions in obj_exts allocation (Hao Ge) - Fix for slab accounting imbalance due to deferred slab decativation (Vlastimil Babka) * tag 'slab-for-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: Fix obj_ext mistakenly considered NULL due to race condition slab: fix slab accounting imbalance due to defer_deactivate_slab() slab: Avoid race on slab->obj_exts in alloc_slab_obj_exts
2025-10-24slab: Fix obj_ext mistakenly considered NULL due to race conditionHao Ge1-5/+11
If two competing threads enter alloc_slab_obj_exts(), and the one that allocates the vector wins the cmpxchg(), the other thread that failed allocation mistakenly assumes that slab->obj_exts is still empty due to its own allocation failure. This will then trigger warnings with CONFIG_MEM_ALLOC_PROFILING_DEBUG checks in the subsequent free path. Therefore, let's check the result of cmpxchg() to see if marking the allocation as failed was successful. If it wasn't, check whether the winning side has succeeded its allocation (it might have been also marking it as failed) and if yes, return success. Suggested-by: Harry Yoo <harry.yoo@oracle.com> Fixes: f7381b911640 ("slab: mark slab->obj_exts allocation failures unconditionally") Cc: <stable@vger.kernel.org> Signed-off-by: Hao Ge <gehao@kylinos.cn> Link: https://patch.msgid.link/20251023143313.1327968-1-hao.ge@linux.dev Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-23slab: fix slab accounting imbalance due to defer_deactivate_slab()Vlastimil Babka1-3/+5
Since commit af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") there's a possibility in alloc_single_from_new_slab() that we discard the newly allocated slab if we can't spin and we fail to trylock. As a result we don't perform inc_slabs_node() later in the function. Instead we perform a deferred deactivate_slab() which can either put the unacounted slab on partial list, or discard it immediately while performing dec_slabs_node(). Either way will cause an accounting imbalance. Fix this by not marking the slab as frozen, and using free_slab() instead of deactivate_slab() for non-frozen slabs in free_deferred_objects(). For CONFIG_SLUB_TINY, that's the only possible case. By not using discard_slab() we avoid dec_slabs_node(). Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Link: https://patch.msgid.link/20251023-fix-slab-accounting-v2-1-0e62d50986ea@suse.cz Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-22Merge tag 'mm-hotfixes-stable-2025-10-22-12-43' of ↵Linus Torvalds7-18/+25
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "17 hotfixes. 12 are cc:stable and 14 are for MM. There's a two-patch DAMON series from SeongJae Park which addresses a missed check and possible memory leak. Apart from that it's all singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: csky: abiv2: adapt to new folio flags field mm/damon/core: use damos_commit_quota_goal() for new goal commit mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme hugetlbfs: move lock assertions after early returns in huge_pmd_unshare() vmw_balloon: indicate success when effectively deflating during migration mm/damon/core: fix list_add_tail() call on damon_call() mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap mm: prevent poison consumption when splitting THP ocfs2: clear extent cache after moving/defragmenting extents mm: don't spin in add_stack_record when gfp flags don't allow dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC mm/damon/sysfs: dealloc commit test ctx always mm/damon/sysfs: catch commit test ctx alloc failure hung_task: fix warnings caused by unaligned lock pointers
2025-10-21mm/damon/core: use damos_commit_quota_goal() for new goal commitSeongJae Park1-1/+1
When damos_commit_quota_goals() is called for adding new DAMOS quota goals of DAMOS_QUOTA_USER_INPUT metric, current_value fields of the new goals should be also set as requested. However, damos_commit_quota_goals() is not updating the field for the case, since it is setting only metrics and target values using damos_new_quota_goal(), and metric-optional union fields using damos_commit_quota_goal_union(). As a result, users could see the first current_value parameter that committed online with a new quota goal is ignored. Users are assumed to commit the current_value for DAMOS_QUOTA_USER_INPUT quota goals, since it is being used as a feedback. Hence the real impact would be subtle. That said, this is obviously not intended behavior. Fix the issue by using damos_commit_quota_goal() which sets all quota goal parameters, instead of damos_commit_quota_goal_union(), which sets only the union fields. Link: https://lkml.kernel.org/r/20251014001846.279282-1-sj@kernel.org Fixes: 1aef9df0ee90 ("mm/damon/core: commit damos_quota_goal->nid") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-21mm/damon/core: fix potential memory leak by cleaning ops_filter in ↵Enze Li1-0/+3
damon_destroy_scheme Currently, damon_destroy_scheme() only cleans up the filter list but leaves ops_filter untouched, which could lead to memory leaks when a scheme is destroyed. This patch ensures both filter and ops_filter are properly freed in damon_destroy_scheme(), preventing potential memory leaks. Link: https://lkml.kernel.org/r/20251014084225.313313-1-lienze@kylinos.cn Fixes: ab82e57981d0 ("mm/damon/core: introduce damos->ops_filters") Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-21hugetlbfs: move lock assertions after early returns in huge_pmd_unshare()Deepanshu Kartikey1-3/+2
When hugetlb_vmdelete_list() processes VMAs during truncate operations, it may encounter VMAs where huge_pmd_unshare() is called without the required shareable lock. This triggers an assertion failure in hugetlb_vma_assert_locked(). The previous fix in commit dd83609b8898 ("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list") skipped entire VMAs without shareable locks to avoid the assertion. However, this prevented pages from being unmapped and freed, causing a regression in fallocate(PUNCH_HOLE) operations where pages were not freed immediately, as reported by Mark Brown. Instead of checking locks in the caller or skipping VMAs, move the lock assertions in huge_pmd_unshare() to after the early return checks. The assertions are only needed when actual PMD unsharing work will be performed. If the function returns early because sz != PMD_SIZE or the PMD is not shared, no locks are required and assertions should not fire. This approach reverts the VMA skipping logic from commit dd83609b8898 ("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list") while moving the assertions to avoid the assertion failure, keeping all the logic within huge_pmd_unshare() itself and allowing page unmapping and freeing to proceed for all VMAs. Link: https://lkml.kernel.org/r/20251014113344.21194-1-kartikey406@gmail.com Fixes: dd83609b8898 ("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reported-by: <syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com> Reported-by: Mark Brown <broonie@kernel.org> Closes: https://syzkaller.appspot.com/bug?extid=f26d7c75c26ec19790e7 Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Oscar Salvador <osalvador@suse.de> Tested-by: <syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-21mm/damon/core: fix list_add_tail() call on damon_call()SeongJae Park1-1/+1
Each damon_ctx maintains callback requests using a linked list (damon_ctx->call_controls). When a new callback request is received via damon_call(), the new request should be added to the list. However, the function is making a mistake at list_add_tail() invocation: putting the new item to add and the list head to add it before, in the opposite order. Because of the linked list manipulation implementation, the new request can still be reached from the context's list head. But the list items that were added before the new request are dropped from the list. As a result, the callbacks are unexpectedly not invocated. Worse yet, if the dropped callback requests were dynamically allocated, the memory is leaked. Actually DAMON sysfs interface is using a dynamically allocated repeat-mode callback request for automatic essential stats update. And because the online DAMON parameters commit is using a non-repeat-mode callback request, the issue can easily be reproduced, like below. # damo start --damos_action stat --refresh_stat 1s # damo tune --damos_action stat --refresh_stat 1s The first command dynamically allocates the repeat-mode callback request for automatic essential stat update. Users can see the essential stats are automatically updated for every second, using the sysfs interface. The second command calls damon_commit() with a new callback request that was made for the commit. As a result, the previously added repeat-mode callback request is dropped from the list. The automatic stats refresh stops working, and the memory for the repeat-mode callback request is leaked. It can be confirmed using kmemleak. Fix the mistake on the list_add_tail() call. Link: https://lkml.kernel.org/r/20251014205939.1206-1-sj@kernel.org Fixes: 004ded6bee11 ("mm/damon: accept parallel damon_call() requests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-21mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remapLorenzo Stoakes1-9/+6
Commit b714ccb02a76 ("mm/mremap: complete refactor of move_vma()") mistakenly introduced a new behaviour - clearing the VM_ACCOUNT flag of the old mapping when a mapping is mremap()'d with the MREMAP_DONTUNMAP flag set. While we always clear the VM_LOCKED and VM_LOCKONFAULT flags for the old mapping (the page tables have been moved, so there is no data that could possibly be locked in memory), there is no reason to touch any other VMA flags. This is because after the move the old mapping is in a state as if it were freshly mapped. This implies that the attributes of the mapping ought to remain the same, including whether or not the mapping is accounted. Link: https://lkml.kernel.org/r/20251013165836.273113-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Fixes: b714ccb02a76 ("mm/mremap: complete refactor of move_vma()") Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-21slab: Avoid race on slab->obj_exts in alloc_slab_obj_extsHao Ge1-3/+6
If two competing threads enter alloc_slab_obj_exts() and one of them fails to allocate the object extension vector, it might override the valid slab->obj_exts allocated by the other thread with OBJEXTS_ALLOC_FAIL. This will cause the thread that lost this race and expects a valid pointer to dereference a NULL pointer later on. Update slab->obj_exts atomically using cmpxchg() to avoid slab->obj_exts overrides by racing threads. Thanks for Vlastimil and Suren's help with debugging. Fixes: f7381b911640 ("slab: mark slab->obj_exts allocation failures unconditionally") Cc: <stable@vger.kernel.org> Suggested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Link: https://patch.msgid.link/20251021010353.1187193-1-hao.ge@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-16slab: reset slab->obj_ext when freeing and it is OBJEXTS_ALLOC_FAILHao Ge1-1/+8
If obj_exts allocation failed, slab->obj_exts is set to OBJEXTS_ALLOC_FAIL, But we do not clear it when freeing the slab. Since OBJEXTS_ALLOC_FAIL and MEMCG_DATA_OBJEXTS currently share the same bit position, during the release of the associated folio, a VM_BUG_ON_FOLIO() check in folio_memcg_kmem() is triggered because the OBJEXTS_ALLOC_FAIL flag was not cleared, causing it to be interpreted as a kmem folio (non-slab) with MEMCG_OBJEXTS_DATA flag set, which is invalid because MEMCG_OBJEXTS_DATA is supposed to be set only on slabs. Another problem that predates sharing the OBJEXTS_ALLOC_FAIL and MEMCG_DATA_OBJEXTS bits is that on configurations with is_check_pages_enabled(), the non-cleared bit in page->memcg_data will trigger a free_page_is_bad() failure "page still charged to cgroup" When freeing a slab, we clear slab->obj_exts if the obj_ext array has been successfully allocated. So let's clear it also when the allocation has failed. Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations") Fixes: 7612833192d5 ("slab: Reuse first bit for OBJEXTS_ALLOC_FAIL") Link: https://lore.kernel.org/all/20251015141642.700170-1-hao.ge@linux.dev/ Cc: <stable@vger.kernel.org> Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-15mm: prevent poison consumption when splitting THPQiuxu Zhuo2-1/+5
When performing memory error injection on a THP (Transparent Huge Page) mapped to userspace on an x86 server, the kernel panics with the following trace. The expected behavior is to terminate the affected process instead of panicking the kernel, as the x86 Machine Check code can recover from an in-userspace #MC. mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134 mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0} mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320 mce: [Hardware Error]: Run the above through 'mcelog --ascii' mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel Kernel panic - not syncing: Fatal local machine check The root cause of this panic is that handling a memory failure triggered by an in-userspace #MC necessitates splitting the THP. The splitting process employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which reads the pages in the THP to identify zero-filled pages. However, reading the pages in the THP results in a second in-kernel #MC, occurring before the initial memory_failure() completes, ultimately leading to a kernel panic. See the kernel panic call trace on the two #MCs. First Machine Check occurs // [1] memory_failure() // [2] try_to_split_thp_page() split_huge_page() split_huge_page_to_list_to_order() __folio_split() // [3] remap_page() remove_migration_ptes() remove_migration_pte() try_to_map_unused_to_zeropage() // [4] memchr_inv() // [5] Second Machine Check occurs // [6] Kernel panic [1] Triggered by accessing a hardware-poisoned THP in userspace, which is typically recoverable by terminating the affected process. [2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page(). [3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page(). [4] Try to map the unused THP to zeropage. [5] Re-access pages in the hw-poisoned THP in the kernel. [6] Triggered in-kernel, leading to a panic kernel. In Step[2], memory_failure() sets the poisoned flag on the page in the THP by TestSetPageHWPoison() before calling try_to_split_thp_page(). As suggested by David Hildenbrand, fix this panic by not accessing to the poisoned page in the THP during zeropage identification, while continuing to scan unaffected pages in the THP for possible zeropage mapping. This prevents a second in-kernel #MC that would cause kernel panic in Step[4]. Thanks to Andrew Zaborowski for his initial work on fixing this issue. Link: https://lkml.kernel.org/r/20251015064926.1887643-1-qiuxu.zhuo@intel.com Link: https://lkml.kernel.org/r/20251011075520.320862-1-qiuxu.zhuo@intel.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Acked-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mariano Pache <npache@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-15mm: don't spin in add_stack_record when gfp flags don't allowAlexei Starovoitov1-0/+3
syzbot was able to find the following path: add_stack_record_to_list mm/page_owner.c:182 [inline] inc_stack_record_count mm/page_owner.c:214 [inline] __set_page_owner+0x2c3/0x4a0 mm/page_owner.c:333 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851 prep_new_page mm/page_alloc.c:1859 [inline] get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858 alloc_pages_nolock_noprof+0x94/0x120 mm/page_alloc.c:7554 Don't spin in add_stack_record_to_list() when it is called from *_nolock() context. Link: https://lkml.kernel.org/r/CAADnVQK_8bNYEA7TJYgwTYR57=TTFagsvRxp62pFzS_z129eTg@mail.gmail.com Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reported-by: syzbot+8259e1d0e3ae8ed0c490@syzkaller.appspotmail.com Reported-by: syzbot+665739f456b28f32b23d@syzkaller.appspotmail.com Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-15mm/damon/sysfs: dealloc commit test ctx alwaysSeongJae Park1-3/+2
The damon_ctx for testing online DAMON parameters commit inputs is deallocated only when the test fails. This means memory is leaked for every successful online DAMON parameters commit. Fix the leak by always deallocating it. Link: https://lkml.kernel.org/r/20251003201455.41448-3-sj@kernel.org Fixes: 4c9ea539ad59 ("mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-15mm/damon/sysfs: catch commit test ctx alloc failureSeongJae Park1-0/+2
Patch series "mm/damon/sysfs: fix commit test damon_ctx [de]allocation". DAMON sysfs interface dynamically allocates and uses a damon_ctx object for testing if given inputs for online DAMON parameters update is valid. The object is being used without an allocation failure check, and leaked when the test succeeds. Fix the two bugs. This patch (of 2): The damon_ctx for testing online DAMON parameters commit inputs is used without its allocation failure check. This could result in an invalid memory access. Fix it by directly returning an error when the allocation failed. Link: https://lkml.kernel.org/r/20251003201455.41448-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251003201455.41448-2-sj@kernel.org Fixes: 4c9ea539ad59 ("mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-14slab: fix clearing freelist in free_deferred_objects()Vlastimil Babka1-3/+4
defer_free() links pending objects using the slab's freelist offset which is fine as they are not free yet. free_deferred_objects() then clears this pointer to avoid confusing the debugging consistency checks that may be enabled for the cache. However, with CONFIG_SLAB_FREELIST_HARDENED, even the NULL pointer needs to be encoded appropriately using set_freepointer(), otherwise it's decoded as something else and triggers the consistency checks, as found by the kernel test robot. Use set_freepointer() to prevent the issue. Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Reported-and-tested-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202510101652.7921fdc6-lkp@intel.com Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-11Merge tag 'slab-for-6.18-rc1-hotfix' of ↵Linus Torvalds1-14/+51
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: "A NULL pointer deref hotfix" * tag 'slab-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: fix barn NULL pointer dereference on memoryless nodes
2025-10-11Merge tag 'mm-nonmm-stable-2025-10-10-15-03' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more updates from Andrew Morton: "Just one series here - Mike Rappoport has taught KEXEC handover to preserve vmalloc allocations across handover" * tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt kho: add support for preserving vmalloc allocations kho: replace kho_preserve_phys() with kho_preserve_pages() kho: check if kho is finalized in __kho_preserve_order() MAINTAINERS, .mailmap: update Umang's email address
2025-10-11Merge tag 'mm-hotfixes-stable-2025-10-10-15-00' of ↵Linus Torvalds6-34/+24
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. All 7 are cc:stable and all 7 are for MM. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-10-10-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: hugetlb: avoid soft lockup when mprotect to large memory area fsnotify: pass correct offset to fsnotify_mmap_perm() mm/ksm: fix flag-dropping behavior in ksm_madvise mm/damon/vaddr: do not repeat pte_offset_map_lock() until success mm/rmap: fix soft-dirty and uffd-wp bit loss when remapping zero-filled mTHP subpage to shared zeropage mm/thp: fix MTE tag mismatch when replacing zero-filled subpages memcg: skip cgroup_file_notify if spinning is not allowed
2025-10-11slab: fix barn NULL pointer dereference on memoryless nodesVlastimil Babka1-14/+51
Phil reported a boot failure once sheaves become used in commits 59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and 3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"): BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary) Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025 RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0 Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89 RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300 RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388 R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0 R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0 FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0 Call Trace: <TASK> ? srso_return_thunk+0x5/0x5f ? vm_area_alloc+0x1e/0x60 kmem_cache_alloc_noprof+0x4ec/0x5b0 vm_area_alloc+0x1e/0x60 create_init_stack_vma+0x26/0x210 alloc_bprm+0x139/0x200 kernel_execve+0x4a/0x140 call_usermodehelper_exec_async+0xd0/0x190 ? __pfx_call_usermodehelper_exec_async+0x10/0x10 ret_from_fork+0xf0/0x110 ? __pfx_call_usermodehelper_exec_async+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: CR2: 0000000000000040 ---[ end trace 0000000000000000 ]--- RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0 Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89 RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300 RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388 R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0 R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0 FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such that memory is only on 2 of them." # numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90 node 1 size: 31584 MB node 1 free: 30397 MB node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92 node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94 node 3 size: 0 MB node 3 free: 0 MB node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89 node 4 size: 0 MB node 4 free: 0 MB node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91 node 5 size: 32214 MB node 5 free: 31625 MB node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93 node 6 size: 0 MB node 6 free: 0 MB node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95 node 7 size: 0 MB node 7 free: 0 MB Linus decoded the stacktrace to get_barn() and get_node() and determined that kmem_cache->node[numa_mem_id()] is NULL. The problem is due to a wrong assumption that memoryless nodes only exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id() points to the nearest node that has memory. SLUB has been allocating its kmem_cache_node structures only on nodes with memory and so it does with struct node_barn. For kmem_cache_node, get_partial_node() checks if get_node() result is not NULL, which I assumed was for protection from a bogus node id passed to kmalloc_node() but apparently it's also for systems where numa_mem_id() (used when no specific node is given) might return a memoryless node. Fix the sheaves code the same way by checking the result of get_node() and bailing out if it's NULL. Note that cpus on such memoryless nodes will have degraded sheaves performance, which can be improved later, preferably by making numa_mem_id() work properly on such systems. Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves") Reported-and-tested-by: Phil Auld <pauld@redhat.com> Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/ Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/all/CAHk-%3Dwg1xK%2BBr%3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ%3DQD_BcQ@mail.gmail.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-09Merge tag 'slab-for-6.18-rc1' of ↵Linus Torvalds1-4/+13
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Fixes for several corner cases in error paths and debugging options, related to the new kmalloc_nolock() functionality (Kuniyuki Iwashima, Ran Xiaokai) * tag 'slab-for-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slub: Don't call lockdep_unregister_key() for immature kmem_cache. slab: Fix using this_cpu_ptr() in preemptible context slab: Add allow_spin check to eliminate kmemleak warnings
2025-10-07mm: hugetlb: avoid soft lockup when mprotect to large memory areaYang Shi1-0/+2
When calling mprotect() to a large hugetlb memory area in our customer's workload (~300GB hugetlb memory), soft lockup was observed: watchdog: BUG: soft lockup - CPU#98 stuck for 23s! [t2_new_sysv:126916] CPU: 98 PID: 126916 Comm: t2_new_sysv Kdump: loaded Not tainted 6.17-rc7 Hardware name: GIGACOMPUTING R2A3-T40-AAV1/Jefferson CIO, BIOS 5.4.4.1 07/15/2025 pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : mte_clear_page_tags+0x14/0x24 lr : mte_sync_tags+0x1c0/0x240 sp : ffff80003150bb80 x29: ffff80003150bb80 x28: ffff00739e9705a8 x27: 0000ffd2d6a00000 x26: 0000ff8e4bc00000 x25: 00e80046cde00f45 x24: 0000000000022458 x23: 0000000000000000 x22: 0000000000000004 x21: 000000011b380000 x20: ffff000000000000 x19: 000000011b379f40 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc875e0aa5e2c x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 x5 : fffffc01ce7a5c00 x4 : 00000000046cde00 x3 : fffffc0000000000 x2 : 0000000000000004 x1 : 0000000000000040 x0 : ffff0046cde7c000 Call trace:   mte_clear_page_tags+0x14/0x24   set_huge_pte_at+0x25c/0x280   hugetlb_change_protection+0x220/0x430   change_protection+0x5c/0x8c   mprotect_fixup+0x10c/0x294   do_mprotect_pkey.constprop.0+0x2e0/0x3d4   __arm64_sys_mprotect+0x24/0x44   invoke_syscall+0x50/0x160   el0_svc_common+0x48/0x144   do_el0_svc+0x30/0xe0   el0_svc+0x30/0xf0   el0t_64_sync_handler+0xc4/0x148   el0t_64_sync+0x1a4/0x1a8 Soft lockup is not triggered with THP or base page because there is cond_resched() called for each PMD size. Although the soft lockup was triggered by MTE, it should be not MTE specific. The other processing which takes long time in the loop may trigger soft lockup too. So add cond_resched() for hugetlb to avoid soft lockup. Link: https://lkml.kernel.org/r/20250929202402.1663290-1-yang@os.amperecomputing.com Fixes: 8f860591ffb2 ("[PATCH] Enable mprotect on huge pages") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07fsnotify: pass correct offset to fsnotify_mmap_perm()Ryan Roberts1-1/+2
fsnotify_mmap_perm() requires a byte offset for the file about to be mmap'ed. But it is called from vm_mmap_pgoff(), which has a page offset. Previously the conversion was done incorrectly so let's fix it, being careful not to overflow on 32-bit platforms. Discovered during code review. Link: https://lkml.kernel.org/r/20251003155238.2147410-1-ryan.roberts@arm.com Fixes: 066e053fe208 ("fsnotify: add pre-content hooks on mmap()") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Cc: Amir Goldstein <amir73il@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07mm/damon/vaddr: do not repeat pte_offset_map_lock() until successSeongJae Park1-6/+2
DAMON's virtual address space operation set implementation (vaddr) calls pte_offset_map_lock() inside the page table walk callback function. This is for reading and writing page table accessed bits. If pte_offset_map_lock() fails, it retries by returning the page table walk callback function with ACTION_AGAIN. pte_offset_map_lock() can continuously fail if the target is a pmd migration entry, though. Hence it could cause an infinite page table walk if the migration cannot be done until the page table walk is finished. This indeed caused a soft lockup when CPU hotplugging and DAMON were running in parallel. Avoid the infinite loop by simply not retrying the page table walk. DAMON is promising only a best-effort accuracy, so missing access to such pages is no problem. Link: https://lkml.kernel.org/r/20250930004410.55228-1-sj@kernel.org Fixes: 7780d04046a2 ("mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Xinyu Zheng <zhengxinyu6@huawei.com> Closes: https://lore.kernel.org/20250918030029.2652607-1-zhengxinyu6@huawei.com Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07mm/rmap: fix soft-dirty and uffd-wp bit loss when remapping zero-filled mTHP ↵Lance Yang1-5/+10
subpage to shared zeropage When splitting an mTHP and replacing a zero-filled subpage with the shared zeropage, try_to_map_unused_to_zeropage() currently drops several important PTE bits. For userspace tools like CRIU, which rely on the soft-dirty mechanism for incremental snapshots, losing the soft-dirty bit means modified pages are missed, leading to inconsistent memory state after restore. As pointed out by David, the more critical uffd-wp bit is also dropped. This breaks the userfaultfd write-protection mechanism, causing writes to be silently missed by monitoring applications, which can lead to data corruption. Preserve both the soft-dirty and uffd-wp bits from the old PTE when creating the new zeropage mapping to ensure they are correctly tracked. Link: https://lkml.kernel.org/r/20250930081040.80926-1-lance.yang@linux.dev Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07mm/thp: fix MTE tag mismatch when replacing zero-filled subpagesLance Yang2-19/+4
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace. Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer. KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging. As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage. Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical. [1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com Link: https://lkml.kernel.org/r/20250922021458.68123-1-lance.yang@linux.dev Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Lance Yang <lance.yang@linux.dev> Reported-by: Qun-wei Lin <Qun-wei.Lin@mediatek.com> Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@mediatek.com Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: andrew.yang <andrew.yang@mediatek.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07memcg: skip cgroup_file_notify if spinning is not allowedShakeel Butt1-3/+4
Generally memcg charging is allowed from all the contexts including NMI where even spinning on spinlock can cause locking issues. However one call chain was missed during the addition of memcg charging from any context support. That is try_charge_memcg() -> memcg_memory_event() -> cgroup_file_notify(). The possible function call tree under cgroup_file_notify() can acquire many different spin locks in spinning mode. Some of them are cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's just skip cgroup_file_notify() from memcg charging if the context does not allow spinning. Alternative approach was also explored where instead of skipping cgroup_file_notify(), we defer the memcg event processing to irq_work [1]. However it adds complexity and it was decided to keep things simple until we need more memcg events with !allow_spinning requirement. Link: https://lore.kernel.org/all/5qi2llyzf7gklncflo6gxoozljbm4h3tpnuv4u4ej4ztysvi6f@x44v7nz2wdzd/ [1] Link: https://lkml.kernel.org/r/20250922220203.261714-1-shakeel.butt@linux.dev Fixes: 3ac4638a734a ("memcg: make memcg_rstat_updated nmi safe") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Closes: https://lore.kernel.org/all/20250905061919.439648-1-yepeilin@google.com/ Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peilin Ye <yepeilin@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07kho: replace kho_preserve_phys() with kho_preserve_pages()Mike Rapoport (Microsoft)1-1/+3
to make it clear that KHO operates on pages rather than on a random physical address. The kho_preserve_pages() will be also used in upcoming support for vmalloc preservation. Link: https://lkml.kernel.org/r/20250921054458.4043761-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07Merge tag 'dma-mapping-6.18-2025-10-07' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two small fixes for the recently performed code refactoring (Shigeru Yoshida) and missing handling of direction parameter in DMA debug code (Petr Tesarik)" * tag 'dma-mapping-6.18-2025-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: fix direction in dma_alloc direction traces kmsan: fix kmsan_handle_dma() to avoid false positives
2025-10-07slub: Don't call lockdep_unregister_key() for immature kmem_cache.Kuniyuki Iwashima1-1/+2
syzbot reported the lockdep splat below in __kmem_cache_release(). [0] The problem is that __kmem_cache_release() could be called from do_kmem_cache_create() before init_kmem_cache_cpus() registers the lockdep key. Let's perform lockdep_unregister_key() only when init_kmem_cache_cpus() has been done, which we can determine by checking s->cpu_slab [0]: WARNING: CPU: 1 PID: 6128 at kernel/locking/lockdep.c:6606 lockdep_unregister_key+0x2ca/0x310 kernel/locking/lockdep.c:6606 Modules linked in: CPU: 1 UID: 0 PID: 6128 Comm: syz.4.21 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 RIP: 0010:lockdep_unregister_key+0x2ca/0x310 kernel/locking/lockdep.c:6606 Code: 50 e4 0f 48 3b 44 24 10 0f 84 26 fe ff ff e8 bd cd 17 09 e8 e8 ce 17 09 41 f7 c7 00 02 00 00 74 bd fb 40 84 ed 75 bc eb cd 90 <0f> 0b 90 e9 19 ff ff ff 90 0f 0b 90 e9 2a ff ff ff 48 c7 c7 d0 ac RSP: 0018:ffffc90003e870d0 EFLAGS: 00010002 RAX: eb1525397f5bdf00 RBX: ffff88803c121148 RCX: 1ffff920007d0dfc RDX: 0000000000000000 RSI: ffffffff8acb1500 RDI: ffffffff8b1dd0e0 RBP: 00000000ffffffea R08: ffffffff8eb5aa37 R09: 1ffffffff1d6b546 R10: dffffc0000000000 R11: fffffbfff1d6b547 R12: 0000000000000000 R13: ffff88814d1b8900 R14: 0000000000000000 R15: 0000000000000203 FS: 00007f773f75e6c0(0000) GS:ffff88812712f000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffdaea3af52 CR3: 000000003a5ca000 CR4: 00000000003526f0 Call Trace: <TASK> __kmem_cache_release+0xe3/0x1e0 mm/slub.c:7696 do_kmem_cache_create+0x74e/0x790 mm/slub.c:8575 create_cache mm/slab_common.c:242 [inline] __kmem_cache_create_args+0x1ce/0x330 mm/slab_common.c:340 nfsd_file_cache_init+0x1d6/0x530 fs/nfsd/filecache.c:816 nfsd_startup_generic fs/nfsd/nfssvc.c:282 [inline] nfsd_startup_net fs/nfsd/nfssvc.c:377 [inline] nfsd_svc+0x393/0x900 fs/nfsd/nfssvc.c:786 nfsd_nl_threads_set_doit+0x84a/0x960 fs/nfsd/nfsctl.c:1639 genl_family_rcv_msg_doit+0x212/0x300 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x60e/0x790 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x846/0xa10 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:742 ____sys_sendmsg+0x508/0x820 net/socket.c:2630 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684 __sys_sendmsg net/socket.c:2716 [inline] __do_sys_sendmsg net/socket.c:2721 [inline] __se_sys_sendmsg net/socket.c:2719 [inline] __x64_sys_sendmsg+0x1a1/0x260 net/socket.c:2719 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f77400eeec9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f773f75e038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f7740345fa0 RCX: 00007f77400eeec9 RDX: 0000000000008004 RSI: 0000200000000180 RDI: 0000000000000006 RBP: 00007f7740171f91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f7740346038 R14: 00007f7740345fa0 R15: 00007ffce616f8d8 </TASK> [alexei.starovoitov@gmail.com: simplify the fix] Link: https://lore.kernel.org/all/20251007052534.2776661-1-kuniyu@google.com/ Fixes: 83382af9ddc3 ("slab: Make slub local_(try)lock more precise for LOCKDEP") Reported-by: syzbot+a6f4d69b9b23404bbabf@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e4a3d1.a00a0220.298cc0.0471.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-06slab: Fix using this_cpu_ptr() in preemptible contextRan Xiaokai1-2/+9
defer_free() maybe called in preemptible context, this will trigger the below warning message: BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 caller is defer_free+0x1b/0x60 Call Trace: <TASK> dump_stack_lvl+0xac/0xc0 check_preemption_disabled+0xbe/0xe0 defer_free+0x1b/0x60 kfree_nolock+0x1eb/0x2b0 alloc_slab_obj_exts+0x356/0x390 __alloc_tagging_slab_alloc_hook+0xa0/0x300 __kmalloc_cache_noprof+0x1c4/0x5c0 __set_page_owner+0x10d/0x1c0 post_alloc_hook+0x84/0xf0 get_page_from_freelist+0x73b/0x1380 __alloc_frozen_pages_noprof+0x110/0x2c0 alloc_pages_mpol+0x44/0x140 alloc_slab_page+0xac/0x150 allocate_slab+0x78/0x3a0 ___slab_alloc+0x76b/0xed0 __slab_alloc.constprop.0+0x5a/0xb0 __kmalloc_noprof+0x3dc/0x6d0 __list_lru_init+0x6c/0x210 alloc_super+0x3b6/0x470 sget_fc+0x5f/0x3a0 get_tree_nodev+0x27/0x90 vfs_get_tree+0x26/0xc0 vfs_kern_mount.part.0+0xb6/0x140 kern_mount+0x24/0x40 init_pipe_fs+0x4f/0x70 do_one_initcall+0x62/0x2e0 kernel_init_freeable+0x25b/0x4b0 kernel_init+0x1a/0x1c0 ret_from_fork+0x290/0x2e0 ret_from_fork_asm+0x11/0x20 </TASK> Disable preemption in defer_free() and also defer_deactivate_slab() to make it safe. [vbabka@suse.cz: disable preemption instead of using raw_cpu_ptr() per the discussion ] Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Link: https://lore.kernel.org/r/20250930083402.782927-1-ranxiaokai627@163.com Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-05Merge tag 'mm-stable-2025-10-03-16-49' of ↵Linus Torvalds4-34/+34
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: "Only two patch series in this pull request: - "mm/memory_hotplug: fixup crash during uevent handling" from Hannes Reinecke fixes a race that was causing udev to trigger a crash in the memory hotplug code - "mm_slot: following fixup for usage of mm_slot_entry()" from Wei Yang adds some touchups to the just-merged mm_slot changes" * tag 'mm-stable-2025-10-03-16-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/khugepaged: use KMEM_CACHE() mm/ksm: cleanup mm_slot_entry() invocation Documentation/mm: drop pxx_mkdevmap() descriptions from page table helpers mm: clean up is_guard_pte_marker() drivers/base: move memory_block_add_nid() into the caller mm/memory_hotplug: activate node before adding new memory blocks drivers/base/memory: add node id parameter to add_memory_block()
2025-10-04Merge tag 'memblock-v6.18-rc1' of ↵Linus Torvalds2-196/+65
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull mm-init update from Mike Rapoport: "Simplify deferred initialization of struct pages Refactor and simplify deferred initialization of the memory map. Beside the negative diffstat it gives 3ms (55ms vs 58ms) reduction in the initialization of deferred pages on single node system with 64GiB of RAM" * tag 'memblock-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: drop for_each_free_mem_pfn_range_in_zone_from() mm/mm_init: drop deferred_init_maxorder() mm/mm_init: deferred_init_memmap: use a job per zone mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()
2025-10-03Merge tag 'dma-mapping-6.18-2025-09-30' of ↵Linus Torvalds2-15/+17
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping updates from Marek Szyprowski: - Refactoring of DMA mapping API to physical addresses as the primary interface instead of page+offset parameters This gets much closer to Matthew Wilcox's long term wish for struct-pageless IO to cacheable DRAM and is supporting memdesc project which seeks to substantially transform how struct page works. An advantage of this approach is the possibility of introducing DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow in the common paths, what in turn lets to use recently introduced dma_iova_link() API to map PCI P2P MMIO without creating struct page Developped by Leon Romanovsky and Jason Gunthorpe - Minor clean-up by Petr Tesarik and Qianfeng Rong * tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: kmsan: fix missed kmsan_handle_dma() signature conversion mm/hmm: properly take MMIO path mm/hmm: migrate to physical address-based DMA mapping API dma-mapping: export new dma_*map_phys() interface xen: swiotlb: Open code map_resource callback dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs() kmsan: convert kmsan_handle_dma to use physical addresses dma-mapping: convert dma_direct_*map_page to be phys_addr_t based iommu/dma: implement DMA_ATTR_MMIO for iommu_dma_(un)map_phys() iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys dma-debug: refactor to use physical addresses for page mapping iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link(). dma-mapping: introduce new DMA attribute to indicate MMIO memory swiotlb: Remove redundant __GFP_NOWARN dma-direct: clean up the logic in __dma_direct_alloc_pages()
2025-10-03mm/khugepaged: use KMEM_CACHE()Wei Yang1-4/+1
We got some late review commits during review of commit b4c9ffb54b32 ("mm/khugepaged: remove definition of struct khugepaged_mm_slot"). No need to keep the old cache name "khugepaged_mm_slot", let's simply use KMEM_CACHE(). Link: https://lkml.kernel.org/r/20251001091900.20041-3-richard.weiyang@gmail.com Fixes: b4c9ffb54b32 ("mm/khugepaged: remove definition of struct khugepaged_mm_slot") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Cc: xu xin <xu.xin16@zte.com.cn> Cc: SeongJae Park <sj@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Kiryl Shutsemau <kas@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-03mm/ksm: cleanup mm_slot_entry() invocationWei Yang1-13/+14
Patch series "mm_slot: following fixup for usage of mm_slot_entry()", v2. We got some late review commits during review of "mm_slot: fix the usage of mm_slot_entry()" in [1]. This patch (of 2): We got some late review commits during review of commit 08498be43ee6 ("mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL"). Let's reduce the indentation level and make the code easier to follow by using gotos to a new label. Link: https://lkml.kernel.org/r/20251001091900.20041-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20251001091900.20041-2-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20250927004539.19308-1-richard.weiyang@gmail.com [1] Fixes: 08498be43ee6 ("mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-03mm: clean up is_guard_pte_marker()Lance Yang1-2/+2
Let's simplify the implementation. The current code is redundant as it effectively expands to: is_swap_pte(pte) && is_pte_marker_entry(...) && // from is_pte_marker() is_pte_marker_entry(...) // from is_guard_swp_entry() While a modern compiler could likely optimize this away, let's have clean code and not rely on it. Link: https://lkml.kernel.org/r/20250924045830.3817-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-03mm/memory_hotplug: activate node before adding new memory blocksHannes Reinecke1-15/+17
The sysfs attributes for memory blocks require the node ID to be set and initialized, so move the node activation before adding new memory blocks. This also has the nice side effect that the BUG_ON() can be converted into a WARN_ON() as we now can handle registration errors. Link: https://lkml.kernel.org/r/20250729064637.51662-3-hare@kernel.org Fixes: b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-03Merge tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-8/+26
Pull NFS client updates from Anna Schumaker: "New Features: - Add a Kconfig option to redirect dfprintk() to the trace buffer - Enable use of the RWF_DONTCACHE flag on the NFS client - Add striped layout handling to pNFS flexfiles - Add proper localio handling for READ and WRITE O_DIRECT Bugfixes: - Handle NFS4ERR_GRACE errors during delegation recall - Fix NFSv4.1 backchannel max_resp_sz verification check - Fix mount hang after CREATE_SESSION failure - Fix d_parent->d_inode locking in nfs4_setup_readdir() Other Cleanups and Improvements: - Improvements to write handling tracepoints - Fix a few trivial spelling mistakes - Cleanups to the rpcbind cleanup call sites - Convert the SUNRPC xdr_buf to use a scratch folio instead of scratch page - Remove unused NFS_WBACK_BUSY() macro - Remove __GFP_NOWARN flags - Unexport rpc_malloc() and rpc_free()" * tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (46 commits) NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support nfs/localio: add tracepoints for misaligned DIO READ and WRITE support nfs/localio: add proper O_DIRECT support for READ and WRITE nfs/localio: refactor iocb initialization nfs/localio: refactor iocb and iov_iter_bvec initialization nfs/localio: avoid issuing misaligned IO using O_DIRECT nfs/localio: make trace_nfs_local_open_fh more useful NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support sunrpc: unexport rpc_malloc() and rpc_free() NFSv4/flexfiles: Add support for striped layouts NFSv4/flexfiles: Update layout stats & error paths for striped layouts NFSv4/flexfiles: Write path updates for striped layouts NFSv4/flexfiles: Commit path updates for striped layouts NFSv4/flexfiles: Read path updates for striped layouts NFSv4/flexfiles: Update low level helper functions to be DS stripe aware. NFSv4/flexfiles: Add data structure support for striped layouts NFSv4/flexfiles: Use ds_commit_idx when marking a write commit NFSv4/flexfiles: Remove cred local variable dependency nfs4_setup_readdir(): insufficient locking for ->d_parent->d_inode dereferencing NFS: Enable use of the RWF_DONTCACHE flag on the NFS client ...
2025-10-03Merge tag 'fuse-update-6.18' of ↵Linus Torvalds2-26/+21
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Extend copy_file_range interface to be fully 64bit capable (Miklos) - Add selftest for fusectl (Chen Linxuan) - Move fuse docs into a separate directory (Bagas Sanjaya) - Allow fuse to enter freezable state in some cases (Sergey Senozhatsky) - Clean up writeback accounting after removing tmp page copies (Joanne) - Optimize virtiofs request handling (Li RongQing) - Add synchronous FUSE_INIT support (Miklos) - Allow server to request prune of unused inodes (Miklos) - Fix deadlock with AIO/sync release (Darrick) - Add some prep patches for block/iomap support (Darrick) - Misc fixes and cleanups * tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (26 commits) fuse: move CREATE_TRACE_POINTS to a separate file fuse: move the backing file idr and code into a new source file fuse: enable FUSE_SYNCFS for all fuseblk servers fuse: capture the unique id of fuse commands being sent fuse: fix livelock in synchronous file put from fuseblk workers mm: fix lockdep issues in writeback handling fuse: add prune notification fuse: remove redundant calls to fuse_copy_finish() in fuse_notify() fuse: fix possibly missing fuse_copy_finish() call in fuse_notify() fuse: remove FUSE_NOTIFY_CODE_MAX from <uapi/linux/fuse.h> fuse: remove fuse_readpages_end() null mapping check fuse: fix references to fuse.rst -> fuse/fuse.rst fuse: allow synchronous FUSE_INIT fuse: zero initialize inode private data fuse: remove unused 'inode' parameter in fuse_passthrough_open virtio_fs: fix the hash table using in virtio_fs_enqueue_req() mm: remove BDI_CAP_WRITEBACK_ACCT fuse: use default writeback accounting virtio_fs: Remove redundant spinlock in virtio_fs_request_complete() fuse: remove unneeded offset assignment when filling write pages ...
2025-10-03slab: Add allow_spin check to eliminate kmemleak warningsRan Xiaokai1-1/+2
In slab_post_alloc_hook(), kmemleak check is skipped when gfpflags_allow_spinning() returns false since commit af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Therefore, unconditionally calling kmemleak_not_leak() in alloc_slab_obj_exts() would trigger the following warning: kmemleak: Trying to color unknown object at 0xffff8881057f5000 as Grey Call Trace: alloc_slab_obj_exts+0x1b5/0x370 __alloc_tagging_slab_alloc_hook+0x9f/0x2d0 __kmalloc_cache_noprof+0x1c4/0x5c0 __set_page_owner+0x10d/0x1c0 post_alloc_hook+0x84/0xf0 get_page_from_freelist+0x73b/0x1380 __alloc_frozen_pages_noprof+0x110/0x2c0 alloc_pages_mpol+0x44/0x140 alloc_slab_page+0xac/0x150 allocate_slab+0x78/0x3a0 ___slab_alloc+0x76b/0xed0 __slab_alloc.constprop.0+0x5a/0xb0 Add the allow_spin check in alloc_slab_obj_exts() to eliminate the above warning. Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250930063831.782815-1-ranxiaokai627@163.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-10-03kmsan: fix kmsan_handle_dma() to avoid false positivesShigeru Yoshida1-2/+1
KMSAN reports an uninitialized value issue in dma_map_phys()[1]. This is a false positive caused by the way the virtual address is handled in kmsan_handle_dma(). Fix it by translating the physical address to a virtual address using phys_to_virt(). [1] BUG: KMSAN: uninit-value in dma_map_phys+0xdc5/0x1060 dma_map_phys+0xdc5/0x1060 dma_map_page_attrs+0xcf/0x130 e1000_xmit_frame+0x3c51/0x78f0 dev_hard_start_xmit+0x22f/0xa30 sch_direct_xmit+0x3b2/0xcf0 __dev_queue_xmit+0x3588/0x5e60 neigh_resolve_output+0x9c5/0xaf0 ip6_finish_output2+0x24e0/0x2d30 ip6_finish_output+0x903/0x10d0 ip6_output+0x331/0x600 mld_sendpack+0xb4a/0x1770 mld_ifc_work+0x1328/0x19b0 process_scheduled_works+0xb91/0x1d80 worker_thread+0xedf/0x1590 kthread+0xd5c/0xf00 ret_from_fork+0x1f5/0x4c0 ret_from_fork_asm+0x1a/0x30 Uninit was created at: __kmalloc_cache_noprof+0x8f5/0x16b0 syslog_print+0x9a/0xef0 do_syslog+0x849/0xfe0 __x64_sys_syslog+0x97/0x100 x64_sys_call+0x3cf8/0x3e30 do_syscall_64+0xd9/0xfa0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Bytes 0-89 of 90 are uninitialized Memory access of size 90 starts at ffff8880367ed000 CPU: 1 UID: 0 PID: 1552 Comm: kworker/1:2 Not tainted 6.17.0-next-20250929 #26 PREEMPT(none) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014 Workqueue: mld mld_ifc_work Fixes: 6eb1e769b2c1 ("kmsan: convert kmsan_handle_dma to use physical addresses") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251002051024.3096061-1-syoshida@redhat.com
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds90-2730/+3835
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-10-02Merge tag 'slab-for-6.18' of ↵Linus Torvalds8-160/+2320
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - A new layer for caching objects for allocation and free via percpu arrays called sheaves. The aim is to combine the good parts of SLAB (lower-overhead and simpler percpu caching, compared to SLUB) without the past issues with arrays for freeing remote NUMA node objects and their flushing. It also allows more efficient kfree_rcu(), and cheaper object preallocations for cases where the exact number of objects is unknown, but an upper bound is. Currently VMAs and maple nodes are using this new caching, with a plan to enable it for all caches and remove the complex SLUB fastpath based on cpu (partial) slabs and this_cpu_cmpxchg_double(). (Vlastimil Babka, with Liam Howlett and Pedro Falcato for the maple tree changes) - Re-entrant kmalloc_nolock(), which allows opportunistic allocations from NMI and tracing/kprobe contexts. Building on prior page allocator and memcg changes, it will result in removing BPF-specific caches on top of slab (Alexei Starovoitov) - Various fixes and cleanups. (Kuan-Wei Chiu, Matthew Wilcox, Suren Baghdasaryan, Ye Liu) * tag 'slab-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (40 commits) slab: Introduce kmalloc_nolock() and kfree_nolock(). slab: Reuse first bit for OBJEXTS_ALLOC_FAIL slab: Make slub local_(try)lock more precise for LOCKDEP mm: Introduce alloc_frozen_pages_nolock() mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). locking/local_lock: Introduce local_lock_is_locked(). maple_tree: Convert forking to use the sheaf interface maple_tree: Add single node allocation support to maple state maple_tree: Prefilled sheaf conversion and testing tools/testing: Add support for prefilled slab sheafs maple_tree: Replace mt_free_one() with kfree() maple_tree: Use kfree_rcu in ma_free_rcu testing/radix-tree/maple: Hack around kfree_rcu not existing tools/testing: include maple-shim.c in maple.c maple_tree: use percpu sheaves for maple_node_cache mm, vma: use percpu sheaves for vm_area_struct cache tools/testing: Add support for changes to slab for sheaves slab: allow NUMA restricted allocations to use percpu sheaves tools/testing/vma: Implement vm_refcnt reset slab: skip percpu sheaves for remote object freeing ...
2025-10-02Merge tag 'net-next-6.18' of ↵Linus Torvalds1-9/+31
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - Improve drop account scalability on NUMA hosts for RAW and UDP sockets and the backlog, almost doubling the Pps capacity under DoS - Optimize the UDP RX performance under stress, reducing contention, revisiting the binary layout of the involved data structs and implementing NUMA-aware locking. This improves UDP RX performance by an additional 50%, even more under extreme conditions - Add support for PSP encryption of TCP connections; this mechanism has some similarities with IPsec and TLS, but offers superior HW offloads capabilities - Ongoing work to support Accurate ECN for TCP. AccECN allows more than one congestion notification signal per RTT and is a building block for Low Latency, Low Loss, and Scalable Throughput (L4S) - Reorganize the TCP socket binary layout for data locality, reducing the number of touched cachelines in the fastpath - Refactor skb deferral free to better scale on large multi-NUMA hosts, this improves TCP and UDP RX performances significantly on such HW - Increase the default socket memory buffer limits from 256K to 4M to better fit modern link speeds - Improve handling of setups with a large number of nexthop, making dump operating scaling linearly and avoiding unneeded synchronize_rcu() on delete - Improve bridge handling of VLAN FDB, storing a single entry per bridge instead of one entry per port; this makes the dump order of magnitude faster on large switches - Restore IP ID correctly for encapsulated packets at GSO segmentation time, allowing GRO to merge packets in more scenarios - Improve netfilter matching performance on large sets - Improve MPTCP receive path performance by leveraging recently introduced core infrastructure (skb deferral free) and adopting recent TCP autotuning changes - Allow bridges to redirect to a backup port when the bridge port is administratively down - Introduce MPTCP 'laminar' endpoint that con be used only once per connection and simplify common MPTCP setups - Add RCU safety to dst->dev, closing a lot of possible races - A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing code duplication - Supports pulling data from an skb frag into the linear area of an XDP buffer Things we sprinkled into general kernel code: - Generate netlink documentation from YAML using an integrated YAML parser Driver API: - Support using IPv6 Flow Label in Rx hash computation and RSS queue selection - Introduce API for fetching the DMA device for a given queue, allowing TCP zerocopy RX on more H/W setups - Make XDP helpers compatible with unreadable memory, allowing more easily building DevMem-enabled drivers with a unified XDP/skbs datapath - Add a new dedicated ethtool callback enabling drivers to provide the number of RX rings directly, improving efficiency and clarity in RX ring queries and RSS configuration - Introduce a burst period for the health reporter, allowing better handling of multiple errors due to the same root cause - Support for DPLL phase offset exponential moving average, controlling the average smoothing factor Device drivers: - Add a new Huawei driver for 3rd gen NIC (hinic3) - Add a new SpacemiT driver for K1 ethernet MAC - Add a generic abstraction for shared memory communication devices (dibps) - Ethernet high-speed NICs: - nVidia/Mellanox: - Use multiple per-queue doorbell, to avoid MMIO contention issues - support adjacent functions, allowing them to delegate their SR-IOV VFs to sibling PFs - support RSS for IPSec offload - support exposing raw cycle counters in PTP and mlx5 - support for disabling host PFs. - Intel (100G, ice, idpf): - ice: support for SRIOV VFs over an Active-Active link aggregate - ice: support for firmware logging via debugfs - ice: support for Earliest TxTime First (ETF) hardware offload - idpf: support basic XDP functionalities and XSk - Broadcom (bnxt): - support Hyper-V VF ID - dynamic SRIOV resource allocations for RoCE - Meta (fbnic): - support queue API, zero-copy Rx and Tx - support basic XDP functionalities - devlink health support for FW crashes and OTP mem corruptions - expand hardware stats coverage to FEC, PHY, and Pause - Wangxun: - support ethtool coalesce options - support for multiple RSS contexts - Ethernet virtual: - Macsec: - replace custom netlink attribute checks with policy-level checks - Bonding: - support aggregator selection based on port priority - Microsoft vNIC: - use page pool fragments for RX buffers instead of full pages to improve memory efficiency - Ethernet NICs consumer, and embedded: - Qualcomm: support Ethernet function for IPQ9574 SoC - Airoha: implement wlan offloading via NPU - Freescale - enetc: add NETC timer PTP driver and add PTP support - fec: enable the Jumbo frame support for i.MX8QM - Renesas (R-Car S4): - support HW offloading for layer 2 switching - support for RZ/{T2H, N2H} SoCs - Cadence (macb): support TAPRIO traffic scheduling - TI: - support for Gigabit ICSS ethernet SoC (icssm-prueth) - Synopsys (stmmac): a lot of cleanups - Ethernet PHYs: - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS driver - Support bcm63268 GPHY power control - Support for Micrel lan8842 PHY and PTP - Support for Aquantia AQR412 and AQR115 - CAN: - a large CAN-XL preparation work - reorganize raw_sock and uniqframe struct to minimize memory usage - rcar_canfd: update the CAN-FD handling - WiFi: - extended Neighbor Awareness Networking (NAN) support - S1G channel representation cleanup - improve S1G support - WiFi drivers: - Intel (iwlwifi): - major refactor and cleanup - Broadcom (brcm80211): - support for AP isolation - RealTek (rtw88/89) rtw88/89: - preparation work for RTL8922DE support - MediaTek (mt76): - HW restart improvements - MLO support - Qualcomm/Atheros (ath10k): - GTK rekey fixes - Bluetooth drivers: - btusb: support for several new IDs for MT7925 - btintel: support for BlazarIW core - btintel_pcie: support for _suspend() / _resume() - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs" * tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits) net: stmmac: Add support for Allwinner A523 GMAC200 dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible Revert "Documentation: net: add flow control guide and document ethtool API" octeontx2-pf: fix bitmap leak octeontx2-vf: fix bitmap leak net/mlx5e: Use extack in set rxfh callback net/mlx5e: Introduce mlx5e_rss_params for RSS configuration net/mlx5e: Introduce mlx5e_rss_init_params net/mlx5e: Remove unused mdev param from RSS indir init net/mlx5: Improve QoS error messages with actual depth values net/mlx5e: Prevent entering switchdev mode with inconsistent netns net/mlx5: HWS, Generalize complex matchers net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs selftests/net: add tcp_port_share to .gitignore Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set" net: add NUMA awareness to skb_attempt_defer_free() net: use llist for sd->defer_list net: make softnet_data.defer_count an atomic selftests: drv-net: psp: add tests for destroying devices selftests: drv-net: psp: add test for auto-adjusting TCP MSS ...
2025-09-29Merge tag 'arm64-upstream' of ↵Linus Torvalds1-12/+24
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "There's good stuff across the board, including some nice mm improvements for CPUs with the 'noabort' BBML2 feature and a clever patch to allow ptdump to play nicely with block mappings in the vmalloc area. Confidential computing: - Add support for accepting secrets from firmware (e.g. ACPI CCEL) and mapping them with appropriate attributes. CPU features: - Advertise atomic floating-point instructions to userspace - Extend Spectre workarounds to cover additional Arm CPU variants - Extend list of CPUs that support break-before-make level 2 and guarantee not to generate TLB conflict aborts for changes of mapping granularity (BBML2_NOABORT) - Add GCS support to our uprobes implementation. Documentation: - Remove bogus SME documentation concerning register state when entering/exiting streaming mode. Entry code: - Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY) - Micro-optimise syscall entry path with a compiler branch hint. Memory management: - Enable huge mappings in vmalloc space even when kernel page-table dumping is enabled - Tidy up the types used in our early MMU setup code - Rework rodata= for closer parity with the behaviour on x86 - For CPUs implementing BBML2_NOABORT, utilise block mappings in the linear map even when rodata= applies to virtual aliases - Don't re-allocate the virtual region between '_text' and '_stext', as doing so confused tools parsing /proc/vmcore. Miscellaneous: - Clean-up Kconfig menuconfig text for architecture features - Avoid redundant bitmap_empty() during determination of supported SME vector lengths - Re-enable warnings when building the 32-bit vDSO object - Avoid breaking our eggs at the wrong end. Perf and PMUs: - Support for v3 of the Hisilicon L3C PMU - Support for Hisilicon's MN and NoC PMUs - Support for Fujitsu's Uncore PMU - Support for SPE's extended event filtering feature - Preparatory work to enable data source filtering in SPE - Support for multiple lanes in the DWC PCIe PMU - Support for i.MX94 in the IMX DDR PMU driver - MAINTAINERS update (Thank you, Yicong) - Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets). Selftests: - Add basic LSFE check to the existing hwcaps test - Support nolibc in GCS tests - Extend SVE ptrace test to pass unsupported regsets and invalid vector lengths - Minor cleanups (typos, cosmetic changes). System registers: - Fix ID_PFR1_EL1 definition - Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1 - Sync TCR_EL1 definition with the latest Arm ARM (L.b) - Be stricter about the input fed into our AWK sysreg generator script - Typo fixes and removal of redundant definitions. ACPI, EFI and PSCI: - Decouple Arm's "Software Delegated Exception Interface" (SDEI) support from the ACPI GHES code so that it can be used by platforms booted with device-tree - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI runtime calls - Fix a node refcount imbalance in the PSCI device-tree code. CPU Features: - Ensure register sanitisation is applied to fields in ID_AA64MMFR4 - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests can reliably query the underlying CPU types from the VMM - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes to our context-switching, signal handling and ptrace code" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits) arm64: cpufeature: Remove duplicate asm/mmu.h header arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN perf/dwc_pcie: Fix use of uninitialized variable arm/syscalls: mark syscall invocation as likely in invoke_syscall Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU Documentation: hisi-pmu: Fix of minor format error drivers/perf: hisi: Add support for L3C PMU v3 drivers/perf: hisi: Refactor the event configuration of L3C PMU drivers/perf: hisi: Extend the field of tt_core drivers/perf: hisi: Extract the event filter check of L3C PMU drivers/perf: hisi: Simplify the probe process of each L3C PMU version drivers/perf: hisi: Export hisi_uncore_pmu_isr() drivers/perf: hisi: Relax the event ID check in the framework perf: Fujitsu: Add the Uncore PMU driver arm64: map [_text, _stext) virtual address range non-executable+read-only arm64/sysreg: Update TCR_EL1 register arm64: Enable vmalloc-huge with ptdump arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list arm64: errata: Apply workarounds for Neoverse-V3AE arm64: cputype: Add Neoverse-V3AE definitions ...
2025-09-29Merge tag 'vfs-6.18-rc1.writeback' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs writeback updates from Christian Brauner: "This contains work adressing lockups reported by users when a systemd unit reading lots of files from a filesystem mounted with the lazytime mount option exits. With the lazytime mount option enabled we can be switching many dirty inodes on cgroup exit to the parent cgroup. The numbers observed in practice when systemd slice of a large cron job exits can easily reach hundreds of thousands or millions. The logic in inode_do_switch_wbs() which sorts the inode into appropriate place in b_dirty list of the target wb however has linear complexity in the number of dirty inodes thus overall time complexity of switching all the inodes is quadratic leading to workers being pegged for hours consuming 100% of the CPU and switching inodes to the parent wb. Simple reproducer of the issue: FILES=10000 # Filesystem mounted with lazytime mount option MNT=/mnt/ echo "Creating files and switching timestamps" for (( j = 0; j < 50; j ++ )); do mkdir $MNT/dir$j for (( i = 0; i < $FILES; i++ )); do echo "foo" >$MNT/dir$j/file$i done touch -a -t 202501010000 $MNT/dir$j/file* done wait echo "Syncing and flushing" sync echo 3 >/proc/sys/vm/drop_caches echo "Reading all files from a cgroup" mkdir /sys/fs/cgroup/unified/mycg1 || exit echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit for (( j = 0; j < 50; j ++ )); do cat /mnt/dir$j/file* >/dev/null & done wait echo "Switching wbs" # Now rmdir the cgroup after the script exits This can be solved by: - Avoiding contention on the wb->list_lock when switching inodes by running a single work item per wb and managing a queue of items switching to the wb - Allowing rescheduling when switching inodes over to a different cgroup to avoid softlockups - Maintaining the b_dirty list ordering instead of sorting it" * tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: writeback: Add tracepoint to track pending inode switches writeback: Avoid excessively long inode switching times writeback: Avoid softlockup when switching many inodes writeback: Avoid contention on wb->list_lock when switching inodes
2025-09-29Merge tag 'vfs-6.18-rc1.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-29Merge series "slab: Re-entrant kmalloc_nolock()"Vlastimil Babka7-74/+527
From the cover letter [1]: This patch set introduces kmalloc_nolock() which is the next logical step towards any context allocation necessary to remove bpf_mem_alloc and get rid of preallocation requirement in BPF infrastructure. In production BPF maps grew to gigabytes in size. Preallocation wastes memory. Alloc from any context addresses this issue for BPF and other subsystems that are forced to preallocate too. This long task started with introduction of alloc_pages_nolock(), then memcg and objcg were converted to operate from any context including NMI, this set completes the task with kmalloc_nolock() that builds on top of alloc_pages_nolock() and memcg changes. After that BPF subsystem will gradually adopt it everywhere. Link: https://lore.kernel.org/all/20250909010007.1660-1-alexei.starovoitov@gmail.com/ [1]
2025-09-29Merge series "SLUB percpu sheaves"Vlastimil Babka4-47/+1737
This series adds an opt-in percpu array-based caching layer to SLUB. It has evolved to a state where kmem caches with sheaves are compatible with all SLUB features (slub_debug, SLUB_TINY, NUMA locality considerations). The plan is therefore that it will be later enabled for all kmem caches and replace the complicated cpu (partial) slabs code. Note the name "sheaf" was invented by Matthew Wilcox so we don't call the arrays magazines like the original Bonwick paper. The per-NUMA-node cache of sheaves is thus called "barn". This caching may seem similar to the arrays we had in SLAB, but there are some important differences: - deals differently with NUMA locality of freed objects, thus there are no per-node "shared" arrays (with possible lock contention) and no "alien" arrays that would need periodical flushing - instead, freeing remote objects (which is rare) bypasses the sheaves - percpu sheaves thus contain only local objects (modulo rare races and local node exhaustion) - NUMA restricted allocations and strict_numa mode is still honoured - improves kfree_rcu() handling by reusing whole sheaves - there is an API for obtaining a preallocated sheaf that can be used for guaranteed and efficient allocations in a restricted context, when the upper bound for needed objects is known but rarely reached - opt-in, not used for every cache (for now) The motivation comes mainly from the ongoing work related to VMA locking scalability and the related maple tree operations. This is why VMA and maple nodes caches are sheaf-enabled in the patchset. A sheaf-enabled cache has the following expected advantages: - Cheaper fast paths. For allocations, instead of local double cmpxchg, thanks to local_trylock() it becomes a preempt_disable() and no atomic operations. Same for freeing, which is otherwise a local double cmpxchg only for short term allocations (so the same slab is still active on the same cpu when freeing the object) and a more costly locked double cmpxchg otherwise. - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a separate percpu sheaf and only submit the whole sheaf to call_rcu() when full. After the grace period, the sheaf can be used for allocations, which is more efficient than freeing and reallocating individual slab objects (even with the batching done by kfree_rcu() implementation itself). In case only some cpus are allowed to handle rcu callbacks, the sheaf can still be made available to other cpus on the same node via the shared barn. The maple_node cache uses kfree_rcu() and thus can benefit from this. Note: this path is currently limited to !PREEMPT_RT - Preallocation support. A prefilled sheaf can be privately borrowed to perform a short term operation that is not allowed to block in the middle and may need to allocate some objects. If an upper bound (worst case) for the number of allocations is known, but only much fewer allocations actually needed on average, borrowing and returning a sheaf is much more efficient then a bulk allocation for the worst case followed by a bulk free of the many unused objects. Maple tree write operations should benefit from this. - Compatibility with slub_debug. When slub_debug is enabled for a cache, we simply don't create the percpu sheaves so that the debugging hooks (at the node partial list slowpaths) are reached as before. The same thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by reusing the (ineffective) paths for requests exceeding the cache's sheaf_capacity. This is in line with the existing approach where debugging bypasses the fast paths and SLUB_TINY preferes memory savings over performance. The above is adapted from the cover letter [1], which contains also in-kernel microbenchmark results showing the lower overhead of sheaves. Results from Suren Baghdasaryan [2] using a mmap/munmap microbenchmark also show improvements. Results from Sudarsan Mahendran [3] using will-it-scale show both benefits and regressions, probably due to overall noisiness of those tests. Link: https://lore.kernel.org/all/20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz/ [1] Link: https://lore.kernel.org/all/CAJuCfpEQ%3DRUgcAvRzE5jRrhhFpkm8E2PpBK9e9GhK26ZaJQt%3DQ@mail.gmail.com/ [2] Link: https://lore.kernel.org/all/20250913000935.1021068-1-sudarsanm@google.com/ [3]
2025-09-29slab: Introduce kmalloc_nolock() and kfree_nolock().Alexei Starovoitov5-50/+469
kmalloc_nolock() relies on ability of local_trylock_t to detect the situation when per-cpu kmem_cache is locked. In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) disables IRQs and marks s->cpu_slab->lock as acquired. local_lock_is_locked(&s->cpu_slab->lock) returns true when slab is in the middle of manipulating per-cpu cache of that specific kmem_cache. kmalloc_nolock() can be called from any context and can re-enter into ___slab_alloc(): kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> kmalloc_nolock() -> ___slab_alloc(cache_B) or kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() -> ___slab_alloc(cache_B) Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock can be acquired without a deadlock before invoking the function. If that specific per-cpu kmem_cache is busy the kmalloc_nolock() retries in a different kmalloc bucket. The second attempt will likely succeed, since this cpu locked different kmem_cache. Similarly, in PREEMPT_RT local_lock_is_locked() returns true when per-cpu rt_spin_lock is locked by current _task_. In this case re-entrance into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries a different bucket that is most likely is not locked by the current task. Though it may be locked by a different task it's safe to rt_spin_lock() and sleep on it. Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL immediately if called from hard irq or NMI in PREEMPT_RT. kfree_nolock() defers freeing to irq_work when local_lock_is_locked() and (in_nmi() or in PREEMPT_RT). SLUB_TINY config doesn't use local_lock_is_locked() and relies on spin_trylock_irqsave(&n->list_lock) to allocate, while kfree_nolock() always defers to irq_work. Note, kfree_nolock() must be called _only_ for objects allocated with kmalloc_nolock(). Debug checks (like kmemleak and kfence) were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); will miss kmemleak/kfence book keeping and will cause false positives. large_kmalloc is not supported by either kmalloc_nolock() or kfree_nolock(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: Reuse first bit for OBJEXTS_ALLOC_FAILAlexei Starovoitov1-1/+1
Since the combination of valid upper bits in slab->obj_exts with OBJEXTS_ALLOC_FAIL bit can never happen, use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel instead of (1ull << 2) to free up bit 2. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: Make slub local_(try)lock more precise for LOCKDEPAlexei Starovoitov2-0/+21
kmalloc_nolock() can be called from any context the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock in PREEMPT_RT) and attempt to acquire a different local_trylock_t while in the same task context. The calling sequence might look like: kmalloc() -> tracepoint -> bpf -> kmalloc_nolock() or more precisely: __lock_acquire+0x12ad/0x2590 lock_acquire+0x133/0x2d0 rt_spin_lock+0x6f/0x250 ___slab_alloc+0xb7/0xec0 kmalloc_nolock_noprof+0x15a/0x430 my_debug_callback+0x20e/0x390 [testmod] ___slab_alloc+0x256/0xec0 __kmalloc_cache_noprof+0xd6/0x3b0 Make LOCKDEP understand that local_trylock_t-s protect different kmem_caches. In order to do that add lock_class_key for each kmem_cache and use that key in local_trylock_t. This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT, but teach lockdep about it only for PREEMPT_RT, since in !PREEMPT_RT the ___slab_alloc() code is using local_trylock_irqsave() when lockdep is on. Note, this patch applies this logic to local_lock_t while the next one converts it to local_trylock_t. Both are mapped to rt_spin_lock in PREEMPT_RT. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29mm: Introduce alloc_frozen_pages_nolock()Alexei Starovoitov2-21/+32
Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock() to be used by alloc_slab_page(). Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().Alexei Starovoitov1-4/+6
Change alloc_pages_nolock() to default to __GFP_COMP when allocating pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP. Also allow __GFP_ACCOUNT flag to be specified, since most of BPF infra needs __GFP_ACCOUNT except BPF streams. Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29mm, vma: use percpu sheaves for vm_area_struct cacheVlastimil Babka1-0/+1
Create the vm_area_struct cache with percpu sheaves of size 32 to improve its performance. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: allow NUMA restricted allocations to use percpu sheavesVlastimil Babka1-7/+46
Currently allocations asking for a specific node explicitly or via mempolicy in strict_numa node bypass percpu sheaves. Since sheaves contain mostly local objects, we can try allocating from them if the local node happens to be the requested node or allowed by the mempolicy. If we find the object from percpu sheaves is not from the expected node, we skip the sheaves - this should be rare. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: skip percpu sheaves for remote object freeingVlastimil Babka2-8/+40
Since we don't control the NUMA locality of objects in percpu sheaves, allocations with node restrictions bypass them. Allocations without restrictions may however still expect to get local objects with high probability, and the introduction of sheaves can decrease it due to freed object from a remote node ending up in percpu sheaves. The fraction of such remote frees seems low (5% on an 8-node machine) but it can be expected that some cache or workload specific corner cases exist. We can either conclude that this is not a problem due to the low fraction, or we can make remote frees bypass percpu sheaves and go directly to their slabs. This will make the remote frees more expensive, but if it's only a small fraction, most frees will still benefit from the lower overhead of percpu sheaves. This patch thus makes remote object freeing bypass percpu sheaves, including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However it's not intended to be 100% guarantee that percpu sheaves will only contain local objects. The refill from slabs does not provide that guarantee in the first place, and there might be cpu migrations happening when we need to unlock the local_lock. Avoiding all that could be possible but complicated so we can leave it for later investigation whether it would be worth it. It can be expected that the more selective freeing will itself prevent accumulation of remote objects in percpu sheaves so any such violations would have only short-term effects. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: determine barn status racily outside of lockVlastimil Babka1-7/+20
The possibility of many barn operations is determined by the current number of full or empty sheaves. Taking the barn->lock just to find out that e.g. there are no empty sheaves results in unnecessary overhead and lock contention. Thus perform these checks outside of the lock with a data_race() annotated variable read and fail quickly without taking the lock. Checks for sheaf availability that racily succeed have to be obviously repeated under the lock for correctness, but we can skip repeating checks if there are too many sheaves on the given list as the limits don't need to be strict. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: sheaf prefilling for guaranteed allocationsVlastimil Babka1-0/+263
Add functions for efficient guaranteed allocations e.g. in a critical section that cannot sleep, when the exact number of allocations is not known beforehand, but an upper limit can be calculated. kmem_cache_prefill_sheaf() returns a sheaf containing at least given number of objects. kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf and is guaranteed not to fail until depleted. kmem_cache_return_sheaf() is for giving the sheaf back to the slab allocator after the critical section. This will also attempt to refill it to cache's sheaf capacity for better efficiency of sheaves handling, but it's not stricly necessary to succeed. kmem_cache_refill_sheaf() can be used to refill a previously obtained sheaf to requested size. If the current size is sufficient, it does nothing. If the requested size exceeds cache's sheaf_capacity and the sheaf's current capacity, the sheaf will be replaced with a new one, hence the indirect pointer parameter. kmem_cache_sheaf_size() can be used to query the current size. The implementation supports requesting sizes that exceed cache's sheaf_capacity, but it is not efficient - such "oversize" sheaves are allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be especially ineffective when replacing a sheaf with a new one of a larger capacity. It is therefore better to size cache's sheaf_capacity accordingly to make oversize sheaves exceptional. CONFIG_SLUB_STATS counters are added for sheaf prefill and return operations. A prefill or return is considered _fast when it is able to grab or return a percpu spare sheaf (even if the sheaf needs a refill to satisfy the request, as those should amortize over time), and _slow otherwise (when the barn or even sheaf allocation/freeing has to be involved). sheaf_prefill_oversize is provided to determine how many prefills were oversize (counter for oversize returns is not necessary as all oversize refills result in oversize returns). When slub_debug is enabled for a cache with sheaves, no percpu sheaves exist for it, but the prefill functionality is still provided simply by all prefilled sheaves becoming oversize. If percpu sheaves are not created for a cache due to not passing the sheaf_capacity argument on cache creation, the prefills also work through oversize sheaves, but there's a WARN_ON_ONCE() to indicate the omission. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29maple_tree: remove redundant __GFP_NOWARNQianfeng Rong0-0/+0
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. No functional changes. Link: https://lkml.kernel.org/r/20250804125657.482109-1-rongqianfeng@vivo.com Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-29slab: add sheaf support for batching kfree_rcu() operationsVlastimil Babka3-2/+295
Extend the sheaf infrastructure for more efficient kfree_rcu() handling. For caches with sheaves, on each cpu maintain a rcu_free sheaf in addition to main and spare sheaves. kfree_rcu() operations will try to put objects on this sheaf. Once full, the sheaf is detached and submitted to call_rcu() with a handler that will try to put it in the barn, or flush to slab pages using bulk free, when the barn is full. Then a new empty sheaf must be obtained to put more objects there. It's possible that no free sheaves are available to use for a new rcu_free sheaf, and the allocation in kfree_rcu() context can only use GFP_NOWAIT and thus may fail. In that case, fall back to the existing kfree_rcu() implementation. Expected advantages: - batching the kfree_rcu() operations, that could eventually replace the existing batching - sheaves can be reused for allocations via barn instead of being flushed to slabs, which is more efficient - this includes cases where only some cpus are allowed to process rcu callbacks (CONFIG_RCU_NOCB_CPU) Possible disadvantage: - objects might be waiting for more than their grace period (it is determined by the last object freed into the sheaf), increasing memory usage - but the existing batching does that too. Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny implementation favors smaller memory footprint over performance. Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the contexts where kfree_rcu() is called might not be compatible with taking a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a spinlock - the current kfree_rcu() implementation avoids doing that. Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches that have them. This is not a cheap operation, but the barrier usage is rare - currently kmem_cache_destroy() or on module unload. Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to count how many kfree_rcu() used the rcu_free sheaf successfully and how many had to fall back to the existing implementation. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-28mm: swap: check for stable address space before operating on the VMACharan Teja Kalla1-0/+3
It is possible to hit a zero entry while traversing the vmas in unuse_mm() called from swapoff path and accessing it causes the OOPS: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000446--> Loading the memory from offset 0x40 on the XA_ZERO_ENTRY as address. Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault The issue is manifested from the below race between the fork() on a process and swapoff: fork(dup_mmap()) swapoff(unuse_mm) --------------- ----------------- 1) Identical mtree is built using __mt_dup(). 2) copy_pte_range()--> copy_nonpresent_pte(): The dst mm is added into the mmlist to be visible to the swapoff operation. 3) Fatal signal is sent to the parent process(which is the current during the fork) thus skip the duplication of the vmas and mark the vma range with XA_ZERO_ENTRY as a marker for this process that helps during exit_mmap(). 4) swapoff is tried on the 'mm' added to the 'mmlist' as part of the 2. 5) unuse_mm(), that iterates through the vma's of this 'mm' will hit the non-NULL zero entry and operating on this zero entry as a vma is resulting into the oops. The proper fix would be around not exposing this partially-valid tree to others when droping the mmap lock, which is being solved with [1]. A simpler solution would be checking for MMF_UNSTABLE, as it is set if mm_struct is not fully initialized in dup_mmap(). Thanks to Liam/Lorenzo/David for all the suggestions in fixing this issue. Link: https://lkml.kernel.org/r/20250924181138.1762750-1-charan.kalla@oss.qualcomm.com Link: https://lore.kernel.org/all/20250815191031.3769540-1-Liam.Howlett@oracle.com/ [1] Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/khugepaged: use start_addr/addr for improved readabilityWei Yang1-21/+22
When collapsing a pmd, there are two address in use: * address points to the start of pmd * address points to each individual page Current naming makes it difficult to distinguish these two and is hence error prone. Considering the plan to collapse mTHP, name the first one `start_addr' and the second `addr' for better readability and consistency. Link: https://lkml.kernel.org/r/20250922140938.27343-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28alloc_tag: fix boot failure due to NULL pointer dereferenceRan Xiaokai1-9/+9
There is a boot failure when both CONFIG_DEBUG_KMEMLEAK and CONFIG_MEM_ALLOC_PROFILING are enabled. BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:__alloc_tagging_slab_alloc_hook+0x181/0x2f0 Call Trace: kmem_cache_alloc_noprof+0x1c8/0x5c0 __alloc_object+0x2f/0x290 __create_object+0x22/0x80 kmemleak_init+0x122/0x190 mm_core_init+0xb6/0x160 start_kernel+0x39f/0x920 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0x104/0x120 common_startup_64+0x12c/0x138 In kmemleak, mem_pool_alloc() directly calls kmem_cache_alloc_noprof(), as a result, current->alloc_tag is NULL, leading to a null pointer dereference. Move the checks for SLAB_NO_OBJ_EXT, SLAB_NOLEAKTRACE, and __GFP_NO_OBJ_EXT to the parent function __alloc_tagging_slab_alloc_hook() to fix this. Also this distinguishes the SLAB_NOLEAKTRACE case between the actual memory allocation failures case, make CODETAG_FLAG_INACCURATE more accurate. Link: https://lkml.kernel.org/r/20250926080659.741991-1-ranxiaokai627@163.com Fixes: b9e2f58ffb84 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/memory-failure: don't select MEMORY_ISOLATIONXie Yuanbin1-1/+0
We added that "select MEMORY_ISOLATION" in commit ee6f509c3274 ("mm: factor out memory isolate functions"). However, in commit add05cecef80 ("mm: soft-offline: don't free target page in successful page migration") we remove the need for it, where we removed the calls to set_migratetype_isolate() etc. What CONFIG_MEMORY_FAILURE soft-offline support wants is migrate_pages() support. But that comes with CONFIG_MIGRATION. And isolate_folio_to_list() has nothing to do with CONFIG_MEMORY_ISOLATION. Therefore, we can remove "select MEMORY_ISOLATION" of MEMORY_FAILURE. Link: https://lkml.kernel.org/r/20250922143618.48640-1-xieyuanbin1@huawei.com Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/khugepaged: remove definition of struct khugepaged_mm_slotWei Yang1-37/+21
Current code is not correct to get struct khugepaged_mm_slot by mm_slot_entry() without checking mm_slot is !NULL. There is no problem reported since slot is the first element of struct khugepaged_mm_slot. While struct khugepaged_mm_slot is just a wrapper of struct mm_slot, there is no need to define it. Remove the definition of struct khugepaged_mm_slot, so there is not chance to miss use mm_slot_entry(). [richard.weiyang@gmail.com: fix use-after-free crash] Link: https://lkml.kernel.org/r/20250922002834.vz6ntj36e75ehkyp@master Link: https://lkml.kernel.org/r/20250919071244.17020-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULLWei Yang1-9/+11
Patch series "mm_slot: fix the usage of mm_slot_entry", v2. When using mm_slot in ksm, there is code like: slot = mm_slot_lookup(mm_slots_hash, mm); mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot); if (mm_slot && ..) { } The mm_slot_entry() won't return a valid value if slot is NULL generally. But currently it works since slot is the first element of struct ksm_mm_slot. To reduce the ambiguity and make it robust, access mm_slot_entry() when slot is !NULL. Link: https://lkml.kernel.org/r/20250919071244.17020-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20250919071244.17020-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Kiryl Shutsemau <kirill@shutemov.name> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28hugetlb: increase number of reserving hugepages via cmdlineLi Zhe1-1/+8
Commit 79359d6d24df ("hugetlb: perform vmemmap optimization on a list of pages") batches the submission of HugeTLB vmemmap optimization (HVO) during hugepage reservation. With HVO enabled, hugepages obtained from the buddy allocator are not submitted for optimization and their struct-page memory is therefore not released—until the entire reservation request has been satisfied. As a result, any struct-page memory freed in the course of the allocation cannot be reused for the ongoing reservation, artificially limiting the number of huge pages that can ultimately be provided. As commit b1222550fbf7 ("mm/hugetlb: do pre-HVO for bootmem allocated pages") already applies early HVO to bootmem-allocated huge pages, this patch extends the same benefit to non-bootmem pages by incrementally submitting them for HVO as they are allocated, thereby returning struct-page memory to the buddy allocator in real time. The change raises the maximum 2 MiB hugepage reservation from just under 376 GB to more than 381 GB on a 384 GB x86 VM. Link: https://lkml.kernel.org/r/20250919092353.41671-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm: remove PMD alignment constraint in execmem_vmalloc()Dev Jain1-3/+0
When using vmalloc with VM_ALLOW_HUGE_VMAP flag, it will set the alignment to PMD_SIZE internally, if it deems huge mappings to be eligible. Therefore, setting the alignment in execmem_vmalloc is redundant. Apart from this, it also reduces the probability of allocation in case vmalloc fails to allocate hugepages - in the fallback case, vmalloc tries to use the original alignment and allocate basepages, which unfortunately will again be PMD_SIZE passed over from execmem_vmalloc, thus constraining the search for a free space in vmalloc region. Therefore, remove this constraint. Link: https://lkml.kernel.org/r/20250918093453.75676-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/memory_hotplug: fix typo 'esecially' -> 'especially'Manish Kumar1-1/+1
Link: https://lkml.kernel.org/r/20250918174528.90879-1-manish1588@gmail.com Signed-off-by: Manish Kumar <manish1588@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/rmap: improve mlock tracking for large foliosKiryl Shutsemau1-7/+12
The kernel currently does not mlock large folios when adding them to rmap, stating that it is difficult to confirm that the folio is fully mapped and safe to mlock it. This leads to a significant undercount of Mlocked in /proc/meminfo, causing problems in production where the stat was used to estimate system utilization and determine if load shedding is required. However, nowadays the caller passes a number of pages of the folio that are getting mapped, making it easy to check if the entire folio is mapped to the VMA. mlock the folio on rmap if it is fully mapped to the VMA. Mlocked in /proc/meminfo can still undercount, but the value is closer the truth and is useful for userspace. Link: https://lkml.kernel.org/r/20250923110711.690639-7-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/filemap: map entire large folio faultaroundKiryl Shutsemau1-0/+15
Currently, kernel only maps part of large folio that fits into start_pgoff/end_pgoff range. Map entire folio where possible. It will match finish_fault() behaviour that user hits on cold page cache. Mapping large folios at once will allow the rmap code to mlock it on add, as it will recognize that it is fully mapped and mlocking is safe. Link: https://lkml.kernel.org/r/20250923110711.690639-6-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/fault: try to map the entire file folio in finish_fault()Kiryl Shutsemau1-7/+2
finish_fault() uses per-page fault for file folios. This only occurs for file folios smaller than PMD_SIZE. The comment suggests that this approach prevents RSS inflation. However, it only prevents RSS accounting. The folio is still mapped to the process, and the fact that it is mapped by a single PTE does not affect memory pressure. Additionally, the kernel's ability to map large folios as PMD if they are large enough does not support this argument. When possible, map large folios in one shot. This reduces the number of minor page faults and allows for TLB coalescing. Mapping large folios at once will allow the rmap code to mlock it on add, as it will recognize that it is fully mapped and mlocking is safe. Link: https://lkml.kernel.org/r/20250923110711.690639-5-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/rmap: mlock large folios in try_to_unmap_one()Kiryl Shutsemau1-3/+28
Currently, try_to_unmap_once() only tries to mlock small folios. Use logic similar to folio_referenced_one() to mlock large folios: only do this for fully mapped folios and under page table lock that protects all page table entries. [akpm@linux-foundation.org: s/CROSSSED/CROSSED/] Link: https://lkml.kernel.org/r/20250923110711.690639-4-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/rmap: fix a mlock race condition in folio_referenced_one()Kiryl Shutsemau1-37/+20
The mlock_vma_folio() function requires the page table lock to be held in order to safely mlock the folio. However, folio_referenced_one() mlocks a large folios outside of the page_vma_mapped_walk() loop where the page table lock has already been dropped. Rework the mlock logic to use the same code path inside the loop for both large and small folios. Use PVMW_PGTABLE_CROSSED to detect when the folio is mapped across a page table boundary. [akpm@linux-foundation.org: s/CROSSSED/CROSSED/] Link: https://lkml.kernel.org/r/20250923110711.690639-3-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/page_vma_mapped: track if the page is mapped across page table boundaryKiryl Shutsemau1-0/+1
Patch series "mm: Improve mlock tracking for large folios", v3. The patchset includes several fixes and improvements related to mlock tracking of large folios. The main objective is to reduce the undercount of Mlocked memory in /proc/meminfo and improve the accuracy of the statistics. Patches 1-2: These patches address a minor race condition in folio_referenced_one() related to mlock_vma_folio(). Currently, mlock_vma_folio() is called on large folio without the page table lock, which can result in a race condition with unmap (i.e. MADV_DONTNEED). This can lead to partially mapped folios on the unevictable LRU list. While not a significant issue, I do not believe backporting is necessary. Patch 3: This patch adds mlocking logic similar to folio_referenced_one() to try_to_unmap_one(), allowing for mlocking of large folios where possible. Patch 4-5: These patches modifies finish_fault() and faultaround to map in the entire folio when possible, enabling efficient mlocking upon addition to the rmap. Patch 6: This patch makes rmap mlock large folios if they are fully mapped, addressing the primary source of mlock undercount for large folios. This patch (of 6): Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if the page is mapped across page table boundary. Unlike other PVMW_* flags, this one is result of page_vma_mapped_walk() and not set by the caller. folio_referenced_one() will use it to detect if it safe to mlock the folio. [akpm@linux-foundation.org: s/CROSSSED/CROSSED/] Link: https://lkml.kernel.org/r/20250923110711.690639-1-kirill@shutemov.name Link: https://lkml.kernel.org/r/20250923110711.690639-2-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28mm/compaction: fix low_pfn advance on isolating hugetlbWei Yang1-1/+1
Commit 56ae0bb349b4 ("mm: compaction: convert to use a folio in isolate_migratepages_block()") converts api from page to folio. But the low_pfn advance for hugetlb page seems wrong when low_pfn doesn't point to head page. Originally, if page is a hugetlb tail page, compound_nr() return 1, which means low_pfn only advance one in next iteration. After the change, low_pfn would advance more than the hugetlb range, since folio_nr_pages() always return total number of the large page. This results in skipping some range to isolate and then to migrate. The worst case for alloc_contig is it does all the isolation and migration, but finally find some range is still not isolated. And then undo all the work and try a new range. Advance low_pfn to the end of hugetlb. Link: https://lkml.kernel.org/r/20250910092240.3981-1-richard.weiyang@gmail.com Fixes: 56ae0bb349b4 ("mm: compaction: convert to use a folio in isolate_migratepages_block()") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-28Merge tag 'mm-hotfixes-stable-2025-09-27-22-35' of ↵Linus Torvalds4-14/+31
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. 4 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 6 of these fixes are for MM. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-09-27-22-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: include/linux/pgtable.h: convert arch_enter_lazy_mmu_mode() and friends to static inlines mm/damon/sysfs: do not ignore callback's return value in damon_sysfs_damon_call() mailmap: add entry for Bence Csókás fs/proc/task_mmu: check p->vec_buf for NULL kmsan: fix out-of-bounds access to shadow memory mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count mm/hugetlb: fix folio is still mapped when deleted
2025-09-26slab: add opt-in caching layer of percpu sheavesVlastimil Babka3-46/+1097
Specifying a non-zero value for a new struct kmem_cache_args field sheaf_capacity will setup a caching layer of percpu arrays called sheaves of given capacity for the created cache. Allocations from the cache will allocate via the percpu sheaves (main or spare) as long as they have no NUMA node preference. Frees will also put the object back into one of the sheaves. When both percpu sheaves are found empty during an allocation, an empty sheaf may be replaced with a full one from the per-node barn. If none are available and the allocation is allowed to block, an empty sheaf is refilled from slab(s) by an internal bulk alloc operation. When both percpu sheaves are full during freeing, the barn can replace a full one with an empty one, unless over a full sheaves limit. In that case a sheaf is flushed to slab(s) by an internal bulk free operation. Flushing sheaves and barns is also wired to the existing cpu flushing and cache shrinking operations. The sheaves do not distinguish NUMA locality of the cached objects. If an allocation is requested with kmem_cache_alloc_node() (or a mempolicy with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE), the sheaves are bypassed. The bulk operations exposed to slab users also try to utilize the sheaves as long as the necessary (full or empty) sheaves are available on the cpu or in the barn. Once depleted, they will fallback to bulk alloc/free to slabs directly to avoid double copying. The sheaf_capacity value is exported in sysfs for observability. Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf count objects allocated or freed using the sheaves (and thus not counting towards the other alloc/free path counters). Counters sheaf_refill and sheaf_flush count objects filled or flushed from or to slab pages, and can be used to assess how effective the caching is. The refill and flush operations will also count towards the usual alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for the backing slabs. For barn operations, barn_get and barn_put count how many full sheaves were get from or put to the barn, the _fail variants count how many such requests could not be satisfied mainly because the barn was either empty or full. While the barn also holds empty sheaves to make some operations easier, these are not as critical to mandate own counters. Finally, there are sheaf_alloc/sheaf_free counters. Access to the percpu sheaves is protected by local_trylock() when potential callers include irq context, and local_lock() otherwise (such as when we already know the gfp flags allow blocking). The trylock failures should be rare and we can easily fallback. Each per-NUMA-node barn has a spin_lock. When slub_debug is enabled for a cache with sheaf_capacity also specified, the latter is ignored so that allocations and frees reach the slow path where debugging hooks are processed. Similarly, we ignore it with CONFIG_SLUB_TINY which prefers low memory usage to performance. [boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ] Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-26slab: simplify init_kmem_cache_nodes() error handlingVlastimil Babka1-3/+1
We don't need to call free_kmem_cache_nodes() immediately when failing to allocate a kmem_cache_node, because when we return 0, do_kmem_cache_create() calls __kmem_cache_release() which also performs free_kmem_cache_nodes(). Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-09-25mm/damon/sysfs: do not ignore callback's return value in ↵Akinobu Mita1-1/+3
damon_sysfs_damon_call() The callback return value is ignored in damon_sysfs_damon_call(), which means that it is not possible to detect invalid user input when writing commands such as 'commit' to /sys/kernel/mm/damon/admin/kdamonds/<K>/state. Fix it. Link: https://lkml.kernel.org/r/20250920132546.5822-1-akinobu.mita@gmail.com Fixes: f64539dcdb87 ("mm/damon/sysfs: use damon_call() for update_schemes_stats") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-25kmsan: fix out-of-bounds access to shadow memoryEric Biggers2-3/+23
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in kmsan_internal_set_shadow_origin(): BUG: unable to handle page fault for address: ffffbc3840291000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 #10 PREEMPT(voluntary) Tainted: [N]=TEST Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100 [...] Call Trace: <TASK> __msan_memset+0xee/0x1a0 sha224_final+0x9e/0x350 test_hash_buffer_overruns+0x46f/0x5f0 ? kmsan_get_shadow_origin_ptr+0x46/0xa0 ? __pfx_test_hash_buffer_overruns+0x10/0x10 kunit_try_run_case+0x198/0xa00 This occurs when memset() is called on a buffer that is not 4-byte aligned and extends to the end of a guard page, i.e. the next page is unmapped. The bug is that the loop at the end of kmsan_internal_set_shadow_origin() accesses the wrong shadow memory bytes when the address is not 4-byte aligned. Since each 4 bytes are associated with an origin, it rounds the address and size so that it can access all the origins that contain the buffer. However, when it checks the corresponding shadow bytes for a particular origin, it incorrectly uses the original unrounded shadow address. This results in reads from shadow memory beyond the end of the buffer's shadow memory, which crashes when that memory is not mapped. To fix this, correctly align the shadow address before accessing the 4 shadow bytes corresponding to each origin. Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org Fixes: 2ef3cec44c60 ("kmsan: do not wipe out origin when doing partial unpoisoning") Signed-off-by: Eric Biggers <ebiggers@kernel.org> Tested-by: Alexander Potapenko <glider@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-25mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_countJane Chu1-10/+5
commit 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared count") introduced ->pt_share_count dedicated to hugetlb PMD share count tracking, but omitted fixing copy_hugetlb_page_range(), leaving the function relying on page_count() for tracking that no longer works. When lazy page table copy for hugetlb is disabled, that is, revert commit bcd51a3c679d ("hugetlb: lazy page table copies in fork()") fork()'ing with hugetlb PMD sharing quickly lockup - [ 239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s! [ 239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0 [ 239.446631] Call Trace: [ 239.446633] <TASK> [ 239.446636] _raw_spin_lock+0x3f/0x60 [ 239.446639] copy_hugetlb_page_range+0x258/0xb50 [ 239.446645] copy_page_range+0x22b/0x2c0 [ 239.446651] dup_mmap+0x3e2/0x770 [ 239.446654] dup_mm.constprop.0+0x5e/0x230 [ 239.446657] copy_process+0xd17/0x1760 [ 239.446660] kernel_clone+0xc0/0x3e0 [ 239.446661] __do_sys_clone+0x65/0xa0 [ 239.446664] do_syscall_64+0x82/0x930 [ 239.446668] ? count_memcg_events+0xd2/0x190 [ 239.446671] ? syscall_trace_enter+0x14e/0x1f0 [ 239.446676] ? syscall_exit_work+0x118/0x150 [ 239.446677] ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0 [ 239.446681] ? clear_bhb_loop+0x30/0x80 [ 239.446684] ? clear_bhb_loop+0x30/0x80 [ 239.446686] entry_SYSCALL_64_after_hwframe+0x76/0x7e There are two options to resolve the potential latent issue: 1. warn against PMD sharing in copy_hugetlb_page_range(), 2. fix it. This patch opts for the second option. While at it, simplify the comment, the details are not actually relevant anymore. Link: https://lkml.kernel.org/r/20250916004520.1604530-1-jane.chu@oracle.com Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23mm/damon/sysfs: set damon_ctx->min_sz_region only for paddr use caseSeongJae Park1-1/+4
damon_ctx->addr_unit is respected only for physical address space monitoring use case. Meanwhile, damon_ctx->min_sz_region is used by the core layer for aligning regions, regardless of whether it is set for physical address space monitoring or virtual address spaces monitoring. And it is set as 'DAMON_MIN_REGION / damon_ctx->addr_unit'. Hence, if user sets ->addr_unit on virtual address spaces monitoring mode, regions can be unexpectedly aligned in <PAGE_SIZE granularity. It shouldn't cause crash-like issues but make monitoring and DAMOS behavior difficult to understand. Fix the unexpected behavior by setting ->min_sz_region only when it is configured for physical address space monitoring. The issue was found from a result of Chris' experiments that thankfully shared with me off-list. Link: https://lkml.kernel.org/r/20250917160041.53187-1-sj@kernel.org Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23mm/vmalloc: move resched point into alloc_vmap_area()Uladzislau Rezki (Sony)1-2/+6
Currently vm_area_alloc_pages() contains two cond_resched() points. However, the page allocator already has its own in slow path so an extra resched is not optimal because it delays the loops. The place where CPU time can be consumed is in the VA-space search in alloc_vmap_area(), especially if the space is really fragmented using synthetic stress tests, after a fast path falls back to a slow one. Move a single cond_resched() there, after dropping free_vmap_area_lock in a slow path. This keeps fairness where it matters while removing redundant yields from the page-allocation path. [akpm@linux-foundation.org: tweak comment grammar] Link: https://lkml.kernel.org/r/20250917185906.1595454-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23ksm: use a folio inside cmp_and_merge_page()Matthew Wilcox (Oracle)1-9/+6
This removes the last call to page_stable_node(), so delete the wrapper. It also removes a call to trylock_page() and saves a call to compound_head(), as well as removing a reference to folio->page. Link: https://lkml.kernel.org/r/20250916181219.2400258-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Longlong Xia <xialonglong@kylinos.cn> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23mm: page_alloc: avoid kswapd thrashing due to NUMA restrictionsJohannes Weiner1-0/+24
On NUMA systems without bindings, allocations check all nodes for free space, then wake up the kswapds on all nodes and retry. This ensures all available space is evenly used before reclaim begins. However, when one process or certain allocations have node restrictions, they can cause kswapds on only a subset of nodes to be woken up. Since kswapd hysteresis targets watermarks that are *higher* than needed for allocation, even *unrestricted* allocations can now get suckered onto such nodes that are already pressured. This ends up concentrating all allocations on them, even when there are idle nodes available for the unrestricted requests. This was observed with two numa nodes, where node0 is normal and node1 is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes kswapd on node0 only (since node1 is not eligible); once kswapd0 is active, the watermarks hover between low and high, and then even the movable allocations end up on node0, only to be kicked out again; meanwhile node1 is empty and idle. Similar behavior is possible when a process with NUMA bindings is causing selective kswapd wakeups. To fix this, on NUMA systems augment the (misleading) watermark test with a check for whether kswapd is already active during the first iteration through the zonelist. If this fails to place the request, kswapd must be running everywhere already, and the watermark test is good enough to decide placement. With this patch, unrestricted requests successfully make use of node1, even while kswapd is reclaiming node0 for restricted allocations. [gourry@gourry.net: don't retry if no kswapds were active] Link: https://lkml.kernel.org/r/20250919162134.1098208-1-hannes@cmpxchg.org Signed-off-by: Gregory Price <gourry@gourry.net> Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23mm/oom_kill.c: fix inverted checkLorenzo Stoakes1-1/+1
Fix an incorrect logic conversion in process_mrelease(). Link: https://lkml.kernel.org/r/3b7f0faf-4dbc-4d67-8a71-752fbcdf0906@lucifer.local Fixes: 12e423ba4eae ("mm: convert core mm to mm_flags_*() accessors") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Chris Mason <clm@meta.com> Closes: https://lkml.kernel.org/r/c2e28e27-d84b-4671-8784-de5fe0d14f41@lucifer.local Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23mm/khugepaged: do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULLKiryl Shutsemau1-1/+19
MADV_COLLAPSE on a file mapping behaves inconsistently depending on if PMD page table is installed or not. Consider following example: p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); err = madvise(p, 2UL << 20, MADV_COLLAPSE); fd is a populated tmpfs file. The result depends on the address that the kernel returns on mmap(). If it is located in an existing PMD table, the madvise() will succeed. However, if the table does not exist, it will fail with -EINVAL. This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when a page table is missing, which causes collapse_pte_mapped_thp() to fail. SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page tables as needed. Link: https://lkml.kernel.org/r/v5ivpub6z2n2uyemlnxgbilzs52ep4lrary7lm7o6axxoneb75@yfacfl5rkzeh Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Barry Song <baohua@kernel.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-23filemap: Add a version of folio_end_writeback that ignores dropbehindTrond Myklebust1-6/+23
Filesystems such as NFS may need to defer dropbehind until after their 2-stage writes are done. This adds a helper folio_end_writeback_no_dropbehind() that allows them to release the writeback flag without immediately dropping the folio. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-23filemap: Add a helper for filesystems implementing dropbehindTrond Myklebust1-2/+3
Add a helper to allow filesystems to attempt to free the 'dropbehind' folio. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/all/5588a06f6d5a2cf6746828e2d36e7ada668b1739.1745381692.git.trond.myklebust@hammerspace.com/ Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-22mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()Lorenzo Stoakes1-23/+39
In commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems") we introduced the ability for stacked drivers and file systems to correctly invoke the f_op->mmap_prepare() handler from an f_op->mmap() handler via a compatibility layer implemented in compat_vma_mmap_prepare(). This populates vm_area_desc fields according to those found in the (not yet fully initialised) VMA passed to f_op->mmap(). However this function implicitly assumes that the struct file which we are operating upon is equal to vma->vm_file. This is not a safe assumption in all cases. The only really sane situation in which this matters would be something like e.g. i915_gem_dmabuf_mmap() which invokes vfs_mmap() against obj->base.filp: ret = vfs_mmap(obj->base.filp, vma); if (ret) return ret; And then sets the VMA's file to this, should the mmap operation succeed: vma_set_file(vma, obj->base.filp); That is - it is the file that is intended to back the VMA mapping. This is not an issue currently, as so far we have only implemented f_op->mmap_prepare() handlers for some file systems and internal mm uses, and the only stacked f_op->mmap() operations that can be performed upon these are those in backing_file_mmap() and coda_file_mmap(), both of which use vma->vm_file. However, moving forward, as we convert drivers to using f_op->mmap_prepare(), this will become a problem. Resolve this issue by explicitly setting desc->file to the provided file parameter and update callers accordingly. Callers are expected to read desc->file and update desc->vm_file - the former will be the file provided by the caller (if stacked, this may differ from vma->vm_file). If the caller needs to differentiate between the two they therefore now can. While we are here, also provide a variant of compat_vma_mmap_prepare() that operates against a pointer to any file_operations struct and does not assume that the file_operations struct we are interested in is file->f_op. This function is __compat_vma_mmap_prepare() and we invoke it from compat_vma_mmap_prepare() so that we share code between the two functions. This is important, because some drivers provide hooks in a separate struct, for instance struct drm_device provides an fops field for this purpose. Also update the VMA selftests accordingly. Link: https://lkml.kernel.org/r/dd0c72df8a33e8ffaa243eeb9b01010b670610e9.1756920635.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-22mm: specify separate file and vm_file params in vm_area_descLorenzo Stoakes5-31/+22
Patch series "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()", v2. As part of the efforts to eliminate the problematic f_op->mmap callback, a new callback - f_op->mmap_prepare was provided. While we are converting these callbacks, we must deal with 'stacked' filesystems and drivers - those which in their own f_op->mmap callback invoke an inner f_op->mmap callback. To accomodate for this, a compatibility layer is provided that, via vfs_mmap(), detects if f_op->mmap_prepare is provided and if so, generates a vm_area_desc containing the VMA's metadata and invokes the call. So far, we have provided desc->file equal to vma->vm_file. However this is not necessarily valid, especially in the case of stacked drivers which wish to assign a new file after the inner hook is invoked. To account for this, we adjust vm_area_desc to have both file and vm_file fields. The .vm_file field is strictly set to vma->vm_file (or in the case of a new mapping, what will become vma->vm_file). However, .file is set to whichever file vfs_mmap() is invoked with when using the compatibilty layer. Therefore, if the VMA's file needs to be updated in .mmap_prepare, desc->vm_file should be assigned, whilst desc->file should be read. No current f_op->mmap_prepare users assign desc->file so this is safe to do. This makes the .mmap_prepare callback in the context of a stacked filesystem or driver completely consistent with the existing .mmap implementations. While we're here, we do a few small cleanups, and ensure that we const-ify things correctly in the vm_area_desc struct to avoid hooks accidentally trying to assign fields they should not. This patch (of 2): Stacked filesystems and drivers may invoke mmap hooks with a struct file pointer that differs from the overlying file. We will make this functionality possible in a subsequent patch. In order to prepare for this, let's update vm_area_struct to separately provide desc->file and desc->vm_file parameters. The desc->file parameter is the file that the hook is expected to operate upon, and is not assignable (though the hok may wish to e.g. update the file's accessed time for instance). The desc->vm_file defaults to what will become vma->vm_file and is what the hook must reassign should it wish to change the VMA"s vma->vm_file. For now we keep desc->file, vm_file the same to remain consistent. No f_op->mmap_prepare() callback sets a new vma->vm_file currently, so this is safe to change. While we're here, make the mm_struct desc->mm pointers at immutable as well as the desc->mm field itself. As part of this change, also update the single hook which this would otherwise break - mlock_future_ok(), invoked by secretmem_mmap_prepare()). We additionally update set_vma_from_desc() to compare fields in a more logical fashion, checking the (possibly) user-modified fields as the first operand against the existing value as the second one. Additionally, update VMA tests to accommodate changes. Link: https://lkml.kernel.org/r/cover.1756920635.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/3fa15a861bb7419f033d22970598aa61850ea267.1756920635.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: drop all references of writable and SCAN_PAGE_RODev Jain1-11/+3
Now that all actionable outcomes from checking pte_write() are gone, drop the related references. Link: https://lkml.kernel.org/r/20250908075028.38431-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: enable khugepaged anonymous collapse on non-writable regionsDev Jain1-7/+2
Patch series "Expand scope of khugepaged anonymous collapse", v2. Currently khugepaged does not collapse an anonymous region which does not have a single writable pte. This is wasteful since a region mapped with non-writable ptes, for example, non-writable VMAs mapped by the application, won't benefit from THP collapse. An additional consequence of this constraint is that MADV_COLLAPSE does not perform a collapse on a non-writable VMA, and this restriction is nowhere to be found on the manpage - the restriction itself sounds wrong to me since the user knows the protection of the memory it has mapped, so collapsing read-only memory via madvise() should be a choice of the user which shouldn't be overridden by the kernel. Therefore, remove this constraint. On an arm64 bare metal machine, comparing with vanilla 6.17-rc2, an average of 5% improvement is seen on some mmtests benchmarks, particularly hackbench, with a maximum improvement of 12%. In the following table, (I) denotes statistically significant improvement, (R) denotes statistically significant regression. +-------------------------+--------------------------------+---------------+ | mmtests/hackbench | process-pipes-1 (seconds) | -0.06% | | | process-pipes-4 (seconds) | -0.27% | | | process-pipes-7 (seconds) | (I) -12.13% | | | process-pipes-12 (seconds) | (I) -5.32% | | | process-pipes-21 (seconds) | (I) -2.87% | | | process-pipes-30 (seconds) | (I) -3.39% | | | process-pipes-48 (seconds) | (I) -5.65% | | | process-pipes-79 (seconds) | (I) -6.74% | | | process-pipes-110 (seconds) | (I) -6.26% | | | process-pipes-141 (seconds) | (I) -4.99% | | | process-pipes-172 (seconds) | (I) -4.45% | | | process-pipes-203 (seconds) | (I) -3.65% | | | process-pipes-234 (seconds) | (I) -3.45% | | | process-pipes-256 (seconds) | (I) -3.47% | | | process-sockets-1 (seconds) | 2.13% | | | process-sockets-4 (seconds) | 1.02% | | | process-sockets-7 (seconds) | -0.26% | | | process-sockets-12 (seconds) | -1.24% | | | process-sockets-21 (seconds) | 0.01% | | | process-sockets-30 (seconds) | -0.15% | | | process-sockets-48 (seconds) | 0.15% | | | process-sockets-79 (seconds) | 1.45% | | | process-sockets-110 (seconds) | -1.64% | | | process-sockets-141 (seconds) | (I) -4.27% | | | process-sockets-172 (seconds) | 0.30% | | | process-sockets-203 (seconds) | -1.71% | | | process-sockets-234 (seconds) | -1.94% | | | process-sockets-256 (seconds) | -0.71% | | | thread-pipes-1 (seconds) | 0.66% | | | thread-pipes-4 (seconds) | 1.66% | | | thread-pipes-7 (seconds) | -0.17% | | | thread-pipes-12 (seconds) | (I) -4.12% | | | thread-pipes-21 (seconds) | (I) -2.13% | | | thread-pipes-30 (seconds) | (I) -3.78% | | | thread-pipes-48 (seconds) | (I) -5.77% | | | thread-pipes-79 (seconds) | (I) -5.31% | | | thread-pipes-110 (seconds) | (I) -6.12% | | | thread-pipes-141 (seconds) | (I) -4.00% | | | thread-pipes-172 (seconds) | (I) -3.01% | | | thread-pipes-203 (seconds) | (I) -2.62% | | | thread-pipes-234 (seconds) | (I) -2.00% | | | thread-pipes-256 (seconds) | (I) -2.30% | | | thread-sockets-1 (seconds) | (R) 2.39% | +-------------------------+--------------------------------+---------------+ +-------------------------+------------------------------------------------+ | mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | -0.02% | | | sysbenchmutex-4 (usec) | -0.02% | | | sysbenchmutex-7 (usec) | 0.00% | | | sysbenchmutex-12 (usec) | 0.12% | | | sysbenchmutex-21 (usec) | -0.40% | | | sysbenchmutex-30 (usec) | 0.08% | | | sysbenchmutex-48 (usec) | 2.59% | | | sysbenchmutex-79 (usec) | -0.80% | | | sysbenchmutex-110 (usec) | -3.87% | | | sysbenchmutex-128 (usec) | (I) -4.46% | +-------------------------+--------------------------------+---------------+ This patch (of 2): Currently khugepaged does not collapse an anonymous region which does not have a single writable pte. This is wasteful since a region mapped with non-writable ptes, for example, non-writable VMAs mapped by the application, won't benefit from THP collapse. An additional consequence of this constraint is that MADV_COLLAPSE does not perform a collapse on a non-writable VMA, and this restriction is nowhere to be found on the manpage - the restriction itself sounds wrong to me since the user knows the protection of the memory it has mapped, so collapsing read-only memory via madvise() should be a choice of the user which shouldn't be overridden by the kernel. Therefore, remove this restriction by not honouring SCAN_PAGE_RO. Link: https://lkml.kernel.org/r/20250908075028.38431-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250908075028.38431-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/stat: expose negative idle timeSeongJae Park1-5/+5
DAMON_STAT calculates the idle time of a region using the region's age if the region's nr_accesses is zero. If the nr_accesses value is non-zero (positive), the idle time of the region becomes zero. This means the users cannot know how warm and hot data is distributed, using DAMON_STAT's memory_idle_ms_percentiles output. The other stat, namely estimated_memory_bandwidth, can help understanding how the overall access temperature of the system is, but it is still very rough information. On production systems, actually, a significant portion of the system memory is observed with zero idle time, and we cannot break it down based on its internal hotness distribution. Define the idle time of the region using its age, similar to those having zero nr_accesses, but multiples '-1' to distinguish it. And expose that using the same parameter interface, memory_idle_ms_percentiles. Link: https://lkml.kernel.org/r/20250916183127.65708-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/stat: expose the current tuned aggregation intervalSeongJae Park1-0/+6
Patch series "mm/damon/stat: expose auto-tuned intervals and non-idle ages". DAMON_STAT is intentionally providing limited information for easy consumption of the information. From production fleet level usages, below limitations are found, though. The aggregation interval of DAMON_STAT represents the granularity of the memory_idle_ms_percentiles. But the interval is auto-tuned and not exposed to users, so users cannot know the granularity. All memory regions of non-zero (positive) nr_accesses are treated as having zero idle time. A significant portion of production systems have such zero idle time. Hence breakdown of warm and hot data is nearly impossible. Make following changes to overcome the limitations. Expose the auto-tuned aggregation interval with a new parameter named aggr_interval_us. Expose the age of non-zero nr_accesses (how long >0 access frequency the region retained) regions as a negative idle time. This patch (of 2): DAMON_STAT calculates the idle time for a region as the region's age multiplied by the aggregation interval. That is, the aggregation interval is the granularity of the idle time. Since the aggregation interval is auto-tuned and not exposed to users, however, users cannot easily know in what granularity the stat is made. Expose the tuned aggregation interval in microseconds via a new parameter, aggr_interval_us. Link: https://lkml.kernel.org/r/20250916183127.65708-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250916183127.65708-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/lru_sort: use damon_initialized()SeongJae Park1-2/+7
DAMON_LRU_SORT is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/reclaim: use damon_initialized()SeongJae Park1-2/+7
DAMON_RECLAIM is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/stat: use damon_initialized()SeongJae Park1-4/+6
DAMON_STAT is assuming DAMON is ready to use in module_init time, and uses its own hack to see if it is the time. Use damon_initialized(), which is a way for seeing if DAMON is ready to be used that is more reliable and better to maintain instead of the hack. Link: https://lkml.kernel.org/r/20250916033511.116366-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/core: implement damon_initialized() functionSeongJae Park1-0/+10
Patch series "mm/damon: define and use DAMON initialization check function". DAMON is initialized in subsystem initialization time, by damon_init(). If DAMON API functions are called before the initialization, the system could crash. Actually such issues happened and were fixed [1] in the past. For the fix, DAMON API callers have updated to check if DAMON is initialized or not, using their own hacks. The hacks are unnecessarily duplicated on every DAMON API callers and therefore it would be difficult to reliably maintain in the long term. Make it reliable and easy to maintain. For this, implement a new DAMON core layer API function that returns if DAMON is successfully initialized. If it returns true, it means DAMON API functions are safe to be used. After the introduction of the new API, update DAMON API callers to use the new function instead of their own hacks. This patch (of 7): If DAMON is tried to be used when it is not yet successfully initialized, the caller could be crashed. DAMON core layer is not providing a reliable way to see if it is successfully initialized and therefore ready to be used, though. As a result, DAMON API callers are implementing their own hacks to see it. The hacks simply assume DAMON should be ready on module init time. It is not reliable as DAMON initialization can indeed fail if KMEM_CACHE() fails, and difficult to maintain as those are duplicates. Implement a core layer API function for better reliability and maintainability to replace the hacks with followup commits. Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org Link: https://lkml.kernel.org/r/20250916033511.116366-2-sj@kernel.org Link: https://lore.kernel.org/20250909022238.2989-1-sj@kernel.org [1] Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/core: set effective quota on first charge windowSeongJae Park1-1/+3
The effective quota of a scheme is initialized zero, which means there is no quota. It is set based on user-specified time/quota/quota goals. But the later value set is done only from the second charge window. As a result, a scheme having a user-specified quota can work as not having the quota (unexpectedly fast) for the first charge window. In practical and common use cases the quota interval is not too long, and the scheme's target access pattern is restrictive. Hence the issue should be modest. That said, it is apparently an unintended misbehavior. Fix the problem by setting esz on the first charge window. Link: https://lkml.kernel.org/r/20250916032339.115817-3-sj@kernel.org Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota") # 5.16.x Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/core: reset age if nr_accesses changes between non-zero and zeroSeongJae Park1-0/+2
Patch series "mm/damon: misc fixups and improvements for 6.18", v2. Misc fixes and improvements for DAMON that are not critical and therefore aims to be merged into Linux 6.18-rc1. The first patch improves DAMON's age counting for nr_accesses zero to/from non-zero changes. The second patch fixes an initial DAMOS apply interval delay issue that is not realistic but still could happen on an odd setup. The third and the fourth patches update DAMON community meetup description and DAMON user-space tool example command for DAMOS usage, respectively. Finally, the fifth patch updates MAINTAINERS section name for DAMON to just DAMON. This patch (of 5): DAMON resets the age of a region if its nr_accesses value has significantly changed. Specifically, the threshold is calculated as 20% of largest nr_accesses of the current snapshot. This means that regions changing the nr_accesses from zero to small non-zero value or from a small non-zero value to zero will keep the age. Since many users treat zero nr_accesses regions special, this can be confusing. Kernel code including DAMOS' regions priority calculation and DAMON_STAT's idle time calculation also treat zero nr_accesses regions special. Make it unconfusing by resetting the age when the nr_accesses changes between zero and a non-zero value. Link: https://lkml.kernel.org/r/20250916032339.115817-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250916032339.115817-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21alloc_tag: mark inaccurate allocation counters in /proc/allocinfo outputSuren Baghdasaryan1-0/+2
While rare, memory allocation profiling can contain inaccurate counters if slab object extension vector allocation fails. That allocation might succeed later but prior to that, slab allocations that would have used that object extension vector will not be accounted for. To indicate incorrect counters, "accurate:no" marker is appended to the call site line in the /proc/allocinfo output. Bump up /proc/allocinfo version to reflect the change in the file format and update documentation. Example output with invalid counters: allocinfo - version: 2.0 0 0 arch/x86/kernel/kdebugfs.c:105 func:create_setup_data_nodes 0 0 arch/x86/kernel/alternative.c:2090 func:alternatives_smp_module_add 0 0 arch/x86/kernel/alternative.c:127 func:__its_alloc accurate:no 0 0 arch/x86/kernel/fpu/regset.c:160 func:xstateregs_set 0 0 arch/x86/kernel/fpu/xstate.c:1590 func:fpstate_realloc 0 0 arch/x86/kernel/cpu/aperfmperf.c:379 func:arch_enable_hybrid_capacity_scale 0 0 arch/x86/kernel/cpu/amd_cache_disable.c:258 func:init_amd_l3_attrs 49152 48 arch/x86/kernel/cpu/mce/core.c:2709 func:mce_device_create accurate:no 32768 1 arch/x86/kernel/cpu/mce/genpool.c:132 func:mce_gen_pool_create 0 0 arch/x86/kernel/cpu/mce/amd.c:1341 func:mce_threshold_create_device [surenb@google.com: document new "accurate:no" marker] Fixes: 39d117e04d15 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output") [akpm@linux-foundation.org: simplification per Usama, reflow text] [akpm@linux-foundation.org: add newline to prevent docs warning, per Randy] Link: https://lkml.kernel.org/r/20250915230224.4115531-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/oom_kill: the OOM reaper traverses the VMA maple tree in reverse orderzhongjinji1-2/+8
Although the oom_reaper is delayed and it gives the oom victim chance to clean up its address space this might take a while especially for processes with a large address space footprint. In those cases oom_reaper might start racing with the dying task and compete for shared resources - e.g. page table lock contention has been observed. Reduce those races by reaping the oom victim from the other end of the address space. It is also a significant improvement for process_mrelease(). When a process is killed, process_mrelease is used to reap the killed process and often runs concurrently with the dying task. The test data shows that after applying the patch, lock contention is greatly reduced during the procedure of reaping the killed process. The test is conducted on arm64. The following basic perf numbers show that applying this patch significantly reduces pte spin lock contention. Without the patch: |--99.57%-- oom_reaper | |--73.58%-- unmap_page_range | | |--8.67%-- [hit in function] | | |--41.59%-- __pte_offset_map_lock | | |--29.47%-- folio_remove_rmap_ptes | | |--16.11%-- tlb_flush_mmu | |--19.94%-- tlb_finish_mmu | |--3.21%-- folio_remove_rmap_ptes With the patch: |--99.53%-- oom_reaper | |--55.77%-- unmap_page_range | | |--20.49%-- [hit in function] | | |--58.30%-- folio_remove_rmap_ptes | | |--11.48%-- tlb_flush_mmu | | |--3.33%-- folio_mark_accessed | |--32.21%-- tlb_finish_mmu | |--6.93%-- folio_remove_rmap_ptes | |--0.69%-- __pte_offset_map_lock Detailed breakdowns for both scenarios are provided below. The cumulative time for oom_reaper plus exit_mmap(victim) in both cases is also summarized, making the performance improvements clear. +----------------------------------------------------------------+ | Category | Applying patch | Without patch | +-------------------------------+----------------+---------------+ | Total running time | 132.6 | 167.1 | | (exit_mmap + reaper work) | 72.4 + 60.2 | 90.7 + 76.4 | +-------------------------------+----------------+---------------+ | Time waiting for pte spinlock | 1.0 | 33.1 | | (exit_mmap + reaper work) | 0.4 + 0.6 | 10.0 + 23.1 | +-------------------------------+----------------+---------------+ | folio_remove_rmap_ptes time | 42.0 | 41.3 | | (exit_mmap + reaper work) | 18.4 + 23.6 | 22.4 + 18.9 | +----------------------------------------------------------------+ From this report, we can see that: 1. The reduction in total time comes mainly from the decrease in time spent on pte spinlock and other locks. 2. oom_reaper performs more work in some areas, but at the same time, exit_mmap also handles certain tasks more efficiently, such as folio_remove_rmap_ptes. Here is a more detailed perf report. [1] Link: https://lkml.kernel.org/r/20250915162946.5515-3-zhongjinji@honor.com Link: https://lore.kernel.org/all/20250915162619.5133-1-zhongjinji@honor.com/ [1] Signed-off-by: zhongjinji <zhongjinji@honor.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/oom_kill: thaw the entire OOM victim processzhongjinji1-5/+5
Patch series "Improvements to Victim Process Thawing and OOM Reaper Traversal Order", v10. This patch series focuses on optimizing victim process thawing and refining the traversal order of the OOM reaper. Since __thaw_task() is used to thaw a single thread of the victim, thawing only one thread cannot guarantee the exit of the OOM victim when it is frozen. Patch 1 thaw the entire process of the OOM victim to ensure that OOM victims are able to terminate themselves. Even if the oom_reaper is delayed, patch 2 is still beneficial for reaping processes with a large address space footprint, and it also greatly improves process_mrelease. This patch (of 10): OOM killer is a mechanism that selects and kills processes when the system runs out of memory to reclaim resources and keep the system stable. But the oom victim cannot terminate on its own when it is frozen, even if the OOM victim task is thawed through __thaw_task(). This is because __thaw_task() can only thaw a single OOM victim thread, and cannot thaw the entire OOM victim process. In addition, freezing_slow_path() determines whether a task is an OOM victim by checking the task's TIF_MEMDIE flag. When a task is identified as an OOM victim, the freezer bypasses both PM freezing and cgroup freezing states to thaw it. Historically, TIF_MEMDIE was a "this is the oom victim & it has access to memory reserves" flag in the past. It has that thread vs. process problems and tsk_is_oom_victim was introduced later to get rid of them and other issues as well as the guarantee that we can identify the oom victim's mm reliably for other oom_reaper. Therefore, thaw_process() is introduced to unfreeze all threads within the OOM victim process, ensuring that every thread is properly thawed. The freezer now uses tsk_is_oom_victim() to determine OOM victim status, allowing all victim threads to be unfrozen as necessary. With this change, the entire OOM victim process will be thawed when an OOM event occurs, ensuring that the victim can terminate on its own. Link: https://lkml.kernel.org/r/20250915162946.5515-1-zhongjinji@honor.com Link: https://lkml.kernel.org/r/20250915162946.5515-2-zhongjinji@honor.com Signed-off-by: zhongjinji <zhongjinji@honor.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/lru_sort: use param_ctx for damon_attrs stagingSeongJae Park1-1/+1
damon_lru_sort_apply_parameters() allocates a new DAMON context, stages user-specified DAMON parameters on it, and commits to running DAMON context at once, using damon_commit_ctx(). The code is, however, directly updating the monitoring attributes of the running context. And the attributes are over-written by later damon_commit_ctx() call. This means that the monitoring attributes parameters are not really working. Fix the wrong use of the parameter context. Link: https://lkml.kernel.org/r/20250916031549.115326-1-sj@kernel.org Fixes: a30969436428 ("mm/damon/lru_sort: use damon_commit_ctx()") Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: <stable@vger.kernel.org> [6.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/reclaim: support addr_unit for DAMON_RECLAIMQuanmin Yan1-0/+40
Implement a sysfs file to expose addr_unit for DAMON_RECLAIM users. During parameter application, use the configured addr_unit parameter to perform the necessary initialization. Similar to the core layer, prevent setting addr_unit to zero. It is worth noting that when monitor_region_start and monitor_region_end are unset (i.e., 0), their values will later be set to biggest_system_ram. At that point, addr_unit may not be the default value 1. Although we could divide the biggest_system_ram value by addr_unit, changing addr_unit without setting monitor_region_start/end should be considered a user misoperation. And biggest_system_ram is only within the 0~ULONG_MAX range, system can clearly work correctly with addr_unit=1. Therefore, if monitor_region_start/end are unset, always silently reset addr_unit to 1. Link: https://lkml.kernel.org/r/20250910113221.1065764-3-yanquanmin1@huawei.com Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/damon/lru_sort: support addr_unit for DAMON_LRU_SORTQuanmin Yan1-0/+40
Patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM". In DAMON_LRU_SORT and DAMON_RECLAIM, damon_ctx is independent of the core. Add addr_unit to these modules to support systems like ARM32 with LPAE. This patch (of 2): Implement a sysfs file to expose addr_unit for DAMON_LRU_SORT users. During parameter application, use the configured addr_unit parameter to perform the necessary initialization. Similar to the core layer, prevent setting addr_unit to zero. It is worth noting that when monitor_region_start and monitor_region_end are unset (i.e., 0), their values will later be set to biggest_system_ram. At that point, addr_unit may not be the default value 1. Although we could divide the biggest_system_ram value by addr_unit, changing addr_unit without setting monitor_region_start/end should be considered a user misoperation. And biggest_system_ram is only within the 0~ULONG_MAX range, system can clearly work correctly with addr_unit=1. Therefore, if monitor_region_start/end are unset, always silently reset addr_unit to 1. Link: https://lkml.kernel.org/r/20250910113221.1065764-1-yanquanmin1@huawei.com Link: https://lkml.kernel.org/r/20250910113221.1065764-2-yanquanmin1@huawei.com Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: remove page->orderMatthew Wilcox (Oracle)1-2/+2
We already use page->private for storing the order of a page while it's in the buddy allocator system; extend that to also storing the order while it's in the pcp_llist. Link: https://lkml.kernel.org/r/20250910142923.2465470-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: remove redundant test in validate_page_before_insert()Matthew Wilcox (Oracle)1-2/+1
The page_has_type() call would have included slab since commit 46df8e73a4a3 and now we don't even get that far because slab pages have a zero refcount since commit 9aec2fb0fd5e. Link: https://lkml.kernel.org/r/20250910142923.2465470-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: lru_add_drain_all() do local lru_add_drain() firstHugh Dickins1-0/+3
No numbers to back this up, but it seemed obvious to me, that if there are competing lru_add_drain_all()ers, the work will be minimized if each flushes its own local queues before locking and doing cross-CPU drains. Link: https://lkml.kernel.org/r/33389bf8-f79d-d4dd-b7a4-680c4aa21b23@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21alloc_tag: avoid warnings when freeing non-compound "tail" pagesSuren Baghdasaryan1-1/+8
When freeing "tail" pages of a non-compount high-order page, we properly subtract the allocation tag counters, however later when these pages are released, alloc_tag_sub() will issue warnings because tags for these pages are NULL. This issue was originally anticipated by Vlastimil in his review [1] and then recently reported by David. Prevent warnings by marking the tags empty. Link: https://lkml.kernel.org/r/20250915212756.3998938-4-surenb@google.com Link: https://lore.kernel.org/all/6db0f0c8-81cb-4d04-9560-ba73d63db4b8@suse.cz/ [1] Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: David Wang <00107082@163.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/shmem: remove unused entry_order after large swapin reworkJackie Liu1-7/+4
After commit 93c0476e7057 ("mm/shmem, swap: rework swap entry and index calculation for large swapin"), xas_get_order() will never return a non-zero value for `entry_order` in shmem_split_large_entry(). As a result, the local variable `entry_order` is effectively unused. Clean up the code by removing `entry_order` and directly using `cur_order`. This change is purely a refactor and has no functional impact. No functional change intended. Link: https://lkml.kernel.org/r/20250908062614.89880-1-liu.yun@linux.dev Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: skip mlocked THPs that are underused early in deferred_split_scan()Lance Yang1-0/+7
When we stumble over a fully-mapped mlocked THP in the deferred shrinker, it does not make sense to try to detect whether it is underused, because try_to_map_unused_to_zeropage(), called while splitting the folio, will not actually replace any zeroed pages by the shared zeropage. Splitting the folio in that case does not make any sense, so let's not even scan to check if the folio is underused. Link: https://lkml.kernel.org/r/20250908090741.61519-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/hmm: populate PFNs from PMD swap entryFrancois Dugast1-5/+65
Once support for THP migration of zone device pages is enabled, device private swap entries will be found during the walk not only for PTEs but also for PMDs. Therefore, it is necessary to extend to PMDs the special handling which is already in place for PTEs when device private pages are owned by the caller: instead of faulting or skipping the range, the correct behavior is to use the swap entry to populate HMM PFNs. This change is a prerequisite to make use of device-private THP in drivers using drivers/gpu/drm/drm_pagemap, such as xe. Even though subsequent PFNs can be inferred when handling large order PFNs, the PFN list is still fully populated because this is currently expected by HMM users. In case this changes in the future, that is all HMM users support a sparsely populated PFN list, the for() loop can be made to skip remaining PFNs for the current order. A quick test shows the loop takes about 10 ns, roughly 20 times faster than without this optimization. Link: https://lkml.kernel.org/r/20250908091052.612303-1-francois.dugast@intel.com Signed-off-by: Francois Dugast <francois.dugast@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: David Airlie <airlied@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/gup: fix handling of errors from arch_make_folio_accessible() in ↵David Hildenbrand1-6/+3
follow_page_pte() In case we call arch_make_folio_accessible() and it fails, we would incorrectly return a value that is "!= 0" to the caller, indicating that we pinned all requested pages and that the caller can keep going. follow_page_pte() is not supposed to return error values, but instead "0" on failure and "1" on success -- we'll clean that up separately. In case we return "!= 0", the caller will just keep going pinning more pages. If we happen to pin a page afterwards, we're in trouble, because we essentially skipped some pages in the requested range. Staring at the arch_make_folio_accessible() implementation on s390x, I assume it should actually never really fail unless something unexpected happens (BUG?). So let's not CC stable and just fix common code to do the right thing. Clean up the code a bit now that there is no reason to store the return value of arch_make_folio_accessible(). Link: https://lkml.kernel.org/r/20250908094517.303409-1-david@redhat.com Fixes: f28d43636d6f ("mm/gup/writeback: add callbacks for inaccessible pages") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: re-enable kswapd when memory pressure subsides or demotion is toggledChanwon Park5-16/+44
If kswapd fails to reclaim pages from a node MAX_RECLAIM_RETRIES in a row, kswapd on that node gets disabled. That is, the system won't wakeup kswapd for that node until page reclamation is observed at least once. That reclamation is mostly done by direct reclaim, which in turn enables kswapd back. However, on systems with CXL memory nodes, workloads with high anon page usage can disable kswapd indefinitely, without triggering direct reclaim. This can be reproduced with following steps: numa node 0 (32GB memory, 48 CPUs) numa node 2~5 (512GB CXL memory, 128GB each) (numa node 1 is disabled) swap space 8GB 1) Set /sys/kernel/mm/demotion_enabled to 0. 2) Set /proc/sys/kernel/numa_balancing to 0. 3) Run a process that allocates and random accesses 500GB of anon pages. 4) Let the process exit normally. During 3), free memory on node 0 gets lower than low watermark, and kswapd runs and depletes swap space. Then, kswapd fails consecutively and gets disabled. Allocation afterwards happens on CXL memory, so node 0 never gains more memory pressure to trigger direct reclaim. After 4), kswapd on node 0 remains disabled, and tasks running on that node are unable to swap. If you turn on NUMA_BALANCING_MEMORY_TIERING and demotion now, it won't work properly since kswapd is disabled. To mitigate this problem, reset kswapd_failures to 0 on following conditions: a) ZONE_BELOW_HIGH bit of a zone in hopeless node with a fallback memory node gets cleared. b) demotion_enabled is changed from false to true. Rationale for a): ZONE_BELOW_HIGH bit being cleared might be a sign that the node may be reclaimable afterwards. This won't help much if the memory-hungry process keeps running without freeing anything, but at least the node will go back to reclaimable state when the process exits. Rationale for b): When demotion_enabled is false, kswapd can only reclaim anon pages by swapping them out to swap space. If demotion_enabled is turned on, kswapd can demote anon pages to another node for reclaiming. So, the original failure count for determining reclaimability is no longer valid. Since kswapd_failures resets may be missed by ++ operation, it is changed from int to atomic_t. [akpm@linux-foundation.org: tweak whitespace] Link: https://lkml.kernel.org/r/aL6qGi69jWXfPc4D@pcw-MS-7D22 Signed-off-by: Chanwon Park <flyinrm@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21readahead: add trace pointsJan Kara1-0/+8
Add a couple of trace points to make debugging readahead logic easier. [jack@suse.cz: v2] Link: https://lkml.kernel.org/r/20250909145849.5090-2-jack@suse.cz Link: https://lkml.kernel.org/r/20250908145533.31528-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/memcg: v1: account event registrations and drop world-writable ↵Stanislav Fort1-4/+4
cgroup.event_control In cgroup v1, the legacy cgroup.event_control file is world-writable and allows unprivileged users to register unbounded events and thresholds. Each registration allocates kernel memory without capping or memcg charging, which can be abused to exhaust kernel memory in affected configurations. Make the following minimal changes: - Account allocations with __GFP_ACCOUNT in event and threshold registration. - Remove CFTYPE_WORLD_WRITABLE from cgroup.event_control to make it owner-writable. This does not affect cgroup v2. Allocations are still subject to kmem accounting being enabled, but this reduces unbounded global growth. Link: https://lkml.kernel.org/r/20250905093851.80596-1-disclosure@aisle.com Signed-off-by: Stanislav Fort <disclosure@aisle.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: use a single page for swap table when the size fitsKairui Song2-10/+43
We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap table so the swap table size of each cluster is exactly one page (4K). If that condition is true, allocate one page direct and disable the slab cache to reduce the memory usage of swap table and avoid fragmentation. Link: https://lkml.kernel.org/r/20250916160100.31545-16-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: implement dynamic allocation of swap tableKairui Song4-51/+194
Now swap table is cluster based, which means free clusters can free its table since no one should modify it. There could be speculative readers, like swap cache look up, protect them by making them RCU protected. All swap table should be filled with null entries before free, so such readers will either see a NULL pointer or a null filled table being lazy freed. On allocation, allocate the table when a cluster is used by any order. This way, we can reduce the memory usage of large swap device significantly. This idea to dynamically release unused swap cluster data was initially suggested by Chris Li while proposing the cluster swap allocator and it suits the swap table idea very well. Link: https://lkml.kernel.org/r/20250916160100.31545-15-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: remove contention workaround for swap cacheKairui Song3-30/+13
Swap cluster setup will try to shuffle the clusters on initialization. It was helpful to avoid contention for the swap cache space. The cluster size (2M) was much smaller than each swap cache space (64M), so shuffling the cluster means the allocator will try to allocate swap slots that are in different swap cache spaces for each CPU, reducing the chance of two CPUs using the same swap cache space, and hence reducing the contention. Now, swap cache is managed by swap clusters, this shuffle is pointless. Just remove it, and clean up related macros. This also improves the HDD swap performance as shuffling IO is a bad idea for HDD, and now the shuffling is gone. Test have shown a ~40% performance gain for HDD [1]: Doing sequential swap in of 8G data using 8 processes with usemem, average of 3 test runs: Before: 1270.91 KB/s per process After: 1849.54 KB/s per process Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=X0zeih2r21QoV=-WWj1hyBxLWRzq74n-C=-Ng@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20250916160100.31545-14-ryncsn@gmail.com Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: mark swap address space ro and add context debug checkKairui Song2-2/+13
Swap cache is now backed by swap table, and the address space is not holding any mutable data anymore. And swap cache is now protected by the swap cluster lock, instead of the XArray lock. All access to swap cache are wrapped by swap cache helpers. Locking is mostly handled internally by swap cache helpers, only a few __swap_cache_* helpers require the caller to lock the cluster by themselves. Worth noting that, unlike XArray, the cluster lock is not IRQ safe. The swap cache was very different compared to filemap, and now it's completely separated from filemap. Nothing wants to mark or change anything or do a writeback callback in IRQ. So explicitly document this and add a debug check to avoid further potential misuse. And mark the swap cache space as read-only to avoid any user wrongly mixing unexpected filemap helpers with swap cache. Link: https://lkml.kernel.org/r/20250916160100.31545-13-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: use the swap table for the swap cache and switch APIKairui Song8-246/+458
Introduce basic swap table infrastructures, which are now just a fixed-sized flat array inside each swap cluster, with access wrappers. Each cluster contains a swap table of 512 entries. Each table entry is an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE), a folio type (pointer), or NULL. In this first step, it only supports storing a folio or shadow, and it is a drop-in replacement for the current swap cache. Convert all swap cache users to use the new sets of APIs. Chris Li has been suggesting using a new infrastructure for swap cache for better performance, and that idea combined well with the swap table as the new backing structure. Now the lock contention range is reduced to 2M clusters, which is much smaller than the 64M address_space. And we can also drop the multiple address_space design. All the internal works are done with swap_cache_get_* helpers. Swap cache lookup is still lock-less like before, and the helper's contexts are same with original swap cache helpers. They still require a pin on the swap device to prevent the backing data from being freed. Swap cache updates are now protected by the swap cluster lock instead of the XArray lock. This is mostly handled internally, but new __swap_cache_* helpers require the caller to lock the cluster. So, a few new cluster access and locking helpers are also introduced. A fully cluster-based unified swap table can be implemented on top of this to take care of all count tracking and synchronization work, with dynamic allocation. It should reduce the memory usage while making the performance even better. Link: https://lkml.kernel.org/r/20250916160100.31545-12-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: wrap swap cache replacement with a helperKairui Song5-20/+44
There are currently three swap cache users that are trying to replace an existing folio with a new one: huge memory splitting, migration, and shmem replacement. What they are doing is quite similar. Introduce a common helper for this. In later commits, this can be easily switched to use the swap table by updating this helper. The newly added helper also makes the swap cache API better defined, and make debugging easier by adding a few more debug checks. Migration and shmem replace are meant to clone the folio, including content, swap entry value, and flags. And splitting will adjust each sub folio's swap entry according to order, which could be non-uniform in the future. So document it clearly that it's the caller's responsibility to set up the new folio's swap entries and flags before calling the helper. The helper will just follow the new folio's entry value. This also prepares for replacing high-order folios in the swap cache. Currently, only splitting to order 0 is allowed for swap cache folios. Using the new helper, we can handle high-order folio splitting better. Link: https://lkml.kernel.org/r/20250916160100.31545-11-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/shmem, swap: remove redundant error handling for replacing folioKairui Song1-25/+7
Shmem may replace a folio in the swap cache if the cached one doesn't fit the swapin's GFP zone. When doing so, shmem has already double checked that the swap cache folio is locked, still has the swap cache flag set, and contains the wanted swap entry. So it is impossible to fail due to an XArray mismatch. There is even a comment for that. Delete the defensive error handling path, and add a WARN_ON instead: if that happened, something has broken the basic principle of how the swap cache works, we should catch and fix that. Link: https://lkml.kernel.org/r/20250916160100.31545-10-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: cleanup swap cache API and add kerneldocKairui Song9-59/+103
In preparation for replacing the swap cache backend with the swap table, clean up and add proper kernel doc for all swap cache APIs. Now all swap cache APIs are well-defined with consistent names. No feature change, only renaming and documenting. Link: https://lkml.kernel.org/r/20250916160100.31545-9-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: tidy up swap device and cluster info helpersKairui Song4-31/+60
swp_swap_info is the most commonly used helper for retrieving swap info. It has an internal check that may lead to a NULL return value, but almost none of its caller checks the return value, making the internal check pointless. In fact, most of these callers already ensured the entry is valid and never expect a NULL value. Tidy this up and improve the function names. If the caller can make sure the swap entry/type is valid and the device is pinned, use the new introduced __swap_entry_to_info/__swap_type_to_info instead. They have more debug sanity checks and lower overhead as they are inlined. Callers that may expect a NULL value should use swap_entry_to_info/swap_type_to_info instead. No feature change. The rearranged codes should have had no effect, or they should have been hitting NULL de-ref bugs already. Only some new sanity checks are added so potential issues may show up in debug build. The new helpers will be frequently used with swap table later when working with swap cache folios. A locked swap cache folio ensures the entries are valid and stable so these helpers are very helpful. Link: https://lkml.kernel.org/r/20250916160100.31545-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: rename and move some swap cluster definition and helpersKairui Song2-68/+99
No feature change, move cluster related definitions and helpers to mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock helpers, so they can be used outside of swap files. And while at it, add kerneldoc. Link: https://lkml.kernel.org/r/20250916160100.31545-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: always lock and check the swap cache folio before useKairui Song4-6/+41
Swap cache lookup only increases the reference count of the returned folio. That's not enough to ensure a folio is stable in the swap cache, so the folio could be removed from the swap cache at any time. The caller should always lock and check the folio before using it. We have just documented this in kerneldoc, now introduce a helper for swap cache folio verification with proper sanity checks. Also, sanitize a few current users to use this convention and the new helper for easier debugging. They were not having observable problems yet, only trivial issues like wasted CPU cycles on swapoff or reclaiming. They would fail in some other way, but it is still better to always follow this convention to make things robust and make later commits easier to do. Link: https://lkml.kernel.org/r/20250916160100.31545-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: check page poison flag after locking itKairui Song1-11/+11
Instead of checking the poison flag only in the fast swap cache lookup path, always check the poison flags after locking a swap cache folio. There are two reasons to do so. The folio is unstable and could be removed from the swap cache anytime, so it's totally possible that the folio is no longer the backing folio of a swap entry, and could be an irrelevant poisoned folio. We might mistakenly kill a faulting process. And it's totally possible or even common for the slow swap in path (swapin_readahead) to bring in a cached folio. The cache folio could be poisoned, too. Only checking the poison flag in the fast path will miss such folios. The race window is tiny, so it's very unlikely to happen, though. While at it, also add a unlikely prefix. Link: https://lkml.kernel.org/r/20250916160100.31545-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: fix swap cache index error when retrying reclaimKairui Song1-4/+4
The allocator will reclaim cached slots while scanning. Currently, it will try again if reclaim found a folio that is already removed from the swap cache due to a race. But the following lookup will be using the wrong index. It won't cause any OOB issue since the swap cache index is truncated upon lookup, but it may lead to reclaiming of an irrelevant folio. This should not cause a measurable issue, but we should fix it. Link: https://lkml.kernel.org/r/20250916160100.31545-4-ryncsn@gmail.com Fixes: fae859550531 ("mm, swap: avoid reclaiming irrelevant swap cache") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, swap: use unified helper for swap cache look upKairui Song7-70/+81
The swap cache lookup helper swap_cache_get_folio currently does readahead updates as well, so callers that are not doing swapin from any VMA or mapping are forced to reuse filemap helpers instead, and have to access the swap cache space directly. So decouple readahead update with swap cache lookup. Move the readahead update part into a standalone helper. Let the caller call the readahead update helper if they do readahead. And convert all swap cache lookups to use swap_cache_get_folio. After this commit, there are only three special cases for accessing swap cache space now: huge memory splitting, migration, and shmem replacing, because they need to lock the XArray. The following commits will wrap their accesses to the swap cache too, with special helpers. And worth noting, currently dropbehind is not supported for anon folio, and we will never see a dropbehind folio in swap cache. The unified helper can be updated later to handle that. While at it, add proper kernedoc for touched helpers. No functional change. Link: https://lkml.kernel.org/r/20250916160100.31545-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/memremap: remove unused get_dev_pagemap() parameterAlistair Popple3-20/+6
GUP no longer uses get_dev_pagemap(). As it was the only user of the get_dev_pagemap() pgmap caching feature it can be removed. Link: https://lkml.kernel.org/r/20250903225926.34702-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/gup: remove dead pgmap refcounting codeAlistair Popple1-41/+26
Prior to commit aed877c2b425 ("device/dax: properly refcount device dax pages when mapping") ZONE_DEVICE pages were not fully reference counted when mapped into user page tables. Instead GUP would take a reference on the associated pgmap to ensure the results of pfn_to_page() remained valid. This is no longer required and most of the code was removed by commit fd2825b0760a ("mm/gup: remove pXX_devmap usage from get_user_pages()"). Finish cleaning this up by removing the dead calls to put_dev_pagemap() and the temporary context struct. Link: https://lkml.kernel.org/r/20250903225926.34702-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/page_alloc: check the correct buddy if it is a starting blockWei Yang1-17/+8
find_large_buddy() search buddy based on start_pfn, which maybe different from page's pfn, e.g. when page is not pageblock aligned, because prep_move_freepages_block() always align start_pfn to pageblock. This means when we found a starting block at start_pfn, it may check on the wrong page theoretically. And not split the free page as it is supposed to, causing a freelist migratetype mismatch. The good news is the page passed to __move_freepages_block_isolate() has only two possible cases: * page is pageblock aligned * page is __first_valid_page() of this block So it is safe for the first case, and it won't get a buddy larger than pageblock for the second case. To fix the issue, check the returned pfn of find_large_buddy() to decide whether to split the free page: 1. if it is not a PageBuddy pfn, no split; 2. if it is a PageBuddy pfn but order <= pageblock_order, no split; 3. if it is a PageBuddy pfn with order > pageblock_order, start_pfn is either in the starting block or tail block, split the PageBuddy at pageblock_order level. Link: https://lkml.kernel.org/r/20250905140358.28849-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/hwpoison: decouple hwpoison_filter from mm/memory-failure.cMiaohe Lin4-97/+113
mm/memory-failure.c defines and uses hwpoison_filter_* parameters but the values of those parameters can only be modified via mm/hwpoison-inject.c from userspace. They have a potentially different life time. Decouple those parameters from mm/memory-failure.c to fix this broken layering. Link: https://lkml.kernel.org/r/20250904062258.3336092-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21filemap: optimize folio refount update in filemap_map_pagesJinjiang Tu1-6/+14
There are two meaningless folio refcount update for order0 folio in filemap_map_pages(). First, filemap_map_order0_folio() adds folio refcount after the folio is mapped to pte. And then, filemap_map_pages() drops a refcount grabbed by next_uptodate_folio(). We could remain the refcount unchanged in this case. As Matthew metenioned in [1], it is safe to call folio_unlock() before calling folio_put() here, because the folio is in page cache with refcount held, and truncation will wait for the unlock. Optimize filemap_map_folio_range() with the same method too. With this patch, we can get 8% performance gain for lmbench testcase 'lat_pagefault -P 1 file' in order0 folio case, the size of file is 512M. Link: https://lkml.kernel.org/r/20250904132737.1250368-1-tujinjiang@huawei.com Link: https://lore.kernel.org/all/aKcU-fzxeW3xT5Wv@casper.infradead.org/ [1] Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: shmem: fix the strategy for the tmpfs 'huge=' optionsBaolin Wang1-42/+5
After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), we have extended tmpfs to allow any sized large folios, rather than just PMD-sized large folios. The strategy discussed previously was: : Considering that tmpfs already has the 'huge=' option to control the : PMD-sized large folios allocation, we can extend the 'huge=' option to : allow any sized large folios. The semantics of the 'huge=' mount option : are: : : huge=never: no any sized large folios : huge=always: any sized large folios : huge=within_size: like 'always' but respect the i_size : huge=advise: like 'always' if requested with madvise() : : Note: for tmpfs mmap() faults, due to the lack of a write size hint, still : allocate the PMD-sized huge folios if huge=always/within_size/advise is : set. : : Moreover, the 'deny' and 'force' testing options controlled by : '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same : semantics. The 'deny' can disable any sized large folios for tmpfs, while : the 'force' can enable PMD sized large folios for tmpfs. This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size', tmpfs will allow getting a highest order hint based on the size of write() and fallocate() paths. It will then try each allowable large order, rather than continually attempting to allocate PMD-sized large folios as before. However, this might break some user scenarios for those who want to use PMD-sized large folios, such as the i915 driver which did not supply a write size hint when allocating shmem [1]. Moreover, Hugh also complained that this will cause a regression in userspace with 'huge=always' or 'huge=within_size'. So, let's revisit the strategy for tmpfs large page allocation. A simple fix would be to always try PMD-sized large folios first, and if that fails, fall back to smaller large folios. This approach differs from the strategy for large folio allocation used by other file systems, however, tmpfs is somewhat different from other file systems, as quoted from David's opinion: : There were opinions in the past that tmpfs should just behave like any : other fs, and I think that's what we tried to satisfy here: use the write : size as an indication. : : I assume there will be workloads where either approach will be beneficial. : I also assume that workloads that use ordinary fs'es could benefit from : the same strategy (start with PMD), while others will clearly not. Link: https://lkml.kernel.org/r/10e7ac6cebe6535c137c064d5c5a235643eebb4a.1756888965.git.baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/ [1] Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/show_mem: add trylock while printing alloc infoYueyang Pan1-1/+4
In production, show_mem() can be called concurrently from two different entities, for example one from oom_kill_process() another from __alloc_pages_slowpath from another kthread. This patch adds a spinlock and invokes trylock before printing out the kernel alloc info in show_mem(). This way two alloc info won't interleave with each other, which then makes parsing easier. Link: https://lkml.kernel.org/r/4ed91296e0c595d945a38458f7a8d9611b0c1e52.1756897825.git.pyyjason@gmail.com Signed-off-by: Yueyang Pan <pyyjason@gmail.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/show_mem: dump the status of the mem alloc profiling before printingYueyang Pan1-1/+2
This patchset fixes two issues we saw in production rollout. The first issue is that we saw all zero output of memory allocation profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING is set and sysctl.vm.mem_profiling=0. This cause ambiguity as we don't know what 0B actually means in the output. It can mean either memory allocation profiling is temporary disabled or the allocation at that position is actually 0. Such ambiguity will make further parsing harder as we cannot differentiate between two case. The second issue is that multiple entities can call show_mem() which messed up the allocation info in dmesg. We saw outputs like this: 327 MiB 83635 mm/compaction.c:1880 func:compaction_alloc 48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc 7.48 GiB 10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd 298 MiB 95216 kernel/fork.c:318 func:alloc_thread_stack_node 250 MiB 63901 mm/zsmalloc.c:987 func:alloc_zspage 1.42 GiB 372527 mm/memory.c:1063 func:folio_prealloc 1.17 GiB 95693 mm/slub.c:2424 func:alloc_slab_page 651 MiB 166732 mm/readahead.c:270 func:page_cache_ra_unbounded 419 MiB 107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow 404 MiB 103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one The above example is because one kthread invokes show_mem() from __alloc_pages_slowpath while kernel itself calls oom_kill_process() This patch (of 2): This patch prints the status of the memory allocation profiling before __show_mem actually prints the detailed allocation info. This way will let us know the `0B` we saw in allocation info is because the profiling is disabled or the allocation is actually 0B. Link: https://lkml.kernel.org/r/cover.1756897825.git.pyyjason@gmail.com Link: https://lkml.kernel.org/r/d7998ea0ddc2ea1a78bb6e89adf530526f76679a.1756897825.git.pyyjason@gmail.com Signed-off-by: Yueyang Pan <pyyjason@gmail.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/page_alloc: add kernel-docs for free_pages()Vishal Moola (Oracle)1-0/+9
Patch series "Cleanup free_pages() misuse", v3. free_pages() is supposed to be called when we only have a virtual address. __free_pages() is supposed to be called when we have a page. There are a number of callers that use page_address() to get a page's virtual address then call free_pages() on it when they should just call __free_pages() directly. Add kernel-docs for free_pages() to help callers better understand which function they should be calling, and replace the obvious cases of misuse. This patch (of 7): Add kernel-docs to free_pages(). This will help callers understand when to use it instead of __free_pages(). Link: https://lkml.kernel.org/r/20250903185921.1785167-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20250903185921.1785167-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/filemap: align last_index to folio sizeYouling Tang1-2/+3
On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE enabled, We observed the following readahead behaviors: # echo 3 > /proc/sys/vm/drop_caches # dd if=test of=/dev/null bs=64k count=1 # ./tools/mm/page-types -r -L -f /mnt/xfs/test foffset offset flags 0 136d4c __RU_l_________H______t_________________F_1 1 136d4d __RU_l__________T_____t_________________F_1 2 136d4e __RU_l__________T_____t_________________F_1 3 136d4f __RU_l__________T_____t_________________F_1 ... c 136bb8 __RU_l_________H______t_________________F_1 d 136bb9 __RU_l__________T_____t_________________F_1 e 136bba __RU_l__________T_____t_________________F_1 f 136bbb __RU_l__________T_____t_________________F_1 <-- first read 10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag 11 13c2cd ___U_l__________T_____t______________I__F_1 12 13c2ce ___U_l__________T_____t______________I__F_1 13 13c2cf ___U_l__________T_____t______________I__F_1 ... 1c 1405d4 ___U_l_________H______t_________________F_1 1d 1405d5 ___U_l__________T_____t_________________F_1 1e 1405d6 ___U_l__________T_____t_________________F_1 1f 1405d7 ___U_l__________T_____t_________________F_1 [ra_size = 32, req_count = 16, async_size = 16] # echo 3 > /proc/sys/vm/drop_caches # dd if=test of=/dev/null bs=60k count=1 # ./page-types -r -L -f /mnt/xfs/test foffset offset flags 0 136048 __RU_l_________H______t_________________F_1 ... c 110a40 __RU_l_________H______t_________________F_1 d 110a41 __RU_l__________T_____t_________________F_1 e 110a42 __RU_l__________T_____t_________________F_1 <-- first read f 110a43 __RU_l__________T_____t_________________F_1 <-- first readahead flag 10 13e7a8 ___U_l_________H______t_________________F_1 ... 20 137a00 ___U_l_________H______t_______P______I__F_1 <-- second readahead flag (20 - 2f) 21 137a01 ___U_l__________T_____t_______P______I__F_1 ... 3f 10d4af ___U_l__________T_____t_______P_________F_1 [first readahead: ra_size = 32, req_count = 15, async_size = 17] When reading 64k data (same for 61-63k range, where last_index is page-aligned in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra() and the PG_readahead flag is set on the next folio (the one containing 0x10 page). When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra(). However, in this case the readahead flag is set on the 0xf page. Although the requested read size (req_count) is 60k, the actual read will be aligned to folio size (64k), which triggers the readahead flag and initiates asynchronous readahead via page_cache_async_ra(). This results in two readahead operations totaling 256k. The root cause is that when the requested size is smaller than the actual read size (due to folio alignment), it triggers asynchronous readahead. By changing last_index alignment from page size to folio size, we ensure the requested size matches the actual read size, preventing the case where a single read operation triggers two readahead operations. After applying the patch: # echo 3 > /proc/sys/vm/drop_caches # dd if=test of=/dev/null bs=60k count=1 # ./page-types -r -L -f /mnt/xfs/test foffset offset flags 0 136d4c __RU_l_________H______t_________________F_1 1 136d4d __RU_l__________T_____t_________________F_1 2 136d4e __RU_l__________T_____t_________________F_1 3 136d4f __RU_l__________T_____t_________________F_1 ... c 136bb8 __RU_l_________H______t_________________F_1 d 136bb9 __RU_l__________T_____t_________________F_1 e 136bba __RU_l__________T_____t_________________F_1 <-- first read f 136bbb __RU_l__________T_____t_________________F_1 10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag 11 13c2cd ___U_l__________T_____t______________I__F_1 12 13c2ce ___U_l__________T_____t______________I__F_1 13 13c2cf ___U_l__________T_____t______________I__F_1 ... 1c 1405d4 ___U_l_________H______t_________________F_1 1d 1405d5 ___U_l__________T_____t_________________F_1 1e 1405d6 ___U_l__________T_____t_________________F_1 1f 1405d7 ___U_l__________T_____t_________________F_1 [ra_size = 32, req_count = 16, async_size = 16] The same phenomenon will occur when reading from 49k to 64k. Set the readahead flag to the next folio. Because the minimum order of folio in address_space equals the block size (at least in xfs and bcachefs that already support bs > ps), having request_count aligned to block size will not cause overread. [klarasmodin@gmail.com: fix overflow on 32-bit] Link: https://lkml.kernel.org/r/yru7qf5gvyzccq5ohhpylvxug5lr5tf54omspbjh4sm6pcdb2r@fpjgj2pxw7va [akpm@linux-foundation.org: update it for Max's constification efforts] Link: https://lkml.kernel.org/r/20250711055509.91587-1-youling.tang@linux.dev Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Klara Modin <klarasmodin@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Youling Tang <youling.tang@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: constify highmem related functions for improved const-correctnessMax Kellermann1-5/+5
Lots of functions in mm/highmem.c do not write to the given pointers and do not call functions that take non-const pointers and can therefore be constified. This includes functions like kunmap() which might be implemented in a way that writes to the pointer (e.g. to update reference counters or mapping fields), but currently are not. kmap() on the other hand cannot be made const because it calls set_page_address() which is non-const in some architectures/configurations. [akpm@linux-foundation.org: "fix" folio_page() build failure] Link: https://lkml.kernel.org/r/20250901205021.3573313-13-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: constify arch_pick_mmap_layout() for improved const-correctnessMax Kellermann1-3/+3
This function only reads from the rlimit pointer (but writes to the mm_struct pointer which is kept without `const`). All callees are already const-ified or (internal functions) are being constified by this patch. Link: https://lkml.kernel.org/r/20250901205021.3573313-9-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm, s390: constify mapping related test/getter functionsMax Kellermann1-5/+5
For improved const-correctness. We select certain test functions which either invoke each other, functions that are already const-ified, or no further functions. It is therefore relatively trivial to const-ify them, which provides a basis for further const-ification further up the call stack. (Even though seemingly unrelated, this also constifies the pointer parameter of mmap_is_legacy() in arch/s390/mm/mmap.c because a copy of the function exists in mm/util.c.) Link: https://lkml.kernel.org/r/20250901205021.3573313-7-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: constify process_shares_mm() for improved const-correctnessMax Kellermann1-3/+3
This function only reads from the pointer arguments. Local (loop) variables are also annotated with `const` to clarify that these will not be written to. Link: https://lkml.kernel.org/r/20250901205021.3573313-6-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: constify shmem related test functions for improved const-correctnessMax Kellermann1-3/+3
Patch series "mm: establish const-correctness for pointer parameters", v6. This series is to improved const-correctness in the low-level memory-management subsystem, which provides a basis for further constification further up the call stack (e.g. filesystems). I started this work when I tried to constify the Ceph filesystem code, but found that to be impossible because many "mm" functions accept non-const pointers, even though they modify nothing. This patch (of 12): We select certain test functions which either invoke each other, functions that are already const-ified, or no further functions. It is therefore relatively trivial to const-ify them, which provides a basis for further const-ification further up the call stack. Link: https://lkml.kernel.org/r/20250901205021.3573313-1-max.kellermann@ionos.com Link: https://lkml.kernel.org/r/20250901205021.3573313-2-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: hugeltb: check NUMA_NO_NODE in only_alloc_fresh_hugetlb_folio()Kefeng Wang1-4/+3
Move the NUMA_NO_NODE check out of buddy and gigantic folio allocation to cleanup code a bit, also this will avoid NUMA_NO_NODE passed as 'nid' to node_isset() in alloc_buddy_hugetlb_folio(). Link: https://lkml.kernel.org/r/20250910133958.301467-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: hugetlb: remove struct hstate from init_new_hugetlb_folio()Kefeng Wang1-4/+4
The struct hstate is never used since commit d67e32f26713 ("hugetlb: restructure pool allocations”), remove it. Link: https://lkml.kernel.org/r/20250910133958.301467-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: hugetlb: directly pass order when allocate a hugetlb folioKefeng Wang3-20/+18
Use order instead of struct hstate to remove huge_page_order() call from all hugetlb folio allocation, also order_is_gigantic() is added to check whether it is a gigantic order. Link: https://lkml.kernel.org/r/20250910133958.301467-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: hugetlb: convert to account_new_hugetlb_folio()Kefeng Wang1-8/+8
In order to avoid the wrong nid passed into the account, and we did make such mistake before, so it's better to move folio_nid() into account_new_hugetlb_folio(). Link: https://lkml.kernel.org/r/20250910133958.301467-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: hugetlb: convert to use more alloc_fresh_hugetlb_folio()Kefeng Wang1-32/+14
Patch series "mm: hugetlb: cleanup hugetlb folio allocation", v3. Some cleanups for hugetlb folio allocation. This patch (of 3): Simplify alloc_fresh_hugetlb_folio() and convert more functions to use it, which help us to remove prep_new_hugetlb_folio() and __prep_new_hugetlb_folio(). Link: https://lkml.kernel.org/r/20250910133958.301467-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20250910133958.301467-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: show_mem: show number of zspages in show_free_areasThadeu Lima de Souza Cascardo1-0/+6
When OOM is triggered, it will show where the pages might be for each zone. When using zram or zswap, it might look like lots of pages are missing. After this patch, zspages are shown as below. [ 48.792859] Node 0 DMA free:2812kB boost:0kB min:60kB low:72kB high:84kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB zspages:11160kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 48.792962] lowmem_reserve[]: 0 956 956 956 956 [ 48.792988] Node 0 DMA32 free:3512kB boost:0kB min:3912kB low:4888kB high:5864kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:28kB active_file:8kB inactive_file:16kB unevictable:0kB writepending:0kB zspages:916780kB present:1032064kB managed:978944kB mlocked:0kB bounce:0kB free_pcp:500kB local_pcp:248kB free_cma:0kB [ 48.793118] lowmem_reserve[]: 0 0 0 0 0 Link: https://lkml.kernel.org/r/20250902-show_mem_zspages-v2-1-545daaa8b410@igalia.com Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/hugetlb: retry to allocate for early boot hugepage allocationLi RongQing1-4/+22
In cloud environments with massive hugepage reservations (95%+ of system RAM), single-attempt allocation during early boot often fails due to memory pressure. Commit 91f386bf0772 ("hugetlb: batch freeing of vmemmap pages") intensified this by deferring page frees, increase peak memory usage during allocation. Introduce a retry mechanism that leverages vmemmap optimization reclaim (~1.6% memory) when available. Upon initial allocation failure, the system retries until successful or no further progress is made, ensuring reliable hugepage allocation while preserving batched vmemmap freeing benefits. Testing on a 256G machine allocating 252G of hugepages: Before: 128056/129024 hugepages allocated After: Successfully allocated all 129024 hugepages Link: https://lkml.kernel.org/r/20250901082052.3247-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21kasan: apply write-only mode in kasan kunit testcasesYeoreum Yun1-69/+136
When KASAN is configured in write-only mode, fetch/load operations do not trigger tag check faults. As a result, the outcome of some test cases may differ compared to when KASAN is configured without write-only mode. Therefore, by modifying pre-exist testcases check the write only makes tag check fault (TCF) where writing is perform in "allocated memory" but tag is invalid (i.e) redzone write in atomic_set() testcases. Otherwise check the invalid fetch/read doesn't generate TCF. Also, skip some testcases affected by initial value (i.e) atomic_cmpxchg() testcase maybe successd if it passes valid atomic_t address and invalid oldaval address. In this case, if invalid atomic_t doesn't have the same oldval, it won't trigger write operation so the test will pass. Link: https://lkml.kernel.org/r/20250916222755.466009-3-yeoreum.yun@arm.com Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: D Scott Phillips <scott@os.amperecomputing.com> Cc: Hardevsinh Palaniya <hardevsinh.palaniya@siliconsignals.io> Cc: James Morse <james.morse@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21kasan/hw-tags: introduce kasan.write_only optionYeoreum Yun2-2/+50
Patch series "introduce kasan.write_only option in hw-tags", v8. Hardware tag based KASAN is implemented using the Memory Tagging Extension (MTE) feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI (Top Byte Ignore) feature and allows software to access a 4-bit allocation tag for each 16-byte granule in the physical address space. A logical tag is derived from bits 59-56 of the virtual address used for the memory access. A CPU with MTE enabled will compare the logical tag against the allocation tag and potentially raise an tag check fault on mismatch, subject to system registers configuration. Since ARMv8.9, FEAT_MTE_STORE_ONLY can be used to restrict raise of tag check fault on store operation only. Using this feature (FEAT_MTE_STORE_ONLY), introduce KASAN write-only mode which restricts KASAN check write (store) operation only. This mode omits KASAN check for read (fetch/load) operation. Therefore, it might be used not only debugging purpose but also in normal environment. This patch (of 2): Since Armv8.9, FEATURE_MTE_STORE_ONLY feature is introduced to restrict raise of tag check fault on store operation only. Introduce KASAN write only mode based on this feature. KASAN write only mode restricts KASAN checks operation for write only and omits the checks for fetch/read operations when accessing memory. So it might be used not only debugging enviroment but also normal enviroment to check memory safty. This features can be controlled with "kasan.write_only" arguments. When "kasan.write_only=on", KASAN checks write operation only otherwise KASAN checks all operations. This changes the MTE_STORE_ONLY feature as BOOT_CPU_FEATURE like ARM64_MTE_ASYMM so that makes it initialise in kasan_init_hw_tags() with other function together. Link: https://lkml.kernel.org/r/20250916222755.466009-1-yeoreum.yun@arm.com Link: https://lkml.kernel.org/r/20250916222755.466009-2-yeoreum.yun@arm.com Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: D Scott Phillips <scott@os.amperecomputing.com> Cc: Hardevsinh Palaniya <hardevsinh.palaniya@siliconsignals.io> Cc: James Morse <james.morse@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: levi.yun <yeoreum.yun@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21kfence: drop nth_page() usageDavid Hildenbrand1-5/+7
We want to get rid of nth_page(), and kfence init code is the last user. Unfortunately, we might actually walk a PFN range where the pages are not contiguous, because we might be allocating an area from memblock that could span memory sections in problematic kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP). We could check whether the page range is contiguous using page_range_contiguous() and failing kfence init, or making kfence incompatible these problemtic kernel configs. Let's keep it simple and simply use pfn_to_page() by iterating PFNs. Link: https://lkml.kernel.org/r/20250901150359.867252-36-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()David Hildenbrand1-1/+7
There is the concern that unpin_user_page_range_dirty_lock() might do some weird merging of PFN ranges -- either now or in the future -- such that PFN range is contiguous but the page range might not be. Let's sanity-check for that and drop the nth_page() usage. Link: https://lkml.kernel.org/r/20250901150359.867252-35-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21scatterlist: disallow non-contigous page ranges in a single SG entryDavid Hildenbrand1-0/+1
The expectation is that there is currently no user that would pass in non-contigous page ranges: no allocator, not even VMA, will hand these out. The only problematic part would be if someone would provide a range obtained directly from memblock, or manually merge problematic ranges. If we find such cases, we should fix them to create separate SG entries. Let's check in sg_set_page() that this is really the case. No need to check in sg_set_folio(), as pages in a folio are guaranteed to be contiguous. As sg_set_page() gets inlined into modules, we have to export the page_range_contiguous() helper -- use EXPORT_SYMBOL, there is nothing special about this helper such that we would want to enforce GPL-only modules. We can now drop the nth_page() usage in sg_page_iter_page(). Link: https://lkml.kernel.org/r/20250901150359.867252-25-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/cma: refuse handing out non-contiguous page rangesDavid Hildenbrand2-15/+59
Let's disallow handing out PFN ranges with non-contiguous pages, so we can remove the nth-page usage in __cma_alloc(), and so any callers don't have to worry about that either when wanting to blindly iterate pages. This is really only a problem in configs with SPARSEMEM but without SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some cases. Will this cause harm? Probably not, because it's mostly 32bit that does not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could look into allocating the memmap for the memory sections spanned by a single CMA region in one go from memblock. [david@redhat.com: we can have NUMMU configs with SPARSEMEM enabled] Link: https://lkml.kernel.org/r/6ec933b1-b3f7-41c0-95d8-e518bb87375e@redhat.com Link: https://lkml.kernel.org/r/20250901150359.867252-23-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/gup: remove record_subpages()David Hildenbrand1-17/+10
We can just cleanup the code by calculating the #refs earlier, so we can just inline what remains of record_subpages(). Calculate the number of references/pages ahead of times, and record them only once all our tests passed. [david@redhat.com: fix `pages' adjustment] Link: https://lkml.kernel.org/r/cc7f03f8-da8b-407e-a03a-e8e5a9ec5462@redhat.com Link: https://lkml.kernel.org/r/20250901150359.867252-20-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/gup: drop nth_page() usage within folio when recording subpagesDavid Hildenbrand1-4/+3
nth_page() is no longer required when iterating over pages within a single folio, so let's just drop it when recording subpages. Link: https://lkml.kernel.org/r/20250901150359.867252-19-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()David Hildenbrand1-1/+1
It's no longer required to use nth_page() within a folio, so let's just drop the nth_page() in folio_walk_start(). Link: https://lkml.kernel.org/r/20250901150359.867252-18-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/mm/percpu-km: drop nth_page() usage within single allocationDavid Hildenbrand1-1/+1
We're allocating a higher-order page from the buddy. For these pages (that are guaranteed to not exceed a single memory section) there is no need to use nth_page(). Link: https://lkml.kernel.org/r/20250901150359.867252-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()David Hildenbrand1-8/+12
We can now safely iterate over all pages in a folio, so no need for the pfn_to_page(). Also, as we already force the refcount in __init_single_page() to 1 through init_page_count(), we can just set the refcount to 0 and avoid page_ref_freeze() + VM_BUG_ON. Likely, in the future, we would just want to tell __init_single_page() to which value to initialize the refcount. Further, adjust the comments to highlight that we are dealing with an open-coded prep_compound_page() variant, and add another comment explaining why we really need the __init_single_page() only on the tail pages. Note that the current code was likely problematic, but we never ran into it: prep_compound_tail() would have been called with an offset that might exceed a memory section, and prep_compound_tail() would have simply added that offset to the page pointer -- which would not have done the right thing on sparsemem without vmemmap. Link: https://lkml.kernel.org/r/20250901150359.867252-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: sanity-check maximum folio size in folio_set_order()David Hildenbrand1-0/+1
Let's sanity-check in folio_set_order() whether we would be trying to create a folio with an order that would make it exceed MAX_FOLIO_ORDER. This will enable the check whenever a folio/compound page is initialized through prepare_compound_head() / prepare_compound_page() with CONFIG_DEBUG_VM set. Link: https://lkml.kernel.org/r/20250901150359.867252-11-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/mm_init: make memmap_init_compound() look more like prep_compound_page()David Hildenbrand1-8/+7
Grepping for "prep_compound_page" leaves on clueless how devdax gets its compound pages initialized. Let's add a comment that might help finding this open-coded prep_compound_page() initialization more easily. Further, let's be less smart about the ordering of initialization and just perform the prep_compound_head() call after all tail pages were initialized: just like prep_compound_page() does. No need for a comment to describe the initialization order: again, just like prep_compound_page(). Link: https://lkml.kernel.org/r/20250901150359.867252-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm/hugetlb: check for unreasonable folio sizes when registering hstateDavid Hildenbrand1-0/+2
Let's check that no hstate that corresponds to an unreasonable folio size is registered by an architecture. If we were to succeed registering, we could later try allocating an unsupported gigantic folio size. Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have to use a BUILD_BUG_ON_INVALID() to make it compile. No existing kernel configuration should be able to trigger this check: either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or gigantic folios will not exceed a memory section (the case on sparse). Link: https://lkml.kernel.org/r/20250901150359.867252-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>