Opened 3 weeks ago
Last modified 3 weeks ago
#36700 new Cleanup/optimization
ASGIHandler creates reference cycles that require a gc pass to free
| Reported by: | Patryk Zawadzki | Owned by: | |
|---|---|---|---|
| Component: | HTTP handling | Version: | 5.2 |
| Severity: | Normal | Keywords: | memory asgihandler gc |
| Cc: | Patryk Zawadzki, Carlton Gibson | Triage Stage: | Accepted |
| Has patch: | no | Needs documentation: | no |
| Needs tests: | no | Patch needs improvement: | no |
| Easy pickings: | no | UI/UX: | no |
Description
Disclaimer: it's impossible for pure Python code to truly leak memory (in the sense that valgrind would detect), however it's quite easy to create structures that effectively occupy memory for a long time because they require the deepest (generation 2) garbage collection cycle to collect and that happens very rarely. In addition to that, the more such structures aggregate, the more expensive the garbage collection cycle becomes, because it effectively stops the entire interpreter to do its job and it can take seconds. On top of that, it's entirely possible for a container to run out of memory before the garbage collection happens and we (Saleor Commerce) see containers being terminated by the kernel OOM killer due to high memory pressure where most of that memory is locked by garbage.
One such case is found in the ASGIHandler. When handling a request, the ASGIHandler.handle spawns two async tasks. One for the actual app code (process_request) and one for the disconnection handler (ASGIHandler.listen_for_disconnect). The latter will raise RequestAborted every time it receives the http.disconnect ASGI message.
In our setup (uvicorn), the http.disconnect message is received for every request, even after successfully processing the view code and delivering the response, but that's not critical for this issue, it just makes it easy to reproduce this on our end.
Here's where the problem is:
- When
RequestAbortedis raised, its stack trace includes the call toASGIHandler.handle, which is whereASGIHandler.listen_for_disconnectwas called. - In turn, the
ASGIHandler.handlestack frame includes references to all local variables. - Among those variables is
taskswhich holds the references to both async tasks. - Now, one of those tasks is the task created from
ASGIHandler.listen_for_disconnect. - The task future is already resolved and now holds a reference back to the
RequestAbortedexception from step 1. And thus the cycle completes, creating an unfreeable reference cycle.
All of those objects hold references to other objects and stack frames that also become unfreeable, ending up holding a sizeable list of objects hostage until the next time gc.collect(2) happens (which can be minutes, depending on how much code your app executes).
Making ASGIHandler.handle explicitly call tasks.clear() or just del tasks after the tasks are no longer needed breaks the cycle by removing the link between the exception stack frame locals and the future referencing the exception.
PS: I've classified this as a bug as high memory use can lead to OOM kills and crashes but feel free to reclassify as "cleanup/optimization" if that's more fitting.
Attachments (1)
Change History (3)
by , 3 weeks ago
| Attachment: | asgi-ref-cycle.svg added |
|---|
comment:1 by , 3 weeks ago
| Cc: | added |
|---|---|
| Triage Stage: | Unreviewed → Accepted |
| Type: | Bug → Cleanup/optimization |
comment:2 by , 3 weeks ago
Hi Patryk, this seems plausible, yes. Thanks for the report.
See also #36315 — which is to use TaskGroup here. There's two related PRs there: gh19366 and gh19370. If you could have a look at those, it may be that that's the better way forward. (They're on my list to look at this cycle, but your eyes would be welcome. 🤹)
Thank you Patryk Zawadzki for the complete ticket report. Accepting on the basis that the solution provided can be accompanied by a proper test case ensuring proper garbage collection. Looking forward to your contribution!