Dr. Mahmoud Mabrouk’s Post

View profile for Dr. Mahmoud Mabrouk

Co-Founder @ Agenta | Helping teams ship reliable LLM Apps

A new paper created a framework to test for AGI. GPT-5 made a huge jump of 30 points in just two years. However, it still has a fundamental gap. The researchers tested 10 cognitive domains. Math ability climbed by more than 100%. Reasoning went from 0 to roughly 60%. Visual processing rose from 0 to 20%. Reading and writing jumped from 60 to 100%. 𝐓𝐡𝐞 𝐟𝐮𝐧𝐝𝐚𝐦𝐞𝐧𝐭𝐚𝐥 𝐠𝐚𝐩 Two capabilities stayed very low: long-term memory storage and memory retrieval precision. Long-term memory storage scored 0% for both GPT-4 and GPT-5. Memory retrieval precision scored 40% for both models (hallucinations persist at the same rate). This challenge connects to real integration problems. A report from MIT on AI adoption in enterprise found the top complaint from companies: AI repeats the same mistakes. It does not learn from corrections. It does not remember user preferences. Your chatbot forgets context across sessions. 𝐖𝐡𝐲 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐰𝐢𝐧𝐝𝐨𝐰𝐬 𝐝𝐞𝐠𝐫𝐚𝐝𝐞 Humans abstract context all the time. We summarize what matters and let details fade. We do not hold every word equally in memory. LLMs work differently. They hold everything with equal weight until the context degrades. You have seen this: large context windows lose quality over time. Important details get buried. The model struggles to surface what matters. The fundamental issue is how memory works. When we have long conversations, humans build abstractions (what is this conversation about? what are the key points?). LLMs treat all tokens equally. Over time, the important information gets lost in noise. 𝐂𝐮𝐫𝐫𝐞𝐧𝐭 𝐰𝐨𝐫𝐤𝐚𝐫𝐨𝐮𝐧𝐝𝐬 Building AI applications today means working around these gaps. For instance, coding agents write summaries at regular intervals. They use agentic workflows to iterate on key points (to-do lists, important findings) and keep them visible in context. This prevents important information from getting buried. RAG systems compensate for memory failures. They retrieve information from external storage because the model cannot reliably access its own knowledge. 𝐖𝐡𝐚𝐭 𝐢𝐭 𝐦𝐞𝐚𝐧𝐬 There is clearly a lot of value to be extracted from engineering to build reliable intelligent systems in certain use cases. You can do this by creating agentic workflows, prompting, and changing models. We're building Agenta, an open-source LLMOps platform that allows you to manage the whole AI engineering process. If you're building in this space, check it out.

  • text
Dr. Mahmoud Mabrouk

Co-Founder @ Agenta | Helping teams ship reliable LLM Apps

3w
See more comments

To view or add a comment, sign in

Explore content categories