Last week, my colleagues at Apple introduced FlashSigmoid, an optimization to the transformer attention layer that achieves a 17% inference speed-up on H100 GPUs. Attention is the backbone of transformer architecture, crucial for tasks like language modeling and image understanding. However, it has also been a key bottleneck in scaling transformers to longer sequences, as its runtime and memory grow quadratically with sequence length. In recent years, many have worked to optimize attention computation pushing the boundaries of what transformers can handle. Efforts like FlashAttention (Dao et al., 2022) and FlashAttention-2 (Dao, 2023) introduced methods to increase computation parallelism and reduce memory I/O, but there’s still room for improvement. FlashSigmoid takes it further by replacing the traditional softmax with a sigmoid function, paired with specific normalization and regularization techniques. The authors have shared both the paper and code — definitely worth a look! Links in the comments below. #AI #MachineLearning #ComputerVision #LLM #NLP
Latest Innovations in Transformer Architectures
Explore top LinkedIn content from expert professionals.
Summary
The latest innovations in transformer architectures are pushing the boundaries of what artificial intelligence (AI) systems can achieve in tasks like language modeling, computer vision, and long-term memory. Transformers, a type of deep learning model, use a mechanism called "attention" to focus on the most relevant parts of input data, enabling them to process and understand complex patterns. Recent advancements such as FlashSigmoid, Titans, and Transformer² introduce groundbreaking techniques to enhance speed, scalability, and adaptability, while new applications are proving the versatility of transformers in fields like graph reasoning and machine learning optimization.
- Explore memory breakthroughs: Discover how new AI architectures like Titans use human-like memory systems to prioritize and adapt to critical data during real-time tasks, improving efficiency and output quality.
- Focus on decoding speed: Learn how innovations such as Block Transformers reduce memory overhead and achieve up to 20x faster decoding speeds without sacrificing computational efficiency.
- Adopt specialized designs: Consider solutions like Transformer² and Flux for dynamic task-specific adjustments and creative control in visual and language-based applications.
-
-
1/ Google Research unveils new paper: "Titans: Learning to Memorize at Test Time" It introduces human-like memory structures to overcome the limits of Transformers, with one "SURPRISING" feature. Here's why this is huge for AI. 🧵👇 2/ The Problem: Transformers, the backbone of most AI today, struggle with long-term memory due to quadratic memory complexity. Basically, there's a big penalty for long context windows! Titans aims to solve this with massive scalability. 3/ What Makes Titans Different? Inspired by human memory, Titans integrate: • Short-term memory (real-time processing) • Long-term memory (retaining key past information) • Persistent memory (task-specific baked-in knowledge) This modular approach mimics how the brain works. 4/ Game-Changer: Memory at Test Time Titans can learn and adapt during inference (test time), unlike Transformers, which rely on pre-training. This means: • Dynamic updating of memory during real-time use. • Better generalization and contextual understanding. 5/ The "Surprise" Mechanism: Humans remember surprising events better. Titans use a "surprise" metric to prioritize what to memorize and forget. • Adaptive Forgetting ensures efficiency. • Surprising inputs create stronger memory retention. This leads to smarter, leaner models. 6/ Three Architectural Variants: Titans offer flexible implementations based on use cases: • Memory as Context (MAC): Best for tasks needing detailed historical context. • Memory as Gated (MAG): Balances short- and long-term memory. • Memory as Layer (MAL): Most efficient, slightly less powerful. Trade-offs for every need! 7/ Performance: Titans outperform Transformers and other models in: • Language modeling. • Common-sense reasoning. • Needle-in-a-haystack tasks (retrieving data in vast contexts). • DNA modeling & time-series forecasting. They maintain high accuracy even with millions of tokens. 8/ Why This Matters: • Massive Context: No more limits on how much info models can process. • Real-Time Adaptation: Models learn dynamically, like humans. • Scalability: Opens the door for AI in genomics, long video understanding, and reasoning across massive datasets. 9/ Key Innovations: • Surprise-based memory prioritization. • Efficient, scalable architectures with adaptive forgetting. • Parallelizable training algorithms for better hardware utilization. Titans bridges the gap between AI and human-like reasoning. 10/ What’s Next? With Titans, we could see breakthroughs in AI applications that demand massive context, from personalized healthcare to real-time video analytics. Read the paper here: https://lnkd.in/gBSPtkpf Check out my video breakdown here: https://lnkd.in/gbcdbN8S What do you think of Titans? Let’s discuss. 💬
-
+5
-
🧠 Titans and Transformers Unite! What if AI models could learn and memorize information in real-time, just like humans do? New research (link in comments) from Google introduces "Titans" - a new architecture that's challenging how we think about AI memory. The key innovation? A neural long-term memory module that learns to identify and store surprising or important information during inference, similar to how human memory prioritizes unexpected events. Three fascinating findings: - Titans outperformed both Transformers and modern recurrent models across multiple tasks, while scaling to massive 2M+ context windows - far beyond traditional limits. - The architecture introduces a "surprise-based" memory system, measuring both immediate surprise and the flow of information over time. This helps it determine what's truly worth remembering. - In needle-in-haystack tasks, Titans achieved 98.6% accuracy on 16K sequences - significantly outperforming GPT-4 and other large language models, despite using far fewer parameters. Titans introduces a two-tier memory system: - Short-term: Uses attention for precise, immediate understanding - Long-term: A neural memory module that learns what's worth remembering, just like our brains prioritize surprising or important events The real breakthrough? Titans can learn during deployment: - Adapts its memory in real-time - Uses "surprise metrics" to decide what to remember - Maintains fast training AND inference speeds The implications? We might be seeing the emergence of AI systems that can learn and adapt during deployment, rather than remaining static after training. What do you think - could this approach to AI memory revolutionize how we build adaptive systems? #MachineLearning #AI #DeepLearning #NeuralNetworks #Innovation
-
A new research from Google shows that Transformers are not just for text—they redefine how we approach graph reasoning tasks. Graph reasoning tasks involve analyzing relationships and dependencies in graph-structured data, such as social networks, molecular structures, or transportation systems. These tasks are critical for solving problems like finding the shortest path between points, detecting network connectivity, or understanding complex dependencies across large systems. In a groundbreaking study, Google researchers demonstrated that standard transformer models can outperform Graph Neural Networks (GNNs) in solving complex graph problems, like connectivity and shortest paths. The paper introduces a new representational hierarchy that classifies graph reasoning tasks by their complexity and maps them to the transformer architectures capable of solving them. Key Transformer capabilities covered in the paper: (1) Parallelizable tasks (e.g., connectivity) – achieved efficiently by logarithmic-depth transformers, showing superior performance over GNNs (2) Search tasks (e.g., shortest paths) – require deeper and wider transformer architectures but are within reach, unlike GNNs (3) Retrieval tasks (e.g., edge or node count) – solved by single-layer transformers, proving their adaptability even with minimal resources This work provides not only a theoretical understanding of transformers' reasoning capabilities but also empirical evidence validating their potential to replace specialized models for many graph-related tasks. The implications are vast, spanning AI research, network analysis, and computational optimization. Blog post https://lnkd.in/gK69ajtB — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai
-
Do you know your LLM uses less than 1% of your GPU at inference? New research paper from KAIST AI, LG AI Research and Google DeepMind introduced the Block Transformer: a global-to-local architecture that speeds up decoding up to 20x 🚀 🐢 LLMs typically generate one token at a time, requiring memory access for all previous tokens at each step. This means your GPU spends ~99% of the time on memory access. 🤖The Block Transformer tackles this by isolating global attention to the lower layers (block decoder), reducing the context length by 4x and the quadratic memory access overhead by 16x. Fine-grained attention is applied in the upper layers (token decoder) within local blocks, preserving detail and nearly eliminating KV cache memory overhead. 🧐With our default block length of 4 tokens, Block Transformers achieve 10-20x throughput gains over vanilla Transformers and reach 44% MFU on H100 GPUs, compared to 1% for vanilla models. 🤩Block Transformers also offer cheaper training for specific performance targets and can be uptrained from pre-trained LLMs with minimal cost, requiring only about 20% of the original pre-training steps. 🆒Key takeaway: By optimizing the parameter allocation between block and token decoders and tweaking block lengths, we achieve a significant boost in performance and efficiency.
-
Sakana AI Introduces Transformer²: A Machine Learning System that Dynamically Adjusts Its Weights for Various Tasks The researchers at Sakana AI and Institute of Science Tokyo introduced Transformer², a novel self-adaptive machine learning framework for large language models. Transformer² employs a groundbreaking method called Singular Value Fine-tuning (SVF), which adapts LLMs in real time to new tasks without extensive retraining. By focusing on selectively modifying the singular components of the model’s weight matrices, Transformer² enables dynamic task-specific adjustments. This innovation reduces the computational burden associated with fine-tuning, offering a scalable and efficient solution for self-adaptation. At the heart of Transformer² is the SVF method, which fine-tunes the singular values of weight matrices. This approach drastically minimizes the number of trainable parameters compared to traditional methods. Instead of altering the entire model, SVF leverages reinforcement learning to create compact “expert” vectors specialized for specific tasks. For the inference process, Transformer² works on a two-pass mechanism: the first is to analyze what the task might be and requires, and in the second, it dynamically integrates various relevant expert vectors to produce suitable behavior. Modularly, the approach ensures efficiency in addressing such a wide array of tasks through Transformer²........ Read the full article: https://lnkd.in/gRMAW6p9 Paper: https://lnkd.in/g-QqBtnm GitHub Page: https://lnkd.in/gExWBKR2 Sakana AI Qi Sun Edoardo Cetin Yujin Tang
-
The BASE Transformer is a great example of thinking outside the box to advance the state of the art in ML. The standard way to assign tokens to experts in Mixtures-of-Expert-based LLMs is greedy routing: simply assign each token to the most suitable expert, as determined by the gating network. This works, but it creates load imbalance, where some experts end up with much more work than others. This problem is usually solved by introducing load balancing losses - however this also introduces the challenge of having to tune additional knobs in order to get the desired performance. BASE ("balanced assignment of experts") challenges the assumption that we need to use greedy routing in the first place, and instead uses an auction algorithm to assign tokens to experts, where the tokens are the bidders bidding on experts, and their bids are simply determined by the gating network. Most importantly, BASE guarantees perfect load balance without any auxiliary losses: in experiments, the authors (Lewis et al 2021) were able to achieve 16% higher training throughput (tokens per second) compared to the Switch Transformer! Learn more about it in my blog: https://lnkd.in/giuU-yJH
-
Detailed dive into Flux by Black Forest Labs- 🔹 Transformer-First Design Unlike traditional models, Flux replaces UNet with a 12B parameter multimodal transformer, enabling sharper outputs, better prompt understanding, and richer compositions. ⚡ Fast Yet High-Fidelity Using flow-matching training and advanced distillation, Flux achieves stunning results in fewer diffusion steps — especially in the Flux Schnell variant optimized for speed. 🛠️ Controllable Generation With tools like Flux Fill, Depth, and Canny, users can guide the generation using masks, structure maps, or even sketches — perfect for inpainting, extensions, and creative control. 🎨 Easy Style Personalization Fine-tune the model on just 5 sample images to personalize its visual output — a game-changer for creators and brands wanting unique, consistent aesthetics. 🌍 Open & Scalable With open weights (Flux Dev), fast local versions (Schnell), and full-power APIs (Pro), Flux balances performance with broad accessibility. #GenAI #Transformers #AI #Flux #GPT #Vision #ComputerVision #LLM #OpenAI
-
Transformers are SSMs Generalized Models and Efficient Algorithms Through Structured State Space Duality While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.
-
Chapter 4 of the Big Book of Large Language Models is finally here! That was a difficult chapter to write! Originally, I wanted to cram in that chapter all the improvements related to the Transformer architecture, since the Attention is all you need paper, but I realized that it would be too long for one chapter. I ended up focusing only on improvements related to the attention layer and delaying things like relative positional encoding and Mixture of Experts to the next chapter. In this chapter, I addressed the following improvements: - Sparse Attention Mechanisms - Linear Attention Mechanisms - Memory-Efficient Attention - Faster Decoding Attention Mechanisms - Long Sequence Attentions Obviously, I could not include everything that was ever invented in the context of the attention layer, but I believe those use cases capture well the different research routes that have been explored since then. I believe it is a very important chapter, as most materials available online tend to focus on the vanilla self-attention, which starts to be an outdated concept for today’s standards. I also found that trying to understand how to improve the self-attention is a very good way to understand what it is we are trying to improve in the first place! The self-attention may appear odd at first, but diving into the inner workings of the layer in order to improve it gives us a level of understanding that is beyond anything we can learn just by looking at the original self-attention. I hope you will enjoy it! The book and chapter: https://book.theaiedge.io/