SConvTransform: A New Compiler Optimization for Convolution Operations

This title was summarized by AI from the post below.

View organization page for Celera Systems

56 followers

2w Edited

We are pleased to announce the public release of their latest research: the SConvTransform code optimization algorithm. This compiler optimization, implemented using the MLIR compiler infrastructure, converts Convolution operations from the Linalg dialect into an efficient loop nest that performs tiling, packing, and invokes a highly optimized microkernel. SConvTransform is used as an operation in the Transform dialect, allowing both the code being transformed (Payload IR) and the transformation itself (Transform IR) to be represented using MLIR. SConvTransform can efficiently utilize the cache hierarchy by running a convolutional analysis algorithm called Convolution Slicing Analysis (CSA), which determines how many tiles for each tensor can fit in each cache level. The user provides parameters such as cache size, cache latency, and microkernel size, allowing CSA to output the ideal scheduling and partitioning of the tensors. With this information, the Convolution Slicing Optimization (CSO) tiles the convolution operation into a loop nest, adds packing operations, and invokes a microkernel from the OpenBLAS library. The code for the SConvTransform is publicly available at Celera AI GitHub: https://lnkd.in/dVW8tmwp If you have questions or suggestions, reach out via email contact@celera.ai or send us a DM! #compilers #mlir #llvm #convolution #deeplearning #optimization

To view or add a comment, sign in

More Relevant Posts

Compilers Lab

7,042 followers
3w
Report this post
Celera.AI is pleased to announce the public release of their latest research: the SConvTransform code optimization algorithm. This compiler optimization, implemented using the MLIR compiler infrastructure, converts Convolution operations from the Linalg dialect into an efficient loop nest that performs tiling, packing, and invokes a highly optimized microkernel. SConvTransform is used as an operation in the Transform dialect, allowing both the code being transformed (Payload IR) and the transformation itself (Transform IR) to be represented using MLIR. SConvTransform can efficiently utilize the cache hierarchy by running a convolutional analysis algorithm called Convolution Slicing Analysis (CSA), which determines how many tiles for each tensor can fit in each cache level. The user provides parameters such as cache size, cache latency, and microkernel size, allowing CSA to output the ideal scheduling and partitioning of the tensors. With this information, the Convolution Slicing Optimization (CSO) tiles the convolution operation into a loop nest, adds packing operations, and invokes a microkernel from the OpenBLAS library. The code for the SConvTransform is publicly available at Celera AI GitHub: https://lnkd.in/dVW8tmwp For questions or suggestions, contact contact@celera.ai or reach out to Guido Araujo (Universidade Estadual de Campinas). #compiler #mlir #llvm #convolution #programming
Like Comment
To view or add a comment, sign in
Luciano Muratore

C++ | DSP Engineer | Audio Software
2w
Report this post
#Cpp #VirtualPointer #VirtualTable #VirtualPointer #Polymorphism #Dynamic_cast The Virtual Table (vtable) is a table of function pointer that is created by the compiler to support dynamic polymorphism. If a class contains a virtual function, the compiler created a vtable for that class. Each object of the class is then provided with a hidden pointer to this table, known as vptr. Dynamic casting relies on this mechanism. When it is performed on a polymorphic type, the runtime uses the 'vptr' to access type information stored alongside the 'vtable'. This allows the cast to check at runtime wheter the object being cast actually belongs to the target type. If the cast is valid, the pointer is safely converted. If not, 'dynamic_cast' returns 'nullptr' or throws 'std::bad_cast' (for references). In conclusion, messing up with the virtual tables can create problems for the dynamic casting. A code below that works as an example
1 Comment
Like Comment
To view or add a comment, sign in
Thomas Dybdahl Ahle

Head of ML
1mo
Report this post
When we first started working on auto-formalization for DRAMs at Normal Computing, we searched for the most elegant formal language to describe all checkable behavior of a chip. That language is DRAMml by Matthias Jung and team. Now, together, we've build an extension in the form of #DRAMPyML! - An even more powerful language built right inside of Python. This will be able to model any chip, not just DRAM, and we have build extensions to export System Verilog Assertions and other formally checkable artifacts. Look out for this space, much more to come!
Matthias Jung

Professor for Computer Engineering
1mo

On Wednesday at DVCon Europe, Derek Christ will present #DRAMPyML, our next-generation #DRAM description language at 2:15 PM in Session 6D 👨🏫. A joint work of Fraunhofer IESE, Julius-Maximilians-Universität Würzburg and Normal Computing 🤓. With DRAMPyML, you can describe JEDEC standards 📚 in just a few lines of Python code 🐍 using our Python-based DSL. From that, you can generate simulation and verification IP. Our simulator #DRAMSys uses this approach for correct by construction model generation. Dmitri Saberi, Thomas Dybdahl Ahle, Matthias Tan, Thomas Z., Philippe Barbie #DRAM #JEDEC #AI #EDA #DVCON #COOLSTUFF #COMPUTING #VERIFICATION #SIMULATION #SystemC #Python #SystemVerilog #MEMORY
Like Comment
To view or add a comment, sign in
JuliaHub

34,367 followers
2w
Report this post
What if ODE solvers weren’t just numerical black boxes? In this talk, Dr. Chris Rackauckas breaks convention with Julia by integrating symbolic-numeric techniques into stiff ODE/DAE solvers. Discover how tools like #ModelingToolkit and codegen transform solver performance—boosting speed and robustness in ways traditional methods can’t match. A must-watch for computational scientists and engineers. https://lnkd.in/eCBdXvMx #JuliaLang #ModelingToolkit #ScientificComputing #NumericalMethods #SymbolicComputing #ODESolvers #DAESolvers #HighPerformanceComputing #TechnicalComputing #DifferentialEquations

Fast Stiff ODE/DAE Solvers via Symbolic-Numeric Compiler Tricks | Rackauckas

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Prince Badola

Attended Graphic Era Hill University software Engineer
3w
Report this post
💡 Understanding Minimum Spanning Tree (MST) in Graphs An MST connects all the vertices in a graph with the minimum total edge weight — ensuring no cycles and minimum cost. In the example below, even though there are multiple possible MSTs, each gives the same total cost = 6 ⚡ 📘 Concepts Used: Graph Theory Kruskal’s Algorithm / Prim’s Algorithm Edge Weight Optimization #DataStructures #Algorithms #GraphTheory #MST #Coding #LearningDSA #ComputerScience #Kruskal #PrimAlgorithm
Like Comment
To view or add a comment, sign in
Shounak Chakraborty

Assistant Professor at Department of Computer Science, Durham University (UK); Visiting Research Fellow at CSEE, University of Essex, UK; Senior Member IEEE;
1w Edited
Report this post
🎉 Very happy to announce that our paper, "VLIM: Verified Loop Interchange for Optimised Matrix Multiplication", has been accepted as a regular paper for the #DATE2026 conference! Huge congratulations to my co-author, Oliver T., my first undergraduate project student at Durham University. This acceptance to a conference like DATE is a testament to his hard work, and dedication. So proud to see his work heading to Verona, Italy in April 2026! 🇮🇹 Title: VLIM: Verified Loop Interchange for Optimised Matrix Multiplication Abstract: Loop optimisations are essential for achieving high performance in modern computing, particularly for memory-intensive operations. However, while unverified optimisers achieve impressive speedups, their manual application is error-prone and challenging to verify, making them risky in high-assurance computing platforms. This paper introduces VLIM, a novel rewrite algebra, to overcome these difficulties, enabling the development and automatic verification of loop transformations within the Capla programming language, a formally defined front-end for the Compcert verified compiler. Our framework allows compiler developers to define rewrite rules, with correctness proofs automatically derived through rewrite composition, ensuring semantic preservation during optimisation. We demonstrate the effectiveness of our approach, VLIM, by implementing a loop interchange optimisation and evaluating its impact on matrix multiplication performance. Empirical analyses show significant performance improvements: for a 1000 × 1000 matrix, loop interchange using VLIM reduced runtime by 36.6% and 74.6% when compiled with Compcert and Clang, respectively. This work advances the state-of-the-art in verified compilation, offering a promising direction for developing high-performance, formally verified software. #DATE2026 #ConferenceAcceptance #ComputerScience #SciComp #DurhamUniversity

16 Comments
Like Comment
To view or add a comment, sign in
Sreenivas Narasimha Murali

Tech Lead AI/ML/DL at Bosch Global Software Technologies
3w
Report this post
https://lnkd.in/grw8jcye This is one of the best step-by-step tutorials I’ve seen for VLMs. Most other videos or courses either become too high level or don’t explain why we implement a single line of code in these architectures. The author, Umar Jamil, covers every necessary topic for implementing PaliGemma from scratch. This 6-hour-long video covers everything from siglip, projection, text encoding, rotary positional embedding, normalisation, decoder, kv cache, and grouped query attention. It’s a well-rounded practical lecture with good theoretical backing in each step.

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

https://www.youtube.com/

2 Comments
Like Comment
To view or add a comment, sign in
Sohil Shah

Senior Software Engineer | Open-Source Project Founder | Expert in Java, Microservices, and Real-Time Data Solutions | Passionate About Innovation & Mentorship
1w Edited
Report this post
Title: 🧩 CGO v0.2.0 — Substrate Freeze: Deterministic Core Stable Body: Every intelligent system starts with something simple — stability. Today’s pre-release locks the Causal Graph Orchestrator (CGO) substrate inside BraineousAI as a stable, deterministic foundation. No rules yet, no LLMs — just a graph that never lies. This release finalizes the substrate and validator interfaces, along with a declarative Rulepack schema that defines how future reasoning layers will plug in. The goal wasn’t to move fast — it was to freeze motion itself. Determinism is fundamental. Docs & Design Notes: Architecture Overview — https://lnkd.in/ekBg6YCu Validator Design — https://lnkd.in/eYrFqGDS Rulepack Template — https://lnkd.in/e5qj_TDY Versioned Release Notes — https://lnkd.in/eAVFmUFk Next milestone: → Enable validator and rulepack runtime (v0.3.0-alpha.2). → Integrate validation flow into the deterministic graph substrate. These Tuesday builds are my version of a long-form book written in code — one reproducible layer at a time. The story isn’t speed. It’s structure. #BraineousAI #GraphReasoning #CausalAI #OpenSource #Java #AIEngineering #DeterministicSystems #SystemDesign

1 Comment
Like Comment
To view or add a comment, sign in
Nrapendra Trivedi

Experienced Software Engineer
2w
Report this post
Rust frees memory deterministically using ownership + RAII (drop on scope exit). The compiler proves who owns what, inserts “drop glue” to run destructors at the right time, and forbids use-after-free. No background GC thread, no tracing pauses. When you opt in, you can also use reference counting (Rc/Arc)—which is GC-like in spirit but explicit, local, and non-tracing.

What Does Rust Use Instead of a Garbage Collector? medium.com
Like Comment
To view or add a comment, sign in
Yavuz Akbay

Quantitative Analyst
3w
Report this post
🚀 Major Performance Updates to Enhanced GBM Application (v1.3) 🎯 Improvements: ✅ 10–100x speedup — Vectorized GBM simulation (replaced O(n²) nested loops with NumPy operations) ✅ 5–20x faster portfolio analysis — Optimized triple nested loops using broadcasting and matrix operations ✅ GPU acceleration — Fixed regime-switching to use true GPU parallelization ✅ Memory management — Added GPU memory cleanup to prevent OOM errors ✅ Numerical stability — Enhanced division-by-zero checks and edge case handling ✅ Accuracy — Fixed maximum drawdown calculations using actual price paths ✅ Code quality — Made random seeds optional parameters, fixed memory leaks, improved error handling 💡 Impact: Large-scale Monte Carlo simulations run 10–100x faster Portfolio analysis handles thousands of simulations in seconds GPU memory usage is optimized, preventing crashes More accurate risk metrics via improved drawdown calculations To use it or fork it: https://lnkd.in/eW38NAXj Stay tuned for future articles I will share on the practical use and analysis of all methods on financial markets: https://lnkd.in/gQvhXVvU #QuantitativeFinance #FinTech #Python #GPUComputing #MonteCarlo #RiskAnalysis #PerformanceOptimization #DataScience
Like Comment
To view or add a comment, sign in

56 followers

View Profile Follow

SConvTransform: A New Compiler Optimization for Convolution Operations

More Relevant Posts

Fast Stiff ODE/DAE Solvers via Symbolic-Numeric Compiler Tricks | Rackauckas

https://www.youtube.com/

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

https://www.youtube.com/

Explore content categories