The road ahead for scientific
computing with Python
Ralf Gommers
7 Dec 2021
The open source PyData ecosystem
The state of PyData today
#1 language for scientific computing, data science, ML & AI
Estimated user base: 25-40 million users
Incredibly broad ecosystem & high-profile scientific successes
Array-based computing in Python
Today’s Python data ecosystem
Many other array libraries - fragmentation!
Why?
Technical: performance, GPU & distributed computing, and autograd.
Social/commercial: large tech companies want control, and to move fast.
Key technical challenges for the
PyData ecosystem
My top 5 technical challenges
1. Fragmentation in array/tensor libraries & support for heterogeneous
computing
2. (lack of) parallel execution capabilities in the NumPy-based PyData stack
3. Packaging: PyPI (the Python Package Index) imposes serious constraints
4. Performance for algorithms that can’t be expressed in vectorized form.
5. Technical debt: old Fortran 77 & C code hard to maintain; long double
support probably needs removing; distutils is end-of-life
(5) technical debt
SciPy: old Fortran 77 & C code
hard to maintain
long double: basically obsolete (MSVC
on Windows, macOS on arm64 and
Linux/aarch64 all alias it to double)
and takes a ton of time to maintain.
Needs to be replaced by proper
quad-precision float128.
distutils will be removed in Python
3.12 (see PEP 632), packages with a lot
of compiled code should migrate to
Meson or CMake.
These are just 3 out of many examples
(4) performance of numerical code
Vectorization Use compiled code
Python compilers Python interpreters
Pythran
CPython
Plus Cinder, Pyston, and more -- very experimental,
and limited gains for numerical code
(3) packaging constraints
● PyPI does not offer fundamentals for scientific computing like
BLAS/LAPACK or OpenMP. It was not designed for non-Python
dependencies.
● PyPI + pip != a package manager
As a result, PyPI’s author-led model cannot ensure that a set of packages
was built with a consistent set of rules (e.g., a single compiler toolchain)
● PyPI serves 3 distinct purposes:
○ Flow of source code from authors to redistributors (OK)
○ End-user binary installs via wheels (OK-ish)
○ End-user installs from source (very problematic)
(2) lack of parallelism in PyData stack
1. NumPy: single-threaded, except for calls to BLAS/LAPACK
2. SciPy:
a. single-threaded by default, except for calls to BLAS/LAPACK
b. `workers=` API to let user enable multiple threads
3. Scikit-learn:
a. Most functionality single-threaded, with `n_jobs=` API to let user
enable multiple threads
b. Starts to use OpenMP more, for automatic parallelization
c. Complex control (see threadpoolctl package) of NumPy/SciPy’s
BLAS and LAPACK libraries to prevent oversubscription in the
presence of multiprocessing on top of multi-threading.
With 32/64-core CPUs becoming more common, this isn’t a tenable situation.
A common threading layer is needed.
(1) fragmentation & heterogeneous
computing
We need common APIs to address CPUs, GPUs, TPUs, FPGAs & other emerging
hardware. Separate libraries for each type of hardware is not composable.
To address this, we created a
standardization effort - see
https://data-apis.org
Consortium for Python Data API Standards
Goals for and scope of the array API
Syntax and semantics of functions
and objects in the API
Casting rules, broadcasting, indexing,
Python operator support
Data interchange & device support
Execution semantics (e.g. task
scheduling, parallelism, lazy eval)
Non-standard dtypes, masked arrays,
I/O, subclassing array object, C API
Error handling & behaviour for invalid
inputs to functions and methods
Goal 1: enable writing code & packages that support multiple array libraries
Goal 2: make it easy for end users to switch between array libraries
In Scope Out of Scope
Use case: the einops package
● A popular package for array manipulation
● Supports 8 popular array/tensor libraries.
● Almost 50% of the code can be removed through array API standardization!
Array API - participation & adoption
In numpy.array_api namespace
API adoption done
or close to done
Design participation,
adoption in progress or being discussed
In cupy.array_api namespace
In torch (main) namespace
Distributed & GPU arrays with SciPy,
scikit-learn and scikit-image
Key social challenges for the
PyData ecosystem
My top 3 social challenges
1. Sustainability of key projects
2. Big tech has discovered PyData
3. Academia still needs to find its role
Sustainability
For NumPy, SciPy, scikit-learn: each ~10-20 active maintainers, O(10-20 million) users
Funding is still very hard to obtain. Number of funded devs:
● NumPy: 1 full-time, 3 part-time
● SciPy: 2 part-time
● Scikit-learn: 2 full-time, 3 part-time
● Most other projects: volunteer-only
Burnout of maintainers is a real risk/problem.
Funding is mostly coming from independent funders: Sloan Foundation, Moore
Foundation, Chan-Zuckerberg Institute.
Diversity & inclusivity remains challenging: >90% of maintainers are white men
Big tech has discovered PyData
Big tech focuses on a narrower set of capabilities for deep learning than is needed for
scientific computing.
Ex.: complex dtype support only arrived in 2021)
Corporate-backed vs. community-driven culture:
● Tension in funded vs. volunteer efforts
○ “Move fast and break things” vs.
○ “Backwards compatibility for 10+ year old scientific models”
● Bandwidth problem:
TensorFlow/PyTorch have ~200-300 full-time engineers, RAPIDS >50.
Role of academia
Academia relies heavily on scientific open source, however:
● Open source contributions are valued significantly less than papers today.
Even if often much more impactful (ex: 13% of all papers on ArXiV used Matplotlib)
● Institutional funders are behind the times. The investment in, e.g., exoscale
facilities is >100x (or >1000x ?) that in open source.
● Publishers & reviewers need to require source code for papers (slowly happening).
Recommendations:
1. Work on fixing career paths and funding
2. Focus on
a. missing science-specific key pieces (e.g., sparse linear algebra)
b. production-quality code
c. skills building for grad students
A look at NumPy’s technical and
social roadmap
Where is NumPy going - community
NumPy is a community-driven project.
Where is NumPy going - technical
Interoperability
Array API standard support
Extensibility
Easier custom dtypes
Performance
SIMD acceleration on:
x86, arm64, PPC, …?
C++ (?)
Just dipping our toes in the
water here - so far it was just
Python and C
Platform support
PPC, AIX, s390x,
cross-compiling to embedded
ARM systems, ...
Type annotations
Main namespace annotations
just completed
Where is NumPy going - community
NumPy is a community-driven project. Most people are volunteers.
Goals for the coming 1-2 years include:
● Grow more (autonomous) teams: web, docs, triage, f2py, SIMD, API, …
● Further increase the diversity of the team & inclusivity of the project
● Better communication channels: move to Discourse, with separate user
forum
● Increasing active alignment across PyData projects (from processes like
when to drop Python versions, to hairy technical topics like parallelism)
Getting involved
How can you get involved?
Yes, you can! People are friendly, and you have
talent and knowledge that has value!
Domain knowledge (linear algebra, special
functions, statistics, etc.) is just as important as
coding skills.
How can you get involved?
How do you pick a project to start contributing too?
● First, pick something that interests you!
● Look at the activity on GitHub - do pull requests get reviewed
and merged in a reasonable amount of time? Is the feedback
constructive and given in a friendly manner?
● Small vs. large projects
Find me at: ralf.gommers@gmail.com, rgommers, ralfgommers
Thank you!

The road ahead for scientific computing with Python

  • 1.
    The road aheadfor scientific computing with Python Ralf Gommers 7 Dec 2021
  • 2.
    The open sourcePyData ecosystem
  • 4.
    The state ofPyData today #1 language for scientific computing, data science, ML & AI Estimated user base: 25-40 million users Incredibly broad ecosystem & high-profile scientific successes
  • 5.
  • 6.
    Today’s Python dataecosystem Many other array libraries - fragmentation! Why? Technical: performance, GPU & distributed computing, and autograd. Social/commercial: large tech companies want control, and to move fast.
  • 7.
    Key technical challengesfor the PyData ecosystem
  • 8.
    My top 5technical challenges 1. Fragmentation in array/tensor libraries & support for heterogeneous computing 2. (lack of) parallel execution capabilities in the NumPy-based PyData stack 3. Packaging: PyPI (the Python Package Index) imposes serious constraints 4. Performance for algorithms that can’t be expressed in vectorized form. 5. Technical debt: old Fortran 77 & C code hard to maintain; long double support probably needs removing; distutils is end-of-life
  • 9.
    (5) technical debt SciPy:old Fortran 77 & C code hard to maintain long double: basically obsolete (MSVC on Windows, macOS on arm64 and Linux/aarch64 all alias it to double) and takes a ton of time to maintain. Needs to be replaced by proper quad-precision float128. distutils will be removed in Python 3.12 (see PEP 632), packages with a lot of compiled code should migrate to Meson or CMake. These are just 3 out of many examples
  • 10.
    (4) performance ofnumerical code Vectorization Use compiled code Python compilers Python interpreters Pythran CPython Plus Cinder, Pyston, and more -- very experimental, and limited gains for numerical code
  • 11.
    (3) packaging constraints ●PyPI does not offer fundamentals for scientific computing like BLAS/LAPACK or OpenMP. It was not designed for non-Python dependencies. ● PyPI + pip != a package manager As a result, PyPI’s author-led model cannot ensure that a set of packages was built with a consistent set of rules (e.g., a single compiler toolchain) ● PyPI serves 3 distinct purposes: ○ Flow of source code from authors to redistributors (OK) ○ End-user binary installs via wheels (OK-ish) ○ End-user installs from source (very problematic)
  • 12.
    (2) lack ofparallelism in PyData stack 1. NumPy: single-threaded, except for calls to BLAS/LAPACK 2. SciPy: a. single-threaded by default, except for calls to BLAS/LAPACK b. `workers=` API to let user enable multiple threads 3. Scikit-learn: a. Most functionality single-threaded, with `n_jobs=` API to let user enable multiple threads b. Starts to use OpenMP more, for automatic parallelization c. Complex control (see threadpoolctl package) of NumPy/SciPy’s BLAS and LAPACK libraries to prevent oversubscription in the presence of multiprocessing on top of multi-threading. With 32/64-core CPUs becoming more common, this isn’t a tenable situation. A common threading layer is needed.
  • 13.
    (1) fragmentation &heterogeneous computing We need common APIs to address CPUs, GPUs, TPUs, FPGAs & other emerging hardware. Separate libraries for each type of hardware is not composable. To address this, we created a standardization effort - see https://data-apis.org
  • 14.
    Consortium for PythonData API Standards
  • 15.
    Goals for andscope of the array API Syntax and semantics of functions and objects in the API Casting rules, broadcasting, indexing, Python operator support Data interchange & device support Execution semantics (e.g. task scheduling, parallelism, lazy eval) Non-standard dtypes, masked arrays, I/O, subclassing array object, C API Error handling & behaviour for invalid inputs to functions and methods Goal 1: enable writing code & packages that support multiple array libraries Goal 2: make it easy for end users to switch between array libraries In Scope Out of Scope
  • 16.
    Use case: theeinops package ● A popular package for array manipulation ● Supports 8 popular array/tensor libraries. ● Almost 50% of the code can be removed through array API standardization!
  • 17.
    Array API -participation & adoption In numpy.array_api namespace API adoption done or close to done Design participation, adoption in progress or being discussed In cupy.array_api namespace In torch (main) namespace
  • 18.
    Distributed & GPUarrays with SciPy, scikit-learn and scikit-image
  • 19.
    Key social challengesfor the PyData ecosystem
  • 20.
    My top 3social challenges 1. Sustainability of key projects 2. Big tech has discovered PyData 3. Academia still needs to find its role
  • 21.
    Sustainability For NumPy, SciPy,scikit-learn: each ~10-20 active maintainers, O(10-20 million) users Funding is still very hard to obtain. Number of funded devs: ● NumPy: 1 full-time, 3 part-time ● SciPy: 2 part-time ● Scikit-learn: 2 full-time, 3 part-time ● Most other projects: volunteer-only Burnout of maintainers is a real risk/problem. Funding is mostly coming from independent funders: Sloan Foundation, Moore Foundation, Chan-Zuckerberg Institute. Diversity & inclusivity remains challenging: >90% of maintainers are white men
  • 22.
    Big tech hasdiscovered PyData Big tech focuses on a narrower set of capabilities for deep learning than is needed for scientific computing. Ex.: complex dtype support only arrived in 2021) Corporate-backed vs. community-driven culture: ● Tension in funded vs. volunteer efforts ○ “Move fast and break things” vs. ○ “Backwards compatibility for 10+ year old scientific models” ● Bandwidth problem: TensorFlow/PyTorch have ~200-300 full-time engineers, RAPIDS >50.
  • 23.
    Role of academia Academiarelies heavily on scientific open source, however: ● Open source contributions are valued significantly less than papers today. Even if often much more impactful (ex: 13% of all papers on ArXiV used Matplotlib) ● Institutional funders are behind the times. The investment in, e.g., exoscale facilities is >100x (or >1000x ?) that in open source. ● Publishers & reviewers need to require source code for papers (slowly happening). Recommendations: 1. Work on fixing career paths and funding 2. Focus on a. missing science-specific key pieces (e.g., sparse linear algebra) b. production-quality code c. skills building for grad students
  • 24.
    A look atNumPy’s technical and social roadmap
  • 25.
    Where is NumPygoing - community NumPy is a community-driven project.
  • 26.
    Where is NumPygoing - technical Interoperability Array API standard support Extensibility Easier custom dtypes Performance SIMD acceleration on: x86, arm64, PPC, …? C++ (?) Just dipping our toes in the water here - so far it was just Python and C Platform support PPC, AIX, s390x, cross-compiling to embedded ARM systems, ... Type annotations Main namespace annotations just completed
  • 27.
    Where is NumPygoing - community NumPy is a community-driven project. Most people are volunteers. Goals for the coming 1-2 years include: ● Grow more (autonomous) teams: web, docs, triage, f2py, SIMD, API, … ● Further increase the diversity of the team & inclusivity of the project ● Better communication channels: move to Discourse, with separate user forum ● Increasing active alignment across PyData projects (from processes like when to drop Python versions, to hairy technical topics like parallelism)
  • 28.
  • 29.
    How can youget involved? Yes, you can! People are friendly, and you have talent and knowledge that has value! Domain knowledge (linear algebra, special functions, statistics, etc.) is just as important as coding skills.
  • 30.
    How can youget involved? How do you pick a project to start contributing too? ● First, pick something that interests you! ● Look at the activity on GitHub - do pull requests get reviewed and merged in a reasonable amount of time? Is the feedback constructive and given in a friendly manner? ● Small vs. large projects
  • 31.
    Find me at:ralf.gommers@gmail.com, rgommers, ralfgommers Thank you!