The road ahead for scientific computing with Python

The road ahead for scientiﬁc
computing with Python
Ralf Gommers
7 Dec 2021

The open source PyData ecosystem

The state of PyData today
#1 language for scientific computing, data science, ML & AI
Estimated user base: 25-40 million users
Incredibly broad ecosystem & high-profile scientific successes

Array-based computing in Python

Today’s Python data ecosystem
Many other array libraries - fragmentation!
Why?
Technical: performance, GPU & distributed computing, and autograd.
Social/commercial: large tech companies want control, and to move fast.

Key technical challenges for the
PyData ecosystem

My top 5 technical challenges
1. Fragmentation in array/tensor libraries & support for heterogeneous
computing
2. (lack of) parallel execution capabilities in the NumPy-based PyData stack
3. Packaging: PyPI (the Python Package Index) imposes serious constraints
4. Performance for algorithms that can’t be expressed in vectorized form.
5. Technical debt: old Fortran 77 & C code hard to maintain; long double
support probably needs removing; distutils is end-of-life

(5) technical debt
SciPy: old Fortran 77 & C code
hard to maintain
long double: basically obsolete (MSVC
on Windows, macOS on arm64 and
Linux/aarch64 all alias it to double)
and takes a ton of time to maintain.
Needs to be replaced by proper
quad-precision ﬂoat128.
distutils will be removed in Python
3.12 (see PEP 632), packages with a lot
of compiled code should migrate to
Meson or CMake.
These are just 3 out of many examples

(4) performance of numerical code
Vectorization Use compiled code
Python compilers Python interpreters
Pythran
CPython
Plus Cinder, Pyston, and more -- very experimental,
and limited gains for numerical code

(3) packaging constraints
● PyPI does not oﬀer fundamentals for scientiﬁc computing like
BLAS/LAPACK or OpenMP. It was not designed for non-Python
dependencies.
● PyPI + pip != a package manager
As a result, PyPI’s author-led model cannot ensure that a set of packages
was built with a consistent set of rules (e.g., a single compiler toolchain)
● PyPI serves 3 distinct purposes:
○ Flow of source code from authors to redistributors (OK)
○ End-user binary installs via wheels (OK-ish)
○ End-user installs from source (very problematic)

(2) lack of parallelism in PyData stack
1. NumPy: single-threaded, except for calls to BLAS/LAPACK
2. SciPy:
a. single-threaded by default, except for calls to BLAS/LAPACK
b. `workers=` API to let user enable multiple threads
3. Scikit-learn:
a. Most functionality single-threaded, with `n_jobs=` API to let user
enable multiple threads
b. Starts to use OpenMP more, for automatic parallelization
c. Complex control (see threadpoolctl package) of NumPy/SciPy’s
BLAS and LAPACK libraries to prevent oversubscription in the
presence of multiprocessing on top of multi-threading.
With 32/64-core CPUs becoming more common, this isn’t a tenable situation.
A common threading layer is needed.

(1) fragmentation & heterogeneous
computing
We need common APIs to address CPUs, GPUs, TPUs, FPGAs & other emerging
hardware. Separate libraries for each type of hardware is not composable.
To address this, we created a
standardization eﬀort - see
https://data-apis.org

Consortium for Python Data API Standards

Goals for and scope of the array API
Syntax and semantics of functions
and objects in the API
Casting rules, broadcasting, indexing,
Python operator support
Data interchange & device support
Execution semantics (e.g. task
scheduling, parallelism, lazy eval)
Non-standard dtypes, masked arrays,
I/O, subclassing array object, C API
Error handling & behaviour for invalid
inputs to functions and methods
Goal 1: enable writing code & packages that support multiple array libraries
Goal 2: make it easy for end users to switch between array libraries
In Scope Out of Scope

Use case: the einops package
● A popular package for array manipulation
● Supports 8 popular array/tensor libraries.
● Almost 50% of the code can be removed through array API standardization!

Array API - participation & adoption
In numpy.array_api namespace
API adoption done
or close to done
Design participation,
adoption in progress or being discussed
In cupy.array_api namespace
In torch (main) namespace

Distributed & GPU arrays with SciPy,
scikit-learn and scikit-image

Key social challenges for the
PyData ecosystem

My top 3 social challenges
1. Sustainability of key projects
2. Big tech has discovered PyData
3. Academia still needs to ﬁnd its role

Sustainability
For NumPy, SciPy, scikit-learn: each ~10-20 active maintainers, O(10-20 million) users
Funding is still very hard to obtain. Number of funded devs:
● NumPy: 1 full-time, 3 part-time
● SciPy: 2 part-time
● Scikit-learn: 2 full-time, 3 part-time
● Most other projects: volunteer-only
Burnout of maintainers is a real risk/problem.
Funding is mostly coming from independent funders: Sloan Foundation, Moore
Foundation, Chan-Zuckerberg Institute.
Diversity & inclusivity remains challenging: >90% of maintainers are white men

Big tech has discovered PyData
Big tech focuses on a narrower set of capabilities for deep learning than is needed for
scientific computing.
Ex.: complex dtype support only arrived in 2021)
Corporate-backed vs. community-driven culture:
● Tension in funded vs. volunteer efforts
○ “Move fast and break things” vs.
○ “Backwards compatibility for 10+ year old scientific models”
● Bandwidth problem:
TensorFlow/PyTorch have ~200-300 full-time engineers, RAPIDS >50.

Role of academia
Academia relies heavily on scientific open source, however:
● Open source contributions are valued significantly less than papers today.
Even if often much more impactful (ex: 13% of all papers on ArXiV used Matplotlib)
● Institutional funders are behind the times. The investment in, e.g., exoscale
facilities is >100x (or >1000x ?) that in open source.
● Publishers & reviewers need to require source code for papers (slowly happening).
Recommendations:
1. Work on fixing career paths and funding
2. Focus on
a. missing science-specific key pieces (e.g., sparse linear algebra)
b. production-quality code
c. skills building for grad students

A look at NumPy’s technical and
social roadmap

Where is NumPy going - community
NumPy is a community-driven project.

Where is NumPy going - technical
Interoperability
Array API standard support
Extensibility
Easier custom dtypes
Performance
SIMD acceleration on:
x86, arm64, PPC, …?
C++ (?)
Just dipping our toes in the
water here - so far it was just
Python and C
Platform support
PPC, AIX, s390x,
cross-compiling to embedded
ARM systems, ...
Type annotations
Main namespace annotations
just completed

Where is NumPy going - community
NumPy is a community-driven project. Most people are volunteers.
Goals for the coming 1-2 years include:
● Grow more (autonomous) teams: web, docs, triage, f2py, SIMD, API, …
● Further increase the diversity of the team & inclusivity of the project
● Better communication channels: move to Discourse, with separate user
forum
● Increasing active alignment across PyData projects (from processes like
when to drop Python versions, to hairy technical topics like parallelism)

How can you get involved?
Yes, you can! People are friendly, and you have
talent and knowledge that has value!
Domain knowledge (linear algebra, special
functions, statistics, etc.) is just as important as
coding skills.

How can you get involved?
How do you pick a project to start contributing too?
● First, pick something that interests you!
● Look at the activity on GitHub - do pull requests get reviewed
and merged in a reasonable amount of time? Is the feedback
constructive and given in a friendly manner?
● Small vs. large projects

Find me at: ralf.gommers@gmail.com, rgommers, ralfgommers
Thank you!

The road ahead for scientific computing with Python

More Related Content

What's hot

Similar to The road ahead for scientific computing with Python

More from Ralf Gommers

Recently uploaded

The road ahead for scientific computing with Python