Python Linters at Scale
Jimmy Lai, Staff Software Engineer at Carta
April 22, 2023
2
What to Expect?
● ❗Problems
● 🛠 Tools
● ☑ Checklists
3
Tax
Fund Admin.
Compensation
Valuation
Startup Founder
Employee
Stock
Option
Investor
Stock
Money
4
Python Codebases
Monolith: a large codebase
3 million lines of code
Service 1 Library 1
Service 2 Library 2
Service 2 Library 2
… …
Micro Services
Many Developers
Popular Python Linters
6
Black: code formatting
https://github.com/psf/black
🛠
pyproject.toml
[tool.black]
Line-length = 120 # defaults to 88
target-version = ["py39"]
exclude = "some_path"
# include = defaults to “.pyi?$”
7
isort: import sorting
https://github.com/PyCQA/isort
🛠
pyproject.toml
[tool.isort]
profile = 'black'
line_length = 120
8
Flake8: code style, syntax errors and bugs
https://github.com/PyCQA/flake8
🛠
In setup.cfg, tox.ini or .flake8
[flake8]
max-line-length=120
# select=E,W # pycodestyle
# F # pyflakes
ignore=E203,E501 # conflict to Black on py files
E301,E302 # conflict to Black on pyi files
9
mypy: type checking
% mypy example.py
mypy.py:4: error: Argument 1 to
"greeting" has incompatible type
"int"; expected "str" [arg-type]
mypy.py:5: error: Argument 1 to
"greeting" has incompatible type
"bytes"; expected "str" [arg-type]
Found 2 errors in 1 file (checked 1
source file)
https://github.com/python/mypy 🛠
pyproject.toml
[tool.mypy]
# strict type annotation
# explicit over implicit
warn_return_any = true
warn_unused_configs = true
warn_unused_ignores = true
warn_redundant_casts = true
disallow_incomplete_defs = true
disallow_untyped_defs = true
no_implicit_optional = true
Common Linter Practices
Version Control in a codebase:
● Linter versions
● Linter configs
Python package management:
● pip with a requirements.txt
● poetry with a pyproject.toml and a lock file
Install specific versions of linters and use the linter config file in the codebase
Version Control
11
pyproject.toml
[tool.isort]
profile = 'black'
line_length = 120
requirements-dev.txt
isort==5.10.0
Goal: detect linter errors at development time to iterate fast
Setup local environment:
● pip/poetry install
● docker
Run linters at:
● Commit time via git commit hooks
● Edit time via IDE plugin or file watcher on file changes
● Ad-hoc via linter CLI command
Local Runs
12
13
Continuous Integration (CI) Runs
Pre-install and cache dependencies in CI runners:
● Remote cache
● Docker image
Run linters when a commit is pushed
Scaling Challenges
❗Slow Linters: 10-30+ minutes
Large Codebase
15
Monolith: a large codebase
30,000+ Python files
❗Poor Developer Experience:
● Inconsistent linter version
and configuration
● Endless efforts on upgrading
linters and configs
Multiple Codebases
16
Service 1 Library 1
Service 2 Library 2
Service 2 Library 2
… …
Linter A
Linter A
Linter A
Linter B
Linter B
Linter B
Linter B
❗Poor Developer Experience
● Observability is missing
● Linter/test errors may be merged to the
main branch
● Developers are slowed down by linter
suggestions
● Missing best practices on things other
than Python, e.g. Github, Docker, etc.
Many Developers
17
Pull
Request
1
Pull
Request
2
Pull
Request
3
Pull
Request
4
Pull
Request
5
Solutions
19
❗Checklist for Speeding up Linters
Strategy: Avoid unnecessary code analysis on large number of code
Checklist:
❏ Only run on updated code
❏ Run in parallel
❏ Reuse prior results
❏ Faster implementation
20
Only run necessary analysis on updated code
Local:
● get updated files from git:
○ git diff --name-only --diff-filter=d
● Run linters in daemon mode with a file watcher (e.g. watchman)
○ mypy daemon
CI: get updated files from Github Pulls API (application/vnd.github.diff)
● gh api -H "Accept: application/vnd.github.VERSION.diff"
/repos/python/cpython/pulls/100957
✓ Only run on updated code
❏ Run in parallel
❏ Reuse prior results
❏ Faster implementation
21
pre-commit: manage pre-commit hooks
Features:
● Run on committed files
● Run linters in parallel
● Reuse installed linters with a
virtualenv
https://github.com/pre-commit/pre-commit 🛠
.pre-commit-config.yaml
repos:
- repo: 'https://github.com/psf/black'
rev: 22.10.0
hooks:
- id: black
✓ Only run on updated code
✓ Run in parallel
❏ Reuse prior results
❏ Faster implementation
22
Some linters (e.g. mypy) require the knowledge of the dependency graph
Cache the prior results of the entire codebase
Workflow:
● Download most recent cache based on Git revision
● Run linters with cache
● Upload cache to be reused later
Use case: use mypy remote cache improved our mypy CI run from 20+ minutes to
less than 5 minutes
Remote Cache
❏ Only run on updated code
❏ Run in parallel
✓ Reuse prior results
❏ Faster implementation
23
Ruff: fast linter implementation using rust
Implements:
● Flake8
● isort
Parse source code once
across supported linters
Cache results and skip
unchanged files
https://github.com/charliermarsh/ruff 🛠
❏ Only run on updated code
❏ Run in parallel
✓ Reuse prior results
✓ Faster implementation
24
❗Checklist for Improving Developer Experience
Problems:
● Inconsistent linter version and
configuration
● Endless efforts on upgrading linters
and configs
● Observability is missing
● Linter/test errors may be merged to
the main branch
● Developers are slowed down by linter
suggestions
● Missing best practices on things
other than Python, e.g. Github,
Docker, etc.
Strategy: Build linters for best
practices and provide autofixes for
productivity
Checklist:
❏ Telemetry
❏ Custom Linter
❏ Autofix
Collect metrics from CI and Local runs:
● Where: environment, Git codebase and branch
● What: linter suggestions
● How: latency, exception stack trace
Understand Developer Experience
25
✓ Telemetry
❏ Custom Linter
❏ Autofix
26
fixit: Python linters and autofixes using LibCST
ExplicitFrozenDataclassRule
@dataclass
class Data:
name: str
# linter suggestion:
# When using dataclasses, explicitly
specify a frozen keyword argument.
# suggested fix
@dataclass(frozen=True)
class Data:
name: str
UseFstringRule
"%s" % "hi"
# linter suggestion:
# Do not use printf style formatting or
.format().
# Use f-string instead to be more
readable and efficient.
# See
https://www.python.org/dev/peps/pep-0498/
# suggested fix
f"{'hi'}"
https://github.com/Instagram/Fixit 🛠
❏ Telemetry
✓ Custom Linter
✓ Autofix
27
Our Custom Python Linters: Github Check with annotations
Github Check:
● Use required check
for branch protection
● Use annotations to
provide inline
context to speed up
the fix
❏ Telemetry
✓ Custom Linter
❏ Autofix
28
Our Custom non-Python Linters: rebase reminder
Errors may be merged into the main branch
A
| 
B PR1
| 
C PR2
(x)
✓ Telemetry
✓ Custom Linter
❏ Autofix
29
Our Custom Python Linters: deprecation toolkit
Too many pre-existing
linter errors
Need to resolve them
incrementally
Linters for prevent new
usages
Run linters to collect
historical data to drive
progress over time
✓ Telemetry
✓ Custom Linter
❏ Autofix
30
Reusable Workflows
Build reusable workflows to be shared across codebases easily, e.g. Github
reusable workflows
Build a reusable framework:
● Simple APIs for building linters and autofixes
● Collect metrics
● Generate Github Check with annotations easily
✓ Telemetry
✓ Custom Linter
✓ Autofix
31
Automated Refactoring
Auto bump version: Github Dependabot
Auto fix linter errors:
● LibCST: custom codemods
● PyGithub: create pull requests
Build an automated refactoring framework:
● Create pull requests and manage their life cycle until merges
● [talk] Automated Refactoring in Large Python Codebases (EuroPython 2022)
● [blog] Type annotation via automated refactoring (link)
❏ Telemetry
❏ Custom Linter
✓ Autofix
Our Custom Python Autofixes: Flake8
32
❏ Telemetry
❏ Custom Linter
✓ Autofix
Our Custom Python Autofixes: mypy
33
❏ Telemetry
❏ Custom Linter
✓ Autofix
34
Our Custom non-Python Autofixes: notify-reviewer-teams
Sometimes PRs are blocked
on code reviews.
❏ Telemetry
❏ Custom Linter
✓ Autofix
35
Our Custom non-Python Autofixes: release-on-merge
❏ Telemetry
❏ Custom Linter
✓ Autofix
36
Results
Support 200+ developers in 30+ codebases to run common Python linters with
consistent configs and autofixes
Each week, the linters run 10k+ times and provide 25k+ suggestions.
So far, the autofixes have been used 7000+ times and saved lots of developer time.
37
Recap
Slow Linter Checklist:
✓ Only run on updated code
✓ Run in parallel
✓ Reuse prior results
✓ Faster implementation
Developer Experience Checklist:
✓ Telemetry
✓ Linter
✓ Autofix
38
Thank you for your attentions!
Carta Engineering Blog https://medium.com/building-carta
Carta Jobs https://boards.greenhouse.io/carta

Python Linters at Scale.pdf

  • 1.
    Python Linters atScale Jimmy Lai, Staff Software Engineer at Carta April 22, 2023
  • 2.
    2 What to Expect? ●❗Problems ● 🛠 Tools ● ☑ Checklists
  • 3.
  • 4.
    4 Python Codebases Monolith: alarge codebase 3 million lines of code Service 1 Library 1 Service 2 Library 2 Service 2 Library 2 … … Micro Services Many Developers
  • 5.
  • 6.
    6 Black: code formatting https://github.com/psf/black 🛠 pyproject.toml [tool.black] Line-length= 120 # defaults to 88 target-version = ["py39"] exclude = "some_path" # include = defaults to “.pyi?$”
  • 7.
  • 8.
    8 Flake8: code style,syntax errors and bugs https://github.com/PyCQA/flake8 🛠 In setup.cfg, tox.ini or .flake8 [flake8] max-line-length=120 # select=E,W # pycodestyle # F # pyflakes ignore=E203,E501 # conflict to Black on py files E301,E302 # conflict to Black on pyi files
  • 9.
    9 mypy: type checking %mypy example.py mypy.py:4: error: Argument 1 to "greeting" has incompatible type "int"; expected "str" [arg-type] mypy.py:5: error: Argument 1 to "greeting" has incompatible type "bytes"; expected "str" [arg-type] Found 2 errors in 1 file (checked 1 source file) https://github.com/python/mypy 🛠 pyproject.toml [tool.mypy] # strict type annotation # explicit over implicit warn_return_any = true warn_unused_configs = true warn_unused_ignores = true warn_redundant_casts = true disallow_incomplete_defs = true disallow_untyped_defs = true no_implicit_optional = true
  • 10.
  • 11.
    Version Control ina codebase: ● Linter versions ● Linter configs Python package management: ● pip with a requirements.txt ● poetry with a pyproject.toml and a lock file Install specific versions of linters and use the linter config file in the codebase Version Control 11 pyproject.toml [tool.isort] profile = 'black' line_length = 120 requirements-dev.txt isort==5.10.0
  • 12.
    Goal: detect lintererrors at development time to iterate fast Setup local environment: ● pip/poetry install ● docker Run linters at: ● Commit time via git commit hooks ● Edit time via IDE plugin or file watcher on file changes ● Ad-hoc via linter CLI command Local Runs 12
  • 13.
    13 Continuous Integration (CI)Runs Pre-install and cache dependencies in CI runners: ● Remote cache ● Docker image Run linters when a commit is pushed
  • 14.
  • 15.
    ❗Slow Linters: 10-30+minutes Large Codebase 15 Monolith: a large codebase 30,000+ Python files
  • 16.
    ❗Poor Developer Experience: ●Inconsistent linter version and configuration ● Endless efforts on upgrading linters and configs Multiple Codebases 16 Service 1 Library 1 Service 2 Library 2 Service 2 Library 2 … … Linter A Linter A Linter A Linter B Linter B Linter B Linter B
  • 17.
    ❗Poor Developer Experience ●Observability is missing ● Linter/test errors may be merged to the main branch ● Developers are slowed down by linter suggestions ● Missing best practices on things other than Python, e.g. Github, Docker, etc. Many Developers 17 Pull Request 1 Pull Request 2 Pull Request 3 Pull Request 4 Pull Request 5
  • 18.
  • 19.
    19 ❗Checklist for Speedingup Linters Strategy: Avoid unnecessary code analysis on large number of code Checklist: ❏ Only run on updated code ❏ Run in parallel ❏ Reuse prior results ❏ Faster implementation
  • 20.
    20 Only run necessaryanalysis on updated code Local: ● get updated files from git: ○ git diff --name-only --diff-filter=d ● Run linters in daemon mode with a file watcher (e.g. watchman) ○ mypy daemon CI: get updated files from Github Pulls API (application/vnd.github.diff) ● gh api -H "Accept: application/vnd.github.VERSION.diff" /repos/python/cpython/pulls/100957 ✓ Only run on updated code ❏ Run in parallel ❏ Reuse prior results ❏ Faster implementation
  • 21.
    21 pre-commit: manage pre-commithooks Features: ● Run on committed files ● Run linters in parallel ● Reuse installed linters with a virtualenv https://github.com/pre-commit/pre-commit 🛠 .pre-commit-config.yaml repos: - repo: 'https://github.com/psf/black' rev: 22.10.0 hooks: - id: black ✓ Only run on updated code ✓ Run in parallel ❏ Reuse prior results ❏ Faster implementation
  • 22.
    22 Some linters (e.g.mypy) require the knowledge of the dependency graph Cache the prior results of the entire codebase Workflow: ● Download most recent cache based on Git revision ● Run linters with cache ● Upload cache to be reused later Use case: use mypy remote cache improved our mypy CI run from 20+ minutes to less than 5 minutes Remote Cache ❏ Only run on updated code ❏ Run in parallel ✓ Reuse prior results ❏ Faster implementation
  • 23.
    23 Ruff: fast linterimplementation using rust Implements: ● Flake8 ● isort Parse source code once across supported linters Cache results and skip unchanged files https://github.com/charliermarsh/ruff 🛠 ❏ Only run on updated code ❏ Run in parallel ✓ Reuse prior results ✓ Faster implementation
  • 24.
    24 ❗Checklist for ImprovingDeveloper Experience Problems: ● Inconsistent linter version and configuration ● Endless efforts on upgrading linters and configs ● Observability is missing ● Linter/test errors may be merged to the main branch ● Developers are slowed down by linter suggestions ● Missing best practices on things other than Python, e.g. Github, Docker, etc. Strategy: Build linters for best practices and provide autofixes for productivity Checklist: ❏ Telemetry ❏ Custom Linter ❏ Autofix
  • 25.
    Collect metrics fromCI and Local runs: ● Where: environment, Git codebase and branch ● What: linter suggestions ● How: latency, exception stack trace Understand Developer Experience 25 ✓ Telemetry ❏ Custom Linter ❏ Autofix
  • 26.
    26 fixit: Python lintersand autofixes using LibCST ExplicitFrozenDataclassRule @dataclass class Data: name: str # linter suggestion: # When using dataclasses, explicitly specify a frozen keyword argument. # suggested fix @dataclass(frozen=True) class Data: name: str UseFstringRule "%s" % "hi" # linter suggestion: # Do not use printf style formatting or .format(). # Use f-string instead to be more readable and efficient. # See https://www.python.org/dev/peps/pep-0498/ # suggested fix f"{'hi'}" https://github.com/Instagram/Fixit 🛠 ❏ Telemetry ✓ Custom Linter ✓ Autofix
  • 27.
    27 Our Custom PythonLinters: Github Check with annotations Github Check: ● Use required check for branch protection ● Use annotations to provide inline context to speed up the fix ❏ Telemetry ✓ Custom Linter ❏ Autofix
  • 28.
    28 Our Custom non-PythonLinters: rebase reminder Errors may be merged into the main branch A | B PR1 | C PR2 (x) ✓ Telemetry ✓ Custom Linter ❏ Autofix
  • 29.
    29 Our Custom PythonLinters: deprecation toolkit Too many pre-existing linter errors Need to resolve them incrementally Linters for prevent new usages Run linters to collect historical data to drive progress over time ✓ Telemetry ✓ Custom Linter ❏ Autofix
  • 30.
    30 Reusable Workflows Build reusableworkflows to be shared across codebases easily, e.g. Github reusable workflows Build a reusable framework: ● Simple APIs for building linters and autofixes ● Collect metrics ● Generate Github Check with annotations easily ✓ Telemetry ✓ Custom Linter ✓ Autofix
  • 31.
    31 Automated Refactoring Auto bumpversion: Github Dependabot Auto fix linter errors: ● LibCST: custom codemods ● PyGithub: create pull requests Build an automated refactoring framework: ● Create pull requests and manage their life cycle until merges ● [talk] Automated Refactoring in Large Python Codebases (EuroPython 2022) ● [blog] Type annotation via automated refactoring (link) ❏ Telemetry ❏ Custom Linter ✓ Autofix
  • 32.
    Our Custom PythonAutofixes: Flake8 32 ❏ Telemetry ❏ Custom Linter ✓ Autofix
  • 33.
    Our Custom PythonAutofixes: mypy 33 ❏ Telemetry ❏ Custom Linter ✓ Autofix
  • 34.
    34 Our Custom non-PythonAutofixes: notify-reviewer-teams Sometimes PRs are blocked on code reviews. ❏ Telemetry ❏ Custom Linter ✓ Autofix
  • 35.
    35 Our Custom non-PythonAutofixes: release-on-merge ❏ Telemetry ❏ Custom Linter ✓ Autofix
  • 36.
    36 Results Support 200+ developersin 30+ codebases to run common Python linters with consistent configs and autofixes Each week, the linters run 10k+ times and provide 25k+ suggestions. So far, the autofixes have been used 7000+ times and saved lots of developer time.
  • 37.
    37 Recap Slow Linter Checklist: ✓Only run on updated code ✓ Run in parallel ✓ Reuse prior results ✓ Faster implementation Developer Experience Checklist: ✓ Telemetry ✓ Linter ✓ Autofix
  • 38.
    38 Thank you foryour attentions! Carta Engineering Blog https://medium.com/building-carta Carta Jobs https://boards.greenhouse.io/carta