After a decade in data engineering, I’ve seen hundreds of hours wasted developing on top of messy, unmaintainable code. Here’s how to make your code easy to maintain in just 5 minutes: 🚀 1. Create a Validation Script Before refactoring, ensure your output remains consistent. ✅ Check row count differences ✅ Validate metric consistency across key dimensions ✅ Use tools like datacompy to automate checks 🔄 2. Split Large Code Blocks into Individual Parts Refactor complex logic into modular components. 💡 Break down CTEs/subqueries into individual parts 💡 In Python, use functions 💡 In dbt, create separate models 🔌 3. Separate I/O from Transformation Logic Decouple data reading/writing from transformations. 🔹 Easier testing & debugging 🔹 Re-running transformations becomes simpler 🛠️ 4. Make Each Function Independent Your transformation functions should have no side effects. 🔑 Inputs = DataFrames → Outputs = DataFrames 🔑 External writes (e.g., logging) should use objects 🧪 5. Write Extensive Tests Tests ensure your pipelines don’t break with new changes. ✅ Catch issues before they hit production ✅ Gain confidence in refactoring 🔗 6. Think in Chains of Functions ETL should be a chain of reusable transformation functions. 💡 Modular functions = easier debugging, maintenance, and scaling Following these principles will save you hours of frustration while keeping your code clean, scalable, and easy to modify. What’s your biggest challenge with maintaining ETL pipelines? Drop it in the comments! 👇 #data #dataengineering #datapipeline
Clean Code Practices For Data Science Projects
Explore top LinkedIn content from expert professionals.
Summary
Clean code practices for data science projects involve applying software engineering principles to ensure your code is readable, maintainable, and scalable. These practices help streamline workflows, prevent errors, and improve collaboration in data-driven environments.
- Modularize your code: Break down large blocks of code into smaller, reusable functions or components to make debugging and future updates simpler.
- Document thoroughly: Add clear comments and explanations to your code, including business logic and decisions, to support collaboration and long-term usability.
- Implement testing: Use unit tests to validate your data inputs, processing, and outputs, ensuring that changes to your code don’t break functionality.
-
-
🚨 Data professionals NEED to utilize software engineering best practices. Gone are the days of a scrappy jupyter notebook or quick SQL queries to get stuff done. 👀 While both of those scrappy methods serve a purpose, the reality is that our industry as a whole has matured where data is no longer a means to an end, but actual products with dependencies for critical business processes. 👇🏽 What does this look like? - Clean code where each action is encapsulated in a specific function and or class. - Version control where each pull request has a discrete purpose (compared to 1k+ line PRs). - Clear documentation of business logic and reasoning (on my SQL queries I would leave comments with Slack message links to show when public decisions were made). - Unit tests that test your above functions as well as the data when possible. - CI/CD on your pull requests, which is very approachable now with GitHub actions. 💻 In my LinkedIn learning course I was adamant about not just teaching you dbt, but instead how to create a dbt project that was production ready and utilized engineering best practices. Specifically, it's hands on where you will learn: - How to use the command line - How to setup databases - Utilizing requirements.txt files for reproducibility - Creating discrete PRs for building a your project - Documentation as code - Utilizing the DRY principle (don't repeat yourself) - Implementing tests on your code - Creating a dev and prod environment - Setting up GitHub Actions workflows (CI/CD) 🔗 Link to the course in the comments!
-
Why does learning about software engineering matter in data science? Your model code is a crucial component of the software. Though we may not know as much tech depth as a full-time software engineer, it's vital that we hold the same standard in our modeling code. The model code you deploy is not just Pandas (or Spark job functions) for data manipulation and SkLearn (or Tensorflow) for modeling. You have to understand how to "package" your solution with some principles and techniques of a software engineer. 1. Practice DRY Wrote 10+ lines of numpy and pandas manipulation for a feature engineering technique? Modularize your function as much as you can and submit the code in a repo that you can reuse for your next project and colleagues could use. 2. Optimizing Code Efficiency Many complain about algos & data structures as though they are not applicable in real-life projects. Pandas and numpy are the only ones you will use right? Well, look under the hood for those libraries, I assure you that object-oriented programming and algo patterns of dynamic programming, queues, sorting - you will see them applied in DF.sort_values, DF.groupby, and such. And, if you need to write custom functions, you don't want to blindly write a preprocessing function that looks like for i in array1: for j in array2: for k in array3: ..... 3. Code Reviews Pair programming ensures that your code is functional, optimal, and error-prone. We need to apply the same sanity check with our modeling solution. We are subject to our own biases. Your model solution may look great, but how would it perform where there are edge cases (e.g. outliers) or bottlenecks in coding functions? These are things that could be scoped out in code reviews. 4. Unit Testing Unit test your data input, data processing, and modeling output functions. If your model breaks in production, one of the first things you want to assess is issues with your code. Perhaps there was a change you made or colleague made that caused a breakage. Unit test is one of the first guardrail to prevent this. 5. Gather Non-Functional Requirements This is the part that's, honestly, frustrating about current data science courses, books, and Kaggle. In software engineering 101, you learn about throughput, latency, data volume to consider in your design solution. In data science course, you only learn about model metrics as though that's the only thing that actually matters. In real-life project, rarely do you just optimize your models based on your model accuracy. Latency constraints and QPS should also be things you consider when designing your model solution. ⭐ If you want to ensure the successful delivery of your modeling project, it's not just coding with Pandas and Sklearn on Jupyter. Best practices in software engineering are vital. 👉 Found this post helpful? Smash 👍 and follow Daniel Lee 📚 👉 Land dream data job on 𝗗𝗮𝘁𝗮𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄[.]𝗰𝗼𝗺 🚀