Last modified: Nov 24, 2025 By Alexander Williams
Clean Normalize Spreadsheet Data Python pyexcel
Data cleaning is crucial for analysis. Messy data leads to wrong insights. Python pyexcel makes cleaning easy.
This guide shows practical cleaning techniques. You will learn to handle common data issues. Follow along with examples.
Why Clean Spreadsheet Data?
Real-world data is often messy. It contains errors and inconsistencies. Cleaning ensures accurate results.
Common problems include missing values. Duplicate records also cause issues. Inconsistent formatting creates problems.
Clean data saves time later. It prevents analysis errors. Your reports become more reliable.
Install pyexcel and Dependencies
First, install the required packages. Use pip for installation. The process is straightforward.
pip install pyexcel pyexcel-xlsx pyexcel-ods
If you encounter installation issues, refer to our guide on Install pyexcel in Python with pip and Virtualenv for detailed instructions.
Loading Spreadsheet Data
Start by loading your data file. Pyexcel supports multiple formats. Use get_array function.
import pyexcel as pe
# Load data from Excel file
data_array = pe.get_array(file_name="raw_data.xlsx")
print("Original data:")
for row in data_array:
print(row)
Original data:
['Name', 'Age', 'Salary', 'Department']
['John Doe', 25, 50000, 'Marketing']
['Jane Smith', '', 60000, 'Sales']
['Bob Johnson', 30, '55,000', 'IT']
['Alice Brown', 28, 48000, '']
['John Doe', 25, 50000, 'Marketing']
Notice the data issues. Some ages are missing. Salary has formatting problems. There are duplicate records.
Handling Missing Values
Missing values disrupt analysis. You can fill or remove them. Choose based on your needs.
Use None checks for detection. Replace empty values appropriately. Maintain data integrity.
def handle_missing_values(data):
cleaned_data = []
header = data[0]
cleaned_data.append(header)
for row in data[1:]:
cleaned_row = []
for cell in row:
# Replace empty strings with 'Unknown'
if cell == '':
cleaned_row.append('Unknown')
else:
cleaned_row.append(cell)
cleaned_data.append(cleaned_row)
return cleaned_data
cleaned_data = handle_missing_values(data_array)
print("After handling missing values:")
for row in cleaned_data:
print(row)
After handling missing values:
['Name', 'Age', 'Salary', 'Department']
['John Doe', 25, 50000, 'Marketing']
['Jane Smith', 'Unknown', 60000, 'Sales']
['Bob Johnson', 30, '55,000', 'IT']
['Alice Brown', 28, 48000, 'Unknown']
['John Doe', 25, 50000, 'Marketing']
Removing Duplicate Rows
Duplicates skew your analysis. They inflate counts incorrectly. Remove them carefully.
Compare all row values. Keep only unique records. Preserve data accuracy.
def remove_duplicates(data):
seen = set()
unique_data = []
for row in data:
# Convert row to tuple for hashing
row_tuple = tuple(str(cell) for cell in row)
if row_tuple not in seen:
seen.add(row_tuple)
unique_data.append(row)
return unique_data
unique_data = remove_duplicates(cleaned_data)
print("After removing duplicates:")
for row in unique_data:
print(row)
After removing duplicates:
['Name', 'Age', 'Salary', 'Department']
['John Doe', 25, 50000, 'Marketing']
['Jane Smith', 'Unknown', 60000, 'Sales']
['Bob Johnson', 30, '55,000', 'IT']
['Alice Brown', 28, 48000, 'Unknown']
Normalizing Data Formats
Inconsistent formats cause errors. Normalize text cases and number formats. Ensure consistency throughout.
Standardize salary formatting. Remove commas from numbers. Convert to proper data types.
def normalize_formats(data):
normalized_data = [data[0]] # Keep header
for row in data[1:]:
name, age, salary, dept = row
# Normalize department to title case
if dept != 'Unknown':
dept = dept.title()
# Normalize salary - remove commas and convert to integer
if isinstance(salary, str):
salary = salary.replace(',', '')
salary = int(salary)
normalized_data.append([name, age, salary, dept])
return normalized_data
normalized_data = normalize_formats(unique_data)
print("After normalizing formats:")
for row in normalized_data:
print(row)
After normalizing formats:
['Name', 'Age', 'Salary', 'Department']
['John Doe', 25, 50000, 'Marketing']
['Jane Smith', 'Unknown', 60000, 'Sales']
['Bob Johnson', 30, 55000, 'It']
['Alice Brown', 28, 48000, 'Unknown']
Saving Cleaned Data
Save your cleaned data properly. Use appropriate file formats. Preserve your cleaning work.
Pyexcel makes saving easy. Choose from multiple formats. Your Python pyexcel Guide: Convert CSV XLSX JSON provides more conversion options.
# Save cleaned data to new Excel file
pe.save_as(array=normalized_data, dest_file_name="cleaned_data.xlsx")
print("Cleaned data saved to cleaned_data.xlsx")
# You can also save as CSV
pe.save_as(array=normalized_data, dest_file_name="cleaned_data.csv")
print("Cleaned data saved to cleaned_data.csv")
Complete Data Cleaning Script
Here is the complete workflow. It combines all cleaning steps. Use this as your template.
import pyexcel as pe
def clean_spreadsheet_data(input_file, output_file):
# Load data
data_array = pe.get_array(file_name=input_file)
# Handle missing values
def handle_missing(data):
cleaned = [data[0]]
for row in data[1:]:
cleaned_row = ['Unknown' if cell == '' else cell for cell in row]
cleaned.append(cleaned_row)
return cleaned
# Remove duplicates
def remove_dups(data):
seen = set()
unique = []
for row in data:
row_tuple = tuple(str(cell) for cell in row)
if row_tuple not in seen:
seen.add(row_tuple)
unique.append(row)
return unique
# Normalize formats
def normalize(data):
normalized = [data[0]]
for row in data[1:]:
name, age, salary, dept = row
if dept != 'Unknown':
dept = dept.title()
if isinstance(salary, str):
salary = int(salary.replace(',', ''))
normalized.append([name, age, salary, dept])
return normalized
# Apply cleaning steps
step1 = handle_missing(data_array)
step2 = remove_dups(step1)
cleaned_data = normalize(step2)
# Save result
pe.save_as(array=cleaned_data, dest_file_name=output_file)
return cleaned_data
# Usage
final_data = clean_spreadsheet_data("raw_data.xlsx", "final_cleaned_data.xlsx")
print("Data cleaning completed successfully!")
Best Practices for Data Cleaning
Always keep original files. Create copies for cleaning. This prevents data loss.
Document your cleaning steps. Note all transformations applied. This ensures reproducibility.
Validate results thoroughly. Check for remaining issues. Quality assurance is essential.
For more advanced operations, our Python pyexcel Tutorial: Read Write Excel CSV Files covers additional functionality.
Conclusion
Data cleaning is essential for analysis. Pyexcel simplifies the process. It handles common data issues effectively.
You learned to handle missing values. Duplicate removal was covered. Data normalization techniques were shown.
Start with these basic techniques. Adapt them to your specific needs. Clean data leads to better decisions.
Remember to always backup original files. Test your cleaning scripts thoroughly. Happy data cleaning!