Last modified: Nov 24, 2025 By Alexander Williams

Clean Normalize Spreadsheet Data Python pyexcel

Data cleaning is crucial for analysis. Messy data leads to wrong insights. Python pyexcel makes cleaning easy.

This guide shows practical cleaning techniques. You will learn to handle common data issues. Follow along with examples.

Why Clean Spreadsheet Data?

Real-world data is often messy. It contains errors and inconsistencies. Cleaning ensures accurate results.

Common problems include missing values. Duplicate records also cause issues. Inconsistent formatting creates problems.

Clean data saves time later. It prevents analysis errors. Your reports become more reliable.

Install pyexcel and Dependencies

First, install the required packages. Use pip for installation. The process is straightforward.


pip install pyexcel pyexcel-xlsx pyexcel-ods

If you encounter installation issues, refer to our guide on Install pyexcel in Python with pip and Virtualenv for detailed instructions.

Loading Spreadsheet Data

Start by loading your data file. Pyexcel supports multiple formats. Use get_array function.


import pyexcel as pe

# Load data from Excel file
data_array = pe.get_array(file_name="raw_data.xlsx")
print("Original data:")
for row in data_array:
    print(row)


Original data:
['Name', 'Age', 'Salary', 'Department']
['John Doe', 25, 50000, 'Marketing']
['Jane Smith', '', 60000, 'Sales']
['Bob Johnson', 30, '55,000', 'IT']
['Alice Brown', 28, 48000, '']
['John Doe', 25, 50000, 'Marketing']

Notice the data issues. Some ages are missing. Salary has formatting problems. There are duplicate records.

Handling Missing Values

Missing values disrupt analysis. You can fill or remove them. Choose based on your needs.

Use None checks for detection. Replace empty values appropriately. Maintain data integrity.


def handle_missing_values(data):
    cleaned_data = []
    header = data[0]
    cleaned_data.append(header)
    
    for row in data[1:]:
        cleaned_row = []
        for cell in row:
            # Replace empty strings with 'Unknown'
            if cell == '':
                cleaned_row.append('Unknown')
            else:
                cleaned_row.append(cell)
        cleaned_data.append(cleaned_row)
    
    return cleaned_data

cleaned_data = handle_missing_values(data_array)
print("After handling missing values:")
for row in cleaned_data:
    print(row)


After handling missing values:
['Name', 'Age', 'Salary', 'Department']
['John Doe', 25, 50000, 'Marketing']
['Jane Smith', 'Unknown', 60000, 'Sales']
['Bob Johnson', 30, '55,000', 'IT']
['Alice Brown', 28, 48000, 'Unknown']
['John Doe', 25, 50000, 'Marketing']

Removing Duplicate Rows

Duplicates skew your analysis. They inflate counts incorrectly. Remove them carefully.

Compare all row values. Keep only unique records. Preserve data accuracy.


def remove_duplicates(data):
    seen = set()
    unique_data = []
    
    for row in data:
        # Convert row to tuple for hashing
        row_tuple = tuple(str(cell) for cell in row)
        if row_tuple not in seen:
            seen.add(row_tuple)
            unique_data.append(row)
    
    return unique_data

unique_data = remove_duplicates(cleaned_data)
print("After removing duplicates:")
for row in unique_data:
    print(row)


After removing duplicates:
['Name', 'Age', 'Salary', 'Department']
['John Doe', 25, 50000, 'Marketing']
['Jane Smith', 'Unknown', 60000, 'Sales']
['Bob Johnson', 30, '55,000', 'IT']
['Alice Brown', 28, 48000, 'Unknown']

Normalizing Data Formats

Inconsistent formats cause errors. Normalize text cases and number formats. Ensure consistency throughout.

Standardize salary formatting. Remove commas from numbers. Convert to proper data types.


def normalize_formats(data):
    normalized_data = [data[0]]  # Keep header
    
    for row in data[1:]:
        name, age, salary, dept = row
        
        # Normalize department to title case
        if dept != 'Unknown':
            dept = dept.title()
        
        # Normalize salary - remove commas and convert to integer
        if isinstance(salary, str):
            salary = salary.replace(',', '')
            salary = int(salary)
        
        normalized_data.append([name, age, salary, dept])
    
    return normalized_data

normalized_data = normalize_formats(unique_data)
print("After normalizing formats:")
for row in normalized_data:
    print(row)


After normalizing formats:
['Name', 'Age', 'Salary', 'Department']
['John Doe', 25, 50000, 'Marketing']
['Jane Smith', 'Unknown', 60000, 'Sales']
['Bob Johnson', 30, 55000, 'It']
['Alice Brown', 28, 48000, 'Unknown']

Saving Cleaned Data

Save your cleaned data properly. Use appropriate file formats. Preserve your cleaning work.

Pyexcel makes saving easy. Choose from multiple formats. Your Python pyexcel Guide: Convert CSV XLSX JSON provides more conversion options.


# Save cleaned data to new Excel file
pe.save_as(array=normalized_data, dest_file_name="cleaned_data.xlsx")
print("Cleaned data saved to cleaned_data.xlsx")

# You can also save as CSV
pe.save_as(array=normalized_data, dest_file_name="cleaned_data.csv")
print("Cleaned data saved to cleaned_data.csv")

Complete Data Cleaning Script

Here is the complete workflow. It combines all cleaning steps. Use this as your template.


import pyexcel as pe

def clean_spreadsheet_data(input_file, output_file):
    # Load data
    data_array = pe.get_array(file_name=input_file)
    
    # Handle missing values
    def handle_missing(data):
        cleaned = [data[0]]
        for row in data[1:]:
            cleaned_row = ['Unknown' if cell == '' else cell for cell in row]
            cleaned.append(cleaned_row)
        return cleaned
    
    # Remove duplicates
    def remove_dups(data):
        seen = set()
        unique = []
        for row in data:
            row_tuple = tuple(str(cell) for cell in row)
            if row_tuple not in seen:
                seen.add(row_tuple)
                unique.append(row)
        return unique
    
    # Normalize formats
    def normalize(data):
        normalized = [data[0]]
        for row in data[1:]:
            name, age, salary, dept = row
            if dept != 'Unknown':
                dept = dept.title()
            if isinstance(salary, str):
                salary = int(salary.replace(',', ''))
            normalized.append([name, age, salary, dept])
        return normalized
    
    # Apply cleaning steps
    step1 = handle_missing(data_array)
    step2 = remove_dups(step1)
    cleaned_data = normalize(step2)
    
    # Save result
    pe.save_as(array=cleaned_data, dest_file_name=output_file)
    return cleaned_data

# Usage
final_data = clean_spreadsheet_data("raw_data.xlsx", "final_cleaned_data.xlsx")
print("Data cleaning completed successfully!")

Best Practices for Data Cleaning

Always keep original files. Create copies for cleaning. This prevents data loss.

Document your cleaning steps. Note all transformations applied. This ensures reproducibility.

Validate results thoroughly. Check for remaining issues. Quality assurance is essential.

For more advanced operations, our Python pyexcel Tutorial: Read Write Excel CSV Files covers additional functionality.

Conclusion

Data cleaning is essential for analysis. Pyexcel simplifies the process. It handles common data issues effectively.

You learned to handle missing values. Duplicate removal was covered. Data normalization techniques were shown.

Start with these basic techniques. Adapt them to your specific needs. Clean data leads to better decisions.

Remember to always backup original files. Test your cleaning scripts thoroughly. Happy data cleaning!