2

I have been looking around for a way to convert an excel file with multiple headers into column headings using the pandas library.

I have been successful in importing the data into a dataframe by reading the file and parsing it using the ExcelFile. I have also been able to identify the headers using the header=[0, 4]. Where I run into issues is reindexing and/or using the melt function to convert the headers into columns.

When I use the melt function I am able to successfully convert the columns into the rows. However, I want the headers to be a single column rather than be stacked with the rest of the data.

Currently, this is how the data is structured:

Excel file displaying data with multiple headers

After the conversion, the data should look like this:

Data that is unpivot with headers converted into columns

I have been reading about indexing, but not sure I understand how it would apply here.

I'm new to python, like really new, and any support or direction is greatly appreciated. I have been reading the following cheatsheets but haven't found the right way to convert it:

https://www.datacamp.com/community/data-science-cheatsheets

Here is a sample code:

import pandas as pd

xl = pd.ExcelFile('help.xlsx')
df1 = xl.parse('Sheet1')

df2 = pd.melt(df1,
          id_vars=['PW'],
          value_vars=['Fruit','Conventional'])

Also, adding the results after running the code: df1 the data with multiple headers

The following is the error with the data (headers are not converted into columns, headers are stacked with the rest of the data):

after using pandas melt the headers are stacked with the data and not converted into their own column

This is how the final product should look:

Headers converted into columns

1
  • 2
    do you have any code to show representing the state of your problem thus-far? Commented Jan 6, 2018 at 0:21

2 Answers 2

1

Try this:

# In[1]:
df = pd.read_excel('help.xlsx', header=[0,1,2,3]) #Read file, use 4 rows as header
df.columns = df.columns.map(','.join) #Concatenate by ',' the fields name
df = df.rename_axis('PW').reset_index() #reset and rename index
df2 = pd.melt(df, id_vars=list(df.columns)[0], value_vars=list(df.columns)[1:], value_name='Volume') #Unpivot table, g roping by 'variable' and 'volume'
df2[['Category', 'Field_Type', 'Growing_Method', 'Product']] = df2['variable'].str.split(',',expand=True) #Split using ',' as delimeter
df2.__delitem__('variable') #Delete extra field 'variable'
#Reorder Columns
cols = df2.columns.tolist() 
df2 = df2[[cols[0]] + cols[2:] + [cols[1]]]
df2
Sign up to request clarification or add additional context in comments.

2 Comments

When I run this code the 'PW' data values are absent from the result dataframe. I was under the impression that the question author wants to have this data included though.
It was because the melt method reset the index. I fixed it!
0

One way to accomplish this type of reshaping is with the stack operation of pandas:

import pandas as pd

# Read excel file. Use first column as row index, and use first four rows as
# column index levels
df = pd.read_excel('test.xlsx', index_col=0, header=[0, 1, 2, 3])

# Assign names to row index and column index levels
df.index.name = 'PW'
df.columns.names = ['Category', 'Field_Type', 'Growing_Method', 'Product']

# Convert all column index levels into row index levels
s = df.stack([0, 1, 2, 3])

# Assign name to the single data values column
s.name = 'Volume'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.