Can't display a column in a PySpark SQLContext DataFrame

Question

Sorry for the noob question but I've been stuck for hours on that problem :

If I type :

df['avg_wind_speed_9am'].head()

It returns :

TypeError Traceback (most recent call last) <ipython-input-42-c01967246c17> in <module>() ----> 1 df['avg_wind_speed_9am'].head() TypeError: 'Column' object is not callable

And if I type :

df[['avg_wind_speed_9am']].head()

It returns :

Row(avg_wind_speed_9am=2.080354199999768)

I don't understand, normally it should print a column.

Here is how I imported the dataframe :

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load('file:///home/cloudera/Downloads/big-data-4/daily_weather.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')

Here is how my dataset looks like :

number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,918.0600000000087,74.82200000000041,271.1,2.080354199999768,295.39999999999986,2.863283199999908,0.0,0.0,42.42000000000046,36.160000000000494
1,917.3476881177097,71.40384263106537,101.93517935618371,2.4430092157340217,140.47154847112498,3.5333236016106238,0.0,0.0,24.328697291802207,19.4265967985621

Your error messages and output look like pyspark, not pandas. — Michael Szczesny
– Michael Szczesny, Commented Nov 8, 2020 at 18:27
Damn, I didn't know pyspark and pandas were different about that. Yes, I'm on pyspark. — Autechre
– Autechre, Commented Nov 8, 2020 at 18:29
Subscribing to what @Michael Szczesny said - I would try: df.select('avg_wind_speed_9am').head() to keep it more conventional — Georgina Skibinski
– Georgina Skibinski, Commented Nov 8, 2020 at 18:31

Georgina Skibinski · Accepted Answer · 2020-11-08 18:37:30Z

0

Try one of the below:

df.select('avg_wind_speed_9am').head()

df.select('avg_wind_speed_9am').show()
n = 10
df.select('avg_wind_speed_9am').take(n)

Generally in pyspark you query dataframe, and not individual columns, hence to query single column you need to use:

df.select(<list_of_cols>) where <list_of_cols> is a single column in your case.

answered Nov 8, 2020 at 18:37

Georgina Skibinski

13.5k2 gold badges16 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Can't display a column in a PySpark SQLContext DataFrame

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related