1

While running a simple Spark SQL

spark.sql("select count(1) from edw.result_base").show()

...I'm getting this error

py4j.protocol.Py4JJavaError: An error occurred while calling o69.sql. java.lang.AssertionError: assertion failed

The table is in parquet format, and Spark version is 2.4.8.

I tried setting the below property and still got an error.

spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")

The complete stacktrace for the error look like this:

java.lang.AssertionError: assertion failed
      at scala.Predef$.assert(Predef.scala:208)
      at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261)
      at org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137)
      at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220)
      at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
      at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
      at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
      at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
      at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
      at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207)
      at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191)
      at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130)
      at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
      at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
      at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49)
      at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127)
      at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119)
      at scala.collection.immutable.List.foreach(List.scala:392)
      at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122)
      at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98)
      at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
      at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146)
      at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145)
      at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66)
      at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
      at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63)
      at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
      at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:55)
4
  • Hi! We're missing the interesting part of your error. Can you share what the full error message is (including the Java error stack trace if there is one) by editing your question? Also, is edw.result_base a view? A table? Commented Oct 22, 2023 at 9:40
  • @Koedlt . Its an external table and we have some internal managed tables getting same error. Commented Oct 22, 2023 at 16:49
  • @Koedlt I have added detailed error. Any suggestion pls Commented Oct 22, 2023 at 17:31
  • Ahh damn I also forgot to ask you: what is your spark version? Commented Oct 22, 2023 at 19:04

2 Answers 2

0

After having looked around a bit, it seems like you're on the right path by looking at that spark.sql.hive.convertMetastoreOrc parameter (see e.g. this and this for similar errors). Since you're using parquet as underlying storage though, it will be the spark.sql.hive.convertMetastoreParquet parameter that you will need.

Try setting spark.sql.hive.convertMetastoreParquet to false instead of true. true is the default value, so that won't get you very far since the default behaviour is what's giving you this error.

Sign up to request clarification or add additional context in comments.

7 Comments

@Koedit. I tried with "false" as well and getting the same error. Any other way I can try
@Koedit. I tried and getting error as ::::::: py4j.protocol.Py4JJavaError: An error occurred while calling o285.showString java.lang.NoClassDefFoundError: parquet/hadoop/ParquetOutputFormat
It seems like you don't have the necessary parquet-hadoop jar. This is fixable but maybe it's easier to reinstall spark. How did you install Spark?
@Koedit I tried updating to read specific jar "parquet-hadoop-bundle-1.6.0.jar" but its not able to read in the spark.session builder. Any suggestion pls
when i run the spark-submit command by explicitly specifying the --driver-class-path I am able to get the count but when I use "pyspark" shell manually and tried all the ways to read this --driver-class-path inside spark session its not able to read it. Any suggestion if possible here
|
0

This assertion error indicates that logical schema of your Hive table edw.result_base doesn't match physical schema of the underlying parquet files. Here is the relevant code in HiveMetastoreCatalog.scala:

:
// The inferred schema may have different field names as the table schema, we should respect
// it, but also respect the exprId in table relation output.
assert(result.output.length == relation.output.length &&
  result.output.zip(relation.output).forall { case (a1, a2) => a1.dataType == a2.dataType })
:

You can certainly simply disable spark.sql.hive.convertMetastoreParquet, but I think a safer approach would be to figure out where the differences are because otherwise they may lead to incorrect query results (especially if data types don't match or you have extra columns in your table def).

Please also note that in the newer Spark versions this assert was refactored to throw more detailed and accurate AnalysisExceptions.

For setting driver classpath in Jupyter, this SO answer seems to work, although I don't think any extra jars are needed in your case since you're able to infer schema from parquet just fine.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.