Is there any way to collect all RDD[(String, String)] into one RDD[Map[String, String]]?
E.g., for file input.csv:
1,one
2,two
3,three
Code:
val file = sc.textFile("input.csv")
val pairs = file.map(line => { val a = line.split(","); (a(0), a(1)) })
val rddMap = ???
Output (approximate):
val map = rddMap.collect
map: Array[scala.collection.immutable.Map[String,String]] = Array(Map(1 -> one, 2 -> two, 3 -> three))
Tried pairs.collectAsMap but it returns Map not inside RDD.
RDD[Map[String, String]]this way you can't take credit of the parallelism. If the map is small and you really need a map, maybe take a look at broadcast variables and accumulators (spark.apache.org/docs/latest/…).