How does foldLeft in Scala work on DataFrame?

Consider a trivialized foldLeft example more similar to your DataFrame version:

List(3, 2, 1).foldLeft("abcde")((acc, x) => acc.take(x))

If you look closely at what the (acc, x) => acc.take(x) function does in each iteration, the foldLeft is no difference from the following:

"abcde".take(3).take(2).take(1)
// Result: "a"

Going back to the foldLeft for your DataFrame:

stringColumns.foldLeft(yearDF){ (tempdf, colName) =>
  tempdf.withColumn(colName, regexp_replace(col(colName), "\n", ""))
}

Similarly it's no difference from:

val sz = stringColumns.size

yearDF.
  withColumn(stringColumns(0), regexp_replace(col(stringColumns(0)), "\n", "")).
  withColumn(stringColumns(1), regexp_replace(col(stringColumns(1)), "\n", "")).
  ...
  withColumn(stringColumns(sz - 1), regexp_replace(col(stringColumns(sz - 1)), "\n", ""))
  1. What value does tempDF hold ? If it is the same as yearDF, how is it mapped to yearDF ?

In each iteration (i = 0, 1, 2, ...), tempDF holds a new DataFrame transformed from applying withColumn(stringColumns(i), ...), starting from yearDF

  1. If withColumns is used in the function and the result is added to yearDF, how come it is not creating duplicating columns when

From withColumn(stringColumns(i), regexp_replace(col(stringColumns(i)), "\n", "")), method withColumn creates a new DataFrame, "adding" a column with the same name as the column stringColumns(i) it derives from, thus essentially resulting in a new DataFrame with the same column list as the original yearDF.

Tags:

Scala