When I write spark program, I often deal with transformation functions. These functions need to be serialized to enable parallelization. Scala uses lamda closure as syntax suger to write transformation functions. But it really frustrates me when the closure cannot be serialized which causes runtime failure of spark program.
Example code snippet
In order to figure out the true self of scala closure, take the following code snippet as an example.
Methods to dig into the secret
There is something we should know when debuging scala program. Sometimes we need to find out the java code it transform to or even byte code that is compiled to.
source code -> class:
source code -> class (with syntax tree parsed info printed):
class -> decompiled signature:
class -> byte code:
Parsed syntax tree reveals the truth
The syntax tree of the example class is as follows:
So closure is transformed to an inner class with an apply method. For closure, the outer class object is passed into the closure inner class when constructed if any class member variable is referenced. So the closure is serializable only if the outer class is serializable. Otherwise, if a closure depends only on some local variables, then only these variables are passed into the closure when constructed. So one way to avoid serializing all members of outer class is to copy the necessary member variables to local ones.
What if a closure depends on some variables of its outer object. See,
It is transformed to
The outer object is not passed by. The object’s member variable is used in the closure.