[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9803 from hhbyyh/w2vVocab. (cherry picked from commit e391abdf2cb6098a35347bd123b815ee9ac5b689) Signed-off-by: Xiangrui Meng <meng@databricks.com>

commit: 307f27e24e17afd92030194a3e6fec312fc19f4f [log] [tgz]
author: Yuhao Yang <hhbyyh@gmail.com> Wed Nov 18 13:25:15 2015 -0800
committer: Xiangrui Meng <meng@databricks.com> Wed Nov 18 13:26:18 2015 -0800
tree: b64eba9ca7a74188beebef8839387f351a536d19
parent: 4b6e24e25e69169fdb5aab7cdcba5b32d9789c93 [diff]
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index 7960f3c..d983dd3 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

@@ -127,8 +127,8 @@
 
   private var trainWordsCount = 0
   private var vocabSize = 0
-  private var vocab: Array[VocabWord] = null
-  private var vocabHash = mutable.HashMap.empty[String, Int]
+  @transient private var vocab: Array[VocabWord] = null
+  @transient private var vocabHash = mutable.HashMap.empty[String, Int]
 
   private def learnVocab(words: RDD[String]): Unit = {
     vocab = words.map(w => (w, 1))
commit	307f27e24e17afd92030194a3e6fec312fc19f4f	[log] [tgz]
author	Yuhao Yang <hhbyyh@gmail.com>	Wed Nov 18 13:25:15 2015 -0800
committer	Xiangrui Meng <meng@databricks.com>	Wed Nov 18 13:26:18 2015 -0800
tree	b64eba9ca7a74188beebef8839387f351a536d19
parent	4b6e24e25e69169fdb5aab7cdcba5b32d9789c93 [diff]