Skip Gram (SG) is an unsupervised learning algorithm for obtaining vector representations for words.You could see more datial in the paper.


The software can be downloaded at this page word2vec.


./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3

Parameters for training:

    -train <file>
         Use text data from <file> to train the model

    -output <file>
         Use <file> to save the resulting word vectors / word clusters

    -size <int>
         Set size of word vectors; default is 100

    -window <int>
         Set max skip length between words; default is 5

    -sample <float>
         Set threshold for occurrence of words. Those that appear with 
         higher frequency in the training data will be randomly 
         down-sampled; default is 1e-3, useful range is (0, 1e-5)

    -hs <int>
         Use Hierarchical Softmax; default is 0 (not used)

    -negative <int>
         Number of negative examples; default is 5, common values 
         are 3-10 (0 = not used)

    -threads <int>
         Use <int> threads (default 12)

    -iter <int>
         Run more training iterations (default 5)

    -min-count <int>
         This will discard words that appear less than <int> times; 
         default is 5

    -alpha <float>
         Set the starting learning rate; default is 0.025 for skip-gram and 
         0.05 for CBOW

    -classes <int>
         Output word classes rather than word vectors; default number of 
         classes is 0 (vectors are written)

    -debug <int>
         Set the debug mode (default = 2 = more info during training)

    -binary <int>
         Save the resulting vectors in binary moded; default is 0 (off)

    -save-vocab <file>
         The vocabulary will be saved to <file>

    -read-vocab <file>
         The vocabulary will be read from <file>, not constructed from the 
         training data

    -cbow <int>
         Use the continuous bag of words model; default is 1 (use 0 for 
         skip-gram model)


  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of Workshop of ICLR.

  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.

Download     Downloads  0  times