HDC

Descriptions

Hierarchical Document Context (PDC) model is an unsupervised learning algorithm for obtaining vector representations for words.

In this model, the document is used to predict a target word, and the target word is further used to predict its surrounding context words.

You could see more datial in the paper.

Software

The software can be downloaded at this page.


Useage

./w2v -train data.txt -word_output vec.txt -size 200 -window 5 -subsample 1e-4 -negative 5 -model pdc -binary 0 -iter 5

 -train, the input file of the corpus, each line a document;
 -word_output, the output file of the word embeddings;
 -binary, whether saving the output file in binary mode; the default is 0 (off);
 -word_size, the dimension of word embeddings; the default is 100;
 -doc_size, the dimension of word embeddings; the default is 100;
 -window, max skip length between words; default is 5;
 -negative, the number of negative samples used in negative sampling; the deault is 5;
 -subsample, parameter for subsampling; default is 1e-4;
 -threads, the total number of threads used; the default is 1.
 -alpha, the starting learning rate; default is 0.025 for HDC and 0.05 for PDC; 
 -model, model used to learn the word embeddings; default is Parallel Document Context model(pdc) (use hdc for Hierarchical Document Context model)
 -min-count, the threshold for occurrence of words; default is 5;
 -iter, the number of iterations; default is 5;

Reference

  • Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu and Xueqi Cheng. Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations. The 53rd Annual Meeting of the Association for Computational Linguistics (ACL2015)

Download

WordRep-master.zip     Downloads  0  times