首先要声明,如果条件允许,不要在windows下做类似的事情,我这里是在折腾。
如果只需要下载代码,相应的代码,我已经上传了github,可以在这里下载到:
word2vec_win32
编译工具为:VS2013
具体的做法为:
1、到google code下载代码https://code.google.com/p/word2vec/
2、根据makefile,创建VS2013工程
3、进行调整,保证编译成功
3.1、所有c文件,添加下面的宏定义
#define _CRT_SECURE_NO_WARNINGS
3.2、将部分const修改为define,比如
#define MAX_STRING 100
3.3、用_aligned_malloc函数,替换posix_memalign函数
#define posix_memalign(p, a, s) (((*(p)) = _aligned_malloc((s), (a))), *(p) ?0 :errno)
3.4、下载windows下的pthread库,pthreads-win32,并修改include及link配置
3.5、编译成功
4、可执行文件说明
word2vec:词转向量,或者进行聚类
word2phrase:词转词组,用于预处理,可重复使用(运行一遍则生成2个词的短语,运行两遍则形成4个词的短语)
compute-accuracy:校验模型精度
distance:输入一个词A,返回最相近的词(A=》?)
word-analogy:输入三个词A,B,C,返回(如果A=》B,C=》?)
5、进行测试
5.1下载测试资料
http://mattmahoney.net/dc/text8.zip
5.2建立模型
>word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 Starting training using file text8 Vocab size: 71291 Words in train file: 16718843 Alpha: 0.000005 Progress: 100.10% Words/thread/sec: 13.74k
5.3校验模型精度
>compute-accuracy vectors.bin 30000 < questions-word s.txt capital-common-countries: ACCURACY TOP1: 80.83 % (409 / 506) Total accuracy: 80.83 % Semantic accuracy: 80.83 % Syntactic accuracy: -1.#J % capital-world: ACCURACY TOP1: 62.65 % (884 / 1411) Total accuracy: 67.45 % Semantic accuracy: 67.45 % Syntactic accuracy: -1.#J % currency: ACCURACY TOP1: 23.13 % (62 / 268) Total accuracy: 62.01 % Semantic accuracy: 62.01 % Syntactic accuracy: -1.#J % city-in-state: ACCURACY TOP1: 46.85 % (736 / 1571) Total accuracy: 55.67 % Semantic accuracy: 55.67 % Syntactic accuracy: -1.#J % family: ACCURACY TOP1: 77.45 % (237 / 306) Total accuracy: 57.31 % Semantic accuracy: 57.31 % Syntactic accuracy: -1.#J % gram1-adjective-to-adverb: ACCURACY TOP1: 19.44 % (147 / 756) Total accuracy: 51.37 % Semantic accuracy: 57.31 % Syntactic accuracy: 19.44 % gram2-opposite: ACCURACY TOP1: 24.18 % (74 / 306) Total accuracy: 49.75 % Semantic accuracy: 57.31 % Syntactic accuracy: 20.81 % gram3-comparative: ACCURACY TOP1: 64.92 % (818 / 1260) Total accuracy: 52.74 % Semantic accuracy: 57.31 % Syntactic accuracy: 44.75 % gram4-superlative: ACCURACY TOP1: 39.53 % (200 / 506) Total accuracy: 51.77 % Semantic accuracy: 57.31 % Syntactic accuracy: 43.81 % gram5-present-participle: ACCURACY TOP1: 40.32 % (400 / 992) Total accuracy: 50.33 % Semantic accuracy: 57.31 % Syntactic accuracy: 42.91 % gram6-nationality-adjective: ACCURACY TOP1: 84.46 % (1158 / 1371) Total accuracy: 55.39 % Semantic accuracy: 57.31 % Syntactic accuracy: 53.88 % gram7-past-tense: ACCURACY TOP1: 39.79 % (530 / 1332) Total accuracy: 53.42 % Semantic accuracy: 57.31 % Syntactic accuracy: 51.00 % gram8-plural: ACCURACY TOP1: 61.39 % (609 / 992) Total accuracy: 54.11 % Semantic accuracy: 57.31 % Syntactic accuracy: 52.38 % gram9-plural-verbs: ACCURACY TOP1: 33.38 % (217 / 650) Total accuracy: 53.01 % Semantic accuracy: 57.31 % Syntactic accuracy: 50.86 % Questions seen / total: 12227 19544 62.56 %
5.4查找关系最近的单词
>distance vectors.bin Enter word or sentence (EXIT to break): china Word: china Position in vocabulary: 486 Word Cosine distance ------------------------------------------------------------------------ taiwan 0.649276 japan 0.624836 hainan 0.567946 kalmykia 0.562871 tibet 0.562600 prc 0.553833 tuva 0.553255 korea 0.552685 chinese 0.545661 xiamen 0.542703 liao 0.542607 jiang 0.540888 manchuria 0.540783 wuhan 0.537735 yunnan 0.535809 hunan 0.535770 hangzhou 0.524340 yong 0.523802 sichuan 0.517254 guangdong 0.514874 liang 0.511881 jin 0.511389 india 0.508853 xinjiang 0.505971 taiwanese 0.503072 qing 0.502909 shanghai 0.502771 shandong 0.499169 jiangxi 0.495940 nanjing 0.492893 guangzhou 0.492788 zhao 0.490396 shenzhen 0.489658 singapore 0.489428 hubei 0.488228 harbin 0.488112 liaoning 0.484283 zhejiang 0.484192 joseon 0.483718 mongolia 0.481411 Enter word or sentence (EXIT to break):
5.5根据A=>B,得到C=>?
>word-analogy vectors.bin Enter three words (EXIT to break): china beijing canada Word: china Position in vocabulary: 486 Word: beijing Position in vocabulary: 3880 Word: canada Position in vocabulary: 474 Word Distance ------------------------------------------------------------------------ toronto 0.624131 montreal 0.559667 mcgill 0.519338 calgary 0.518366 ryerson 0.515524 ottawa 0.515316 alberta 0.509334 edmonton 0.498436 moncton 0.488861 quebec 0.487712 canadian 0.475655 saskatchewan 0.460744 fredericton 0.460354 ontario 0.458213 montrealers 0.435611 vancouver 0.429893 saskatoon 0.416954 dieppe 0.404408 iqaluit 0.401143 canadians 0.398137 winnipeg 0.397547 labatt 0.393893 city 0.386245 bilingualism 0.386245 columbia 0.384754 provincial 0.383439 banff 0.382603 metro 0.382367 molson 0.379343 nunavut 0.375992 montr 0.373883 francophones 0.373512 brunswick 0.364261 manitoba 0.360447 bec 0.359977 francophone 0.358556 leafs 0.353035 ellensburg 0.352787 curling 0.351973 cdn 0.347580 Enter three words (EXIT to break):
5.6进行聚类,输出结果(classes为0时,就是向量输出了)
>word2vec -train text8 -output classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15 -classes 500 Starting training using file text8 Vocab size: 71291 Words in train file: 16718843 Alpha: 0.000005 Progress: 100.10% Words/thread/sec: 14.72k
5.7原来程序中,还有三个测试脚本,是处理词组的,由于要用到linux命令sed,awk等,大家还是到Cygwin或MinGW下运行吧