word2vec训练中文模型¶
1.准备数据与预处理¶
首先需要一份比较大的中文语料数据,可以考虑中文的维基百科(也可以试试搜狗的新闻语料库)。
中文维基百科的打包文件地址为 链接: https://pan.baidu.com/s/1H-wuIve0d_fvczvy3EOKMQ 提取码: uqua
百度网盘加速下载地址:https://www.baiduwp.com/?m=index
中文维基百科的数据不是太大,xml的压缩文件大约1G左右。首先用处理这个XML压缩文件。
注意输入输出地址
In [9]:
import logging
import os.path
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
# 定义输入输出
basename = "F:/temp/DL/"
inp = basename+'zhwiki-latest-pages-articles.xml.bz2'
outp = basename+'wiki.zh.text'
program = os.path.basename(basename)
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 3:
print(globals()['__doc__'] % locals())
sys.exit(1)
space = " "
i = 0
output = open(outp, 'w',encoding='utf-8')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
2019-05-08 21:42:31,184: INFO: running c:\users\mantch\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py -f C:\Users\mantch\AppData\Roaming\jupyter\runtime\kernel-30939db9-3a59-4a92-844c-704c6189dbef.json 2019-05-08 21:43:12,274: INFO: Saved 10000 articles 2019-05-08 21:43:45,223: INFO: Saved 20000 articles 2019-05-08 21:44:14,638: INFO: Saved 30000 articles 2019-05-08 21:44:44,601: INFO: Saved 40000 articles 2019-05-08 21:45:16,004: INFO: Saved 50000 articles 2019-05-08 21:45:47,421: INFO: Saved 60000 articles 2019-05-08 21:46:16,722: INFO: Saved 70000 articles 2019-05-08 21:46:46,733: INFO: Saved 80000 articles 2019-05-08 21:47:16,143: INFO: Saved 90000 articles 2019-05-08 21:47:47,533: INFO: Saved 100000 articles 2019-05-08 21:48:29,591: INFO: Saved 110000 articles 2019-05-08 21:49:04,530: INFO: Saved 120000 articles 2019-05-08 21:49:40,279: INFO: Saved 130000 articles 2019-05-08 21:50:15,592: INFO: Saved 140000 articles 2019-05-08 21:50:54,183: INFO: Saved 150000 articles 2019-05-08 21:51:31,123: INFO: Saved 160000 articles 2019-05-08 21:52:06,278: INFO: Saved 170000 articles 2019-05-08 21:52:43,157: INFO: Saved 180000 articles 2019-05-08 21:55:59,809: INFO: Saved 190000 articles 2019-05-08 21:57:01,859: INFO: Saved 200000 articles 2019-05-08 21:58:33,921: INFO: Saved 210000 articles 2019-05-08 21:59:26,744: INFO: Saved 220000 articles 2019-05-08 22:00:41,757: INFO: Saved 230000 articles 2019-05-08 22:01:36,532: INFO: Saved 240000 articles 2019-05-08 22:02:26,347: INFO: Saved 250000 articles 2019-05-08 22:03:08,634: INFO: Saved 260000 articles 2019-05-08 22:03:53,447: INFO: Saved 270000 articles 2019-05-08 22:04:37,136: INFO: Saved 280000 articles 2019-05-08 22:05:14,017: INFO: Saved 290000 articles 2019-05-08 22:06:01,296: INFO: Saved 300000 articles 2019-05-08 22:06:47,762: INFO: Saved 310000 articles 2019-05-08 22:07:39,714: INFO: Saved 320000 articles 2019-05-08 22:08:28,825: INFO: Saved 330000 articles 2019-05-08 22:09:11,412: INFO: finished iterating over Wikipedia corpus of 338005 documents with 77273203 positions (total 3288566 articles, 91445479 positions before pruning articles shorter than 50 words) 2019-05-08 22:09:11,555: INFO: Finished Saved 338005 articles
In [ ]:
import logging
import os.path
import sys
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
# 定义输入输出
basename = "F:/temp/DL/"
inp = basename+'wiki.zh.text'
outp1 = basename+'wiki.zh.text.model'
outp2 = basename+'wiki.zh.text.vector'
program = os.path.basename(basename)
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 4:
print(globals()['__doc__'] % locals())
model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
workers=multiprocessing.cpu_count())
# trim unneeded model memory = use(much) less RAM
#model.init_sims(replace=True)
model.save(outp1)
model.save_word2vec_format(outp2, binary=False)
3.测试结果¶
In [1]:
# 测试结果
import gensim
# 定义输入输出
basename = "F:/temp/DL/"
model = basename+'wiki.zh.text.model'
model = gensim.models.Word2Vec.load(model)
result = model.most_similar(u"足球")
for e in result:
print(e[0], e[1])
c:\users\mantch\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py:11: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). # This is added back by InteractiveShellApp.init_path()
排球 0.8914323449134827 籃球 0.8889479041099548 棒球 0.854706883430481 高爾夫 0.832783043384552 高爾夫球 0.8316080570220947 網球 0.8276922702789307 橄欖球 0.823620080947876 英式足球 0.8229209184646606 板球 0.822044312953949 欖球 0.8151556253433228
In [2]:
result = model.most_similar(u"男人")
for e in result:
print(e[0], e[1])
女人 0.908246636390686 男孩 0.872255802154541 女孩 0.8567496538162231 孩子 0.8363182544708252 知道 0.8341636061668396 某人 0.8211491107940674 漂亮 0.8023637533187866 伴侶 0.8001378774642944 什麼 0.7944830656051636 嫉妒 0.7929206490516663
c:\users\mantch\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel.
In [ ]: