2 回答

TA贡献1895条经验 获得超7个赞
事实上,我已经找到了解决这个问题的办法。
在gensim.models.keyedvectors文件中class WordEmbeddingKeyedVectors,我们可以从
def init_sims(self, replace=False):
"""Precompute L2-normalized vectors."""
if getattr(self, 'vectors_norm', None) is None or replace:
logger.info("precomputing L2-norms of word weight vectors")
self.vectors_norm = _l2_norm(self.vectors, replace=replace)
到
def init_sims(self, replace=False):
"""Precompute L2-normalized vectors."""
if getattr(self, 'vectors_norm', None) is None or replace:
logger.info("precomputing L2-norms of word weight vectors")
self.vectors_norm = _l2_norm(self.vectors, replace=replace)
elif (len(self.vectors_norm) == len(self.vectors)): #if all of the added vectors are pre-computed into L2-normalized vectors
pass
else: #when there are vectors added but have not been pre-computed into L2-normalized vectors yet
logger.info("adding L2-norm vectors for new documents")
diff = len(self.vectors) - len(self.vectors_norm)
self.vectors_norm = vstack((self.vectors_norm, _l2_norm(self.vectors[-diff:])))
本质上,原始函数所做的是,如果没有self.vectors_norm,则通过 L2-normalizing 计算self.vectors。但是,如果其中有任何新添加的向量self.vectors没有被预先计算为 L2 归一化向量,我们应该预先计算它们然后添加到self.vectors_norm.
我会将其作为评论发布到您的错误报告@gojomo 并添加拉取请求!谢谢 :)

TA贡献1942条经验 获得超3个赞
看来该操作并未清除由类似操作add()
创建和重用的归一化到单位长度向量的缓存。most_similar()
在执行 之前或之后add()
,您可以使用以下命令显式删除该缓存:
del test.vectors_norm
然后,您test.most_similar('3')
应该在没有IndexError
.
添加回答
举报