首页猿问是否有可能获得 Spacy...

是否有可能获得 Spacy 命名实体识别的置信度分数

Python

紫衣仙女 2021-10-05 17:47:04

我需要获得 Spacy NER 所做预测的置信度分数。CSV 文件Text,Amount & Nature,Percent of Class"T. Rowe Price Associates, Inc.","28,223,360 (1)",8.7% (1)100 E. Pratt Street,Not Listed,Not Listed"Baltimore, MD 21202",Not Listed,Not Listed"BlackRock, Inc.","21,871,854 (2)",6.8% (2)55 East 52nd Street,Not Listed,Not Listed"New York, NY 10022",Not Listed,Not ListedThe Vanguard Group,"21,380,085 (3)",6.64% (3)100 Vanguard Blvd.,Not Listed,Not Listed"Malvern, PA 19355",Not Listed,Not ListedFMR LLC,"20,784,414 (4)",6.459% (4)245 Summer Street,Not Listed,Not Listed"Boston, MA 02210",Not Listed,Not Listed代码import pandas as pdimport spacywith open('/path/table.csv') as csvfile: reader1 = csv.DictReader(csvfile) data1 =[["Text","Amount & Nature","Prediction"]] for row in reader1: AmountNature = row["Amount & Nature"] nlp = spacy.load('en_core_web_sm') doc1 = nlp(row["Text"]) for ent in doc1.ents: #output = [ent.text, ent.start_char, ent.end_char, ent.label_] label1 = ent.label_ text1 = ent.text data1.append([str(doc1),AmountNature,label1])my_df1 = pd.DataFrame(data1)my_df1.columns = my_df1.iloc[0]my_df1 = my_df1.drop(my_df1.index[[0]])my_df1.to_csv('/path/output.csv', index=False, header=["Text","Amount & Nature","Prediction"])输出 CSVText,Amount & Nature,Prediction"T. Rowe Price Associates, Inc.","28,223,360 (1)",ORG100 E. Pratt Street,Not Listed,FAC"Baltimore, MD 21202",Not Listed,CARDINAL"BlackRock, Inc.","21,871,854 (2)",ORG55 East 52nd Street,Not Listed,LOC"New York, NY 10022",Not Listed,DATEThe Vanguard Group,"21,380,085 (3)",ORG100 Vanguard Blvd.,Not Listed,FAC"Malvern, PA 19355",Not Listed,DATEFMR LLC,"20,784,414 (4)",ORG245 Summer Street,Not Listed,CARDINAL"Boston, MA 02210",Not Listed,GPE在上面的输出中，是否有可能在 Spacy NER 预测上获得 Confident Score。如果是，我该如何实现？有人可以帮我吗？

查看完整描述

3 回答

慕雪6442864

TA贡献1812条经验获得超5个赞

不，不可能在 Spacy 中获得模型的置信度分数（不幸的是）。

虽然使用 F1 分数有利于整体评估，但我更希望 Spacy 为其预测提供个人置信度分数，而目前还没有提供。

反对回复 2021-10-05

繁星点点滴滴

TA贡献1803条经验获得超3个赞

要么获得一个完全注释的数据集，要么自己手动注释（因为你有一个 CSV 文件，这可能是你的首选）。通过这种方式，您可以将地面实况与您的 Spacy 预测区分开来。基于此，您可以计算混淆矩阵。我建议使用 F1 分数作为信心的衡量标准。

这里有一些很棒的链接，讨论各种公开可用的数据集和注释方法（包括 CRF）。

反对回复 2021-10-05

慕无忌1623718

TA贡献1744条经验获得超4个赞

对此没有直接的解释。首先，spaCy为命名实体解析实现两个不同的目标：

贪婪的模仿学习目标。这个目标询问，“如果我从这个状态执行，哪些可用的操作不会引入新的错误？”
全局波束搜索目标。全局模型不是优化单个转换决策，而是询问最终解析是否正确。为了优化这个目标，我们构建了 top-k 最有可能不正确解析和 top-k 最可能正确解析的集合。

注意：测试过spaCy v2.0.13

import spacy

import sys

from collections import defaultdict

nlp = spacy.load('en')

text = 'Hi there! Hope you are doing good. Greetings from India.'

with nlp.disable_pipes('ner'):

doc = nlp(text)

threshold = 0.2

# Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.

beam_width = 16

# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.

beam_density = 0.0001

beams, _ = nlp.entity.beam_parse([ doc ], beam_width, beam_density)

entity_scores = defaultdict(float)

for beam in beams:

for score, ents in nlp.entity.moves.get_beam_parses(beam):

for start, end, label in ents:

entity_scores[(start, end, label)] += score

for key in entity_scores:

start, end, label = key

score = entity_scores[key]

if score > threshold:

print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))

输出：

Label: GPE, Text: India, Score: 0.9999509961251819

反对回复 2021-10-05

3 回答
0 关注
415 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

是否有可能获得 Spacy 命名实体识别的置信度分数

是否有可能获得 Spacy 命名实体识别的置信度分数

3 回答

添加回答