为了账号安全,请及时绑定邮箱和手机立即绑定

是否有可能获得 Spacy 命名实体识别的置信度分数

是否有可能获得 Spacy 命名实体识别的置信度分数

紫衣仙女 2021-10-05 17:47:04
我需要获得 Spacy NER 所做预测的置信度分数。CSV 文件Text,Amount & Nature,Percent of Class"T. Rowe Price Associates, Inc.","28,223,360 (1)",8.7% (1)100 E. Pratt Street,Not Listed,Not Listed"Baltimore, MD 21202",Not Listed,Not Listed"BlackRock, Inc.","21,871,854 (2)",6.8% (2)55 East 52nd Street,Not Listed,Not Listed"New York, NY 10022",Not Listed,Not ListedThe Vanguard Group,"21,380,085 (3)",6.64% (3)100 Vanguard Blvd.,Not Listed,Not Listed"Malvern, PA 19355",Not Listed,Not ListedFMR LLC,"20,784,414 (4)",6.459% (4)245 Summer Street,Not Listed,Not Listed"Boston, MA 02210",Not Listed,Not Listed代码import pandas as pdimport spacywith open('/path/table.csv') as csvfile:    reader1 = csv.DictReader(csvfile)    data1 =[["Text","Amount & Nature","Prediction"]]    for row in reader1:        AmountNature = row["Amount & Nature"]        nlp = spacy.load('en_core_web_sm')         doc1 = nlp(row["Text"])        for ent in doc1.ents:            #output = [ent.text, ent.start_char, ent.end_char, ent.label_]            label1 = ent.label_            text1 = ent.text        data1.append([str(doc1),AmountNature,label1])my_df1 = pd.DataFrame(data1)my_df1.columns = my_df1.iloc[0]my_df1 = my_df1.drop(my_df1.index[[0]])my_df1.to_csv('/path/output.csv', index=False, header=["Text","Amount & Nature","Prediction"])输出 CSVText,Amount & Nature,Prediction"T. Rowe Price Associates, Inc.","28,223,360 (1)",ORG100 E. Pratt Street,Not Listed,FAC"Baltimore, MD 21202",Not Listed,CARDINAL"BlackRock, Inc.","21,871,854 (2)",ORG55 East 52nd Street,Not Listed,LOC"New York, NY 10022",Not Listed,DATEThe Vanguard Group,"21,380,085 (3)",ORG100 Vanguard Blvd.,Not Listed,FAC"Malvern, PA 19355",Not Listed,DATEFMR LLC,"20,784,414 (4)",ORG245 Summer Street,Not Listed,CARDINAL"Boston, MA 02210",Not Listed,GPE在上面的输出中,是否有可能在 Spacy NER 预测上获得 Confident Score。如果是,我该如何实现?有人可以帮我吗?
查看完整描述

3 回答

?
慕雪6442864

TA贡献1812条经验 获得超5个赞

不,不可能在 Spacy 中获得模型的置信度分数(不幸的是)。

虽然使用 F1 分数有利于整体评估,但我更希望 Spacy 为其预测提供个人置信度分数,而目前还没有提供。


查看完整回答
反对 回复 2021-10-05
?
繁星点点滴滴

TA贡献1803条经验 获得超3个赞

要么获得一个完全注释的数据集,要么自己手动注释(因为你有一个 CSV 文件,这可能是你的首选)。通过这种方式,您可以将地面实况与您的 Spacy 预测区分开来。基于此,您可以计算混淆矩阵。我建议使用 F1 分数作为信心的衡量标准。

这里 一些 很棒的 链接,讨论各种公开可用的数据集和注释方法(包括 CRF)。


查看完整回答
反对 回复 2021-10-05
?
慕无忌1623718

TA贡献1744条经验 获得超4个赞

对此没有直接的解释。首先,spaCy为命名实体解析实现两个不同的目标:

  1. 贪婪的模仿学习目标。这个目标询问,“如果我从这个状态执行,哪些可用的操作不会引入新的错误?”

  2. 全局波束搜索目标。全局模型不是优化单个转换决策,而是询问最终解析是否正确。为了优化这个目标,我们构建了 top-k 最有可能不正确解析和 top-k 最可能正确解析的集合。

注意:测试过spaCy v2.0.13

import spacy

import sys

from collections import defaultdict


nlp = spacy.load('en')

text = 'Hi there! Hope you are doing good. Greetings from India.'


with nlp.disable_pipes('ner'):

    doc = nlp(text)


threshold = 0.2

# Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.

beam_width = 16

# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.

beam_density = 0.0001 

beams, _ = nlp.entity.beam_parse([ doc ], beam_width, beam_density)


entity_scores = defaultdict(float)

for beam in beams:

    for score, ents in nlp.entity.moves.get_beam_parses(beam):

        for start, end, label in ents:

            entity_scores[(start, end, label)] += score

            

for key in entity_scores:

    start, end, label = key

    score = entity_scores[key]

    if score > threshold:

        print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))

输出:


Label: GPE, Text: India, Score: 0.9999509961251819


查看完整回答
反对 回复 2021-10-05
  • 3 回答
  • 0 关注
  • 415 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
微信客服

购课补贴
联系客服咨询优惠详情

帮助反馈 APP下载

慕课网APP
您的移动学习伙伴

公众号

扫描二维码
关注慕课网微信公众号