首页手记使用 spacy 进行自然语言处理（一）

使用 spacy 进行自然语言处理（一）

标签：

人工智能

介绍

自然语言处理(NLP) 是人工智能方向一个非常重要的研究领域。自然语言处理在很多智能应用中扮演着非常重要的角色，例如：

automated chat bots,
article summarizers,
multi-lingual translation
opinion identification from data

每一个利用NLP来理解非结构化文本数据的行业，不仅要求准确，而且在获取结果方面也很敏捷。

自然语言处理是一个非常广阔的领域，NLP 的任务包括

text classification,
entity detection,
machine translation,
question answering,
concept identification.

在本文中，将介绍一个高级的 NLP 库 - spaCy

内容列表

关于 spaCy 和安装
Spacy 流水线和属性

Tokenization
Pos Tagging
Entity Detection
Dependency Parsing
名词短语

与 NLTK 和 coreNLP 的对比

1.关于 spaCy 和安装

1.1 关于 Spacy

Spacy 是由 cython 编写。因此它是一个非常快的库。 spaCy 提供简洁的接口用来访问其方法和属性 governed by trained machine (and deep) learning models.

1.2 安装

安装 Spacy

pip install spacy

下载数据和模型

python -m spacy download en

现在，您可以使用 Spacy 了。

2. Spacy 流水线和属性

要想使用 Spacy 和访问其不同的 properties，需要先创建 pipelines。 通过加载模型来创建一个 pipeline。 Spacy 提供了许多不同的模型 , 模型中包含了语言的信息- 词汇表，预训练的词向量，语法和实体。

下面将加载默认的模型- english-core-web

import spacy 
nlp = spacy.load(“en”)

nlp 对象将要被用来创建文档，访问语言注释和不同的 nlp 属性。我们通过加载一个文本文件来创建一个 document 。这里使用的是从 tripadvisor's 网站上下载下来的旅馆评论。

document = open(filename).read()
document = nlp(document)

现在，document 成为 spacy.english 模型的一部分，同时 document 也有一些成员属性。可以通过 dir(document) 查看。

dir(document)
>> [..., 'user_span_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab']

document 包含大量的文档属性信息，包括 - tokens, token’s reference index, part of speech tags, entities, vectors, sentiment, vocabulary etc. 下面将介绍一下几个属性

2.1 Tokenization

"this is a sentence."-> (tokenization)
>> ['this', 'is', 'a', 'sentence', '.']

Spacy 会先将文档分解成句子，然后再 tokenize 。我们可以使用迭代来遍历整个文档。

# first token of the doc document[0] 
>> Nice# last token of the doc  document[len(document)-5]
>> boston 

# List of sentences of our doc list(document.sents)
>> [ Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
...
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).]

2.2 Part of Speech Tagging (词性标注)

词性标注： word 的动词/名词/… 属性。这些标注可以作为文本特征用到 information filtering, statistical models, 和 rule based parsing 中.

# get all tagsall_tags = {w.pos: w.pos_ for w in document}
>> {83: 'ADJ', 91: 'NOUN', 84: 'ADP', 89: 'DET', 99: 'VERB', 94: 'PRON', 96: 'PUNCT', 85: 'ADV', 88: 'CCONJ', 95: 'PROPN', 102: 'SPACE', 93: 'PART', 98: 'SYM', 92: 'NUM', 100: 'X', 90: 'INTJ'}# all tags of first sentence of our document for word in list(document.sents)[0]:  
    print(word, word.tag_)
>> (Nice, 'JJ') (place, 'NN') (Better, 'JJR') (than, 'IN') (some, 'DT') (reviews, 'NNS') (give, 'VBP') (it, 'PRP') (credit, 'NN') (for, 'IN') (., '.')

下面代码创建一个文本处理操作，去掉噪声词。

#define some parameters  noisy_pos_tags = ["PROP"]
min_token_length = 2#Function to check if the token is a noise or not  def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise 
def cleanup(token, lower = True):
    if lower:
       token = token.lower()    return token.strip()# top unigrams used in the reviews from collections import Counter
cleaned_list = [cleanup(word.string) for word in document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)
>> [('hotel', 683), ('room', 652), ('great', 300),  ('sheraton', 285), ('location', 271)]

2.3 Entity Detection （实体检测）

Spacy 包含了一个快速的实体识别模型，它可以识别出文档中的实体短语。有多种类型的实体，例如 - 人物，地点，组织，日期，数字。可以通过 document 的 ents 属性来访问这些实体。

下面代码用来找出当前文档中的所有命名实体。

labels = set([w.label_ for w in document.ents]) 
for label in labels: 
    entities = [cleanup(e.string, lower=False) for e in document.ents if label==e.label_] 
    entities = list(set(entities)) 
    print label,entities

2.4 Dependency Parsing

spacy 一个非常强大的特性就是十分快速和准确的语法解析树的构建，通过一个简单的 API 即可完成。这个 parser 也可以用作句子边界检测和短语切分。通过 “.children” , “.root”, “.ancestor” 即可访问。

# extract all review sentences that contains the term - hotelhotel = [sent for sent in document.sents if 'hotel' in sent.string.lower()]# create dependency treesentence = hotel[2] 
for word in sentence:
    print(word, ': ', str(list(word.children)))
>> A :  []  
cab :  [A, from] 
from :  [airport, to]
the :  [] 
airport :  [the] 
to :  [hotel] 
the :  [] 
hotel :  [the] 
can :  []
be :  [cab, can, cheaper, .] 
cheaper :  [than]
than :  [shuttles] 
the :  []
shuttles :  [the, depending] 
depending :  [time] 
what :  [] 
time :  [what, of] 
of :  [day]
the :  [] 
day :  [the, go] 
you :  []
go :  [you]
. :  []

下面代码所作的工作是：解析所有包含 “hotel” 句子的依赖树，看看都用了什么样的形容词来描述 “hotel”。下面创建了一个自定义函数来解析依赖树和抽取相关的词性标签。

# check all adjectives used with a word def pos_words (document, token, pos_tag):
    sentences = [sent for sent in document.sents if token in sent.string]     
    pwrds = []    for sent in sentences:        for word in sent:            if token in word.string: 
                   pwrds.extend([child.string.strip() for child in word.children                                                      if child.pos_ == pos_tag] )    return Counter(pwrds).most_common(10)

pos_words(document, 'hotel', "ADJ")
>> [(u'other', 20), (u'great', 10), (u'good', 7), (u'better', 6), (u'nice', 6), (u'different', 5), (u'many', 5), (u'best', 4), (u'my', 4), (u'wonderful', 3)]

2.5 Noun Phrases （名词短语）

Dependency trees 也可以用来生成名词短语。

# Generate Noun Phrases doc = nlp(u'I love data science on analytics vidhya') 
for np in doc.noun_chunks:
    print(np.text, np.root.dep_, np.root.head.text)
>> I nsubj love
   data science dobj love
   analytics pobj on

3.与CNTK和core NLP 的对比

参考资料

https://github.com/pytorch/text

https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/

译文出处

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

烙印99

手记
篇

粉丝

92

获赞与收藏

450

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 33034 372

网络编程入门教程

20个小节 13710 256

Pandas 入门教程

25个小节 20348 390

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空