为了账号安全,请及时绑定邮箱和手机立即绑定

nltk NaiveBayesClassifier情绪分析培训

/ 猿问

nltk NaiveBayesClassifier情绪分析培训

达令说 2019-12-26 09:58:18

我正在NaiveBayesClassifier使用句子在Python中进行训练,这给了我下面的错误。我不知道错误可能是什么,任何帮助都将是很好的。


我尝试了许多其他输入格式,但错误仍然存在。下面给出的代码:


from text.classifiers import NaiveBayesClassifier

from text.blob import TextBlob

train = [('I love this sandwich.', 'pos'),

         ('This is an amazing place!', 'pos'),

         ('I feel very good about these beers.', 'pos'),

         ('This is my best work.', 'pos'),

         ("What an awesome view", 'pos'),

         ('I do not like this restaurant', 'neg'),

         ('I am tired of this stuff.', 'neg'),

         ("I can't deal with this", 'neg'),

         ('He is my sworn enemy!', 'neg'),

         ('My boss is horrible.', 'neg') ]


test = [('The beer was good.', 'pos'),

        ('I do not enjoy my job', 'neg'),

        ("I ain't feeling dandy today.", 'neg'),

        ("I feel amazing!", 'pos'),

        ('Gary is a friend of mine.', 'pos'),

        ("I can't believe I'm doing this.", 'neg') ]

classifier = nltk.NaiveBayesClassifier.train(train)

我包括下面的追溯。


Traceback (most recent call last):

  File "C:\Users\5460\Desktop\train01.py", line 15, in <module>

    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))

  File "C:\Users\5460\Desktop\train01.py", line 15, in <genexpr>

    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))

  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 87, in word_tokenize

    return _word_tokenize(text)

  File "C:\Python27\lib\site-packages\nltk\tokenize\treebank.py", line 67, in tokenize

    text = re.sub(r'^\"', r'``', text)

  File "C:\Python27\lib\re.py", line 151, in sub

    return _compile(pattern, flags).sub(repl, string, count)

TypeError: expected string or buffer


查看完整描述

3 回答

?
慕桂英546537

您需要更改数据结构。这是train目前的清单:


>>> train = [('I love this sandwich.', 'pos'),

('This is an amazing place!', 'pos'),

('I feel very good about these beers.', 'pos'),

('This is my best work.', 'pos'),

("What an awesome view", 'pos'),

('I do not like this restaurant', 'neg'),

('I am tired of this stuff.', 'neg'),

("I can't deal with this", 'neg'),

('He is my sworn enemy!', 'neg'),

('My boss is horrible.', 'neg')]

问题是,每个元组的第一个元素应该是功能字典。因此,我将您的列表更改为分类器可以使用的数据结构:


>>> from nltk.tokenize import word_tokenize # or use some other tokenizer

>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))

>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

现在,您的数据的结构应如下所示:


>>> t

[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]

注意,每个元组的第一个元素现在是字典。现在您的数据已经到位,每个元组的第一个元素是字典,您可以像这样训练分类器:


>>> import nltk

>>> classifier = nltk.NaiveBayesClassifier.train(t)

>>> classifier.show_most_informative_features()

Most Informative Features

                    this = True              neg : pos    =      2.3 : 1.0

                    this = False             pos : neg    =      1.8 : 1.0

                      an = False             neg : pos    =      1.6 : 1.0

                       . = True              pos : neg    =      1.4 : 1.0

                       . = False             neg : pos    =      1.4 : 1.0

                 awesome = False             neg : pos    =      1.2 : 1.0

                      of = False             pos : neg    =      1.2 : 1.0

                    feel = False             neg : pos    =      1.2 : 1.0

                   place = False             neg : pos    =      1.2 : 1.0

                horrible = False             pos : neg    =      1.2 : 1.0

如果要使用分类器,可以这样进行。首先,从一个测试句子开始:


>>> test_sentence = "This is the best band I've ever heard!"

然后,您标记该句子并找出该句子与all_words共享的单词。这些构成了句子的特征。


>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}

现在,您的功能将如下所示:


>>> test_sent_features

{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}

然后,您只需对这些功能进行分类:


>>> classifier.classify(test_sent_features)

'pos' # note 'best' == True in the sentence features above

这个测试句子似乎是肯定的。


查看完整回答
反对 2019-12-26
?
慕桂英4014372

有关NLTK贝叶斯分类器的数据结构的教程很棒。从更高的角度来看,我们可以将其视为


我们有带有情感标签的输入句子:


training_data = [('I love this sandwich.', 'pos'),

('This is an amazing place!', 'pos'),

('I feel very good about these beers.', 'pos'),

('This is my best work.', 'pos'),

("What an awesome view", 'pos'),

('I do not like this restaurant', 'neg'),

('I am tired of this stuff.', 'neg'),

("I can't deal with this", 'neg'),

('He is my sworn enemy!', 'neg'),

('My boss is horrible.', 'neg')]

让我们将特征集视为单个单词,因此我们从训练数据中提取所有可能单词的列表(我们称之为词汇),如下所示:


from nltk.tokenize import word_tokenize

from itertools import chain

vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

本质上,vocabulary这里是@ 275365的相同all_word


>>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0]))

>>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

>>> print vocabulary == all_words

True

从每个数据点(即每个句子和pos / neg标签),我们要说一个特征(即词汇中的单词)是否存在。


>>> sentence = word_tokenize('I love this sandwich.'.lower())

>>> print {i:True for i in vocabulary if i in sentence}

{'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True}

但是我们还想告诉分类器,句子中不存在哪个单词,而是词汇中的单词,因此对于每个数据点,我们列出词汇中所有可能的单词,并说出一个单词是否存在:


>>> sentence = word_tokenize('I love this sandwich.'.lower())

>>> x =  {i:True for i in vocabulary if i in sentence}

>>> y =  {i:False for i in vocabulary if i not in sentence}

>>> x.update(y)

>>> print x

{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

但是,由于这两次遍历词汇表,因此这样做更有效:


>>> sentence = word_tokenize('I love this sandwich.'.lower())

>>> x = {i:(i in sentence) for i in vocabulary}

{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

因此,对于每个句子,我们想告诉每个句子的分类器哪个词存在,哪个词不存在,并为其赋予pos / neg标记。我们可以称其为feature_set,它是一个由x(如上所示)及其标签组成的元组。


>>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]

然后,我们将feature_set中的这些功能和标签提供给分类器以对其进行训练:


from nltk import NaiveBayesClassifier as nbc

classifier = nbc.train(feature_set)

现在,您拥有训练有素的分类器,并且当您要标记新句子时,您必须“特征化”新句子以查看新句子中哪个词在分类器接受过训练的词汇表中:


>>> test_sentence = "This is the best band I've ever heard! foobar"

>>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

注意:从上面的步骤中可以看到,朴素的贝叶斯分类器无法处理超出词汇量的单词,因为foobar标记化为特征后会消失。


然后,您将特征化的测试句子输入分类器,并要求其进行分类:


>>> classifier.classify(featurized_test_sentence)

'pos'

希望这可以更清晰地说明如何将数据输入到NLTK的朴素贝叶斯分类器中进行情感分析。这是完整的代码,没有注释和演练:


from nltk import NaiveBayesClassifier as nbc

from nltk.tokenize import word_tokenize

from itertools import chain


training_data = [('I love this sandwich.', 'pos'),

('This is an amazing place!', 'pos'),

('I feel very good about these beers.', 'pos'),

('This is my best work.', 'pos'),

("What an awesome view", 'pos'),

('I do not like this restaurant', 'neg'),

('I am tired of this stuff.', 'neg'),

("I can't deal with this", 'neg'),

('He is my sworn enemy!', 'neg'),

('My boss is horrible.', 'neg')]


vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))


feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]


classifier = nbc.train(feature_set)


test_sentence = "This is the best band I've ever heard!"

featurized_test_sentence =  {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}


print "test_sent:",test_sentence

print "tag:",classifier.classify(featurized_test_sentence)


查看完整回答
反对 2019-12-26
?
阿波罗的战车

看来您正在尝试使用TextBlob,但正在训练NLTK NaiveBayesClassifier,如其他答案所述,必须将其传递给功能词典。


TextBlob具有默认的特征提取器,该特征提取器指示训练集中的哪些词包含在文档中(如其他答案所示)。因此,TextBlob允许您按原样传递数据。


from textblob.classifiers import NaiveBayesClassifier


train = [('This is an amazing place!', 'pos'),

        ('I feel very good about these beers.', 'pos'),

        ('This is my best work.', 'pos'),

        ("What an awesome view", 'pos'),

        ('I do not like this restaurant', 'neg'),

        ('I am tired of this stuff.', 'neg'),

        ("I can't deal with this", 'neg'),

        ('He is my sworn enemy!', 'neg'),

        ('My boss is horrible.', 'neg') ] 

test = [

        ('The beer was good.', 'pos'),

        ('I do not enjoy my job', 'neg'),

        ("I ain't feeling dandy today.", 'neg'),

        ("I feel amazing!", 'pos'),

        ('Gary is a friend of mine.', 'pos'),

        ("I can't believe I'm doing this.", 'neg') ] 



classifier = NaiveBayesClassifier(train)  # Pass in data as is

# When classifying text, features are extracted automatically

classifier.classify("This is an amazing library!")  # => 'pos'

当然,简单的默认提取器并不适合所有问题。如果要提取特征,只需编写一个函数,该函数将文本字符串作为输入,然后输出特征字典并将其传递给分类器。


classifier = NaiveBayesClassifier(train, feature_extractor=my_extractor_func)

我鼓励您在此处查看简短的TextBlob分类器教程:http ://textblob.readthedocs.org/en/latest/classifiers.html


查看完整回答
反对 2019-12-26

添加回答

回复

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信