为了账号安全,请及时绑定邮箱和手机立即绑定

ValueError:值的长度与嵌套循环中的索引长度不匹配

ValueError:值的长度与嵌套循环中的索引长度不匹配

九州编程 2022-11-29 17:05:41
我正在尝试删除列中每一行的停用词。列包含行和行,因为我已经有了word_tokenized它,nltk现在它是一个包含元组的列表。我试图用这个嵌套列表理解删除停用词,但它说ValueError: Length of values does not match length of index in nested loop。如何解决这个问题?import pandas as pdfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizedata = pd.read_csv(r"D:/python projects/read_files/spam.csv",                    encoding = "latin-1")data = data[['v1','v2']]data = data.rename(columns = {'v1': 'label', 'v2': 'text'})stopwords = set(stopwords.words('english'))data['text'] = data['text'].str.lower()data['new'] = [word_tokenize(row) for row in data['text']]data['new'] = [word for new in data['new'] for word in new if word not in stopwords]我的文本数据data['text'].head(5)Out[92]: 0    go until jurong point, crazy.. available only ...1                        ok lar... joking wif u oni...2    free entry in 2 a wkly comp to win fa cup fina...3    u dun say so early hor... u c already then say...4    nah i don't think he goes to usf, he lives aro...Name: text, dtype: object在我word_tokenized用 nltk之后data['new'].head(5)Out[89]: 0    [go, until, jurong, point, ,, crazy.., availab...1             [ok, lar, ..., joking, wif, u, oni, ...]2    [free, entry, in, 2, a, wkly, comp, to, win, f...3    [u, dun, say, so, early, hor, ..., u, c, alrea...4    [nah, i, do, n't, think, he, goes, to, usf, ,,...Name: new, dtype: object回溯runfile('D:/python projects/NLP_nltk_first.py', wdir='D:/python projects')Traceback (most recent call last):  File "D:\python projects\NLP_nltk_first.py", line 36, in <module>    data['new'] = [new for new in data['new'] for word in new if word not in stopwords]  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3487, in __setitem__    self._set_item(key, value)
查看完整描述

1 回答

?
子衿沉夜

TA贡献1828条经验 获得超3个赞

仔细阅读错误信息:


ValueError:值的长度与索引的长度不匹配


在这种情况下,“值”是右边的东西=:


values = [word for new in data['new'] for word in new if word not in stopwords]

本例中的“索引”是 DataFrame 的行索引:


index = data.index

这里index的行数始终与 DataFrame 本身的行数相同。


问题是values对于index- 即它们对于 DataFrame 来说太长了。如果你检查你的代码,这应该是显而易见的。如果您仍然看不到问题,请尝试以下操作:


data['text_tokenized'] = [word_tokenize(row) for row in data['text']]


values = [word for new in data['text_tokenized'] for word in new if word not in stopwords]


print('N rows:', data.shape[0])

print('N new values:', len(values))

至于如何解决问题——这完全取决于您要达到的目标。一种选择是“分解”数据(还要注意使用.map而不是列表理解):


data['text_tokenized'] = data['text'].map(word_tokenize)


# Flatten the token lists without a nested list comprehension

tokens_flat = data['text_tokenized'].explode()


# Join your labels w/ your flattened tokens, if desired

data_flat = data[['label']].join(tokens_flat)


# Add a 2nd index level to track token appearance order,

# might make your life easier 

data_flat['token_id'] = data.groupby(level=0).cumcount()

data_flat = data_flat.set_index('token_id', append=True)

作为一个不相关的提示,您可以通过仅加载您需要的列来提高 CSV 处理的效率,如下所示:


data = pd.read_csv(r"D:/python projects/read_files/spam.csv",

                    encoding="latin-1",

                    usecols=["v1", "v2"])


查看完整回答
反对 回复 2022-11-29
  • 1 回答
  • 0 关注
  • 158 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信