首页猿问 ValueError：值的长度与嵌...

ValueError：值的长度与嵌套循环中的索引长度不匹配

Python

九州编程 2022-11-29 17:05:41

我正在尝试删除列中每一行的停用词。列包含行和行，因为我已经有了word_tokenized它，nltk现在它是一个包含元组的列表。我试图用这个嵌套列表理解删除停用词，但它说ValueError: Length of values does not match length of index in nested loop。如何解决这个问题？import pandas as pdfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizedata = pd.read_csv(r"D:/python projects/read_files/spam.csv", encoding = "latin-1")data = data[['v1','v2']]data = data.rename(columns = {'v1': 'label', 'v2': 'text'})stopwords = set(stopwords.words('english'))data['text'] = data['text'].str.lower()data['new'] = [word_tokenize(row) for row in data['text']]data['new'] = [word for new in data['new'] for word in new if word not in stopwords]我的文本数据data['text'].head(5)Out[92]: 0 go until jurong point, crazy.. available only ...1 ok lar... joking wif u oni...2 free entry in 2 a wkly comp to win fa cup fina...3 u dun say so early hor... u c already then say...4 nah i don't think he goes to usf, he lives aro...Name: text, dtype: object在我word_tokenized用 nltk之后data['new'].head(5)Out[89]: 0 [go, until, jurong, point, ,, crazy.., availab...1 [ok, lar, ..., joking, wif, u, oni, ...]2 [free, entry, in, 2, a, wkly, comp, to, win, f...3 [u, dun, say, so, early, hor, ..., u, c, alrea...4 [nah, i, do, n't, think, he, goes, to, usf, ,,...Name: new, dtype: object回溯runfile('D:/python projects/NLP_nltk_first.py', wdir='D:/python projects')Traceback (most recent call last): File "D:\python projects\NLP_nltk_first.py", line 36, in <module> data['new'] = [new for new in data['new'] for word in new if word not in stopwords] File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3487, in __setitem__ self._set_item(key, value)

查看完整描述

1 回答

子衿沉夜

TA贡献1828条经验获得超3个赞

仔细阅读错误信息：

ValueError：值的长度与索引的长度不匹配

在这种情况下，“值”是右边的东西=：

values = [word for new in data['new'] for word in new if word not in stopwords]

本例中的“索引”是 DataFrame 的行索引：

index = data.index

这里index的行数始终与 DataFrame 本身的行数相同。

问题是values对于index- 即它们对于 DataFrame 来说太长了。如果你检查你的代码，这应该是显而易见的。如果您仍然看不到问题，请尝试以下操作：

data['text_tokenized'] = [word_tokenize(row) for row in data['text']]

values = [word for new in data['text_tokenized'] for word in new if word not in stopwords]

print('N rows:', data.shape[0])

print('N new values:', len(values))

至于如何解决问题——这完全取决于您要达到的目标。一种选择是“分解”数据（还要注意使用.map而不是列表理解）：

data['text_tokenized'] = data['text'].map(word_tokenize)

# Flatten the token lists without a nested list comprehension

tokens_flat = data['text_tokenized'].explode()

# Join your labels w/ your flattened tokens, if desired

data_flat = data[['label']].join(tokens_flat)

# Add a 2nd index level to track token appearance order,

# might make your life easier

data_flat['token_id'] = data.groupby(level=0).cumcount()

data_flat = data_flat.set_index('token_id', append=True)

作为一个不相关的提示，您可以通过仅加载您需要的列来提高 CSV 处理的效率，如下所示：

data = pd.read_csv(r"D:/python projects/read_files/spam.csv",

encoding="latin-1",

usecols=["v1", "v2"])

反对回复 2022-11-29

1 回答
0 关注
158 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

ValueError：值的长度与嵌套循环中的索引长度不匹配

ValueError：值的长度与嵌套循环中的索引长度不匹配

1 回答

添加回答