首页猿问 Python -...

Python - 根据特定字符串对数据框进行分组

Python

慕尼黑的夜晚无繁华 2022-01-11 19:56:09

我试图在某些逻辑中组合这些字符串和行：s1 = ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt']s2 = [1,1,2,2,2]s3 = ['Harry Potter','Vol 1','Lord of the Rings - Vol 1',np.nan,'Harry Potter']df = pd.DataFrame(list(zip(s1,s2,s3)), columns=['file','id','book'])df数据预览：file id bookabc.txt 1 Harry Potterabc.txt 1 Vol 1ert.txt 2 Lord of the Ringsert.txt 2 NaNert.txt 2 Harry Potter我有一堆与 id 相关联的文件名列。我有“书”列，其中第 1 卷位于单独的行中。我知道这个 vol1 只与给定数据集中的“哈利波特”相关联。基于'file'和'id'的分组，我如何在'Harry Potter'字符串出现在行中的同一行中组合'Vol 1'？请注意，某些数据行没有 Harry Potter 的 vo1 我在查看文件和 id groupby 时只想要“Vol 1”。2 次尝试：第一个：不起作用if (df['book'] == 'Harry Potter' and df['book'].str.contains('Vol 1',case=False) in df.groupby(['file','id'])): df.groupby(['file','id'],as_index=False).first()第二：这适用于每个字符串（但不希望它适用于每个“哈利波特”字符串。df.loc[df['book'].str.contains('Harry Potter',case=False,na=False), 'new_book'] = 'Harry Potter - Vol 1'这是我正在寻找的输出file id bookabc.txt 1 Harry Potter - Vol 1ert.txt 2 Lord of the Rings - Vol 1ert.txt 2 NaNert.txt 2 Harry Potter

查看完整描述

3 回答

杨__羊羊

TA贡献1943条经验获得超7个赞

从import re（您将使用它）开始。

然后创建你的数据框：

df = pd.DataFrame({

'file': ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt'],

'id': [1, 1, 2, 2, 2],

'book': ['Harry Potter', 'Vol 1', 'Lord of the Rings - Vol 1',

np.nan, 'Harry Potter']})

第一个处理步骤是添加一列，我们称之为book2，其中包含下一行的book2：

df["book2"] = df.book.shift(-1).fillna('')

我添加fillna('')了用空字符串替换NaN值。

然后定义一个应用于每一行的函数：

def fn(row):

return f"{row.book} - {row.book2}" if row.book == 'Harry Potter'\

and re.match(r'^Vol \d+$', row.book2) else row.book

此函数检查book == "Harry Potter" 和book2 是否匹配 "Vol" + 数字序列。如果是，则返回book + book2，否则仅返回book。

然后我们应用这个函数并将结果保存在book下：

df["book"] = df.apply(fn, axis=1)

剩下的就是放弃：

book与Vol \d+匹配的行，

book2栏。

代码是：

df = df.drop(df[df.book.str.match(r'^Vol \d+$').fillna(False)].index)\

.drop(columns=['book2'])

需要 fillna(False)，因为str.match为源内容返回NaN == NaN。

反对回复 2022-01-11

拉莫斯之舞

TA贡献1820条经验获得超10个赞

假设“Vol x”出现在标题后面的行上，我将使用通过将 book 列移动 -1 获得的辅助系列。然后，将该 Series 与 book 列在它以开头时合并"Vol "并在 books 列以开头的位置放置行就足够了"Vol "。代码可以是：

b2 = df.book.shift(-1).fillna('')

df['book'] = df.book + np.where(b2.str.match('Vol [0-9]+'), ' - ' + b2, '')

print(df.drop(df.loc[df.book.fillna('').str.match('Vol [0-9]+')].index))

如果不能保证数据帧中的顺序，但如果Vol x行与数据帧中具有相同文件和 id 的另一行匹配，则可以将数据帧分成两部分，一个包含Vol x行，一个包含其他行并更新后者来自前者：

g = df.groupby(df.book.fillna('').str.match('Vol [0-9]+'))

for k, v in g:

if k:

df_vol = v

else:

df = v

for row in df_vol.iterrows():

r = row[1]

df.loc[(df.file == r.file)&(df.id==r.id), 'book'] += ' - ' + r['book']

反对回复 2022-01-11

喵喵时光机

TA贡献1846条经验获得超7个赞

利用merge, apply, update, drop_duplicates.

set_index和merge上索引file，id的DF之间'Harry Potter'和df的'Vol 1'; join创建适当的字符串并将其转换为数据框

df.set_index(['file', 'id'], inplace=True)

df1 = df[df['book'] == 'Harry Potter'].merge(df[df['book'] == 'Vol 1'], left_index=True, right_index=True).apply(' '.join, axis=1).to_frame(name='book')

Out[2059]:

book

file id

abc.txt 1 Harry Potter Vol 1

更新原来df，drop_duplicate和reset_index

df.update(df1)

df.drop_duplicates().reset_index()

Out[2065]:

file id book

0 abc.txt 1 Harry Potter Vol 1

1 ert.txt 2 Lord of the Rings - Vol 1

2 ert.txt 2 NaN

3 ert.txt 2 Harry Potter

反对回复 2022-01-11

3 回答
0 关注
331 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Python - 根据特定字符串对数据框进行分组

Python - 根据特定字符串对数据框进行分组

3 回答

添加回答