首页猿问使用 pandas...

使用 pandas str.split，其中出现需要是 pandas 列的值

Python

侃侃尔雅 2023-06-20 10:24:58

在弄清楚 pandas str.split 时遇到一些麻烦。出现的位置来自列值，而不是为要拆分的字符串放置一个静态值。我环顾四周寻找类似类型的问题，但大多数似乎只是采用静态方法来解决问题。下面我有数据框。.str.split('|',1).str[-1] 将在管道 ('|') 第一次出现时删除字符串的左侧部分。这种静态方法将在整个系列中执行相同的操作。因为 occurrence 参数不会改变。我想要发生的事情： .str.split('|', df['occurrence'] ).str[-1] 可以是动态的并利用出现列中的值并用作 str.split 出现争论。如果值为零或更小，则不会对字符串采取任何操作。lambda 语句实际上工作并正确执行，但是，它从字符串的右侧开始，根据管道之间的值拆分和连接。但最后的结局是好的。不同的方法。我只是不能让它从字符串的左侧做同样的事情。最后一点：删除需要从字符串的左边开始。#-------------------import pandas as pdfrom pandas import DataFrame, Seriesimport numpy as npdata_1 = {'occurrence': [7,2,0,3,4,0], 'string': ['1|2|3|4|5|6|7|8|9|10|11|12','10|11.2|12.2|13.6|14.7','1|2|3', '1|2|3|4|5|6|7|8','1|2.4|3|4.6|5|6.2|7|8.1','1|2|3|4|5'] }df = pd.DataFrame(data_1)df['string'] = df['string'].str.split('|',1).str[-1] # Works but is static only# df['string'] = df['string'].str.split('|',df['occurrence']).str[-1] # Trying to use occurrence # column value as argument# Does work BUT starts with right side of string. Needs to be left.# df['string'] = df.apply(lambda x: '|'.join(x['string'].split('|')[:x.occurrence - 2]), axis=1) print(df)#-------------------Start with: What I would like:occurrence string occurrence string 7 1|2|3|4|5|6|7|8|9|10|11|12 7 8|9|10|11|122 10|11.2|12.2|13.6|14.7 2 12.2|13.6|14.7 0 1|2|3 0 1|2|3 3 1|2|3|4|5|6|7|8 3 4|5|6|7|8 4 1|2.4|3|4.6|5|6.2|7|8.1 4 5|6.2|7|8.10 1|2|3|4|5 0 1|2|3|4|5如果您能为我解决这个问题提供有关此主题的任何帮助，我将不胜感激。一如既往，您的时间很宝贵，我为此感谢您。

查看完整描述

3 回答

慕村225694

TA贡献1880条经验获得超4个赞

用于围绕 delimiterSeries.str.split拆分列，然后使用列表理解压缩拆分列并处理值：string|zipoccurence

df['string'] = ['|'.join(s[i:]) for i, s in zip(df['occurrence'], df['string'].str.split('|'))]

结果：

print(df)

occurrence string

0 7 8|9|10|11|12

1 2 12.2|13.6|14.7

2 0 1|2|3

3 3 4|5|6|7|8

4 4 5|6.2|7|8.1

5 0 1|2|3|4|5

性能（使用测量timeit）：

df.shape

(60000, 2)

%%timeit -n10

_ = ['|'.join(s[i:]) for i, s in zip(df['occurrence'], df['string'].str.split('|'))]

67.9 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10 (using 'apply')

_ = df.apply(lambda x: '|'.join(x['string'].split('|')[x.occurrence:]), axis=1)

1.93 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

反对回复 2023-06-20

交互式爱情

TA贡献1712条经验获得超3个赞

尝试将您的 lambda 表达式更改为：

df.apply(lambda x: '|'.join(x['string'].split('|')[x.occurrence:]), axis=1)

如果你想要后面的元素（右侧），你应该从出现作为索引开始。

结果：

0 8|9|10|11|12

1 12.2|13.6|14.7

2 1|2|3

3 4|5|6|7|8

4 5|6.2|7|8.1

5 1|2|3|4|5

反对回复 2023-06-20

茅侃侃

TA贡献1842条经验获得超22个赞

一种有点非正统的方法：从中构建一个正则表达式df['occurrence']并使用它来匹配：

df['regex'] = df['occurrence'].map(lambda o: '^' + r'(?:[^|]*\|)'*o + r'(.*)$')

df['regex']

0 ^(?:[^|]*\|)(?:[^|]*\|)(?:[^|]*\|)(?:[^|]*\|)(...

1 ^(?:[^|]*\|)(?:[^|]*\|)(.*)$

2 ^(.*)$

3 ^(?:[^|]*\|)(?:[^|]*\|)(?:[^|]*\|)(.*)$

4 ^(?:[^|]*\|)(?:[^|]*\|)(?:[^|]*\|)(?:[^|]*\|)(...

5 ^(.*)$

Name: regex, dtype: object

现在只适用re.match于每一行：

df.apply(lambda row: re.match(row['regex'], row['string']).group(1), axis=1)

0 8|9|10|11|12

1 12.2|13.6|14.7

2 1|2|3

3 4|5|6|7|8

4 5|6.2|7|8.1

5 1|2|3|4|5

dtype: object

反对回复 2023-06-20

3 回答
0 关注
287 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

使用 pandas str.split，其中出现需要是 pandas 列的值

使用 pandas str.split，其中出现需要是 pandas 列的值

3 回答

添加回答