首页猿问如何使用 extract 从...

如何使用 extract 从 pandas 数据框中提取大写字母以及一些子字符串？

Python

慕雪6442864 2023-12-12 14:46:41

这个问题是上一个问题How to extract only uppercase substring from pandas series? 的后续问题。我决定提出新问题，而不是改变旧问题。我的目标是从名为 item 的列中提取聚合方法agg和特征名称。feat这是问题：import numpy as npimport pandas as pddf = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})regexp = (r'(?P<agg>) ' # agg is the word in uppercase (all other substring is lowercased) r'(?P<feat>), ' # 1. if there is no uppercase, whole string is feat # 2. if there is uppercase the substring after example. is feat # e.g. cat ==> cat # cat.N_MOST_COMMON(example.ord)[2] ==> ord )df[['agg','feat']] = df.col.str.extract(regexp,expand=True)# I am not sure how to build up regexp here.print(df)"""Required output item agg feat0 num num1 bool bool2 cat cat3 cat.COUNT(example) COUNT # note: here feat is empty4 cat.N_MOST_COMMON(example.ord)[2] N_MOST_COMMON ord5 cat.FIRST(example.ord) FIRST ord6 cat.FIRST(example.num) FIRST num""";

查看完整描述

1 回答

冉冉说

TA贡献1877条经验获得超1个赞

对于feat，由于您已经在其他 StackOverflow 问题中得到了答案agg，我认为您可以使用以下内容根据两个不同的模式提取两个不同的系列，这些模式彼此分开|，然后fillna()一个系列与另一个系列分开。

^([^A-Z]*$)仅当完整字符串为小写时才返回完整字符串

[^a-z].*example\.([a-z]+)\).*$example.仅当之前的)字符串中有大写字母时才应返回之后和之前的字符串example.

df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)

df['feat'] = s[0].fillna(s[1]).fillna('')

Out[1]:

item feat

0 num num

1 bool bool

2 cat cat

3 cat.COUNT(example)

4 cat.N_MOST_COMMON(example.ord)[2] ord

5 cat.FIRST(example.ord) ord

6 cat.FIRST(example.num) num

上面给出了您正在寻找样本数据的输出，并符合您的条件。然而：

如果后面有大写怎么办example.？电流输出将返回''

请参见下面的示例#2，其中一些数据根据上述点进行了更改：

df = pd.DataFrame({'item': ['num','cat.count(example.AAA)', 'cat.count(example.aaa)', 'cat.count(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)

df['feat'] = s[0].fillna(s[1]).fillna('')

Out[2]:

item feat

0 num num

1 cat.count(example.AAA)

2 cat.count(example.aaa) cat.count(example.aaa)

3 cat.count(example) cat.count(example)

4 cat.N_MOST_COMMON(example.ord)[2] ord

5 cat.FIRST(example.ord) ord

6 cat.FIRST(example.num) num

反对回复 2023-12-12

1 回答
0 关注
54 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何使用 extract 从 pandas 数据框中提取大写字母以及一些子字符串？

如何使用 extract 从 pandas 数据框中提取大写字母以及一些子字符串？

1 回答

添加回答