如何处理熊猫中重复的“唯一标识符”

我有两个表，它们都包含一个称为帐户代码的标识符，但第一个表可以包含该帐户代码的多次出现，而另一个表只出现一次。我的表格最初来自 excel，因此将它们放入 Pandas 数据框后看起来像这样基础数据+-------+----------------+----------+| Name | Account Number | $ Amount |+-------+----------------+----------+| Brett | 1234 | a || Brett | 1234 | b || Jill | 2458 | c || Peter | 1485 | d |+-------+----------------+----------+licensee_fee+----------------+--------------+| Account Number | Licensee Fee |+----------------+--------------+| 1234 | x || 1485 | y |+----------------+--------------+所以当我做base_data = pd.read_excel(filename, sheet_name=0, dytpe={"Account Number": "str"})licensee_fee = pd.read_excel(filename, sheet_name=1, dtype={"Account Number": "str"})# the first 2 columns contain irrelevant dataresult = pd.merge(base_date, licensee_fee.iloc[:,[2,3]], how="outer", on="Account Number")正如预期的那样我得到+-------+----------------+----------+--------------+| Name | Account Number | $ Amount | Licensee Fee |+-------+----------------+----------+--------------+| Brett | 1234 | a | x || Brett | 1234 | b | x || Jill | 2458 | c | - || Peter | 1485 | d | y |+-------+----------------+----------+--------------+但这在我需要的方面是不正确的。我真正想要的是它看起来像这样+-------+----------------+----------+--------------+| Name | Account Number | $ Amount | Licensee Fee |+-------+----------------+----------+--------------+| Brett | 1234 | a | x || Brett | 1234 | b | - || Jill | 2458 | c | - || Peter | 1485 | d | y |+-------+----------------+----------+--------------+被许可人费用只出现一次。我有一些代码来处理 NULL 值，所以这不是问题。

查看完整描述

2 回答

呼如林

TA贡献1798条经验获得超3个赞

这是一个好问题，您可能需要先使用cumcount创建帮助merge密钥，这将确保费用项目一旦使用，就不会再次使用。

base['helpkey']=base.groupby('AccountNumber').cumcount()

fee['helpkey']=fee.groupby('AccountNumber').cumcount()

yourdf=base.merge(fee,on=['AccountNumber','helpkey'],how='left').drop('helpkey',1)

yourdf

Name AccountNumber $Amount LicenseeFee

0 Brett 1234 a x

1 Brett 1234 b NaN

2 Jill 2458 c NaN

3 Peter 1485 d y

反对回复 2021-11-09

慕的地10843

TA贡献1785条经验获得超8个赞

您可以先合并，然后在处理 NaN 之后：

In [11]: res = df.merge(df1, how='outer')

In [12]: res

Out[12]:

Name Account Number $Amount Licensee Fee

0 Brett 1234 a x

1 Brett 1234 b x

2 Jill 2458 c NaN

3 Peter 1485 d y

In [13]: res.loc[res.groupby("Account Number").cumcount() > 0, "Licensee Fee"] = np.nan

In [14]: res

Out[14]:

Name Account Number $Amount Licensee Fee

0 Brett 1234 a x

1 Brett 1234 b NaN

2 Jill 2458 c NaN

3 Peter 1485 d y

反对回复 2021-11-09

热搜

最近搜索清空

如何处理熊猫中重复的“唯一标识符”

如何处理熊猫中重复的“唯一标识符”

2 回答

添加回答