首页猿问 Dask 相当于...

Dask 相当于 pandas.DataFrame.update

Python

芜湖不芜 2023-03-08 15:45:24

我有一些使用pandas.DataFrame.update方法的函数，我正尝试转而使用Dask数据集，但 Dask Pandas API 没有update实现该方法。是否有其他方法可以在中获得相同的结果Dask？以下是我使用的方法update：前向用最后已知值填充数据df.update(df.filter(like='/').mask(lambda x: x == 0).ffill(1))输入id .. .. ..(some cols) 1/1/20 1/2/20 1/3/20 1/4/20 1/5/20 1/6/20 ....1 10 20 0 40 0 502 10 30 30 0 0 50..输出id .. .. ..(some cols) 1/1/20 1/2/20 1/3/20 1/4/20 1/5/20 1/6/20 ....1 10 20 20 40 40 502 10 30 30 30 30 50..根据 id/index 列将数据框中的值替换为另一个数据框中的值def replace_names(df1, df2, idxCol = 'id', srcCol = 'name', dstCol = 'name'): df1 = df1.set_index(idxCol) df1[dstCol].update(df2.set_index(idxCol)[srcCol]) return df1.reset_index()df_new = replace_names(df1, df2)输入df1id name ...123 city a456 city b789 city c789 city c456 city b123 city a...df2id name ...123 City A456 City B789 City C...输出id name ...123 City A456 City B789 City C789 City C456 City B123 City A...

查看完整描述

1 回答

墨色风雨

TA贡献1853条经验获得超6个赞

问题2

有一种方法可以部分解决这个问题。我假设它df2比它小得多df1并且它实际上适合内存所以我们可以读取作为 pandas 数据帧。df1如果是这种情况，如果是一个pandas或一个数据帧，则以下函数可以工作dask，但df2应该是pandas一个。

import pandas as pd

import dask.dataframe as dd

def replace_names(df1, # can be pandas or dask dataframe

df2, # this should be pandas.

idxCol='id',

srcCol='name',

dstCol='name'):

diz = df2[[idxCol, srcCol]].set_index(idxCol).to_dict()[srcCol]

out = df1.copy()

out[dstCol] = out[idxCol].map(diz)

return out

问题一

关于第一个问题，以下代码适用于pandas和dask

df = pd.DataFrame({'a': {0: 1, 1: 2},

'b': {0: 3, 1: 4},

'1/1/20': {0: 10, 1: 10},

'1/2/20': {0: 20, 1: 30},

'1/3/20': {0: 0, 1: 30},

'1/4/20': {0: 40, 1: 0},

'1/5/20': {0: 0, 1: 0},

'1/6/20': {0: 50, 1: 50}})

# if you want to try with dask

# df = dd.from_pandas(df, npartitions=2)

cols = [col for col in df.columns if "/" in col]

df[cols] = df[cols].mask(lambda x: x==0).ffill(1) #.astype(int)

如果您希望输出为整数，请删除最后一行中的注释。

更新问题 2 如果您想要一个dask唯一的解决方案，您可以尝试以下方法。

数据

import numpy as np

import pandas as pd

import dask.dataframe as dd

df1 = pd.DataFrame({'id': {0: 123, 1: 456, 2: 789, 3: 789, 4: 456, 5: 123},

'name': {0: 'city a',

1: 'city b',

2: 'city c',

3: 'city c',

4: 'city b',

5: 'city a'}})

df2 = pd.DataFrame({'id': {0: 123, 1: 456, 2: 789},

'name': {0: 'City A', 1: 'City B', 2: 'City C'}})

df1 = dd.from_pandas(df1, npartitions=2)

df2 = dd.from_pandas(df2, npartitions=2)

情况1

在这种情况下，如果一个id存在于df1但不存在于中，df2则将名称保留在df1.

def replace_names_dask(df1, df2,

idxCol='id',

srcCol='name',

dstCol='name'):

if srcCol == dstCol:

df2 = df2.rename(columns={srcCol:f"{srcCol}_new"})

srcCol = f"{srcCol}_new"

def map_replace(x, srcCol, dstCol):

x[dstCol] = np.where(x[srcCol].notnull(),

x[srcCol],

x[dstCol])

return x

df = dd.merge(df1, df2, on=idxCol, how="left")

df = df.map_partitions(lambda x: map_replace(x, srcCol, dstCol))

df = df.drop(srcCol, axis=1)

return df

df = replace_names_dask(df1, df2)

案例二

在这种情况下，如果一个id存在于df1但不存在于df2则name输出df将是NaN（如在标准左连接中）

def replace_names_dask(df1, df2,

idxCol='id',

srcCol='name',

dstCol='name'):

df1 = df1.drop(dstCol, axis=1)

df2 = df2.rename(columns={srcCol: dstCol})

df = dd.merge(df1, df2, on=idxCol, how="left")

return df

df = replace_names_dask(df1, df2)

反对回复 2023-03-08

1 回答
0 关注
92 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Dask 相当于 pandas.DataFrame.update

Dask 相当于 pandas.DataFrame.update

1 回答

添加回答