首页猿问 Python Pandas...

Python Pandas 在不同日期和日期范围内重新采样特定时间

Python

德玛西亚99 2022-11-24 15:05:26

我有不同实体的数据记录，并且对于每个实体，在整个月的一天中的特定时间记录了一些计数。例如： entity_id time counts0 175 2019-03-01 05:00:00 31 175 2019-03-01 06:00:00 42 175 2019-03-01 07:00:00 63 175 2019-03-01 08:00:00 64 175 2019-03-01 09:00:00 75 178 2019-03-01 05:00:00 86 178 2019-03-01 06:00:00 47 178 2019-03-01 07:00:00 58 178 2019-03-01 08:00:00 69 200 2019-03-01 05:00:00 710 200 2019-03-01 08:00:00 311 175 2019-03-03 05:00:00 312 175 2019-03-03 07:00:00 613 175 2019-03-03 08:00:00 614 175 2019-03-03 09:00:00 715 178 2019-03-03 05:00:00 816 178 2019-03-03 06:00:00 417 178 2019-03-03 07:00:00 518 178 2019-03-03 08:00:00 619 200 2019-03-03 05:00:00 720 200 2019-03-03 08:00:00 321 200 2019-03-03 09:00:00 7...我希望能够为每个实体汇总整个月中一周中不同日期的几个小时范围内的计数平均值。例如：周日早上（早上 6 点到 10 点）的平均值周日至周四早上（早上 6 点至上午 10 点）的平均值周日至周四中午（上午 11 点至下午 1 点）的平均值周五至周六中午（上午 11 点至下午 1 点）的平均值周五晚上 (6PM-9PM) 的平均值等等所以我希望得到这样的 df（部分示例）： entity_id day_in_week time_in_day counts_mean0 175 sun eve 51 175 sun-thu noon 62 178 sun eve 53 178 sat eve 54 200 sun-thu morning 2...我设法通过遍历数据、切片和提取不同的元素来部分完成这项工作，但我认为有一种更有效的方法。我从这个问题开始，但我仍然有太多 for 循环。任何想法如何优化性能？

查看完整描述

2 回答

蝴蝶刀刀

TA贡献1801条经验获得超8个赞

我的解决方案的想法基于具有范围定义的辅助 DataFrame，为此要计算平均值（上述属性的day_in_week、time_in_day 和相应的CustomBusinessHour ）。

这个 DataFrame（我称之为calendars）的创建从 day_in_week , time_in_day列开始：

calendars = pd.DataFrame([

['sun', 'morning'],

['sun-thu', 'morning'],

['sun-thu', 'noon'],

['fri-sat', 'noon'],

['fri', 'eve']],

columns=['day_in_week', 'time_in_day'])

如果您需要更多此类定义，请在此处添加它们。

然后，添加相应的CustomBusinessHour对象：

定义一个函数来获取小时限制：

def getHourLimits(name):

if name == 'morning':

return '06:00', '10:00'

elif name == 'noon':

return '11:00', '13:00'

elif name == 'eve':

return '18:00', '21:00'

else:

return '8:00', '16:00'

定义一个函数来获取周掩码（开始时间和结束时间）：

def getWeekMask(name):

parts = name.split('-')

if len(parts) > 1:

fullWeek = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']

ind1 = fullWeek.index(parts[0].capitalize())

ind2 = fullWeek.index(parts[1].capitalize())

return ' '.join(fullWeek[ind1 : ind2 + 1])

else:

return parts[0].capitalize()

定义生成CustomBusinessHour对象的函数：

def getCBH(row):

wkMask = getWeekMask(row.day_in_week)

hStart, hEnd = getHourLimits(row.time_in_day)

return pd.offsets.CustomBusinessHour(weekmask=wkMask, start=hStart, end=hEnd)

将CustomBusinessHour对象添加到日历：

calendars['CBH'] = calendars.apply(getCBH, axis=1)

然后为给定的实体 Id 定义一个计算所有必需均值的函数：

def getSums(entId):

outRows = []

wrk = df[df.entity_id.eq(entId)] # Filter for entity Id

for _, row in calendars.iterrows():

dd = row.day_in_week

hh = row.time_in_day

cbh = row.CBH

# Filter for the current calendar

cnts = wrk[wrk.time.apply(lambda val: cbh.is_on_offset(val))]

cnt = cnts.counts.mean()

if pd.notnull(cnt):

outRows.append(pd.Series([entId, dd, hh, cnt],

index=['entity_id', 'day_in_week', 'time_in_day', 'counts_mean']))

return pd.DataFrame(outRows)

如您所见，结果仅包含非空均值。

并生成结果，运行：

pd.concat([getSums(entId) for entId in df.entity_id.unique()], ignore_index=True)

对于您的数据样本（仅包含早上的读数），结果是：

entity_id day_in_week time_in_day counts_mean

0 175 sun morning 6.333333

1 175 sun-thu morning 6.333333

2 178 sun morning 5.000000

3 178 sun-thu morning 5.000000

4 200 sun morning 5.000000

5 200 sun-thu morning 5.000000

反对回复 2022-11-24

紫衣仙女

TA贡献1839条经验获得超15个赞

如果您的时间列是 pandas 中的日期时间对象，则可以使用数据时间方法创建新列，

您可以按照以下步骤操作，

您可以创建一个指示 day_in_week 的列，

df["day_in_week"] = df["time"].dt.dayofweek

然后使用一个简单的 .apply 函数根据您的要求制作列，通过比较函数内部的时间将时间划分为早上、晚上等时段。
然后根据之前创建的两列创建另一列指示您的组合。
然后在要获取该组的分组数据或指标的列上使用 groupby。

我知道这个过程有点长，但它没有任何 for 循环，它使用 pandas 已经提供的df.apply和datetime属性以及根据您的要求的一些 if-else 条件。

步骤 2、3、4 完全依赖于数据，因为我没有数据，所以我无法编写确切的代码。能用的方法我都尽量解释了。

我希望这有帮助。

反对回复 2022-11-24

2 回答
0 关注
127 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Python Pandas 在不同日期和日期范围内重新采样特定时间

Python Pandas 在不同日期和日期范围内重新采样特定时间

2 回答

添加回答