首页猿问迭代数据集并更改 Pandas...

迭代数据集并更改 Pandas 未来索引中其他值的特征值的更好方法

慕无忌1623718 2023-02-07 09:39:17

我有一个由高速公路上的传感器记录的速度数据集，我正在更改未来 2 小时（5 分钟时间戳的平均速度）的值（正常值为 30 分钟。现在的label值是观察到的 30 分钟将来）。avg5labelavg5我的数据集具有以下特征和值：我正在通过这种方式进行这种价值转换：hours_added = datetime.timedelta(hours = 2)for index in data_copy.index: hours_ahead = data.loc[index, "timestamp5"] + hours_added result = data_copy[((data_copy["timestamp5"] == hours_ahead) & (data_copy["sensor_id"] == data_copy["sensor_id"].loc[index]))] if len(result) == 1: data_copy.at[index, "label"] = result["avg5"] if(index % 50 == 0): print(f"Index: {index}")该代码提前 2 小时查询并捕获我现在正在迭代的相同 sensor_id 的结果。如果结果给我带来了一些东西，我只会更改标签的值（len(result) == 1).我的数据框有 2950521 个索引，目前我发布这个问题内核运行了超过 24 小时，只达到了 371650 个索引。所以我开始认为我做错了什么，或者是否有更好的方法来改变这些不需要那么长时间的价值观。出于复制目的，所需的行为是为avg52 小时前的标签分配未来 2 小时的相应 sensor_id。为了可重复性，以我的数据集样本为例，其中包含前 10 个寄存器：

查看完整描述

1 回答

慕沐林林

TA贡献2016条经验获得超9个赞

根据我对您的代码如何工作的理解，它似乎花费了这么长时间，因为它在 O(n^c) 时间之内运行。我的意思是，对于每个索引，它必须多次遍历整个数据集以检查条件。

因此，最好尝试避免遍历每个索引的整个数据集——即，使其在 O(n) 线性时间内工作。为此，我将执行以下操作：

import pandas as pd

from pandas import Timestamp

import datetime

data_copy = pd.DataFrame(data = {

'sensor_id': {

0: 1385001, 1: 1385001, 2: 1385001, 3: 1385001, 4: 1385001, 5: 1385001,

6: 1385001, 7: 1385001, 8: 1385001, 9: 1385001},

'label': {

0: 50.79999923706055, 1: 52.69230651855469, 2: 50.0, 3: 48.61538314819336,

4: 48.0, 5: 47.90909194946289, 6: 51.41666793823242, 7: 48.3684196472168,

8: 49.8636360168457, 9: 48.66666793823242},

'avg5': {

0: 49.484848, 1: 51.735294, 2: 51.59375, 3: 49.266666,

4: 50.135135999999996, 5: 50.5, 6: 50.8, 7: 52.69230699999999,

8: 50.0, 9: 48.615383},

'timestamp5': {

0: Timestamp('2014-08-01 00:00:00'), 1: Timestamp('2014-08-01 00:05:00'),

2: Timestamp('2014-08-01 00:10:00'), 3: Timestamp('2014-08-01 00:15:00'),

4: Timestamp('2014-08-01 00:20:00'), 5: Timestamp('2014-08-01 00:25:00'),

6: Timestamp('2014-08-01 00:30:00'), 7: Timestamp('2014-08-01 00:35:00'),

8: Timestamp('2014-08-01 00:40:00'), 9: Timestamp('2014-08-01 00:45:00')}})

hours_added = datetime.timedelta(minutes = 40)

# Create a data series that combines the information about sensor_id & timestamp5

sen_time = data_copy['sensor_id'].astype(str) + data_copy['timestamp5'].astype(str)

# Create a dictionary of the corresponding { sensor_id + timestamp5 : avg5 } values

dictionary = pd.Series(data_copy['avg5'].values, sen_time).to_dict()

# Create a data series combining the timestamp5 + 40 mins information

timePlus40 = data_copy['timestamp5'] + hours_added

# Create a mapping column that combines the sensor_id & timestamp5+40mins

sensor_timePlus40 = (data_copy['sensor_id'].astype(str) + timePlus40.astype(str))

# Create a new_label series by mapping the dictionary onto sensor_timePlus40

new_label = sensor_timePlus40.map(dictionary)

# Extract indices where this series has non-NaN values

where = new_label.notnull()

# Replace the values in the 'label' column with only non-NaN new_label values

data_copy.loc[where, 'label'] = new_label.loc[where]

我相信这与@pecey 和@iracebeth_18 在评论中提出的想法类似。

此EDIT ed 版本反映了 OP 的愿望（来自评论）以label仅使用非 NaN 值更新列。

结果如下所示：

> print(data_copy)

sensor_id label avg5 timestamp5

0 1385001 50.000000 49.484848 2014-08-01 00:00:00

1 1385001 48.615383 51.735294 2014-08-01 00:05:00

2 1385001 50.000000 51.593750 2014-08-01 00:10:00

3 1385001 48.615383 49.266666 2014-08-01 00:15:00

4 1385001 48.000000 50.135136 2014-08-01 00:20:00

5 1385001 47.909092 50.500000 2014-08-01 00:25:00

6 1385001 51.416668 50.800000 2014-08-01 00:30:00

7 1385001 48.368420 52.692307 2014-08-01 00:35:00

8 1385001 49.863636 50.000000 2014-08-01 00:40:00

9 1385001 48.666668 48.615383 2014-08-01 00:45:00

将此代码的速度与您的代码进行比较会timeit产生更快的运行时间，并且差异只会随着数据集的增大而增加。

反对回复 2023-02-07

1 回答
0 关注
87 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

迭代数据集并更改 Pandas 未来索引中其他值的特征值的更好方法

迭代数据集并更改 Pandas 未来索引中其他值的特征值的更好方法

1 回答

添加回答