2 回答
TA贡献1785条经验 获得超4个赞
计算序列开始是否有效?然后只需设置忽略值(标志4)。像这样:
sequence_starts = df.sequence == 2
sequence_ignore = df.sequence == 4
sequence_id = sequence_starts.cumsum()
sequence_id[sequence_ignore] = numpy.nan
TA贡献1829条经验 获得超6个赞
我想不出比循环遍历整个事物的“愚蠢”解决方案更好的方法,例如:
import numpy as np
counter = 0
tmp = np.empty_like(df['sequence'].values, dtype=np.float)
for i in range(len(tmp)):
if df['sequence'][i] == 4:
tmp[i] = np.nan
else:
if df['sequence'][i] == 2:
counter += 1
tmp[i] = counter
df['desired_Id_output'] = tmp
当然,这对于 20M 大小的 DataFrame 来说会很慢。改进这一点的一种方法是通过使用numba以下命令进行实时编译:
import numba
@numba.njit
def foo(sequence):
# put in appropriate modification of the above code block
return tmp
并用参数调用它df['sequence'].values。
添加回答
举报