例如,这是我的测试数据test = spark.createDataFrame([ (0, 1, 5, "2018-06-03", "Region A"), (1, 1, 2, "2018-06-04", "Region B"), (2, 2, 1, "2018-06-03", "Region B"), (3, 3, 1, "2018-06-01", "Region A"), (3, 1, 3, "2018-06-05", "Region A"),])\ .toDF("orderid", "customerid", "price", "transactiondate", "location")test.show()我可以得到这样的汇总数据test.groupBy("customerid", "location").agg(sum("price")).show()在此处输入图片说明但我也想要百分比数据,像这样+----------+--------+----------+ |customerid|location|sum(price)| percentage+----------+--------+----------+ | 1|Region B| 2| 20%| 1|Region A| 8| 80%| 3|Region A| 1| 100%| 2|Region B| 1| 100%+----------+--------+----------+我想知道我该怎么做?也许使用窗口功能?我可以将数据透视表变成这样吗?(带有百分比和总和列)
2 回答
天涯尽头无女友
TA贡献1831条经验 获得超9个赞
这回答了问题的原始版本。
在 SQL 中,您可以使用窗口函数:
select customerid, location, sum(price),
(sum(price) / sum(sum(price)) over (partition by customerid) as ratio
from t
group by customerid, location;
慕雪6442864
TA贡献1812条经验 获得超5个赞
这是解决您问题的干净代码:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
(test.groupby("customerid", "location")
.agg(F.sum("price").alias("t_price"))
.withColumn("perc", F.col("t_price") / F.sum("t_price").over(Window.partitionBy("customerid")))
添加回答
举报
0/150
提交
取消
