首页手记【Spark Java...

【Spark Java API】Transformation(2)—sample、randomSplit

标签：

Spark

sample

官方文档描述：

 Return a sampled subset of this RDD.

返回抽样的样本的子集。

函数原型：

withReplacement can elements be sampled multiple times (replaced when sampled out)
fraction expected size of the sample as a fraction of this RDD's size
without replacement: probability that each element is chosen; fraction must be [0, 1]
with replacement: expected number of times each element is chosen; fraction must be >= 0

def sample(withReplacement: Boolean, fraction: Double): JavaRDD[T]

withReplacement can elements be sampled multiple times (replaced when sampled out)
fraction expected size of the sample as a fraction of this RDD's size
without replacement: probability that each element is chosen; fraction must be [0, 1]
with replacement: expected number of times each element is chosen; fraction must be >= 0
seed seed for the random number generator

def sample(withReplacement: Boolean, fraction: Double, seed: Long): JavaRDD[T]

**
第一函数是基于第二个实现的，在第一个函数中seed为Utils.random.nextLong；其中，withReplacement是建立不同的采样器；fraction为采样比例；seed为随机生成器的种子
**

源码分析：

def sample(withReplacement: Boolean, fraction: Double,    
seed: Long = Utils.random.nextLong): RDD[T] = withScope {  
require(fraction >= 0.0, "Negative fraction value: " + fraction)  
if (withReplacement) {  
  new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)  
} else {    
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed) 
 }
}

**
sample函数中，首先对fraction进行验证；再次建立PartitionwiseSampledRDD，依据withReplacement的值分别建立柏松采样器或伯努利采样器。
**

实例：

List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);//false   是伯努利分布  (元素可以多次采样);0.2   采样比例;100   随机数生成器的种子JavaRDD<Integer> sampleRDD = javaRDD.sample(false,0.2,100);
System.out.println("sampleRDD~~~~~~~~~~~~~~~~~~~~~~~~~~" + sampleRDD.collect());//true  是柏松分布;0.2   采样比例;100   随机数生成器的种子JavaRDD<Integer> sampleRDD1 = javaRDD.sample(false,0.2,100);
System.out.println("sampleRDD1~~~~~~~~~~~~~~~~~~~~~~~~~~" + sampleRDD1.collect());

randomSplit

官方文档描述：

 Randomly splits this RDD with the provided weights.

依据所提供的权重对该RDD进行随机划分

函数原型：

weights for splits, will be normalized if they don't sum to 1
random seed
return split RDDs in an array

def randomSplit(weights: Array[Double], seed: Long): Array[JavaRDD[T]]

weights for splits, will be normalized if they don't sum to 1
return split RDDs in an array

def randomSplit(weights: Array[Double]): Array[JavaRDD[T]]

源码分析：

def randomSplit(weights: Array[Double],    
seed: Long = Utils.random.nextLong): Array[RDD[T]] = withScope {  
val sum = weights.sum  
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)  
normalizedCumWeights.sliding(2).map { x =>
  randomSampleWithRange(x(0), x(1), seed) 
 }.toArray
}def randomSampleWithRange(lb: Double, ub: Double, seed: Long): RDD[T] = { 
this.mapPartitionsWithIndex( { (index, partition) =>    
  val sampler = new BernoulliCellSampler[T](lb, ub)        
  sampler.setSeed(seed + index)    
  sampler.sample(partition)  
 }, preservesPartitioning = true)
}

**
从源码中可以看到randomSPlit先是对权重数组进行0-1正则化；再利用randomSampleWithRange函数，对RDD进行划分；而在该函数中调用mapPartitionsWithIndex（上一节有具体说明），建立伯努利采样器对RDD进行划分。
**

实例：

List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);double [] weights =  {0.1,0.2,0.7};//依据所提供的权重对该RDD进行随机划分JavaRDD<Integer> [] randomSplitRDDs = javaRDD.randomSplit(weights);
System.out.println("randomSplitRDDs of size~~~~~~~~~~~~~~" + randomSplitRDDs.length);int i = 0;for(JavaRDD<Integer> item:randomSplitRDDs)        
   System.out.println(i++ + " randomSplitRDDs of item~~~~~~~~~~~~~~~~" + item.collect())

作者：小飞_侠_kobe
链接：https://www.jianshu.com/p/abe1755220b2

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

慕标5832272

全栈工程师

手记
篇

粉丝

233

获赞与收藏

1008

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 32326 363

网络编程入门教程

20个小节 13328 251

Pandas 入门教程

25个小节 19980 376

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

【Spark Java API】Transformation(2)—sample、randomSplit

sample

官方文档描述：

函数原型：

源码分析：

实例：

randomSplit

官方文档描述：

函数原型：

源码分析：

实例：

阅读免费教程