首页手记基于MapReduce的蓄水池抽样

基于MapReduce的蓄水池抽样

标签：

云计算大数据数据结构

最近在学习大数据相关的算法，写了很多关于算法方面的博文（怪咖科学），希望也能在慕课网上跟大家分享学习的一些技巧和经验。

问题：现在有一个很大的数据，假设有几千万条但不知道具体有多少条，如何在只遍历一次的情况下，随机取出其中K条数据？

思路：
1.可以将此问题抽象为蓄水池抽样问题。即，先把读取到的前K条数据放入列表中，对于第K+1个对象，以K/(K+1)的概率选择该对象；对于第K+2个对象，以K/(K+2)的概率选择该对象；以此类推，以K/M的概率选择第M个对象(M>K)。如果M被选中,则随机替换列表中的一个对象。如果数据总量N无穷大，则每个对象被选中的概率将均为K/N。

2.设计Mapper：首先要在setup中初始化K的值，也就是随机抽样的个数，然后在map中记录此刻传进来的值在数据流中的位置row，如果row小于K，就将此条数据放入列表中；如果row大于K，则随机生成一个0到row之间的数m，如果m小于K，则将此条数据替换列表中第m条数据，否则不替换。

当所有数据经过map后就得到了一个大小为K的列表，这个列表就是我们随机得到的数据。如果数据量小于一个split的大小，则可以省略Reduce过程，直接在cleanup中输出到HDFS。

public class MyMapper extends Mapper<Object, Text, Text, NullWritable>{  
    Logger log = LoggerFactory.getLogger(MyMapper.class);  
    private int row = 0;  
    private int k=0;  
    private ArrayList<Text> result = new ArrayList<>();  
    @Override  
    protected void setup(Mapper<Object, Text, Text, NullWritable>.Context context)  
            throws IOException, InterruptedException {  
        k = context.getConfiguration().getInt("k", 3);  
    }  
    @Override  
    protected void map(Object key, Text value, Context context)  
            throws IOException, InterruptedException {  
        row++;  
        if(row <= k){  
            result.add(new Text(value));   
        }  
        else{  
            int p = randI(row);  
            if(p < k){  
                result.set(p, new Text(value));  
            }  
        }  
    }  
    /*** 
     *  
     * @param max 
     * @return 
     */  
    Random random = new Random();  
    private int randI(int max) {  
        return random.nextInt(max);  
    }  
    @Override  
    protected void cleanup(Context context)  
            throws IOException, InterruptedException {  
        for(int i=0;i<result.size();i++)  
            context.write(result.get(i),NullWritable.get());  

    }  
}

设计Reduce：由于数据量非常大，假设我们有m个map，则经过Mapper之后，我们会得到一个mK大小的列表到Reduce中。因此，只需在Reduce中编写从mK的列表中随机选取K条数据即可。

public class MyReducer extends Reducer<Text, NullWritable, Text, NullWritable>{  
private int row = 0;  
private int k=0;  
private ArrayList<Text> result = new ArrayList<>();  
@Override  
protected void setup(Context context)  
        throws IOException, InterruptedException {  
    k = context.getConfiguration().getInt("k", 3);  
}  
@Override  
protected void reduce(Text key, Iterable<NullWritable> values,  
        Context context) throws IOException, InterruptedException {  
    row++;  
    if(row <= k){  
        result.add(new Text(key));   
    }  
    else{  
        int p = randI(row);  
        if(p < k){  
            result.set(p, new Text(key));  
        }  
    }  
}  
/***  
 *   
 * @param max  
 * @return  
 */  
Random random = new Random();  
private int randI(int max) {  
    return random.nextInt(max);  
}  
@Override  
protected void cleanup(Context context)  
        throws IOException, InterruptedException {  
    for(int i=0;i<result.size();i++)  
        context.write(result.get(i),NullWritable.get());  
}  
}