【算法导论学习-012】n个数随机等概率的抽样m个

算法法导论》P129页课后题5.3-7

suppose we want to create a random sample of the set {1,2,3,…,n}, thatis, an m-element subset S, where0≤m≤n, such that each m-subset is equally likely to be created. One waywould be to set A[i]=i for i=1,2,3,…,n, call RANDOMIZE-IN-PLACE(A),
and then take just the first marray elements. This method would make n calls to the RANDOM procedure.
If n is much larger than m, we can create a random samplewith fewer calls to RANDOM. Show that the following recursive procedurereturns a random m-subset S of {1,2,…,n}, in which eachm-subset is equally likely, while making
only m calls to RANDOM:

RANDOM-SAMPLE(m,n)
if m == 0
    return ?
else
    S = RANDOM-SAMPLE(m-1, n-1)
    i = RANDOM(1,n)
    if i ∈ S
        S = S ∪ {n}
    else
        S = S ∪ {i}
    return S

翻译过来就是：n个数随机等概率的取样m个。

该题的证明方法1：http://clrs.skanev.com/05/03/07.html

该题的证明方法2 ：http://www.cnblogs.com/Jiajun/archive/2013/05/15/3080111.html

题目中其实给出了两种解决方案。

方案1：调用RANDOMIZE-IN-PLACE(A)

/**
 * 创建时间：2014年8月13日 上午9:46:51
 * 项目名称：Test
 * @author Cao Yanfeng
 * @since JDK 1.6.0_21
 * 类说明：
 */
public class RandomSampleTest {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		int[] array={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
		int[] result=randomSample(array, 5);
		for (int i : result) {
			System.out.println(i);
		}

	}
	public static int[] randomSample(int[] array,int m) {
		randomInPlace(array);
		int[] result=new int[m];
		for (int i = 0; i <m; i++) {
			result[i]=array[i];
		}
		return result;

	}
/*《算法导论》P126页伪代码*/
	public static void randomInPlace(int[] array) {
		int n=array.length;
		for (int i = 0; i < n; i++) {
			int index=random(i, n-1);
			if (array[i]!=array[index]) {
				array[i]^=array[index];
				array[index]^=array[i];
				array[i]^=array[index];
			}
		}

	}
	public static int random(int a,int b) {
		return new Random().nextInt(b-a+1)+a;
	}
}

方案2：实现题目中的伪代码

/**

* 创建时间：2014年8月13日上午9:46:51

* 项目名称：Test

* @author
Cao Yanfeng

* @since JDK 1.6.0_21

* 类说明：

public
classRandomSampleTest {

/**

* @param args

public
static void main(String[]
args) {

// TODO Auto-generated method stub

int[]
array={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};

LinkedList<Integer>result=randomSample(array, 5);

for (Integer
integer : result) {

System.out.println(integer);

}

public
staticLinkedList<Integer> randomSample(int[]
array,int
m){

return
sample(array,
array.length,
m);

}

public
staticLinkedList<Integer> sample(int[]
array,int
n,int
m) {

if (m==0) {

return
newLinkedList<Integer>();

}else {

LinkedList<Integer> s=sample(array,
n-1, m-1);

int
i=array[random( 0,
n-1)];

if (s.contains(i)) {

s.add(array[n-1]);

}else {

s.add(i);

}

return
s;

}

/*返回闭区间的[a,b]随机数*/

public
staticintrandom(inta,int
b) {

return
new Random().nextInt(b-a+1)+a;

}

方案3：赋予权重法

《算法导论》P1225.3节的Randomized algorithms中提供的第一种随机采样方法即赋予权重法。但是权重有可能出现相同情况，不推荐这种方法。

方案4：蓄水池抽样

见最后的扩展问题。

*******************************************************************************

正如题目中所言，如果n个数据选取m个样本，如果n远大于m，则应该使用方案2，仅调用m次random（）函数；如果n与m差距不大，则应该使用方案1,调用n次random（）函数，但是方法简单。

*******************************************************************************

扩展问题：【google面试题】给定一个数据流，其中包含无穷尽的搜索关键字（比如，人们在谷歌搜索时不断输入的关键字）。如何才能从这个无穷尽的流中随机的选取 1000 个关键字？

参考：http://blog.csdn.net/minglingji/article/details/7984445

这也是“n个数随机等概率的取样m个”问题，但是n是未知的。采用的方式是蓄水池抽样。即：将数据流中的前1000个放入长度为1000的数组，对于1001个数，调用random(0,
1000),[0,999]闭区间内每个数被选中的概率都是1000/1001。之后对于n>1000的每个数，[0,999]闭区间内每个数被选中的概率都是1000/n。这里random被调用的次数为n-m。下面模拟一下这个过程。

/**
 * 创建时间：2014年8月13日 上午9:46:51
 * 项目名称：Test
 * @author Cao Yanfeng
 * @since JDK 1.6.0_21
 * 类说明：
 */
public class RandomSampleTest {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		int[] array={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
		int[] result=reservoirSample(array, 5);
		for (int i : result) {
		System.out.println(i);
	}
	}
	/*蓄水池抽样*/
	public static int[] reservoirSample(int[] array,int m) {
		int[] reservoir=new int[m];
		for (int i = 0; i <array.length; i++) {
			if (i<m) {
				reservoir[i]=array[i];
			}else {
				int temp=random(0, i);
				if (temp<m) {
					reservoir[temp]=array[i];
				}
			}

		}
		return reservoir;
	}
}

【算法导论学习-012】n个数随机等概率的抽样m个

时间： 2024-10-12 21:01:56

【算法导论学习-012】n个数随机等概率的抽样m个

算法法导论》P129页课后题5.3-7

方案1：调用RANDOMIZE-IN-PLACE(A)

方案2：实现题目中的伪代码

方案3：赋予权重法

方案4：蓄水池抽样

【算法导论学习-012】n个数随机等概率的抽样m个的相关文章

算法导论学习---红黑树具体解释之插入(C语言实现)

算法导论学习---红黑树详解之插入(C语言实现)

【算法导论学习-016】两个已排过序的等长数组的中位数（median of two sorted arrays）

【算法导论学习-015】数组中选择第i小元素（Selection in expected linear time）

算法导论学习资源

【算法导论学习-014】计数排序（CountingSortTest）

【算法导论学习-015】基数排序（Radix sort）

【算法导论学习笔记】第3章：函数的增长

【算法导论学习-23】两个单链表（single linked）求交点