有两张表T1,T2,表结构和数据如下:
create table t1 ( id int ); create table t2 ( id int, finished int ); insert into t1 values (1); insert into t1 values (2); insert into t1 values (3); insert into t1 values (4); insert into t1 values (5); insert into t1 values (6); insert into t2 values (1,1); insert into t2 values (2,0); insert into t2 values (3,1); insert into t2 values (4,0);
SQL Fiddle 可以测试SQL语句的执行。
想要实现T1中存在前提下,T2中不存在或者finished=0,也就是查询结果是:2, 4, 5, 6.
一、性能测试:
1. not in:
explain (analyze,verbose,costs,buffers) select ID from T1 where ID not in (select ID from T2 where finished=1);
Total runtime: 0.128 ms
2. not exists:
explain (analyze,verbose,costs,buffers) select ID from T1 where not exists (select 1 from T2 where T1.ID=T2.ID and T2.finished=1);
Total runtime: 0.105 ms
3. left join:
explain (analyze,verbose,costs,buffers) select T1.ID from T1 left join T2 on T1.ID=T2.ID and T2.finished=1 where T2.ID is null;
Total runtime: 0.096 ms
4. 网上还看到一种更快方法,但测试下来此方法不对,所以不讨论:
select ID from T2 where (select count(1) from T1 where T1.ID=T2.ID) = 0; 这条语句查询结果为空
因此在postgresql 9.3 上语句执行速度 left join > not exists > not in
当T1和T2表中ID出现null时,not in 语句会有不同的表现,所以推荐总是用not exists 代替 not in.
二、大数据量性能测试:
在大量数据的时候,not in有严重性能下降的问题,下面是我在i5 2.4GHz MAC pro 13吋上的测试。
department(T1) 为59280条数据,数据长度29字符;dept(T2) 为23633条数据,数据长度29字符。
1. explain analyze select department.id from department where department.id not in (select id from dept where finished=true);
Total runtime: 447073.065 ms
2. explain analyze select department.id from department where not exists (select 1 from dept where department.id=dept.id and finished=true);
Total runtime: 325.732 ms
3. explain analyze select department.id from department left join dept on department.id=dept.id and dept.finished=true where dept.id is null;
Total runtime: 319.869 ms
三、总结:
在Postgresql 9.3上:
not in 不仅性能差,而且逻辑可能有问题。
not exists 性能不错,思考起来比较容易。
left join 性能最好,但总体跟not exists 比也快不了多少,思考稍微有点绕。
下面是一张网上的left join 的图,但找不到出处,有助于理解 left join 的过程: