PostgreSQL利用递归优化求稀疏列唯一值的方法
在数据库中经常会碰到一些表的列是稀疏列,只有很少的值,例如性别字段,一般就只有2种不同的值。
但是当我们求这些稀疏列的唯一值时,如果表的数据量很大,速度还是会很慢。
例如:
创建测试表
bill=#createtablet_sex(sexchar(1),otherinfotext); CREATETABLE bill=#insertintot_sexselect'm',generate_series(1,10000000)||'thisistest'; INSERT010000000 bill=#insertintot_sexselect'w',generate_series(1,10000000)||'thisistest'; INSERT010000000
查询:
可以看到下面的查询速度很慢。
bill=#selectcount(distinctsex)fromt_sex; count ------- 2 (1row) Time:8803.505ms(00:08.804) bill=#selectsexfromt_sextgroupbysex; sex ----- m w (2rows) Time:1026.464ms(00:01.026)
那么我们对该字段加上索引又是什么情况呢?
速度依然没有明显
bill=#createindexidx_sex_1ont_sex(sex); CREATEINDEX bill=#selectcount(distinctsex)fromt_sex; count ------- 2 (1row) Time:8502.460ms(00:08.502) bill=#selectsexfromt_sextgroupbysex; sex ----- m w (2rows) Time:572.353ms
的变化,可以看到执行计划已经使用IndexOnlyScan了。
bill=#explainselectcount(distinctsex)fromt_sex; QUERYPLAN ---------------------------------------------------------------------------------------------- Aggregate(cost=371996.44..371996.45rows=1width=8) ->IndexOnlyScanusingidx_sex_1ont_sex(cost=0.44..321996.44rows=20000000width=2) (2rows)
同样的SQL我们看看在Oracle中性能如何?
创建测试表:
SQL>createtablet_sex(sexchar(1),otherinfovarchar2(100)); Tablecreated. SQL>insertintot_sexselect'm',rownum||'thisistest'fromdualconnectbylevel<=10000000; 10000000rowscreated. SQL>commit; Commitcomplete. SQL>insertintot_sexselect'w',rownum||'thisistest'fromdualconnectbylevel<=10000000; 10000000rowscreated. SQL>commit; Commitcomplete.
性能测试:
SQL>setlines1000pages2000 SQL>setautotraceon SQL>settimingon SQL>selectcount(distinctsex)fromt_sex; COUNT(DISTINCTSEX) ------------------ 2 Elapsed:00:00:01.58 ExecutionPlan ---------------------------------------------------------- Planhashvalue:3915432945 ---------------------------------------------------------------------------- |Id|Operation|Name|Rows|Bytes|Cost(%CPU)|Time| ---------------------------------------------------------------------------- |0|SELECTSTATEMENT||1|3|20132(1)|00:00:01| |1|SORTGROUPBY||1|3||| |2|TABLEACCESSFULL|T_SEX|14M|42M|20132(1)|00:00:01| ---------------------------------------------------------------------------- Note ----- -dynamicstatisticsused:dynamicsampling(level=2) Statistics ---------------------------------------------------------- 0recursivecalls 0dbblockgets 74074consistentgets 0physicalreads 0redosize 552bytessentviaSQL*Nettoclient 608bytesreceivedviaSQL*Netfromclient 2SQL*Netroundtripsto/fromclient 1sorts(memory) 0sorts(disk) 1rowsprocessed SQL>selectsexfromt_sextgroupbysex; SE -- m w Elapsed:00:00:01.08 ExecutionPlan ---------------------------------------------------------- Planhashvalue:3915432945 ---------------------------------------------------------------------------- |Id|Operation|Name|Rows|Bytes|Cost(%CPU)|Time| ---------------------------------------------------------------------------- |0|SELECTSTATEMENT||14M|42M|20558(3)|00:00:01| |1|SORTGROUPBY||14M|42M|20558(3)|00:00:01| |2|TABLEACCESSFULL|T_SEX|14M|42M|20132(1)|00:00:01| ---------------------------------------------------------------------------- Note ----- -dynamicstatisticsused:dynamicsampling(level=2) Statistics ---------------------------------------------------------- 0recursivecalls 0dbblockgets 74074consistentgets 0physicalreads 0redosize 589bytessentviaSQL*Nettoclient 608bytesreceivedviaSQL*Netfromclient 2SQL*Netroundtripsto/fromclient 1sorts(memory) 0sorts(disk) 2rowsprocessed
可以看到Oracle的性能即使不加索引也明显比PostgreSQL中要好。
那么我们在PostgreSQL中是不是没办法继续优化了呢?这种情况我们利用pg中的递归语句结合索引可以大幅提升性能。
SQL改写:
bill=#withrecursivetmpas( bill(#( bill(#selectmin(t.sex)assexfromt_sextwheret.sexisnotnull bill(#) bill(#unionall bill(#( bill(#select(selectmin(t.sex)fromt_sextwheret.sex>s.sexandt.sexisnotnull) bill(#fromtmpswheres.sexisnotnull bill(#) bill(#) bill-#selectcount(distinctsex)fromtmp; count ------- 2 (1row) Time:2.711ms
查看执行计划:
bill=#explainwithrecursivetmpas( bill(#( bill(#selectmin(t.sex)assexfromt_sextwheret.sexisnotnull bill(#) bill(#unionall bill(#( bill(#select(selectmin(t.sex)fromt_sextwheret.sex>s.sexandt.sexisnotnull) bill(#fromtmpswheres.sexisnotnull bill(#) bill(#) bill-#selectcount(distinctsex)fromtmp; QUERYPLAN ---------------------------------------------------------------------------------------------------------------------- Aggregate(cost=53.62..53.63rows=1width=8) CTEtmp ->RecursiveUnion(cost=0.46..51.35rows=101width=32) ->Result(cost=0.46..0.47rows=1width=32) InitPlan3(returns$1) ->Limit(cost=0.44..0.46rows=1width=2) ->IndexOnlyScanusingidx_sex_1ont_sext(cost=0.44..371996.44rows=20000000width=2) IndexCond:(sexISNOTNULL) ->WorkTableScanontmps(cost=0.00..4.89rows=10width=32) Filter:(sexISNOTNULL) ->CTEScanontmp(cost=0.00..2.02rows=101width=32) (11rows) Time:1.371ms
可以看到执行时间从原先的8000ms降低到了2ms,提升了几千倍!
甚至对比Oracle,性能也是提升了很多。
但是需要注意的是:这种写法仅仅是针对稀疏列,换成数据分布广泛的字段,显然性能是下降的,所以使用递归SQL不适合数据分布广泛的字段的groupby或者count(distinct)操作。
到此这篇关于PostgreSQL利用递归优化求稀疏列唯一值的文章就介绍到这了,更多相关PostgreSQL递归优化内容请搜索毛票票以前的文章或继续浏览下面的相关文章希望大家以后多多支持毛票票!