Update on Histograms
OK, that was a bad example. Here's a new approach.
When the Oracle documentation talks about skewed values they are referring to particular values in a column occuring with significantly different frequency than other values. A classical example of this is 95% "M"and 5% "F".
However, let me propose that what this misses is the consideration of situations like the following: 95% A, 5% B, 0% C, 0% D ... ie. there are values that do not occur in the column but which the optimizer might believe do occur -- or might occur, at any rate.
Let's take a slightly more "real world" example. Let's say that we have a wholesale transactions table that is range partitioned on "Date Of Transaction" at a monthly level, so that there are approximately 30 days to the month. Now the key to this example is that no sales occur on Saturday or Sunday, so in fact there are around twenty unique values to each partition. Each value in our example has the same number of rows, and the distinct values have a pretty even distribution.
Now even if the data does not possess the classical type of skew, there is still a type in there that is hidden from the optimizer, because Saturday and Sunday values of data have 0% representation in the table. The key is to let the optimizer know about these gaps that it is not aware of from High-Low-N column statistics.
Example script:
SQL> drop table demo_hist;
Table dropped.
SQL>
SQL> create table demo_hist
2 as
3 select my_date,
4 num
5 From (
6 Select To_Date('01-aug-2005')+
7 Mod(rownum,
8 Add_months('01-Aug-2005',1)-To_Date('01-Aug-2005'))
9 my_date,
10 rownum num
11 From dual
12 Connect By 1=1 and level <= 100000 13 ) 14 Where to_char(my_date,'D') not in ('1','7'); Table created. SQL>
SQL> commit;
Commit complete.
SQL>
SQL> set autotrace on
SQL>
SQL> select count(*) from demo_hist
2 where my_date = to_date('01-aug-2005');
COUNT(*)
----------
3225
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=44 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=44 Card
=2943 Bytes=26487)
Statistics
----------------------------------------------------------
48 recursive calls
0 db block gets
251 consistent gets
184 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
SQL> select count(*) from demo_hist
2 where my_date between to_date('06-aug-2005') and to_date('09-aug-2005');
COUNT(*)
----------
6452
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=44 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=44 Card
=5830 Bytes=52470)
Statistics
----------------------------------------------------------
6 recursive calls
0 db block gets
247 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
SQL>
SQL> begin
2 dbms_stats.gather_table_stats
3 (
4 ownname => user,
5 tabname => 'DEMO_HIST',
6 estimate_percent => 100,
7 method_opt => 'for columns NUM size 254'
8 );
9 end;
10 /
PL/SQL procedure successfully completed.
SQL>
SQL> select count(*) from demo_hist
2 where my_date = to_date('01-aug-2005');
COUNT(*)
----------
3225
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=45 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=45 Card
=742 Bytes=6678)
Statistics
----------------------------------------------------------
42 recursive calls
0 db block gets
192 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
SQL> select count(*) from demo_hist
2 where my_date between to_date('06-aug-2005') and to_date('09-aug-2005');
COUNT(*)
----------
6452
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=45 Card=1 Bytes=9)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=45 Card
=185 Bytes=1665)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
188 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
SQL> begin
2 dbms_stats.gather_table_stats
3 (
4 ownname => user,
5 tabname => 'DEMO_HIST',
6 estimate_percent => 100,
7 method_opt => 'for columns MY_DATE size 254'
8 );
9 end;
10 /
PL/SQL procedure successfully completed.
SQL>
SQL> select count(*) from demo_hist
2 where my_date = to_date('01-aug-2005');
COUNT(*)
----------
3225
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=45 Card=1 Bytes=8)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=45 Card
=3225 Bytes=25800)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
188 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
SQL> select count(*) from demo_hist
2 where my_date between to_date('06-aug-2005') and to_date('09-aug-2005');
COUNT(*)
----------
6452
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=45 Card=1 Bytes=8)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'DEMO_HIST' (TABLE) (Cost=45 Card
=6452 Bytes=51616)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
188 consistent gets
0 physical reads
0 redo size
394 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Summary ...
Predicate = Between
------------------------ ------ --------
Correct Answer 3,225 6,452
No Stats Estimate 2,943 5,830
Col Stats Estimate 742 185
Histogram Stats Estimate 3,225 6,452
Obviously there's more to this ... the histogram-based estimate was so good because the number of distinct values was low, and you'd expect the quality of the estimate to degenerate as the number of distinct values increases.
Thoughts?
6 Comments:
Things look different in XE. For the first query in your example...
Execution Plan
----------------------------------------------------------
Plan hash value: 628504670
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 9 | 52 (2)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 9 | | |
|* 2 | TABLE ACCESS FULL| DEMO_HIST | 3224 | 29016 | 52 (2)| 00:00:01 |
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("MY_DATE"=TO_DATE('2005-08-01 00:00:00', 'yyyy-mm-dd
hh24:mi:ss'))
Note
-----
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
48 recursive calls
0 db block gets
260 consistent gets
184 physical reads
0 redo size
412 bytes sent via SQL*Net to client
380 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
XE:
no stats 3224 6428
no histograms 742 185
histograms 3225 6452
That'll be the magic of dynamic sampling. With the table being unanalyzed the dynamic sampling kicked in, and because the dynamic sampling query uses the same predicate as the user query, the estimate is very accurate.
Now as soon as the table had statistics the dynamic sampling wasn't performed, and the estimation was based on the misleading column statistics.
With a histogram in place there was also no dynamic sampling but the estimation was much better.
We're going to let Oracle gather statistics in 10g, testing indicates we will get many many histograms where before we adhered more to the "don't create histograms unless you need them for your skewed data". It's going to be an interesting ride. The improved cardinalities from the histograms for tables without skewed data will be a win.
But I expect that introducing the histograms on columns with skewed data where applications use bind variables is going to leave us open to bad plans due to bind variable peeking. We didn't encounter obvious problems in our rounds of testing, but theoretically the problem is lurking out there.
Most of our skewed data is in temp tables that are volatile. I suppose we may end up removing these volatile tables from control of the 10g automated statistics gathering job, manage the statistics on our own, which is what Oracle recommends anyway.
Then the question will be whether to collect lots of histograms for these tables and change the applications that use bind variables but which will now be vulnerable to bind variable peeking and so should use literals, or not use histograms and let the bind variable programs run the way they did in the past. Probably the latter to start with.
In our second round of 10g testing we forceably gathered statistics as part of the migration process, so we had a full set of 10g statistics. Didn't see any obvious problems, I guess the bind variable peeking thing is sort of probabilistic - what are the odds that a particular query will actually have bind variable peeking problems... depends on the skew, on the range of real inputs to the query, etc.
I suppose I should add that changes to applications, if they do occur, wouldn't involve a massive cutover from using bind variables to using literals - the literals would be used against the columns with skewed data, bind variables elsewhere. And there's the question of extra parsing because of use of literals - have to be careful with that.
For those highly volatile tables then you might consider using literals and dynamic sampling instead of statistics gathering, especially if you have a lot of queries that would benefit from histograms, and especially if they would benefit from sensitivity to value distributions in multiple columns (which histograms are not very useful for).
Of course it would also depend on the number of queries -- you don't want to go flooding the shared pool etc.
Post a Comment
<< Home