If you're just joining us, I'd suggest that you scroll down and read the original article then the follow-ups from the bottom upwards.
Update 03 Jun 2005Another consideration raised, on the uniformity of the number of rows per block.One reason why the test data showed good resilience towards estimation of statistics, for block sampling in particular, might be a lack of variation in the number of rows per block. It occurs to me that for a table with pretty regular row length that is not subject to deletes then block sampling of 1% is going to sample around 1% of rows, even if the sampled rows are clustered in a small number of sampled blocks.
So I ran a quick test to see what the distribution of rows-per-block is like in the table, and came up with the following query:
select rows_per_block,
count(*) blocks,
sum(rows_per_block) sum_rows
from
(
select dbms_rowid.rowid_block_number(rowid),
count(*) rows_per_block
from t1
group by dbms_rowid.rowid_block_number(rowid)
)
group by rows_per_block
order by 1 desc
/
ROWS_PER_BLOCK BLOCKS SUM_ROWS
-------------- ---------- ----------
345 1 345
337 2 674
336 3 1008
324 11 3564
323 42 13566
322 58 18676
321 5 1605
320 1 320
319 1 319
315 1 315
314 1 314
313 3 939
312 3 936
311 6 1866
310 115 35650
309 503 155427
308 363 111804
307 38 11666
306 3 918
305 1 305
304 2 608
302 3 906
301 2 602
300 5 1500
299 3 897
298 26 7748
297 475 141075
296 1260 372960
295 378 111510
294 6 1764
213 1 213
---------- ----------
sum 3322 1000000
Or to summarise:
select to_char(floor(rows_per_block/10)*10,'fm990')||'''s' rows_per_block,
count(*) blocks,
sum(rows_per_block) sum_rows
from
(
select dbms_rowid.rowid_block_number(rowid),
count(*) rows_per_block
from t1
group by dbms_rowid.rowid_block_number(rowid)
)
group by to_char(floor(rows_per_block/10)*10,'fm990')||'''s'
order by 1 desc
ROWS_P BLOCKS SUM_ROWS
------ ---------- ----------
340's 1 345
330's 5 1682
320's 117 37731
310's 130 40339
300's 920 283736
290's 2148 635954
210's 1 213
---------- ----------
sum 3322 1000000
Now bear in mind that it's implicit in this query that we're not including empty blocks in there. According to the table statistics based on a compute, the table has 0 empty blocks.
So the worst-case scenarios for a 1% block sampling are that it reads the blocks with either the smallest or the greatest number of rows per block. In this case, that's not going to be a huge disparity. So if the 33 blocks sampled were those with the smallest amount of rows, then Oracle might be sampling 9,647 rows, and if they were the blocks with the largest number of rows then it would be sampling 10,759 rows. Bear in mind also that both the number of rows for the table and the average row length suffer similar errors, although they would be inversely related (over estimation of number of rows would correlate to under estimation of average row length).
It's not too bad though. This means that for this particular table the maximum over estimation of row count based on a 1% block sample is 4.5%.
There's one other consideration, however. A 1% block sample does not guarantee that 1% of the blocks will be read. As the
documentation states, "This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample. It does not mean that the database will retrieve exactly
sample_percent
of the rows of
table
."
We can illustrate this easily by a query:
select avg(est_rows) avg_rows,
stddev(est_rows) sd_rows,
avg(blks) avg_blks,
stddev(blks) sd_blks,
min(blks) min_blks,
max(blks) max_blks
from
(
select count(*) est_rows,
count(distinct dbms_rowid.rowid_block_number(rowid)) blks
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
)
/
AVG_ROWS SD_ROWS AVG_BLKS SD_BLKS MIN_BLKS MAX_BLKS
---------- ---------- ---------- ---------- ---------- ----------
9578.125 1699.43488 31.875 5.6317552 20 43So while the previous minimum and maximum don't really apply -- the estimate could actually be rather higher or lower depending on how many blocks were truly sampled -- I'm inclined to think that this still supports the original idea, that the relatively stable number of rows per block in this table make it more amenable to block-based sampling than might be the case in the real world.
However just as I stated before, neither JL's original article nor this blog are really telling you to use a particular method. They are merely exposing the tools and the thought processes that you can follow in order to make your own judgement on the matter.
Update 02 Jun 2005Prompted by Jan's comment, here's some thoughts on how we can use this information to make decisions on our own system's statistics gathering requirements.This article is not necessarily intended to give direct guidance on what sampling percentage and methodology you ought to be using on your own data, although hopefully it has already given you ideas. What you might take away from it is exactly what I learned from JL's article -- that we ought to be applying measurement techniques to the performance and accuracy of various methods, and making our own decisions based on experimentation. Much of the value of this technique, or indeed any that is based on measurement, comes from being able to document the rationale for your decisions, and this brings to mind two scenarios:
- You measure and document potential changes to the statistics gathering methodology prior to making any decision to change -- an effective proposal can then be put forward ... "Based on this analysis, I recommend the following changes to our statistics gathering technique because although our statistical accuracy will degrade by 3%, our statistics gathering workload will decrease by 66%".
- Having documented the method and the results you can then perform regular reviews (quarterly, for example) to check for any benefits in changing the methodology
- You can justify the decisions you made if they are challenged in the future, and show that you acted with due diligence and a rational procedure.
I'll also repeat what I mentioned in another comment -- that one of the features of JL's article that made it so valuable to me was that it laid bare a problem that I had, but had not realised it. I know that I've been wondering whether computing statistics was beneficial in comparison to estimation, but had never taken the next step of
measuring it.
Without wanting to sound like too much of a kiss-ass, thanks again Jonathan.
Original ArticleJonathan Lewis has posted an
interesting article on the topic of how precise, and how often, should statistics be collected, in which he demonstrates the influence that different distributions of values have on the reliability of a range of sampling percentages in the collection of statistics.
That was a long sentence. Moving along ...
This prompted me to wonder what effect the option to sample "by block", rather than the default of "by row", would have on these numbers. Briefly, the DBMS_STATS package allows you to choose whether the required percentage can be based on a percentage of blocks in the segment or on the percentage of rows -- the implication there being that sampling by rows will necessarily access more blocks than sampling the same percentage of blocks.
It seems reasonable to assume than when you choose to sample by block you will get a reduced accuracy on clustered values than on other more random distributions, but the two questions to be answered were:
How much less accurate might the block sampling prove to be?
How much more performant might the block sampling be?
On to the test.
I used the same method to construct a sample data set as in JL's paper, but on 10.1.0.2.0.
I wrote a procedure to allow non-privileged users to flush the buffer cache, so that the wall-clock timings I used for performance comparison would not be influenced by block caching:
create procedure flush_buffer_cache
as
begin
execute immediate 'alter system flush buffer_cache';
end;
/
grant execute on flush_buffer_cache to dave
/Then I created a table to store the percentages that were to be used in the statistics gathering, and a table to store the results of the tests:
set feedback off
drop table t;
create table t (pct number);
insert into t values (1);
insert into t values (2);
insert into t values (4);
insert into t values (6);
insert into t values (8);
insert into t values (10);
insert into t values (20);
insert into t values (40);
insert into t values (60);
insert into t values (80);
insert into t values (100);
create table result
(
pct number,
block_sampling char(1) check (block_sampling in ('Y','N')),
duration_sec number,
column_name varchar2(30),
num_distinct number(10,2)
)
/
The procedure used to gather the results was as follows:
declare
l_time number;
l_duration number;
begin
for list in (select to_char(pct) pct_txt,pct from t order by pct)
loop
sys.flush_buffer_cache;
l_time := dbms_utility.get_time;
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'T1',
estimate_percent => list.pct_txt,
block_sample => FALSE,
method_opt => 'FOR ALL COLUMNS');
l_duration := (dbms_utility.get_time - l_time)/100;
insert into result
select list.pct,
'N',
l_duration,
column_name,
num_distinct
from user_tab_columns
where table_name = 'T1';
sys.flush_buffer_cache;
l_time := dbms_utility.get_time;
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'T1',
estimate_percent => list.pct_txt,
block_sample => TRUE,
method_opt => 'FOR ALL COLUMNS');
l_duration := (dbms_utility.get_time - l_time)/100;
insert into result
select list.pct,
'Y',
l_duration,
column_name,
num_distinct
from user_tab_columns
where table_name = 'T1';
end loop;
end;
/
Incidentally that code set a new record for me, for the longest code section to compile and work correctly first time. Deep joy.
So, the results. The Y/N indicates whether block sampling was used (Y) or not (N) and the percentage is indicated by the first column of numbers. The two columns under the Y and N indicate seconds duration and estimated number of distinct values.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| N |
| Y |
| CLUSTERED | 1 | 3.7 | 52,298 | 2.3 | 461 |
| 2 | 5.3 | 48,809 | 4.4 | 1,046 |
| 4 | 8.3 | 50,118 | 7.4 | 2,249 |
| 6 | 10.9 | 49,885 | 9.7 | 3,234 |
| 8 | 13.8 | 50,056 | 12.9 | 4,348 |
| 10 | 17.4 | 50,053 | 16.2 | 5,337 |
| 20 | 27.2 | 49,996 | 26.6 | 11,035 |
| 40 | 47.2 | 50,001 | 45.9 | 20,845 |
| 60 | 68.8 | 50,001 | 70.0 | 30,473 |
| 80 | 93.8 | 50,001 | 90.6 | 40,814 |
| 100 | 97.5 | 50,000 | 97.1 | 50,000 |
|
|
|
|
|
|
|
| N |
| Y |
| NORMAL | 1 | 3.7 | 24,551 | 2.3 | 25,248 |
| 2 | 5.3 | 25,301 | 4.4 | 25,253 |
| 4 | 8.3 | 26,624 | 7.4 | 26,939 |
| 6 | 10.9 | 27,581 | 9.7 | 27,765 |
| 8 | 13.8 | 28,753 | 12.9 | 28,993 |
| 10 | 17.4 | 29,905 | 16.2 | 29,980 |
| 20 | 27.2 | 33,711 | 26.6 | 33,833 |
| 40 | 47.2 | 37,565 | 45.9 | 37,650 |
| 60 | 68.8 | 39,652 | 70.0 | 39,644 |
| 80 | 93.8 | 41,055 | 90.6 | 41,083 |
| 100 | 97.5 | 42,114 | 97.1 | 42,114 |
|
|
|
|
|
|
|
| N |
| Y |
| SCATTERED | 1 | 3.7 | 50,918 | 2.3 | 47,236 |
| 2 | 5.3 | 50,456 | 4.4 | 44,017 |
| 4 | 8.3 | 49,839 | 7.4 | 47,745 |
| 6 | 10.9 | 49,865 | 9.7 | 51,248 |
| 8 | 13.8 | 49,888 | 12.9 | 51,105 |
| 10 | 17.4 | 49,961 | 16.2 | 47,144 |
| 20 | 27.2 | 50,024 | 26.6 | 50,249 |
| 40 | 47.2 | 49,999 | 45.9 | 50,001 |
| 60 | 68.8 | 50,001 | 70.0 | 50,001 |
| 80 | 93.8 | 50,001 | 90.6 | 50,001 |
| 100 | 97.5 | 50,000 | 97.1 | 50,000 |
|
|
|
|
|
|
|
| N |
| Y |
| UNIFORM | 1 | 3.7 | 49,104 | 2.3 | 46,389 |
| 2 | 5.3 | 47,515 | 4.4 | 49,023 |
| 4 | 8.3 | 48,078 | 7.4 | 47,797 |
| 6 | 10.9 | 48,637 | 9.7 | 48,522 |
| 8 | 13.8 | 48,742 | 12.9 | 48,947 |
| 10 | 17.4 | 49,001 | 16.2 | 48,955 |
| 20 | 27.2 | 49,636 | 26.6 | 49,750 |
| 40 | 47.2 | 49,988 | 45.9 | 49,979 |
| 60 | 68.8 | 50,000 | 70.0 | 49,999 |
| 80 | 93.8 | 50,001 | 90.6 | 50,001 |
| 100 | 97.5 | 50,000 | 97.1 | 50,000 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Accuracy of Estimation
For the Normal, Scattered and Uniform columns the estimates are pretty good -- at least comparable to those gathered through row sampling. The Scattered estimates fall off a little at the low end of percentages.
The estimates for the Clustered column are not unexpectedly rather poor, even in comparison to the row sampling.
Performance
Performance across the range of percentages of estimate look pretty close between the block and the row sampling, but if you discount the higher percentages for which there is little difference in functionality, you can see that block sampling completes in 62% of the elapsed time for row sampling at 1%, and 83% of the elapsed time at 2%.
Summary
- For clustered data, block sampling apears to be impractically inaccurate.
- The performance advantages of block sampling are mostly realised at very low percentages of sampling -- this further rules out it's use against clustered data, but as Jonathan Lewis has shown very low sample percentages need not be detrimental to the collection of good statistics.