.comment-link {margin-left:.6em;}

Oracle Sponge -- Now Moved To Wordpress

Please use http://oraclesponge.wordpress.com

Friday, June 03, 2005

Statistics By Block Sampling

If you're just joining us, I'd suggest that you scroll down and read the original article then the follow-ups from the bottom upwards.


Update 03 Jun 2005


Another consideration raised, on the uniformity of the number of rows per block.

One reason why the test data showed good resilience towards estimation of statistics, for block sampling in particular, might be a lack of variation in the number of rows per block. It occurs to me that for a table with pretty regular row length that is not subject to deletes then block sampling of 1% is going to sample around 1% of rows, even if the sampled rows are clustered in a small number of sampled blocks.

So I ran a quick test to see what the distribution of rows-per-block is like in the table, and came up with the following query:

select rows_per_block,
count(*) blocks,
sum(rows_per_block) sum_rows
from
(
select dbms_rowid.rowid_block_number(rowid),
count(*) rows_per_block
from t1
group by dbms_rowid.rowid_block_number(rowid)
)
group by rows_per_block
order by 1 desc
/

ROWS_PER_BLOCK BLOCKS SUM_ROWS
-------------- ---------- ----------
345 1 345
337 2 674
336 3 1008
324 11 3564
323 42 13566
322 58 18676
321 5 1605
320 1 320
319 1 319
315 1 315
314 1 314
313 3 939
312 3 936
311 6 1866
310 115 35650
309 503 155427
308 363 111804
307 38 11666
306 3 918
305 1 305
304 2 608
302 3 906
301 2 602
300 5 1500
299 3 897
298 26 7748
297 475 141075
296 1260 372960
295 378 111510
294 6 1764
213 1 213
---------- ----------
sum 3322 1000000

Or to summarise:

select to_char(floor(rows_per_block/10)*10,'fm990')||'''s' rows_per_block,
count(*) blocks,
sum(rows_per_block) sum_rows
from
(
select dbms_rowid.rowid_block_number(rowid),
count(*) rows_per_block
from t1
group by dbms_rowid.rowid_block_number(rowid)
)
group by to_char(floor(rows_per_block/10)*10,'fm990')||'''s'
order by 1 desc

ROWS_P BLOCKS SUM_ROWS
------ ---------- ----------
340's 1 345
330's 5 1682
320's 117 37731
310's 130 40339
300's 920 283736
290's 2148 635954
210's 1 213
---------- ----------
sum 3322 1000000

Now bear in mind that it's implicit in this query that we're not including empty blocks in there. According to the table statistics based on a compute, the table has 0 empty blocks.

So the worst-case scenarios for a 1% block sampling are that it reads the blocks with either the smallest or the greatest number of rows per block. In this case, that's not going to be a huge disparity. So if the 33 blocks sampled were those with the smallest amount of rows, then Oracle might be sampling 9,647 rows, and if they were the blocks with the largest number of rows then it would be sampling 10,759 rows. Bear in mind also that both the number of rows for the table and the average row length suffer similar errors, although they would be inversely related (over estimation of number of rows would correlate to under estimation of average row length).

It's not too bad though. This means that for this particular table the maximum over estimation of row count based on a 1% block sample is 4.5%.

There's one other consideration, however. A 1% block sample does not guarantee that 1% of the blocks will be read. As the documentation states, "This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample. It does not mean that the database will retrieve exactly sample_percent of the rows of table."

We can illustrate this easily by a query:

select avg(est_rows) avg_rows,
stddev(est_rows) sd_rows,
avg(blks) avg_blks,
stddev(blks) sd_blks,
min(blks) min_blks,
max(blks) max_blks
from
(
select count(*) est_rows,
count(distinct dbms_rowid.rowid_block_number(rowid)) blks
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
)
/

AVG_ROWS SD_ROWS AVG_BLKS SD_BLKS MIN_BLKS MAX_BLKS
---------- ---------- ---------- ---------- ---------- ----------
9578.125 1699.43488 31.875 5.6317552 20 43


So while the previous minimum and maximum don't really apply -- the estimate could actually be rather higher or lower depending on how many blocks were truly sampled -- I'm inclined to think that this still supports the original idea, that the relatively stable number of rows per block in this table make it more amenable to block-based sampling than might be the case in the real world.

However just as I stated before, neither JL's original article nor this blog are really telling you to use a particular method. They are merely exposing the tools and the thought processes that you can follow in order to make your own judgement on the matter.

Update 02 Jun 2005

Prompted by Jan's comment, here's some thoughts on how we can use this information to make decisions on our own system's statistics gathering requirements.

This article is not necessarily intended to give direct guidance on what sampling percentage and methodology you ought to be using on your own data, although hopefully it has already given you ideas. What you might take away from it is exactly what I learned from JL's article -- that we ought to be applying measurement techniques to the performance and accuracy of various methods, and making our own decisions based on experimentation. Much of the value of this technique, or indeed any that is based on measurement, comes from being able to document the rationale for your decisions, and this brings to mind two scenarios:
  • You measure and document potential changes to the statistics gathering methodology prior to making any decision to change -- an effective proposal can then be put forward ... "Based on this analysis, I recommend the following changes to our statistics gathering technique because although our statistical accuracy will degrade by 3%, our statistics gathering workload will decrease by 66%".
  • Having documented the method and the results you can then perform regular reviews (quarterly, for example) to check for any benefits in changing the methodology
  • You can justify the decisions you made if they are challenged in the future, and show that you acted with due diligence and a rational procedure.
I'll also repeat what I mentioned in another comment -- that one of the features of JL's article that made it so valuable to me was that it laid bare a problem that I had, but had not realised it. I know that I've been wondering whether computing statistics was beneficial in comparison to estimation, but had never taken the next step of measuring it.

Without wanting to sound like too much of a kiss-ass, thanks again Jonathan.


Original Article


Jonathan Lewis has posted an interesting article on the topic of how precise, and how often, should statistics be collected, in which he demonstrates the influence that different distributions of values have on the reliability of a range of sampling percentages in the collection of statistics.

That was a long sentence. Moving along ...

This prompted me to wonder what effect the option to sample "by block", rather than the default of "by row", would have on these numbers. Briefly, the DBMS_STATS package allows you to choose whether the required percentage can be based on a percentage of blocks in the segment or on the percentage of rows -- the implication there being that sampling by rows will necessarily access more blocks than sampling the same percentage of blocks.

It seems reasonable to assume than when you choose to sample by block you will get a reduced accuracy on clustered values than on other more random distributions, but the two questions to be answered were:
How much less accurate might the block sampling prove to be?
How much more performant might the block sampling be?

On to the test.

I used the same method to construct a sample data set as in JL's paper, but on 10.1.0.2.0.

I wrote a procedure to allow non-privileged users to flush the buffer cache, so that the wall-clock timings I used for performance comparison would not be influenced by block caching:

create procedure flush_buffer_cache
as
begin
execute immediate 'alter system flush buffer_cache';
end;
/
grant execute on flush_buffer_cache to dave
/


Then I created a table to store the percentages that were to be used in the statistics gathering, and a table to store the results of the tests:

set feedback off

drop table t;
create table t (pct number);
insert into t values (1);
insert into t values (2);
insert into t values (4);
insert into t values (6);
insert into t values (8);
insert into t values (10);
insert into t values (20);
insert into t values (40);
insert into t values (60);
insert into t values (80);
insert into t values (100);

create table result
(
pct number,
block_sampling char(1) check (block_sampling in ('Y','N')),
duration_sec number,
column_name varchar2(30),
num_distinct number(10,2)
)

/

The procedure used to gather the results was as follows:

declare
l_time number;
l_duration number;
begin
for list in (select to_char(pct) pct_txt,pct from t order by pct)
loop
sys.flush_buffer_cache;
l_time := dbms_utility.get_time;
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'T1',
estimate_percent => list.pct_txt,
block_sample => FALSE,
method_opt => 'FOR ALL COLUMNS');
l_duration := (dbms_utility.get_time - l_time)/100;
insert into result
select list.pct,
'N',
l_duration,
column_name,
num_distinct
from user_tab_columns
where table_name = 'T1';

sys.flush_buffer_cache;
l_time := dbms_utility.get_time;
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'T1',
estimate_percent => list.pct_txt,
block_sample => TRUE,
method_opt => 'FOR ALL COLUMNS');
l_duration := (dbms_utility.get_time - l_time)/100;
insert into result
select list.pct,
'Y',
l_duration,
column_name,
num_distinct
from user_tab_columns
where table_name = 'T1';

end loop;
end;
/


Incidentally that code set a new record for me, for the longest code section to compile and work correctly first time. Deep joy.

So, the results. The Y/N indicates whether block sampling was used (Y) or not (N) and the percentage is indicated by the first column of numbers. The two columns under the Y and N indicate seconds duration and estimated number of distinct values.























N
Y
CLUSTERED13.752,2982.3461

25.348,8094.41,046

48.350,1187.42,249

610.949,8859.73,234

813.850,05612.94,348

1017.450,05316.25,337

2027.249,99626.611,035

4047.250,00145.920,845

6068.850,00170.030,473

8093.850,00190.640,814

10097.550,00097.150,000








N
Y
NORMAL13.724,5512.325,248

25.325,3014.425,253

48.326,6247.426,939

610.927,5819.727,765

813.828,75312.928,993

1017.429,90516.229,980

2027.233,71126.633,833

4047.237,56545.937,650

6068.839,65270.039,644

8093.841,05590.641,083

10097.542,11497.142,114








N
Y
SCATTERED13.750,9182.347,236

25.350,4564.444,017

48.349,8397.447,745

610.949,8659.751,248

813.849,88812.951,105

1017.449,96116.247,144

2027.250,02426.650,249

4047.249,99945.950,001

6068.850,00170.050,001

8093.850,00190.650,001

10097.550,00097.150,000








N
Y
UNIFORM13.749,1042.346,389

25.347,5154.449,023

48.348,0787.447,797

610.948,6379.748,522

813.848,74212.948,947

1017.449,00116.248,955

2027.249,63626.649,750

4047.249,98845.949,979

6068.850,00070.049,999

8093.850,00190.650,001

10097.550,00097.150,000















Accuracy of Estimation

For the Normal, Scattered and Uniform columns the estimates are pretty good -- at least comparable to those gathered through row sampling. The Scattered estimates fall off a little at the low end of percentages.

The estimates for the Clustered column are not unexpectedly rather poor, even in comparison to the row sampling.

Performance

Performance across the range of percentages of estimate look pretty close between the block and the row sampling, but if you discount the higher percentages for which there is little difference in functionality, you can see that block sampling completes in 62% of the elapsed time for row sampling at 1%, and 83% of the elapsed time at 2%.


Summary

  • For clustered data, block sampling apears to be impractically inaccurate.
  • The performance advantages of block sampling are mostly realised at very low percentages of sampling -- this further rules out it's use against clustered data, but as Jonathan Lewis has shown very low sample percentages need not be detrimental to the collection of good statistics.

8 Comments:

At 7:45 PM, Blogger Kevin said...

Are you trying to sell us something? 'Cause you started out with a point, then there was page after page of fine print -- and you know that no one ever reads the fine print.

This would be a nomination for adding 'uses unreadable fonts' to your list of ground rules for spongeworthiness (or is it non-spongeworthiness -- this whole thing is very confusing)...

 
At 8:53 PM, Blogger David Aldridge said...

Damn, you're right!

That's what you get for trying to write code, and an article, look after three kids and make a shepherd's pie all at once.

Now I have to get jiggy with HTML also. My life sucks.

 
At 9:31 AM, Blogger jan van mourik said...

Thanks David, very interesting! Thanks for the link to JLs document too!
This is one of those things where you think, "now why didn't i try this myself???"...

jan

 
At 11:45 AM, Blogger David Aldridge said...

Jan,

Exactly! That's what I emailed to JL -- the question of how much of the data to analyze is one that I know has occured to me before, but I've never gone ahead and tested it.

 
At 8:58 AM, Blogger jan van mourik said...

So now I'm pondering what to do with this info... Shall I change all my stats jobs to just estimate 1%, or 2% or 8%? Don't want to start changing things right now, cause I'll be in a plane on my way to Taiwan tomorrow. Enough time during the trip to think about it i guess :-) And after that some more experimenting...

 
At 9:22 AM, Blogger David Aldridge said...

I think you've hit on the key word there -- experimenting.

There's nothing much in these articles to say "Thou shalt adopt a psampling percentage of x" -- it's really a demonstration that comparing the performance and accuracy of different percentages and row/block methods is easily in our reach. It shows that both the performance difference and the error in statistical estimation are both measurable, and we all like measurable stuff.

So if, in a few months time, someone said to you, "Hey, how come we're using statistics based on sampling 1% of blocks, crazy man?", you can say "The accuracy is within 8% of the computed statistics, yet it completes in only 1/60th of the time". Game, set and match to Jan, hero of the hour!

 
At 9:59 AM, Blogger jan van mourik said...

I like the way you think (especially the hero part :-)

 
At 10:35 AM, Blogger David Aldridge said...

Hey, I'm a consultant -- I'm in the business of giving people a "warm glow"!

 

Post a Comment

<< Home