The Oracle Sponge Blog -- An Oracle Data Warehousing Resource

Tuesday, June 28, 2005

Dates as Key Values: Natural or Synthetic?

Here's a question: when deciding whether to use a date datatype as a key value, why does there seem to be some consensus that it is OK to do so, even in situations where the same person would not consider using a country code, for example? Now I'm not talking about whether people should use a date datatype or should instead represent a date as a character string or number, because for me that is not an issue. If you want to store a date then you use a date datatype.

But how about a situation where you are using a date value as a key, for example in a data warehouse fact table? Is it "OK" to use the date, or should you be using a synthetic key? This has always been a bit of a philisophical conundrum for me, because I always think, "Yeah, dates are OK as key values -- no need for a synthetic there" but have not put my finger on the reason why.

But recent discussions have firmed up some thoughts on the "true nature" of a synthetic key, and I had a little epiphany on the subject in response to a question by Pando at DBASupport.com.

I stareted with two points:

The difference between a natural key and a synthetic key is that a natural key has some meaning to the naked eye in isolation, and therefore can be exposed directly to the user and "generated" by users. However a synthetic key's meaning is only revealed through translation to a natural value.
If you change the value of a natural key, you are generally not changing what that key value represents, eg. "Finland", you are just changing it's means of representation -- ie. some visible code value.

Now what is a date value in Oracle? It is merely some internal representation that correlates one-to-one with that particular date. Let me float the idea that a date column is actually a synthetic value, but appears to be natural because Oracle uses one of a great many possible conversions to represent it as such. So if your boss came in one morning and said, "Bad news lads, the business is changing the standard date format to be Julian", or you wanted to suddenly start representing month names in ... um ... Urdu, then just as in the case where a country changes its code we're not saying that the date itself has changed, only that its "human-friendly" representation has. You can do this pretty simply through changing NLS settings -- if you have quality code that always uses a date format to say such things as:

Where date_of_transaction <> To_Date('01-02-2005','DD-MM-YYYY') *

... then you generally don't even have to change that.

So here's my proposition: date-based keys are really synthetic, and that is why representing them with a regular sequence-generated synthetic value makes no sense -- it's the representation of them, with year-month-day or whatever, that is natural, and Oracle is effectively maintaining an internal and very flexible two-way translation table for you, accessed through the to_date() and to_char() functions.

Now one difference between dates as synthetic values and other synthetics is that every Oracle database is using the same internal representation for the same date -- in a sense the translation table is global withing the Oracle world. So when moving these raw internal values between different Oracle databases (through tablespace transport, for example) there is not the usual trouble with synthetics of needing a translation to the local system of representation. On the other hand, I'd be willing to lay a couple of $'s on the line that converting from Oracle to other RDBMS's does indeed require a translation of the internal value, possibly by converting to and from a natural representation.

* There was a formatting error in the original post that dropped the RHS of this predicate.

Saturday, June 25, 2005

Natural and Synthetic Keys

Update 27 June 2005

Reversing the order of the updates here to put the most recent at the top. Sorry about any confusion, but it seems to work better this way.

Anyway this is just an update to point out that while the ungrammatical language in the fifth point of "Synthetic Keys: Cons" does indeed show that I backed off on characterising it as always a con, this is because I was considering a quote that HJR posted by Jonathon Lewis in which he said that he would probably not consider the issue of reverse-key indexes at design time, but would monitor for wait events that indicate a possible problem later on.

For anyone inclined to interpret it as evidence of a fervent devotee of synthetic keys, [who] cannot therefore quite bring himself to admit this drawback freely and unreservedly into his list, I'd point out that I also labelled as "tentative" one-out-of-two cons of natural keys, and two-out-of-four of the pros for synthetic keys ( I sort-of labelled halk of one point as tentative, I think -- maybe that should have been two points, but they're very closely related). Not the actions of a fervent devotee, I'd suggest, or at least not one intending to deceive.

I did screw up on the buffer wait thing though. :(

Update 26 June 2005

The 3:01pm comment deleted by the blog administrator at Dizwell was mine, and said "Thanks for sharing". Seems to me that for someone who starts a forum, insults people who disagree with him, then censors their postings, Howard is remarkably prone to telling other people that they are like Don Burleson.

Witness a spirited and lively debate here.

Update: And note that the link to the Oracle Sponge is no longer on the Dizwell Blog template -- I guess that's what happens when you use someone's example against them ;) The Dizwell Informatics link will stay here though. I don't take these things personally.

Here for the record is my list of pros and cons for the natural and the synthetic key approaches ...

Natural Keys: Pros

One less column and one less unique constraint with supporting index: benefit is in proportion to table growth rate
Potentially fewer joins, particularly for "first-generation" child tables, leading to more simple and potentially better performing queries

Natural Keys: Cons

Natural key changes require more complex code, and execution of code requires more extensive processing and locking: detriment of these are in proportion to the number of tables referencing the key and the rate of key changes, and the latter is also in proportion to the number of child table rows per key value.
(Tentative point): More temptation for composite primary keys, particularly in parent-child hierarchies, leading to larger index keys and more block splits -- alternatively keys may be non-composite by concatanation of parent keys into child keys, leading to more complexity on point 1 above. Detriment is in proportion to use and size of composites.

Synthetic Keys: Pros

Changes to natural key values are coded more simply, executed more quickly, and have less system impact.
Primary key columns are always of the same type and are (tentatively) generally single column, potentially leading to more simple join syntax and developer education.
(Tentatively) With smaller non-composite key values either the index size or the number of block splits is reduced.

Synthetic Keys: Cons

Potentially more joins required in queries, leading to more complex and less performant code.
Requires an extra column with supporting index, and an extra unique constraint (all tables having a PK, the extra UK is on the natural key that is no longer the PK)
(Tentatively) With smaller non-composite key values either the index size or the number of block splits is reduced.
May requires extra decision-step to identify high concurrency inserts fort adoption of reverse key index, or later monitoring for excessive buffer free waits (update: Doh! as HJR points out, that ought to be "buffer busy waits". My bad.)

And here is my little mantra:

Think hard about the design and what natural keys there are, and allow them to change, and make them user/third-party friendly, and print them on reports and show them on the screen -- then make them unique keys and make the primary keys synthetic.

Comments as always are welcome.

Howard's blog comments are here.

Wednesday, June 22, 2005

Fuel Economy

Off topic, but here's something that has me scratching my head.

While I was in Spain I rented a Ford Mondeo with a diesel engine in it, and it seemed to be using almost no fuel even under *ahem* spirited driving conditions. I looked up the Mondeo diesel on the Ford UK website and see that it is rated at 46.3 mpg(uk), which is 38.6mpg(US). The gas-powered US Taurus tops out at around 24mpg (average of city and highway) as there is no diesel option available.

Wassup with that? I thought their new chairman was all eco-friendly and whatnot, but ironically if you want a fuel efficient family-sized Ford in the US you have to buy an Escape Hybrid, giving around 34mpg. Imagine that -- to be fuel-efficient, you have to buy an SUV!

Funny old world.

Tuesday, June 21, 2005

Vacation over. Crap.

It's been quiet around here ... too damn quiet. Attentive readers will know that I've been away on a little vacation in England and Spain, and despite vague plans that I might have something to say in the meantime, I couldn't face the thought of it when there were pubs, bars and beaches just a short walk away.

How strange to be back. Everything is on a different scale here, and while walking Number One Son to daycare this morning I only met three people in four miles, although the cheerful attitude of a Colorado Springs Utilities employee playing with the sprinklers in a local park was as refreshing as the water he accidently sprayed me with. The walking thing is inspired by the fine city of Cambridge, where the narrow streets and parking difficulties got us onto our feet after seven years of cruising in air conditioned lazy comfort. So Cambridge was a big hit with all of us - the kids loved the water, the wildlife, the little shops, the fish and chips etc., and the parents loved the beer gardens. The Cambridge Blue in Gwydir Street was particularly refreshing, as the kids could hop over the back wall and play in the cemetary. Lovely.

An impromtu walking tour of London with kids aged 2, 3 and 5? Interesting, and the perfect setup for a week of doing nothing in Spain. Well, I say "nothing". Obviously walking down to the beach every morning and ordering drinks and fresh fish for lunch while watching the kids go crazy in the water is not for the inexperienced, but we rose to the challenge pretty well.

No lessons were learned, nor observations made, that had any parallel in the field of database management.

Anyhoo, click here to acknowledge that you're so desperate for a vacation that you're willing to look at someone else's vacation photos as a substitute.

Friday, June 03, 2005

Statistics By Block Sampling

If you're just joining us, I'd suggest that you scroll down and read the original article then the follow-ups from the bottom upwards.

Update 03 Jun 2005

Another consideration raised, on the uniformity of the number of rows per block.

One reason why the test data showed good resilience towards estimation of statistics, for block sampling in particular, might be a lack of variation in the number of rows per block. It occurs to me that for a table with pretty regular row length that is not subject to deletes then block sampling of 1% is going to sample around 1% of rows, even if the sampled rows are clustered in a small number of sampled blocks.

So I ran a quick test to see what the distribution of rows-per-block is like in the table, and came up with the following query:

select rows_per_block,
count(*) blocks,
sum(rows_per_block) sum_rows
from
(
select dbms_rowid.rowid_block_number(rowid),
count(*) rows_per_block
from t1
group by dbms_rowid.rowid_block_number(rowid)
)
group by rows_per_block
order by 1 desc
/

ROWS_PER_BLOCK BLOCKS SUM_ROWS
-------------- ---------- ----------
345 1 345
337 2 674
336 3 1008
324 11 3564
323 42 13566
322 58 18676
321 5 1605
320 1 320
319 1 319
315 1 315
314 1 314
313 3 939
312 3 936
311 6 1866
310 115 35650
309 503 155427
308 363 111804
307 38 11666
306 3 918
305 1 305
304 2 608
302 3 906
301 2 602
300 5 1500
299 3 897
298 26 7748
297 475 141075
296 1260 372960
295 378 111510
294 6 1764
213 1 213
---------- ----------
sum 3322 1000000

Or to summarise:

select to_char(floor(rows_per_block/10)*10,'fm990')||'''s' rows_per_block,
count(*) blocks,
sum(rows_per_block) sum_rows
from
(
select dbms_rowid.rowid_block_number(rowid),
count(*) rows_per_block
from t1
group by dbms_rowid.rowid_block_number(rowid)
)
group by to_char(floor(rows_per_block/10)*10,'fm990')||'''s'
order by 1 desc

ROWS_P BLOCKS SUM_ROWS
------ ---------- ----------
340's 1 345
330's 5 1682
320's 117 37731
310's 130 40339
300's 920 283736
290's 2148 635954
210's 1 213
---------- ----------
sum 3322 1000000

Now bear in mind that it's implicit in this query that we're not including empty blocks in there. According to the table statistics based on a compute, the table has 0 empty blocks.

So the worst-case scenarios for a 1% block sampling are that it reads the blocks with either the smallest or the greatest number of rows per block. In this case, that's not going to be a huge disparity. So if the 33 blocks sampled were those with the smallest amount of rows, then Oracle might be sampling 9,647 rows, and if they were the blocks with the largest number of rows then it would be sampling 10,759 rows. Bear in mind also that both the number of rows for the table and the average row length suffer similar errors, although they would be inversely related (over estimation of number of rows would correlate to under estimation of average row length).

It's not too bad though. This means that for this particular table the maximum over estimation of row count based on a 1% block sample is 4.5%.

There's one other consideration, however. A 1% block sample does not guarantee that 1% of the blocks will be read. As the documentation states, "This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample. It does not mean that the database will retrieve exactly sample_percent of the rows of table."

We can illustrate this easily by a query:

select avg(est_rows) avg_rows,
stddev(est_rows) sd_rows,
avg(blks) avg_blks,
stddev(blks) sd_blks,
min(blks) min_blks,
max(blks) max_blks
from
(
select count(*) est_rows,
count(distinct dbms_rowid.rowid_block_number(rowid)) blks
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
union all
select count(*) ,
count(distinct dbms_rowid.rowid_block_number(rowid))
from t1 sample block (1)
)
/

AVG_ROWS SD_ROWS AVG_BLKS SD_BLKS MIN_BLKS MAX_BLKS
---------- ---------- ---------- ---------- ---------- ----------
9578.125 1699.43488 31.875 5.6317552 20 43

So while the previous minimum and maximum don't really apply -- the estimate could actually be rather higher or lower depending on how many blocks were truly sampled -- I'm inclined to think that this still supports the original idea, that the relatively stable number of rows per block in this table make it more amenable to block-based sampling than might be the case in the real world.

However just as I stated before, neither JL's original article nor this blog are really telling you to use a particular method. They are merely exposing the tools and the thought processes that you can follow in order to make your own judgement on the matter.

Update 02 Jun 2005

Prompted by Jan's comment, here's some thoughts on how we can use this information to make decisions on our own system's statistics gathering requirements.

This article is not necessarily intended to give direct guidance on what sampling percentage and methodology you ought to be using on your own data, although hopefully it has already given you ideas. What you might take away from it is exactly what I learned from JL's article -- that we ought to be applying measurement techniques to the performance and accuracy of various methods, and making our own decisions based on experimentation. Much of the value of this technique, or indeed any that is based on measurement, comes from being able to document the rationale for your decisions, and this brings to mind two scenarios:

You measure and document potential changes to the statistics gathering methodology prior to making any decision to change -- an effective proposal can then be put forward ... "Based on this analysis, I recommend the following changes to our statistics gathering technique because although our statistical accuracy will degrade by 3%, our statistics gathering workload will decrease by 66%".
Having documented the method and the results you can then perform regular reviews (quarterly, for example) to check for any benefits in changing the methodology
You can justify the decisions you made if they are challenged in the future, and show that you acted with due diligence and a rational procedure.

I'll also repeat what I mentioned in another comment -- that one of the features of JL's article that made it so valuable to me was that it laid bare a problem that I had, but had not realised it. I know that I've been wondering whether computing statistics was beneficial in comparison to estimation, but had never taken the next step of measuring it.

Without wanting to sound like too much of a kiss-ass, thanks again Jonathan.

Original Article

Jonathan Lewis has posted an interesting article on the topic of how precise, and how often, should statistics be collected, in which he demonstrates the influence that different distributions of values have on the reliability of a range of sampling percentages in the collection of statistics.

That was a long sentence. Moving along ...

This prompted me to wonder what effect the option to sample "by block", rather than the default of "by row", would have on these numbers. Briefly, the DBMS_STATS package allows you to choose whether the required percentage can be based on a percentage of blocks in the segment or on the percentage of rows -- the implication there being that sampling by rows will necessarily access more blocks than sampling the same percentage of blocks.

It seems reasonable to assume than when you choose to sample by block you will get a reduced accuracy on clustered values than on other more random distributions, but the two questions to be answered were:
How much less accurate might the block sampling prove to be?
How much more performant might the block sampling be?

On to the test.

I used the same method to construct a sample data set as in JL's paper, but on 10.1.0.2.0.

I wrote a procedure to allow non-privileged users to flush the buffer cache, so that the wall-clock timings I used for performance comparison would not be influenced by block caching:

create procedure flush_buffer_cache
as
begin
execute immediate 'alter system flush buffer_cache';
end;
/
grant execute on flush_buffer_cache to dave
/

Then I created a table to store the percentages that were to be used in the statistics gathering, and a table to store the results of the tests:

set feedback off
drop table t;
create table t (pct number);
insert into t values (1);
insert into t values (2);
insert into t values (4);
insert into t values (6);
insert into t values (8);
insert into t values (10);
insert into t values (20);
insert into t values (40);
insert into t values (60);
insert into t values (80);
insert into t values (100);

create table result
(
pct number,
block_sampling char(1) check (block_sampling in ('Y','N')),
duration_sec number,
column_name varchar2(30),
num_distinct number(10,2)
)
/

The procedure used to gather the results was as follows:

declare
l_time number;
l_duration number;
begin
for list in (select to_char(pct) pct_txt,pct from t order by pct)
loop
sys.flush_buffer_cache;
l_time := dbms_utility.get_time;
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'T1',
estimate_percent => list.pct_txt,
block_sample => FALSE,
method_opt => 'FOR ALL COLUMNS');
l_duration := (dbms_utility.get_time - l_time)/100;
insert into result
select list.pct,
'N',
l_duration,
column_name,
num_distinct
from user_tab_columns
where table_name = 'T1';

sys.flush_buffer_cache;
l_time := dbms_utility.get_time;
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'T1',
estimate_percent => list.pct_txt,
block_sample => TRUE,
method_opt => 'FOR ALL COLUMNS');
l_duration := (dbms_utility.get_time - l_time)/100;
insert into result
select list.pct,
'Y',
l_duration,
column_name,
num_distinct
from user_tab_columns
where table_name = 'T1';

end loop;
end;
/

Incidentally that code set a new record for me, for the longest code section to compile and work correctly first time. Deep joy.

So, the results. The Y/N indicates whether block sampling was used (Y) or not (N) and the percentage is indicated by the first column of numbers. The two columns under the Y and N indicate seconds duration and estimated number of distinct values.


		N		Y
CLUSTERED	1	3.7	52,298	2.3	461
	2	5.3	48,809	4.4	1,046
	4	8.3	50,118	7.4	2,249
	6	10.9	49,885	9.7	3,234
	8	13.8	50,056	12.9	4,348
	10	17.4	50,053	16.2	5,337
	20	27.2	49,996	26.6	11,035
	40	47.2	50,001	45.9	20,845
	60	68.8	50,001	70.0	30,473
	80	93.8	50,001	90.6	40,814
	100	97.5	50,000	97.1	50,000

		N		Y
NORMAL	1	3.7	24,551	2.3	25,248
	2	5.3	25,301	4.4	25,253
	4	8.3	26,624	7.4	26,939
	6	10.9	27,581	9.7	27,765
	8	13.8	28,753	12.9	28,993
	10	17.4	29,905	16.2	29,980
	20	27.2	33,711	26.6	33,833
	40	47.2	37,565	45.9	37,650
	60	68.8	39,652	70.0	39,644
	80	93.8	41,055	90.6	41,083
	100	97.5	42,114	97.1	42,114

		N		Y
SCATTERED	1	3.7	50,918	2.3	47,236
	2	5.3	50,456	4.4	44,017
	4	8.3	49,839	7.4	47,745
	6	10.9	49,865	9.7	51,248
	8	13.8	49,888	12.9	51,105
	10	17.4	49,961	16.2	47,144
	20	27.2	50,024	26.6	50,249
	40	47.2	49,999	45.9	50,001
	60	68.8	50,001	70.0	50,001
	80	93.8	50,001	90.6	50,001
	100	97.5	50,000	97.1	50,000

		N		Y
UNIFORM	1	3.7	49,104	2.3	46,389
	2	5.3	47,515	4.4	49,023
	4	8.3	48,078	7.4	47,797
	6	10.9	48,637	9.7	48,522
	8	13.8	48,742	12.9	48,947
	10	17.4	49,001	16.2	48,955
	20	27.2	49,636	26.6	49,750
	40	47.2	49,988	45.9	49,979
	60	68.8	50,000	70.0	49,999
	80	93.8	50,001	90.6	50,001
	100	97.5	50,000	97.1	50,000

Accuracy of Estimation

For the Normal, Scattered and Uniform columns the estimates are pretty good -- at least comparable to those gathered through row sampling. The Scattered estimates fall off a little at the low end of percentages.

The estimates for the Clustered column are not unexpectedly rather poor, even in comparison to the row sampling.

Performance

Performance across the range of percentages of estimate look pretty close between the block and the row sampling, but if you discount the higher percentages for which there is little difference in functionality, you can see that block sampling completes in 62% of the elapsed time for row sampling at 1%, and 83% of the elapsed time at 2%.

Summary

For clustered data, block sampling apears to be impractically inaccurate.
The performance advantages of block sampling are mostly realised at very low percentages of sampling -- this further rules out it's use against clustered data, but as Jonathan Lewis has shown very low sample percentages need not be detrimental to the collection of good statistics.

So Long, Wage Slaves

I enjoyed my last vacation so much that I thought I'd take another. Starting tomorrow I'll be on a European Vacation, and doubtless it'll be full of the usual cultural misunderstandings, crude stereotypes, and hilarious culinary experiences that Hollywood leads us to expect.

Actually my brother is getting married near Cambridge, UK, so congratulations to him and commiserations to his betrothed. Then it's off to Spain for some quality doing-nothing time.

I may make intermittent posts, while the kids are in the bath etc., but expect nothing interesting for a couple of weeks -- if then.

Until then, as the title says ...

Thursday, June 02, 2005

Partitioning Stuff

Ah, there's nothing like a stimulating discussion on partitioning. Some general philosophy of dealing with partitions in a data warehouse environment is slipping in there also.

The Most Useless Metalink Article Ever

In reference to the blog on gathering statistics, it occured to me that we could just use the statistics table functionality of DBMS_STATS to store the results of different methods of gathering statistics.

Having created a statistics table I went in search of the definitions of it's columns and came across this article at MetaLink, bulletin 1074354.6 titled "DBMS_STATS.CREATE_STAT_TABLE: What Do Table Columns Mean?".

It gives a list of the columns of the table, then helpfully adds, "The documentation states: "The columns and types that compose this table are not relevant as it should be accessed solely through the procedures in this package (dbms_stats)."

So having dedicated a bulletin to this topic, and having shown us the columns and types, we are then told that it's none of our business what they are or what they mean anyway.

Hey thanks, MetaLink!

Oracle Sponge -- Now Moved To Wordpress

Tuesday, June 28, 2005

Dates as Key Values: Natural or Synthetic?

Saturday, June 25, 2005

Natural and Synthetic Keys

Wednesday, June 22, 2005

Fuel Economy

Tuesday, June 21, 2005

Vacation over. Crap.

Friday, June 03, 2005

Statistics By Block Sampling

So Long, Wage Slaves

Thursday, June 02, 2005

Partitioning Stuff

The Most Useless Metalink Article Ever

About Me

Spongeworthy Links

Some Articles

Practical Guide To Oracle Data Warehousing

Best Of The Sponge (IMHO)

Materialized View Articles

Sponge Lists

Spongeworthy Documentation

Version 9i

Version 10G

Previous Posts

Archives