.comment-link {margin-left:.6em;}

Oracle Sponge -- Now Moved To Wordpress

Please use http://oraclesponge.wordpress.com

Sunday, July 24, 2005

Higher Dimensional Values In Fact Tables

update: 24-Jul-2005

Jonathan Lewis writes to educate me on function-based indexes and statistics thereon ... it is indeed possible to gather histogram data on them through the mechanism of the DBMS_Stats method_opt => 'for all hidden columns' -- who'd a thunk it? It's documented here. I hope it doesn't sound like sour grapes if I wish that this were also documented here.

In this article I dismissed the use of function-based indexes (for the purposes outlined) based on the inability of Oracle to generate histograms on them -- whether this new-to-me information changes that judgement will take some further thought and analysis. I may move it to another blog entry once I've reassessed the issue, as an analysis of the costs and benefits of function-based indexes vs. "materializing" the function result in the table rather appeals to me.

Thanks again, Jonathan.


introduction


In getting performance out of an Oracle database, and in data warehousing (DWh) in particular, we live or die by making effective use of the cost-based optimizer (CBO). As I wrote before, "The CBO is based on statistics -- statistics about tables, statistics about indexes, statistics about the distribution of values within a column". So we owe it to ourselves to go the extra mile in our search for statistical accuracy, and sometimes we might have to bend a few rules to do so.

This thought was prompted by the rebuilding of a major fact table, primarily in order to change its partitioning scheme, but the opportunity for some performance tweaking is not to be sniffed at. Accordingly, I have a plan ... a cunning plan if I may say so. A plan that may make database purists squirt coffee from their noses, and I consider that to be fair warning so let's press on.

the scenario

The fact table in question has a number of dimensional values for which the parent values are deterministically based on the lowest value itself. For example, DATE_OF_DAY has parent values of DATE_OF_MONTH, FISCAL_MONTH_CD, FISCAL_QUARTER_CD, FISCAL_YEAR_CD. Given a particular value of DATE_OF_DAY the calculation of the other four values is a simple matter, for example:

FISCAL_MONTH_CD =
To_Char(Add_Months(DATE_OF_DAY, 3), 'YYYY" M"MM')


So for 04-Dec-2004 the fiscal month is "2005 M03"

(Did you know that you can insert arbitrary strings of characters in a date-to-char conversion using double-quotes? A lot of people are surprised by that. Surprisingly.)

These parent values are stored in a dimension table, possibly snowflaked where required by the presence of summary tables at the DATE_OF_MONTH level, for example.

Now here's the problem that I'm trying to solve ... when a report query is submitted to the database and includes a predicate such as FISCAL_MONTH_CD = '2005 M04', and the query is directed to the fact table at the DATE_OF_DAY level, Oracle can only take the most generic estimation of the cardinality of the result set -- histograms on FISCAL_MONTH_CD in the dimension table and on DATE_OF_DAY in the fact table are not effective in this situation.

a demonstration script

set feedback off heading off

drop table dave_fct;

create table dave_fct
pctfree 0 nologging
as
select * from dba_objects;

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_FCT',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254'
);
end;
/

drop table dave_dim
/
create table dave_dim -- lovely name!
(owner primary key,
owner_2char not null)
as
select distinct
owner,
substr(owner,1,2)
from dave_fct
/

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/

set autotrace on

-- QUERY 1
select count(*)
from dave_fct
where owner = 'SYS'
/

-- QUERY 2
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner = 'SYS'
/

-- QUERY 3
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner_2char = 'SY'
/

results, and their implications

My results show that the statistics for queries 1 and 2 are pretty much spot on ... the actual number of rows and the estimated cardinality were within about 3% -- for query 3 the optimizer overestimated the cardinality by 324%. So what does this mean? Potentially it could prompt the optimizer to perform a full table scan instead of an index scan (standard disclaimer: not that I'm saying that full scans are necessarily a bad thing).

some possible solutions

So how do you get the optimizer better statistics?

One method would be to use a query such as:
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner like 'SY%'
/

... which gives a very acceptable estimation in this case.

But this is a special case with its own advantages and disadvantages:
  • The "like" predicate is compatible with the use of histograms
  • This is a difficult technique to use for an In List query
So maybe ...
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and substr(dd.owner,1,2) = 'SY'
/

Nope ... the estimated cardinality is still over 300% too high.

How about creating a function-based index on the table?
create bitmap index dave_fct_ind on dave_fct (substr(owner,1,2))
compute statistics
/

Fast access on the query:
select count(owner)
from dave_fct
where substr(owner,1,2) = 'SY'
/

... but again no accurate statistics because there is no histogram on the index values, only on the table values. (Maybe an enhancement request ... hmmm ...) note: update above identifying this as incorrect information!.

a proposed solution

One method that I'm going to try is to push the higher level dimensional values down to the fact table. In our example, by placing the owner_2char attribute in the fact table and linking it to a snowflaked dimension table with owner_2char as the primary key we can use a histogram directly on that value.

So the demonstration script becomes:

drop table dave_fct;

create table dave_fct pctfree 0 compress nologging
as
select substr(do.owner,1,2) OWNER_2CHAR,
do.* from dba_objects do;

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_FCT',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/

drop table dave_dim
/

create table dave_dim
(owner primary key,
owner_2char not null)
as
select distinct
owner,
owner_2char
from dave_fct
/

create table dave_dim_p
(owner_2char primary key)
as
select distinct
owner_2char
from dave_fct
/

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/

begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM_P',
METHOD_OPT => 'FOR COLUMNS OWNER_2CHAR SIZE 254'
);
end;
/

set autotrace on

select count(*) from dave_fct df, dave_dim_p dd
where df.owner_2char = dd.owner_2char
and dd.owner_2char = 'SY'
/


results of the proposal

Now, maybe because there are fewer distinct values of owner_2char than owner (17 instead of 25), or maybe because of blind luck, the estimation is actually better with predicates based on the former than it is on the latter ... in my case an error of 0.54% instead of 3%.

Another advantage is the ability to miss out the join to the dimension table, of course. In my example case I got a result in 0.12 seconds instead of 0.23 by eliminating the dimension table from the query altogether -- it would bear testing on a larger example of course, as dealing with wall clock times as low as that is not enormously significant.

So, downsides to this approach. Well as the higher values are deterministic upon the lower values there is little ETL overhead - no lookups to be performed, for example, so it doesn't seem to make ETL more complex. It does increase the row length by some amount, although that could be offset with data segment compression.

a common case for improvement

Now I particularly like the idea of this approach for date dimensions. As the dates in a table progress forwards in time there is varying skew on the higher levels of the dimension. For example if you have loaded data from 2001 up to the middle of January 2005 then two queries that filter on "Year = 2004" and "Year = 2005" can expect to retrieve significantly different numbers of rows from the fact table, and without this technique (or without placing the predicate on the level represented in the fact table) the optimizer is going to come up with something very funky -- probably the same incorrect estimation for both of those predicates.

some final words

I am convinced enough by this method to give it a go. The implementation seems simple, the downside seems to be manageable, and the benefits seem to be sufficient.

I have about three dimensions that I'm going to try it on. Date is the obvious one. Another is related to the identifier for the fact table's source system and type of record, which is enormously skewed at higher levels. In fact the records are broken down into regular transactions and journal vouchers, in a ratio of something like 500:1, so a report on journal vouchers is looking for an index-based access method but a report on transactions may do better with table-based access.

Now, if you can see any reason why this is A Really Bad Idea, then please let me know by the end of the week when I potentially start implementing this plan. Words of reassurance also acceptable ;)

7 Comments:

At 1:58 PM, Blogger Pete Scott said...

I do hope it is not really a bad idea - we do this for one of customers where they needed to do "market share" type queries under Business Objects customers. In this case used we the product dimension and included product category in the fact table.

The thing to watch out for is the need to rebuild the table if the hirearchy changes - time is probably OK (unless daylight saving changes ;-) ) But a product can be re-classified and may require a table rebuild or update.

 
At 11:08 AM, Blogger David Aldridge said...

Yes, slowly changing dimensions would be a contra-indication for this technique for me, unless the benefit was very clear -- have you got involved in bitmap-join indexes at all Pete? They sound like they'd be more suitable for that case.

 
At 12:21 PM, Blogger Pete Scott said...

Not tried bitmap joins for a while - I found them a bit slow to build - but I think I'll give them a go again now I have some decent hardware

 
At 7:26 AM, Blogger Connor McDonald said...

Just some addenda on the collection of stats on FBI's.

From 9.2 above, you can do all sorts of nice things with dbms_stats that are not apparent from the doco.

eg's (syntax might be wrong, but you'll get the idea)

a) for column SYS_C123$ size 10

to collect a histogram on a particular column in an FBI (as seen from xxx_tab_cols)

b) for clause 1, for clause 2, for clause 3

to collect multiple histograms etc etc all in the same call

hth
Connor

 
At 8:34 AM, Blogger David Aldridge said...

Thanks Conner,

Didn't know about that ... seems that user_tab_cols show the hidden columns and user_tab_columns doesn't ... hmmm.

A quick test in 10.1 shows that you can rename those system-generated column names also ...

SQL> drop table t;

Table dropped.

SQL> create table t as select * from dba_objects;

Table created.

SQL> create index tt on t (substr(owner,1,1));

Index created.

SQL> select column_name from user_tab_cols
2 where table_name = 'T';

COLUMN_NAME
------------------------------
SYS_NC00014$
SECONDARY
GENERATED
TEMPORARY
STATUS
TIMESTAMP
LAST_DDL_TIME
CREATED
OBJECT_TYPE
DATA_OBJECT_ID
OBJECT_ID

COLUMN_NAME
------------------------------
SUBOBJECT_NAME
OBJECT_NAME
OWNER

14 rows selected.

SQL> alter table T rename column SYS_NC00014$ to SUBSTR_OWNER_1_1;

Table altered.

SQL> select column_name from user_tab_cols
2 where table_name = 'T';

COLUMN_NAME
------------------------------
SUBSTR_OWNER_1_1
SECONDARY
GENERATED
TEMPORARY
STATUS
TIMESTAMP
LAST_DDL_TIME
CREATED
OBJECT_TYPE
DATA_OBJECT_ID
OBJECT_ID

COLUMN_NAME
------------------------------
SUBOBJECT_NAME
OBJECT_NAME
OWNER

14 rows selected.


So, here's a nice interview question ... "What is the meaning of the following query:"

select table_name, column_name
from user_tab_cols
minus
select table_name, column_name
from user_tab_columns;

:(

 
At 6:45 AM, Blogger Connor McDonald said...

It would be nice to rename hidden columns to something useful, but we tried this a while back....

SQL> create table T ( x number);

Table created.

SQL> create index TX on T ( x+123);

Index created.

SQL> create index TX1 on T ( x+100+23);

Index created.

and now we export...

Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.4.0 - Production
With the Partitioning, OLAP and Data Mining options
Export done in WE8MSWIN1252 character set and AL16UTF16 NCHAR character set

About to export specified users ...
. exporting pre-schema procedural objects and actions
. exporting foreign function library names for user MCDONAC
. exporting PUBLIC type synonyms
. exporting private type synonyms
. exporting object type definitions for user MCDONAC
About to export MCDONAC's objects ...
. exporting database links
. exporting sequence numbers
. exporting cluster definitions
. about to export MCDONAC's tables via Conventional Path ...
. . exporting table T 0 rows exported
. exporting synonyms
. exporting views
. exporting stored procedures
. exporting operators
. exporting referential integrity constraints
. exporting triggers
. exporting indextypes
. exporting bitmap, functional and extensible indexes
. exporting posttables actions
. exporting materialized views
. exporting snapshot logs
. exporting job queues
. exporting refresh groups and children
. exporting dimensions
. exporting post-schema procedural objects and actions
. exporting statistics
Export terminated successfully without warnings.

So far so good... but

SQL> alter table T rename column SYS_NC00002$ to fbi1;

Table altered.

SQL> alter table T rename column SYS_NC00003$ to fbi2;

Table altered.

and now we export again

About to export specified users ...
. exporting pre-schema procedural objects and actions
. exporting foreign function library names for user MCDONAC
. exporting PUBLIC type synonyms
. exporting private type synonyms
. exporting object type definitions for user MCDONAC
About to export MCDONAC's objects ...
. exporting database links
. exporting sequence numbers
. exporting cluster definitions
. about to export MCDONAC's tables via Conventional Path ...
. . exporting table T
EXP-00056: ORACLE error 3113 encountered
ORA-03113: end-of-file on communication channel
EXP-00056: ORACLE error 24324 encountered
ORA-24324: service handle not initialized
EXP-00056: ORACLE error 24324 encountered
ORA-24324: service handle not initialized
EXP-00056: ORACLE error 24324 encountered
ORA-24324: service handle not initialized
EXP-00056: ORACLE error 24324 encountered
ORA-24324: service handle not initialized
EXP-00000: Export terminated unsuccessfully

 
At 7:48 AM, Blogger David Aldridge said...

Oh ho, so not an ideal situation at all then.

For what it's worth, Data Pump doesn't like it either ...

Export: Release 10.1.0.2.0 - Production on Wednesday, 27 July, 2005 7:44

Copyright (c) 2003, Oracle. All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - Produc
tion
With the Partitioning, OLAP and Data Mining options
Starting "DAVE"."SYS_EXPORT_TABLE_01": dave/******** tables=t DUMPFILE=table2.d
mp
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TBL_TABLE_DATA/TABLE/TABLE_DATA
ORA-39014: One or more workers have prematurely exited.
ORA-39029: worker 1 with process name "DW01" prematurely terminated
ORA-31672: Worker process DW01 died unexpectedly.

Job "DAVE"."SYS_EXPORT_TABLE_01" stopped due to fatal error at 07:47

 

Post a Comment

<< Home