update: 24-Jul-2005
Jonathan Lewis writes to educate me on function-based indexes and statistics thereon ... it is indeed possible to gather histogram data on them through the mechanism of the DBMS_Stats method_opt => 'for all hidden columns' -- who'd a thunk it? It's documented here. I hope it doesn't sound like sour grapes if I wish that this were also documented here.In this article I dismissed the use of function-based indexes (for the purposes outlined) based on the inability of Oracle to generate histograms on them -- whether this new-to-me information changes that judgement will take some further thought and analysis. I may move it to another blog entry once I've reassessed the issue, as an analysis of the costs and benefits of function-based indexes vs. "materializing" the function result in the table rather appeals to me.Thanks again, Jonathan.
introductionIn getting performance out of an Oracle database, and in data warehousing (DWh) in particular, we live or die by making effective use of the cost-based optimizer (CBO). As I wrote
before, "
The CBO is based on statistics -- statistics about tables, statistics about indexes, statistics about the distribution of values within a column". So we owe it to ourselves to go the extra mile in our search for statistical accuracy, and sometimes we might have to bend a few rules to do so.
This thought was prompted by the rebuilding of a major fact table, primarily in order to change its partitioning scheme, but the opportunity for some performance tweaking is not to be sniffed at. Accordingly, I have a plan ... a cunning plan if I may say so. A plan that may make database purists squirt coffee from their noses, and I consider that to be fair warning so let's press on.
the scenarioThe fact table in question has a number of dimensional values for which the parent values are deterministically based on the lowest value itself. For example, DATE_OF_DAY has parent values of DATE_OF_MONTH, FISCAL_MONTH_CD, FISCAL_QUARTER_CD, FISCAL_YEAR_CD. Given a particular value of DATE_OF_DAY the calculation of the other four values is a simple matter, for example:
FISCAL_MONTH_CD =
To_Char(Add_Months(DATE_OF_DAY, 3), 'YYYY" M"MM')So for 04-Dec-2004 the fiscal month is "2005 M03"
(Did you know that you can insert arbitrary strings of characters in a date-to-char conversion using double-quotes? A lot of people are surprised by that. Surprisingly.)
These parent values are stored in a dimension table, possibly snowflaked where required by the presence of summary tables at the DATE_OF_MONTH level, for example.
Now here's the problem that I'm trying to solve ... when a report query is submitted to the database and includes a predicate such as
FISCAL_MONTH_CD = '2005 M04', and the query is directed to the fact table at the DATE_OF_DAY level, Oracle can only take the most generic estimation of the cardinality of the result set -- histograms on FISCAL_MONTH_CD in the dimension table and on DATE_OF_DAY in the fact table are not effective in this situation.
a demonstration scriptset feedback off heading off
drop table dave_fct;
create table dave_fct
pctfree 0 nologging
as
select * from dba_objects;
begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_FCT',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254'
);
end;
/
drop table dave_dim
/
create table dave_dim -- lovely name!
(owner primary key,
owner_2char not null)
as
select distinct
owner,
substr(owner,1,2)
from dave_fct
/
begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/
set autotrace on
-- QUERY 1
select count(*)
from dave_fct
where owner = 'SYS'
/
-- QUERY 2select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner = 'SYS'
/
-- QUERY 3select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner_2char = 'SY'
/
results, and their implicationsMy results show that the statistics for queries 1 and 2 are pretty much spot on ... the actual number of rows and the estimated cardinality were within about 3% -- for query 3 the optimizer overestimated the cardinality by 324%. So what does this mean? Potentially it could prompt the optimizer to perform a full table scan instead of an index scan (standard disclaimer: not that I'm saying that full scans are necessarily a bad thing).
some possible solutionsSo how do you get the optimizer better statistics?
One method would be to use a query such as:
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and dd.owner like 'SY%'
/... which gives a very acceptable estimation in this case.
But this is a special case with its own advantages and disadvantages:
- The "like" predicate is compatible with the use of histograms
- This is a difficult technique to use for an In List query
So maybe ...
select count(*)
from dave_fct df,
dave_dim dd
where df.owner = dd.owner
and substr(dd.owner,1,2) = 'SY'
/Nope ... the estimated cardinality is still over 300% too high.
How about creating a function-based index on the table?
create bitmap index dave_fct_ind on dave_fct (substr(owner,1,2))
compute statistics
/Fast access on the query:
select count(owner)
from dave_fct
where substr(owner,1,2) = 'SY'
/... but again no accurate statistics because there is no histogram on the
index values, only on the
table values. (Maybe an enhancement request ... hmmm ...)
note: update above identifying this as incorrect information!.
a proposed solutionOne method that I'm going to try is to push the higher level dimensional values down to the fact table. In our example, by placing the owner_2char attribute in the fact table and linking it to a snowflaked dimension table with owner_2char as the primary key we can use a histogram directly on that value.
So the demonstration script becomes:
drop table dave_fct;
create table dave_fct pctfree 0 compress nologging
as
select substr(do.owner,1,2) OWNER_2CHAR,
do.* from dba_objects do;
begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_FCT',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/
drop table dave_dim
/
create table dave_dim
(owner primary key,
owner_2char not null)
as
select distinct
owner,
owner_2char
from dave_fct
/
create table dave_dim_p
(owner_2char primary key)
as
select distinct
owner_2char
from dave_fct
/
begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM',
METHOD_OPT => 'FOR COLUMNS OWNER SIZE 254,
OWNER_2CHAR SIZE 254'
);
end;
/
begin
dbms_stats.gather_table_stats
(
OWNNAME => user,
TABNAME => 'DAVE_DIM_P',
METHOD_OPT => 'FOR COLUMNS OWNER_2CHAR SIZE 254'
);
end;
/
set autotrace on
select count(*) from dave_fct df, dave_dim_p dd
where df.owner_2char = dd.owner_2char
and dd.owner_2char = 'SY'
/results of the proposalNow, maybe because there are fewer distinct values of owner_2char than owner (17 instead of 25), or maybe because of blind luck, the estimation is actually better with predicates based on the former than it is on the latter ... in my case an error of 0.54% instead of 3%.
Another advantage is the ability to miss out the join to the dimension table, of course. In my example case I got a result in 0.12 seconds instead of 0.23 by eliminating the dimension table from the query altogether -- it would bear testing on a larger example of course, as dealing with wall clock times as low as that is not enormously significant.
So, downsides to this approach. Well as the higher values are deterministic upon the lower values there is little ETL overhead - no lookups to be performed, for example, so it doesn't seem to make ETL more complex. It does increase the row length by some amount, although that could be offset with data segment compression.
a common case for improvementNow I particularly like the idea of this approach for date dimensions. As the dates in a table progress forwards in time there is varying skew on the higher levels of the dimension. For example if you have loaded data from 2001 up to the middle of January 2005 then two queries that filter on "Year = 2004" and "Year = 2005" can expect to retrieve significantly different numbers of rows from the fact table, and without this technique (or without placing the predicate on the level represented in the fact table) the optimizer is going to come up with something very funky -- probably the same incorrect estimation for both of those predicates.
some final wordsI am convinced enough by this method to give it a go. The implementation seems simple, the downside seems to be manageable, and the benefits seem to be sufficient.
I have about three dimensions that I'm going to try it on. Date is the obvious one. Another is related to the identifier for the fact table's source system and type of record, which is enormously skewed at higher levels. In fact the records are broken down into regular transactions and journal vouchers, in a ratio of something like 500:1, so a report on journal vouchers is looking for an index-based access method but a report on transactions may do better with table-based access.
Now, if you can see any reason why this is A Really Bad Idea, then please let me know by the end of the week when I potentially start implementing this plan. Words of reassurance also acceptable ;)