The Mysterious Benefit Of Completely Redundant Materialized Views
A Scenario
You've just started working on a new data warehouse project, and you're familiarizing yourself with the database schema. You are looking at a simple table that stores transactional information and which is defined as follows:
create table master
(
transaction_cd number not null,
location_cd number not null,
tx_timestamp date not null,
product_cd number not null,
tx_qty number not null,
tx_cst number not null
)
/
alter table master
add constraint xpkmaster primary key (transaction_cd)
/
Plus various foreign keys, indexes and other stuff of course.
Built upon this table are a number of materialized views, one of which has the following definition:
create materialized view
master_mv1
...
Enable Query Rewrite
As
Select
transaction_cd,
location_cd,
product_cd,
tx_timestamp,
sum(tx_qty) s_tx_qty,
sum(tx_cst) s_tx_cst,
count(*) c_star
from
master
group by
transaction_cd,
location_cd,
product_cd,
tx_timestamp
/
If you've spotted that all of the attribute columns of the master table are included in the materialized view, and one of those attributes is defined as the primary key of the master table, then you will correctly conclude the following: the materialized view is completely redundant. It stores exactly the same data as the master table.
So, simple question. Why go to the trouble of creating such an abominable waste of resources?
Answer below ...
Answer 1
It's because of a combination of problems.
Firstly, there are front-end tools that will not optimize their SQL to take account of situations where an aggregation of a metric is not required. A metric called "Transaction Qty" will commonly be defined as SUM(MASTER.TX_QTY) and will always be included in queries in the following way:
select
transaction_cd,
sum(tx_qty)
from
master
where
tx_timestamp between to_date('01-jan-2005')
and to_date('02-jan-2005') and
product_cd = 3
group by
transaction_cd
/
The front end tool will not recognise that when the primary key of the MASTER fact table is also selected then it is sufficient (and more efficient) to:
select
transaction_cd,
tx_qty
from
master
where
tx_timestamp between to_date('01-jan-2005')
and to_date('02-jan-2005') and
product_cd = 3
/
Now theoretically there is an opportunity for the Oracle optimizer to step in here and rescue us, but unfortunately it does not.
Here's a full demonstration script:
drop table master;
drop materialized view master_mv1;
create table master
(
transaction_cd not null,
location_cd not null,
tx_timestamp not null,
product_cd not null,
tx_qty not null,
tx_cst not null
)
pctfree 0
nologging
as
select
rownum,
trunc(dbms_random.value(1,10)),
to_date('01-jan-2005','DD-Mon-YYYY')
+ dbms_random.value(0,30),
trunc(dbms_random.value(1,10000)),
trunc(dbms_random.value(1,10)),
trunc(dbms_random.value(1,100),2)
from
dual
connect by
1=1 and
level < ownname =""> user,
tabname => 'master',
granularity => 'ALL');
end;
/
create materialized view master_mv1
nologging
Using No Index
Never Refresh
Enable Query Rewrite
As
Select
transaction_cd,
location_cd,
product_cd,
tx_timestamp,
sum(tx_qty) s_tx_qty,
sum(tx_cst) s_tx_cst
from
master
group by
transaction_cd,
location_cd,
product_cd,
tx_timestamp
/
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'master_mv1',
granularity => 'ALL');
end;
/
We can now compare the original and the rewritten queries and their execution plans:
explain plan for
select /*+ norewrite */
transaction_cd,
sum(tx_qty)
from master
where product_cd = 3
group by transaction_cd;
select * from table(dbms_xplan.display());
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 102 | 1224 | 792 (3)| 00:00:10 |
| 1 | SORT GROUP BY | | 102 | 1224 | 792 (3)| 00:00:10 |
|* 2 | TABLE ACCESS FULL| MASTER | 102 | 1224 | 791 (3)| 00:00:10 |
-----------------------------------------------------------------------------
explain plan for
select transaction_cd,
sum(tx_qty)
from master
where product_cd = 3
group by transaction_cd;
select * from table(dbms_xplan.display());
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 101 | 1212 | 874 (3)| 00:00:11 |
|* 1 | MAT_VIEW REWRITE ACCESS FULL| MASTER_MV1 | 101 | 1212 | 874 (3)| 00:00:11 |
-------------------------------------------------------------------------------------------
explain plan for
select transaction_cd,
sum(tx_qty)
from master
where location_cd = 3
group by transaction_cd;
select * from table(dbms_xplan.display());
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 111K| 1193K| 873 (3)| 00:00:11 |
|* 1 | MAT_VIEW REWRITE ACCESS FULL| MASTER_MV1 | 111K| 1193K| 873 (3)| 00:00:11 |
-------------------------------------------------------------------------------------------
Notice also that if we place a tempting index on the transaction_cd column of master_mv1 then we can get further benefits:
alter table master_mv1 add constraint xpk_master_mv1 primary key (transaction_cd)
/
explain plan for
select /*+ norewrite */
transaction_cd,
sum(tx_qty)
from master
where transaction_cd in (1,2,123456)
group by transaction_cd;
select * from table(dbms_xplan.display());
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 24 | 4 (0)| 00:00:01 |
| 1 | SORT GROUP BY NOSORT | | 3 | 24 | 4 (0)| 00:00:01 |
| 2 | INLIST ITERATOR | | | | | |
| 3 | TABLE ACCESS BY INDEX ROWID| MASTER | 3 | 24 | 4 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | XPK_MASTER | 3 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
explain plan for
select transaction_cd,
sum(tx_qty)
from master
where transaction_cd in (1,2,123456)
group by transaction_cd;
select * from table(dbms_xplan.display());
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 24 | 4 (0)| 00:00:01 |
| 1 | INLIST ITERATOR | | | | | |
| 2 | MAT_VIEW REWRITE ACCESS BY INDEX ROWID| MASTER_MV1 | 3 | 24 | 4 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | XPK_MASTER_MV1 | 3 | | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------
Again, the sort group by phase is eliminated.
Answer 2:
Different groups of users may have different reporting priorities. Store managers will be applying filters such as "location_cd = 5", while product managers will be applying "product_cd = 123". Both of them would benefit from the physical clustering of similar values for the columns they usually apply filters to, through partitioning (and hence partition pruning) or through an insert of rows sorted by their favourite column, or maybe though clustering.
My bet would be on the master and the MV tables having different partitioning keys.
Now that's not the only way to "skin this cat" -- of course you might go down the path of composite partitioning. However that will give you a limited number of columns on which partition pruning is possible. For example with a multicolumn-range/list composite partitioning scheme on (A,B,C)/(D) you can get partition pruning for queries on the following combinations of predicated columns:
A
AB
ABC
ABCD
AD
ABD
The following combinations do not benefit from partition pruning:
B
C
BD
CD
BCD
It's actually possible though to benefit from partition pruning in those latter cases if a logically redundant predicate is placed on other columns ... for example if A had only distinct values of 'X','Y', and 'Z' then by including a predicate of:
A In ('X','Y','Z')
...then it is possible to extend the range of partition pruning ... erm ... well maybe I'll write another blog entry on that, but as you can see it's a little bit tricky and still requires some upfront planning.
So does the query rewrite costing take account of differences in partitioning pruning between a master table and it's materialized views?
drop table master;
drop materialized view master_mv1;
create table master
(
transaction_cd not null,
location_cd not null,
tx_timestamp not null,
product_cd not null,
tx_qty not null,
tx_cst not null
)
partition by list (location_cd)
(
partition p01 values (1),
partition p02 values (2),
partition p03 values (3),
partition p04 values (4),
partition p05 values (5),
partition p06 values (6),
partition p07 values (7),
partition p08 values (8),
partition p09 values (9),
partition p10 values (10)
)
pctfree 0
nologging
as
select
rownum,
trunc(dbms_random.value(1,10)),
to_date('01-jan-2005','DD-Mon-YYYY')
+ dbms_random.value(0,30),
trunc(dbms_random.value(1,10000)),
trunc(dbms_random.value(1,10)),
trunc(dbms_random.value(1,100),2)
from
dual
connect by
1=1 and
level < ownname =""> user,
tabname => 'master',
granularity => 'ALL');
end;
/
create materialized view master_mv1
partition by hash (product_cd)
partitions 256
compress nologging
Using No Index
Never Refresh
Enable Query Rewrite
As
Select
transaction_cd,
location_cd,
product_cd,
tx_timestamp,
sum(tx_qty) s_tx_qty,
sum(tx_cst) s_tx_cst
from
master
group by
transaction_cd,
location_cd,
product_cd,
tx_timestamp
/
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname => 'master_mv1',
granularity => 'ALL');
end;
/
explain plan for
select transaction_cd,
sum(tx_qty)
from master
where product_cd = 3
group by transaction_cd;
select * from table(dbms_xplan.display());
explain plan for
select transaction_cd,
sum(tx_qty)
from master
where location_cd = 3
group by transaction_cd;
select * from table(dbms_xplan.display());
Here are the execution plans:
-------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9 | 108 | 4 (25)| 00:00:01 | | |
| 1 | SORT GROUP BY | | 9 | 108 | 4 (25)| 00:00:01 | | |
| 2 | PARTITION HASH SINGLE | | 9 | 108 | 3 (0)| 00:00:01 | 36 | 36 |
|* 3 | MAT_VIEW REWRITE ACCESS FULL| MASTER_MV1 | 9 | 108 | 3 (0)| 00:00:01 | 36 | 36 |
-------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 11040 | 118K| | 63 (4)| 00:00:01 | | |
| 1 | SORT GROUP BY | | 11040 | 118K| 536K| 63 (4)| 00:00:01 | | |
| 2 | PARTITION LIST SINGLE| | 11040 | 118K| | 11 (0)| 00:00:01 | 3 | 3 |
| 3 | TABLE ACCESS FULL | MASTER | 11040 | 118K| | 11 (0)| 00:00:01 | 3 | 3 |
---------------------------------------------------------------------------------------------------------
We see here that the query is rewritten to take advantage of the different partitioning scheme of the materialized view. This simple case would have been a good candidate for a composite partitioning scheme of course.
The Cost Of Maintaining Redundant Materialized Views
So what is the cost to the system of providing this materialized view? Well if we're smart about how we maintain it then there is very little cost indeed. After all, we know that we don't have to aggregate the master data in order to populate the materialized view's table so I don't see why we would choose to do so.
In this previous blog I showed how there can be benefits to maintaining a materialized view manually, rather than through the internal refresh procedure. This redundant MV is an excellent example of a situation that benefits from such an approach, because the MV data is exactly the same as the master data. So instead of inserting the master table data and then refreshing the MV we can insert into both tables at the same time. As I noted before, the MV data can be maintained manually through either the partition exchange mechanism or by dropping the MV (if it is built on a predefined table) and directly inserting into it. In the latter case we could even use the multitable insert syntax that Oracle so thoughtfully provides to insert into both the master and the MV tables simultaneously, giving us the benefit of a single transaction.
Summary
So there we have it. Materialized views that are redundant in terms of data but which can provide interesting benefits in the report query phase.
Would I do such a thing myself? Maybe, maybe not. I don't believe that there ought to be qualms about the extra storage requirement because by the time you have used up half the capacity of todays hard drives you have already run out of bandwidth. I haven't worked in an environment where I have felt comfortable in implementing such a scheme, but I could see that I might. I can imagine a related situation where I would create a materialized view on a subset of the master table's columns in order to provide a skinny table (possibly with different partitioning scheme), in order to get the benefits of reduced i/o for scans of large number of rows and/or to leverage the optimizer's abandonment of the SORT GROUP BY phase when reporting on a primary key column.
Regardless of the practicality, I think that it serves as a nice demonstration of the power of query rewrite.
2 Comments:
I think you might have the record for the longest average blog entry. Your standard deviation may be a record as well. Either itsy bitsy 1 or 2 sentence posts or a full multi-page technical report.
Very intriguing topic. I'm not much into DWs, but its always nice to see Oracle performance hacks. Classic decrease time by increasing space used.
I'm a slave to both impulse and thoroughness I guess.
Yeah, it's probably more like article material than blog content really, but I just work with the (free) tools I have at hand. If I was interested in running a web site I might do things differently I suppose.
Post a Comment
<< Home