The Three Pillars of Oracle Data Warehousing
Introduction
This is a basic topic for Oracle data warehousing beginners, based on some ideas that I'm hoping will stop buzzing round in my head if I commit them to virtual paper.
There are three Oracle features that provide a foundation for successful data warehousing:
- Partitioning
- Parallelism
- Psummary Tables (the "p" is silent)
Partitioning
In a data warehouse we are interested in accessing large numbers of rows that are identified through some common attribute. Very often the same small numbers of attributes are used over and over again to filter the data. The most common is some form of date filter ("give me sales for this January"), and there are often others that are used very commonly ("give me inventory levels for repairable items", "give me backorders for California").
In an OLTP system we would use an index or a full tablescan to get this subset of data, but that can be inefficient if we are selecting a large number of rows representing a subset of the total table data because of the high number of logical i/o's required and the single block reads that are used to access the table.
In a data warehouse we can partition according to attributes commonly used as filters by the users, or according to logical child attributes of them (eg. users commonly filter by month but we partition by day). This achieves two important aims.
- Rows of data that are commonly used together are physically co-located.
- The optimizer can treat each partition as a table and can perform fast scans of multiple partitions to access the data ("partition pruning").
There are also benefits to manipulating bulk data -- we can use partition-based DDL operations to load and unload data from a table (a partition exchange) or to delete old data (partition drop or truncate).
Parallelism
In a parallel query operation the physical locations in which the required rows are held are subdivided into a number of ranges, each of which is then scanned by a different process
(a parallel query slave). The processes that read the data then pass it on to either a single query coordinator process or to another set of slaves which themselves pass the results to the query coordinator.
Insert, update and delete operations can also be parallelized to provide faster bulk data changes.
One of the key differences between serial and parallel queries are that the parallel queries read data directly from disk instead of checking for the required blocks in the SGAs block buffer area, and pass the result set directly to the user's PGA, thus completely bypassing the SGA. Read consistency is maintained through the query coordinator requesting a checkpoint to have commited dirty buffers of the scanned object written to disk before the PQ slaves start reading.
The coordination required in creating and controlling query slaves leads to an overall increase in resource usage which can easily overload a system, but when correctly implemented spare system resources can be used to improve the response time of queries that access or manipulate a large amount of data.
You probably need a lower degree of parallelism than you expect.
Psummary tables
OK, "summaries" then.
The static nature of data warehouse data allows the results of some frequently executed queries to be pre-computed and cached as summary tables. These are often created through the Oracle materialized view object type, although the use of an MV is not mandatory.
The query rewrite function of the cost-based optimizer can use the metadata stored in materialized view definitions or through declarations of query equivalence made through the DBMS_ADVANCED_REWRITE package (10g+ only) to redirect queries against large base tables (typically data warehouse fact tables) to the smaller summary tables. In most cases the summary table does not have to be an exact match for the query.
The major challenge in using summary tables is in finding an efficient mechanism for maintaining consistency between the summary and its base table(s).
The major benefit is that almost any end-user query can be executed extremely quickly with the correct use of summary tables.
PSummary
- Partitioning allows efficient access to relatively large subsets of data and efficient bulk manipulation of data.
- Parallelism allows the work of a single user process to be shared among multiple slave processes, leading to faster completion time but higher resource usage.
- Psummaries provide extremely fast performance, but their maintenance and monitoring can require non-trivial efforts.
12 Comments:
Psummary Tables (the "p" is silent)
as in swimming pool?
WIPOZ - suggestions?
Hi,
Maybe add "Pre-Summarize"?
Without MV's and rollups (preaggregation), it can be tough . . .
Thank you for this article. Is psummons up in short and precise words the main reason why my boss should listen to my reasonings.
I think I will find myself quoting this article often in the future.
OS
When I thank you for these writings it is not as I am taking it as a gift, that is not what I mean. I will, anyway, point my audience to your blog.
Thanks. OS
Don,
Why you think it is tough without MV's and rollups?. Could you please give an example?
You could add Placement, Postition, or possibly just Psorting - as another strategy for optimising physical colocation of data. (I don't think you can get away with claiming a silent P in Pclustering, otherwise that would do as the more generic description).
Jonathan Lewis
Note to self: don't go swimming with Pete :D
DB: yes, I think if I had to pick just one of those three techniques it would be summaries. There's nothing like making a scan of 100,000,000 fact table records return in 0.01 seconds for brightening your day.
JL: Maybe Pack, Parcel or Pigeonhole would also be psuitable.
Clustering (with or with out the P) can be problematic - we can only really cluster on one dimension (perhaps a couple if there is some correlation between them – say a brewer supplies products that are beers!) but when you try to cluster product, location and customer…
Pete,
True: any attempt at clustering is likely to be a benefit along just one dimension - but it's always worth checking if there's one dimension that's worth the effort, so long as it doesn't cause too much disruption elsewhere.
There's also the option that you can cluster in a second dimension by copying the data in a different order, e.g.
create table t2 as
select * from t1
order by colx
Then create a materialized view with query rewrite enabled so that queries against t1 can visit t2.
It's not a strategy that I've used often - but it's just a special case of a summary table, and disks are just so big these days you need something to fill the space with to stop other people from interfering.
Regards
Jonathan Lewis
Along a similar line, having two summary tables with exactly the same data but with different partitioning schemes is a useful technique. Query rewrite is very happy to pick the appropriate one to use.
with query rewrite enabled...
I'm sure I'm missing something obvious, but why isn't query-rewrite automatic? There's a special grant the user needs, and there's a clause to put in the mview defintion. When would you turn query-rewrite off?
"why isn't query-rewrite automatic"
It's all in the way you read the text.
As you pointed out, there's an optional clause that you can put in the mview definition. What I wrote was "Then create a materialized view with query rewrite enabled" - i.e. create a materialized view with that optional clause in place.
Post a Comment
<< Home