The Three Pillars of Oracle Data Warehousing
This is a basic topic for Oracle data warehousing beginners, based on some ideas that I'm hoping will stop buzzing round in my head if I commit them to virtual paper.
There are three Oracle features that provide a foundation for successful data warehousing:
- Psummary Tables (the "p" is silent)
In a data warehouse we are interested in accessing large numbers of rows that are identified through some common attribute. Very often the same small numbers of attributes are used over and over again to filter the data. The most common is some form of date filter ("give me sales for this January"), and there are often others that are used very commonly ("give me inventory levels for repairable items", "give me backorders for California").
In an OLTP system we would use an index or a full tablescan to get this subset of data, but that can be inefficient if we are selecting a large number of rows representing a subset of the total table data because of the high number of logical i/o's required and the single block reads that are used to access the table.
In a data warehouse we can partition according to attributes commonly used as filters by the users, or according to logical child attributes of them (eg. users commonly filter by month but we partition by day). This achieves two important aims.
- Rows of data that are commonly used together are physically co-located.
- The optimizer can treat each partition as a table and can perform fast scans of multiple partitions to access the data ("partition pruning").
There are also benefits to manipulating bulk data -- we can use partition-based DDL operations to load and unload data from a table (a partition exchange) or to delete old data (partition drop or truncate).
In a parallel query operation the physical locations in which the required rows are held are subdivided into a number of ranges, each of which is then scanned by a different process
(a parallel query slave). The processes that read the data then pass it on to either a single query coordinator process or to another set of slaves which themselves pass the results to the query coordinator.
Insert, update and delete operations can also be parallelized to provide faster bulk data changes.
One of the key differences between serial and parallel queries are that the parallel queries read data directly from disk instead of checking for the required blocks in the SGAs block buffer area, and pass the result set directly to the user's PGA, thus completely bypassing the SGA. Read consistency is maintained through the query coordinator requesting a checkpoint to have commited dirty buffers of the scanned object written to disk before the PQ slaves start reading.
The coordination required in creating and controlling query slaves leads to an overall increase in resource usage which can easily overload a system, but when correctly implemented spare system resources can be used to improve the response time of queries that access or manipulate a large amount of data.
You probably need a lower degree of parallelism than you expect.
OK, "summaries" then.
The static nature of data warehouse data allows the results of some frequently executed queries to be pre-computed and cached as summary tables. These are often created through the Oracle materialized view object type, although the use of an MV is not mandatory.
The query rewrite function of the cost-based optimizer can use the metadata stored in materialized view definitions or through declarations of query equivalence made through the DBMS_ADVANCED_REWRITE package (10g+ only) to redirect queries against large base tables (typically data warehouse fact tables) to the smaller summary tables. In most cases the summary table does not have to be an exact match for the query.
The major challenge in using summary tables is in finding an efficient mechanism for maintaining consistency between the summary and its base table(s).
The major benefit is that almost any end-user query can be executed extremely quickly with the correct use of summary tables.
- Partitioning allows efficient access to relatively large subsets of data and efficient bulk manipulation of data.
- Parallelism allows the work of a single user process to be shared among multiple slave processes, leading to faster completion time but higher resource usage.
- Psummaries provide extremely fast performance, but their maintenance and monitoring can require non-trivial efforts.