.comment-link {margin-left:.6em;}

Oracle Sponge -- Now Moved To Wordpress

Please use http://oraclesponge.wordpress.com

Thursday, October 27, 2005

Dynamic Sampling at AskTom

I saw an interesting question on dynamic sampling over at Ask Tom and ended up pretty much dumping a half-finished blog posting in reply. I'll save myself the elctrons of publishing it here by just linking to it. Enjoy.

Thursday, October 06, 2005

Testing the Trace Loader

As I said yesterday, I've been fiddling around with loading trace files into the database for analysis, and today I was looking for a good test case to see how the numbers are coming out. The obvious example to choose was the direct path enhancement to materialized view refresh that I also wrote about recently.

Here is my test script:

drop table master;
drop table master_test_data;
drop materialized view master_mv1;

create table master
location_cd number not null,
tx_timestamp date not null,
product_cd number not null,
tx_qty number not null,
tx_cst number not null
pctfree 0

create materialized view log on master
with rowid
including new values

create materialized view master_mv1
Using No Index
Refresh Fast On Commit
Enable Query Rewrite
trunc(tx_timestamp,'MM') tx_month,
sum(tx_qty) s_tx_qty,
sum(tx_cst) s_tx_cst,
count(*) c_star
group by

create materialized view master_mv2
Using No Index
Refresh Fast On Commit
Enable Query Rewrite
trunc(tx_timestamp,'MM') tx_month,
sum(tx_qty) sum_tx_qty,
sum(tx_cst) sum_tx_cst,
count(*) c_star
group by

create table master_test_data
location_cd not null,
tx_timestamp not null,
product_cd not null,
tx_qty not null,
tx_cst not null
pctfree 0
+ dbms_random.value(1,31),
connect by
1=1 and
level < 1000001
select count(*) from master_test_data
alter session set timed_statistics = true;
alter session set max_dump_file_size=1000000000;
alter session set events '10046 trace name context forever, level 8';

insert into master
select *
from master_test_data
where rownum < 100001

Rather straightforward stuff. The "select count(*)" of rows in the master test data table is just to eliminate the possiblity of the test results getting skewed by delayed block cleanout, by the way. The example above is for the conventional path insert. The only change required for the direct path enhancement to kick-in is to use an APPEND hint in the last insert statement.

So, having identified the trace files I issue the following command:

trace.purge; -- purges all data from the trace file repository

(directory =>'UDUMP',
file_name => 'slimdb_ora_4412.trc',
identify_as => 'direct path',
debug => 0,
log_extents => true,
log_files => true);

(directory =>'UDUMP',
file_name => 'slimdb_ora_4600.trc',
identify_as => 'conv path',
debug => 0,
log_extents => true,
log_files => true);

This allows me to query the trace file information directly, giving me such output as:

SQL> select
2 substr(identify_as,1,12),
3 op_name,
4 sum(op_c),
5 sum(op_e)/1000000 sec,
6 count(*)
7 from
8 (
9 select
10 identify_as,
11 cur_dep,
12 f.file#,
13 f.file_name,
14 c.cursor#,
15 o.op_name,
16 OP_C ,
17 OP_E ,
18 OP_P ,
19 OP_CR ,
20 OP_CU ,
22 from
23 trace_ops o,
24 trace_cursors c,
25 trace_files f
26 where
27 f.file# = c.file# and
28 c.cursor# = o.cursor#
29 )
30 group by
31 substr(identify_as,1,12),
32 op_name
33 order by
34 1,2,4 desc
35 /

------------ ---------- ---------- ---------- ----------
conv path EXEC 20875000 33.358519 100342
conv path FETCH 2296875 3.026768 304
conv path PARSE 156250 .661319 333
direct path EXEC 2890625 3.476836 350
direct path FETCH 140625 .406403 402
direct path PARSE 859375 1.302354 224

6 rows selected.

Of course all the interesting work is going on behind the scenes in the form of recursive SQL:

SQL> select
2 substr(identify_as,1,12),
3 cur_dep,
4 op_name,
5 sum(op_c),
6 sum(op_e)/1000000 sec,
7 count(*)
8 from
9 (
10 select
11 identify_as,
12 cur_dep,
13 f.file#,
14 f.file_name,
15 c.cursor#,
16 o.op_name,
17 OP_C ,
18 OP_E ,
19 OP_P ,
20 OP_CR ,
21 OP_CU ,
23 from
24 trace_ops o,
25 trace_cursors c,
26 trace_files f
27 where
28 f.file# = c.file# and
29 c.cursor# = o.cursor#
30 )
31 group by
32 substr(identify_as,1,12),
33 cur_dep,
34 op_name
35 order by
36 1,2,3,5 desc
37 /

------------ ---------- ---------- ---------- ---------- ----------
conv path 0 EXEC 14484375 22.637736 18
conv path 0 FETCH 2125000 2.342986 15
conv path 0 PARSE 93750 .56999 16
conv path 1 EXEC 6343750 10.694057 100188
conv path 1 FETCH 109375 .619292 147
conv path 1 PARSE 62500 .079386 187
conv path 2 EXEC 46875 .026135 128
conv path 2 FETCH 62500 .062152 134
conv path 2 PARSE 0 .009992 128
conv path 3 EXEC 0 .00006 1
conv path 3 FETCH 0 .002111 1
conv path 3 PARSE 0 .001875 1
conv path 4 EXEC 0 .000531 7
conv path 4 FETCH 0 .000227 7
conv path 4 PARSE 0 .000076 1
direct path 0 EXEC 1718750 2.218695 3
direct path 0 PARSE 31250 .17893 2
direct path 1 EXEC 750000 .841392 138
direct path 1 FETCH 15625 .128181 91
direct path 1 PARSE 343750 .47748 133
direct path 2 EXEC 140625 .14998 124
direct path 2 FETCH 109375 .168264 124
direct path 2 PARSE 421875 .52747 27
direct path 3 EXEC 281250 .260751 78
direct path 3 FETCH 15625 .068693 180
direct path 3 PARSE 46875 .109109 61
direct path 4 EXEC 0 .006018 7
direct path 4 FETCH 0 .041265 7
direct path 4 PARSE 15625 .009365 1

29 rows selected.

So what leaps out at us here (or it would do if Blogger didn't trim leading and multiple white spaces) is an enormous number of executions at a recursive depth (cur_dep) of 1 in the conventional path refresh -- coincidentally (well, actually not at all coincidentally) slightly more than the number of rows being inserted into the master table.

This of course, is the maintenance of the materialized view log being performed in an inefficient row-by-row manner:

2 from trace_cursors
3 where cursor# =
4 (
5 select cursor#
6 from
7 (
8 Select cursor#,count(*)
9 from trace_ops
10 where op_name = 'EXEC'
11 group by cursor#
12 order by 2 desc
13 )
14 where rownum = 1
15 )
16 /


INSERT /*+ IDX(0) */ INTO "DAVE"."MLOG$_MASTER" (dmltype$$,old_new$$,snaptime$$,


ST") VALUES (:d,:o,to_date('4000-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:m,


Hey, at least it's using bind variables!

Wednesday, October 05, 2005

Reporting Wait Events For Data Warehouse Queries

Funny how easy it is to get side-tracked. Well not really side-tracked, but just more and more involved in one particular thing, so that you end up burrowing so deeply into it that it loses all relevance to the real world. Just as a simple SQL statement generates deeper and deeper levels of more and more obscure recursive statements, so it can be with the most simple of tasks.

Those of you still awake may recall that I was looking at materialized view refresh, and in the course of getting to grips with the optimum refresh methodology for fast refreshing multiple MV's I've been picking the bones out of some lovely warm trace files. My guidance has come from two books that are probably familiar to many readers, and which have become indispensible to me: Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning; Shee, Deshpande and Gopalakrishnan; Oracle Press, and Optimizing Oracle Performance; Millsap and Holt; O'Reilly. And let me first say that they are a beautiful compliment to each other. While the former goes into just the right level of detail on how to understand and mitigate the wait events themselves, the latter presents a methodology for data capture and event interpretation backed up with a strong background in queueing theory.

Now the usual challenge is to translate an approach aimed at OLTP databases into something that we can use in data warehousing, and fortunately this is not too tricky in the case of wait event capture and analysis. One of the major challenges associated with wait event analysis on OLTP databases is simply in capturing the correct data in the first place (a topic addressed in depth by the Millsap/Holt book) but this is relatively easy in "the other world": a problematic query that we are interested in tuning is generally well-known and easily isolated, whether it is part of the ETL routine or part of a report.

Issues relating to use of shared servers, which can scatter a single session's wait events among multiple trace files practically at random, are also probably rare. Shared server configurations are inherently problematic in an environment where system load is dominated by fewer, longer running queries than an OLTP system would experience, and so are rarely used.

On the other hand we do have the challenge of parallel query, in which the most meaningful wait events get pushed to different trace files, one for each PQ process. Without identifying these files, capturing their data, and consolidating them into a single entity for analysis there is little that can usefully be done.

And now a word about recursive SQL. Recursive SQL statements are issued by Oracle in support of the execution of the SQL statements submitted to the database. They are used to parse the query, to maintain the data dictionary (for example to add extents to table and index segments), to perform dynamic sampling, to retrieve information on segments for the optimizer to use, etc.. So we might be able to deduce immediately that recursive SQL plays a very large role in the execution of data warehouse queries for a number of reasons, the foremost of which are the high parse ratio and the complexity of the objects that the optimizer has to consider.

The high parse ratio is caused by our desire for the optimizer to be sensitive to literal values in the predicates (histogram usage being particularly important in this environment), and where an OLTP system could be unscalable due to the workload imposed by a high hard parse rate, the longer duration of the data warehouse queries renders this overhead less significant. Not wholly insignificant, but not so significant that it is worth eliminating entirely.

Now there is also the issue of object complexity. I venture to suggest that in data warehousing we have a much higher probability that partitioning is being used, and also that we will be leveraging partition pruning. So despite the apprant simplicity of many queries against star schemas there are actualy a disproportionate number of data and index segments for the optimizer to be aware of. There is also a high chance that we will be using query rewrite, and hence that there are a great many more objects to be considered even if they are never going to be touched by the optimized query.

Now I rather glossed over the issue of how easy it is in a data warehouse to identify the SQL for which we want to the wait events. However if you read much of the literature on wait event analysis you will find that this is actually a very large part of the formal methodologies generally presented. Do you want to identify the queries with the longest duration, or the most logical reads, or the most physical I/O, or what other considerations are there? When data warehousing relieves us of that issue we are left with two other major issues:
  • What makes our query run for so long?
  • What can we do about it?
And these are by no means trivial tasks. However I would suggest that the first of these issues really ought to be more trivial than it currently is. Consider the steps we might follow in a wait event analysis for a hypothetical data warehouse query.
  1. Extract the query, along with representative predicate literals
  2. Execute the query with the 10046 event set at level 8 ("8" rhymes with "wait", which is how I remember that level 8 gives us wait events and level 4 gives us binds :) ... and level 12 gives us both of course).
  3. Find the trace files (parallel query giving us multiple files, of course)
  4. Run tkprof on the trace files to turn them into something readable
  5. Coalesce the results
  6. Analyze what events are causing us pain (generally I/O-related ones and sometimes high CPU usage, I find).
  7. For I/O related events examine what object they are occuring on, how many there are, their average duration, whether there are I/O hot spots on our devices etc..
For me the tricky part starts at step 4: I can extract queries, set events, find trace files without much problem, but the rest leaves me a little cold.

Let's start with tkprof. A fine program in general, but remembering the options for sorting the queries is beyond me, and having listed my major wait events I then have to go probing through the raw file again plus querying the data dictionary to find out what segment and file the events occured on. And remember that we may have upwards of a dozen file to contend with. Unless I'm missing out on something, that means a dozen executions of tkprof then a manual process to merge and examine all those raw files anyway.

Merging, sorting and reporting: these are the bread and butter tasks of a database, not a command-line programs. So while staring at the ceiling in the dark one night I had the following thoughts:
  1. "Load the raw files into the database then report on them from there."
  2. "That means I can link the I/O-related wait events directly to the data dictionary to get object and file names automatically".
  3. "That means we'd have a history of the the exact wait events and the real execution plan associated with a query within the database itself"
  4. "That's so obvious that it must have been done before."
  5. "Even if it has, I bet I could do it easily and customize it for my own purposes."
So that's what I've been up to for the past couple of weeks, in the little time I have between my real work and my family, instead of going through the materialized view refresh analysis that I actually promised. But it's a step in the right direction anyway -- should make that process more easy.

How do you go about desinging and implementing such a methodology then? Here are the aims that I started out with.

Load into database tables ther significant data from multiple trace files

For me the data of interest are the wait events, and in order to put them into a usable context I also need to associate them with SQL statements (cursors, in the trace file), the file name, operations such as PARSE, EXEC and FETCH. They all have to be stored in a set of dedicated tables.

Exactly which data would we load for each of these types of trace information? As much as possible, and for the most part that means all of it. We can get timings, recursive depth (very useful!), user id's, just about anything we need.

Associate significant system data with the trace file data

In our world of data warehousing we regularly add, truncate, split, merge and drop partitions, and when we do so we change the object id to which the wait event parameters refers. It seems prudent for this reason and for performance purposes (data dictionary views being notoriously cranky when it comes to joining to other data) to provide the option of storing the appropriate data at the time that the trace file is loaded.

Associate multiple trace files with a single report

This is of particular use in coalescing the results of multiple files generated by parallel query for a single submitted SQL statement. So we'd want to load multiple trace files and say "these belong together".

Provide predefined mechanisms for reporting on the events

A set of views on the database tables was going to be required, to allow more simple custom queries to be defined on the trace data. Analagous to the data dictionary views provided in the database in fact. In just the same way as "select * from table(dbms_xplan.display())" reports on the most recently generated explain plan, we can provide a mechanism for reporting on the most recently loaded trace file(s) for example.

Provide some administrative tools for purging old data

Purge all the data, or all the data prior to some date, and some options like that.

The progress so far has been encouraging. Loading in a trace file with a known name is now a trivial task, and basic reporting on the events is a doddle. For example I'm currently using the following command to load a single file:

(directory =>'UDUMP', -- 1
file_name => 'slimdb_ora_4680.trc', -- 2
debug => 0, -- 3
log_extents => true, -- 4
log_files => true); -- 5

The parameters are pretty intuitive:
  1. The name of the directory object that points to the location of the file to be loaded.
  2. The name of the file to be loaded (duh).
  3. The level of debugging message required (spat out through DBMS_OUTPUT).
  4. Whether to take a snapshot of the relevant entities in the dba_extents system view, to enhance later reporting (default "false" due to relatively high overhead).
  5. Whether to take a snapshot of the relevant entities in the dba_data files system view, to enhance later reporting (default "true").
With the data loaded I can then report in the following manner:

SQL> select * from all_trace_files;

---------- ------------------------------ ---------
132440 slimdb_ora_4680.trc 05-OCT-05

SQL> column SUM_WT_ELA format 999,999,990
SQL> column AVG_WT_ELA format 999,999,990
SQL> column waits format 999,990
SQL> break on report
SQL> compute sum of SUM_WT_ELA on report
SQL> select substr(WT_NAME,1,25) wt_name,
2 sum(WT_ELA) sum_wt_ela,
3 avg(wt_ela) avg_wt_ela,
4 count(*) waits
5 from last_trace_waits
6 group by substr(WT_NAME,1,25)
7 /

------------------------- ------------ ------------ --------
SQL*Net message from clie 27,372,317 9,124,106 3
SQL*Net message to client 14 5 3
db file scattered read 4,771,858 18,640 256
db file sequential read 1,490,384 3,644 409
sum 33,634,573

SQL> select CUR_DEP,
2 count(*)
3 from last_trace_cursors
4 group by cur_dep
5 order by 1
6 /

---------- ----------
0 2
1 391
2 2189
3 92
4 2
5 1

6 rows selected.

SQL> set pagesize 100
SQL> select substr(WT_NAME,1,23) wt_name,
2 substr(segment_name,1,20) segment_name,
3 sum(WT_ELA) sum_wt_ela,
4 avg(wt_ela) avg_wt_ela,
5 count(*) waits
6 from last_trace_obj_waits
7 group by substr(WT_NAME,1,23),substr(segment_name,1,20)
8 order by 3 desc
9 /

----------------------- -------------------- ------------ ------------ --------
db file scattered read FCT_FSA_1 4,698,204 19,908 236
db file sequential read INDSUBPART$ 3,963,720 1,149 3,450
db file sequential read C_OBJ#_INTCOL# 2,355,677 4,403 535
db file sequential read C_FILE#_BLOCK# 2,129,218 1,342 1,587
db file sequential read I_OBJ2 676,269 1,375 492
db file sequential read I_FILE#_BLOCK# 470,275 4,237 111
db file sequential read FCT_FSA_1 417,323 5,962 70
db file sequential read TABSUBPART$ 170,888 4,069 42
db file scattered read INDCOMPART$ 130,647 6,221 21
db file sequential read HIST_HEAD$ 62,906 7,863 8
db file sequential read I_INDSUBPART$ 61,507 1,922 32
db file scattered read TABCOMPART$ 57,753 6,417 9
db file sequential read I_CDEF2 34,272 4,284 8
db file sequential read SNAP$ 30,162 7,541 4
db file sequential read PARTOBJ$ 29,690 7,423 4
db file sequential read SUMAGG$ 24,404 6,101 4
db file sequential read INDCOMPART$ 19,534 9,767 2
db file sequential read SUMKEY$ 19,116 4,779 4
db file sequential read DIMJOINKEY$ 17,296 4,324 4
db file sequential read PARTCOL$ 15,068 3,767 4
db file sequential read TABCOMPART$ 14,988 7,494 2
db file sequential read SUMDELTA$ 12,143 12,143 1
db file sequential read I_TABSUBPART$ 11,146 5,573 2
db file sequential read I_CDEF3 10,627 10,627 1
db file sequential read I_SUBCOLTYPE1 10,131 10,131 1
db file sequential read I_NTAB2 9,696 9,696 1
db file sequential read I_CCOL1 9,366 4,683 2
db file sequential read C_COBJ# 9,322 9,322 1
db file sequential read SUM$ 8,722 4,361 2
db file sequential read I_OBJ#_INTCOL# 7,496 7,496 1
db file sequential read I_SUBPARTCOL$ 7,059 7,059 1
db file sequential read I_DIMATTR$_2 6,601 6,601 1
db file sequential read DEPENDENCY$ 6,432 3,216 2
db file sequential read I_HH_OBJ#_INTCOL# 6,267 6,267 1
db file sequential read I_DEFSUBPART$ 5,595 5,595 1
db file sequential read VIEW$ 5,231 5,231 1
db file sequential read DEFSUBPART$ 4,795 4,795 1
db file sequential read I_DEPENDENCY1 4,649 4,649 1
db file sequential read I_REFCON2 3,544 3,544 1
db file sequential read I_PARTCOL$ 2,705 2,705 1
db file scattered read SUM$ 742 371 2
db file sequential read I_OPQTYPE1 195 195 1
db file sequential read SUBPARTCOL$ 138 138 1
db file sequential read I_DIMLEVELKEY$_2 136 136 1
sum 15,541,655

The last two are pretty interesting, for a trace file based on a single star schema query, huh? Sorry about the crappy formating though.

There's still work to do, and at the top of my list are the following:
  • Allow grouping of trace files as a single logical entity
  • Add more reporting views, focusing on filtering out recursive statements and on a set of "MY_%" views
  • Associating recursive queries with their parents to enable reporting views based on CONNECT BY queries
  • Better handling of files
  • Creating a load script for the TRACELDR schema under which all this runs
  • Creating a Trace Loader role to allow other users to access the package and views.
Based on that you can probably expect to hear from me again around Christmas.

Nah, not really. Soon.