.comment-link {margin-left:.6em;}

Oracle Sponge -- Now Moved To Wordpress

Please use http://oraclesponge.wordpress.com

Wednesday, October 05, 2005

Reporting Wait Events For Data Warehouse Queries

Funny how easy it is to get side-tracked. Well not really side-tracked, but just more and more involved in one particular thing, so that you end up burrowing so deeply into it that it loses all relevance to the real world. Just as a simple SQL statement generates deeper and deeper levels of more and more obscure recursive statements, so it can be with the most simple of tasks.

Those of you still awake may recall that I was looking at materialized view refresh, and in the course of getting to grips with the optimum refresh methodology for fast refreshing multiple MV's I've been picking the bones out of some lovely warm trace files. My guidance has come from two books that are probably familiar to many readers, and which have become indispensible to me: Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning; Shee, Deshpande and Gopalakrishnan; Oracle Press, and Optimizing Oracle Performance; Millsap and Holt; O'Reilly. And let me first say that they are a beautiful compliment to each other. While the former goes into just the right level of detail on how to understand and mitigate the wait events themselves, the latter presents a methodology for data capture and event interpretation backed up with a strong background in queueing theory.

Now the usual challenge is to translate an approach aimed at OLTP databases into something that we can use in data warehousing, and fortunately this is not too tricky in the case of wait event capture and analysis. One of the major challenges associated with wait event analysis on OLTP databases is simply in capturing the correct data in the first place (a topic addressed in depth by the Millsap/Holt book) but this is relatively easy in "the other world": a problematic query that we are interested in tuning is generally well-known and easily isolated, whether it is part of the ETL routine or part of a report.

Issues relating to use of shared servers, which can scatter a single session's wait events among multiple trace files practically at random, are also probably rare. Shared server configurations are inherently problematic in an environment where system load is dominated by fewer, longer running queries than an OLTP system would experience, and so are rarely used.

On the other hand we do have the challenge of parallel query, in which the most meaningful wait events get pushed to different trace files, one for each PQ process. Without identifying these files, capturing their data, and consolidating them into a single entity for analysis there is little that can usefully be done.

And now a word about recursive SQL. Recursive SQL statements are issued by Oracle in support of the execution of the SQL statements submitted to the database. They are used to parse the query, to maintain the data dictionary (for example to add extents to table and index segments), to perform dynamic sampling, to retrieve information on segments for the optimizer to use, etc.. So we might be able to deduce immediately that recursive SQL plays a very large role in the execution of data warehouse queries for a number of reasons, the foremost of which are the high parse ratio and the complexity of the objects that the optimizer has to consider.

The high parse ratio is caused by our desire for the optimizer to be sensitive to literal values in the predicates (histogram usage being particularly important in this environment), and where an OLTP system could be unscalable due to the workload imposed by a high hard parse rate, the longer duration of the data warehouse queries renders this overhead less significant. Not wholly insignificant, but not so significant that it is worth eliminating entirely.

Now there is also the issue of object complexity. I venture to suggest that in data warehousing we have a much higher probability that partitioning is being used, and also that we will be leveraging partition pruning. So despite the apprant simplicity of many queries against star schemas there are actualy a disproportionate number of data and index segments for the optimizer to be aware of. There is also a high chance that we will be using query rewrite, and hence that there are a great many more objects to be considered even if they are never going to be touched by the optimized query.

Now I rather glossed over the issue of how easy it is in a data warehouse to identify the SQL for which we want to the wait events. However if you read much of the literature on wait event analysis you will find that this is actually a very large part of the formal methodologies generally presented. Do you want to identify the queries with the longest duration, or the most logical reads, or the most physical I/O, or what other considerations are there? When data warehousing relieves us of that issue we are left with two other major issues:
  • What makes our query run for so long?
  • What can we do about it?
And these are by no means trivial tasks. However I would suggest that the first of these issues really ought to be more trivial than it currently is. Consider the steps we might follow in a wait event analysis for a hypothetical data warehouse query.
  1. Extract the query, along with representative predicate literals
  2. Execute the query with the 10046 event set at level 8 ("8" rhymes with "wait", which is how I remember that level 8 gives us wait events and level 4 gives us binds :) ... and level 12 gives us both of course).
  3. Find the trace files (parallel query giving us multiple files, of course)
  4. Run tkprof on the trace files to turn them into something readable
  5. Coalesce the results
  6. Analyze what events are causing us pain (generally I/O-related ones and sometimes high CPU usage, I find).
  7. For I/O related events examine what object they are occuring on, how many there are, their average duration, whether there are I/O hot spots on our devices etc..
For me the tricky part starts at step 4: I can extract queries, set events, find trace files without much problem, but the rest leaves me a little cold.

Let's start with tkprof. A fine program in general, but remembering the options for sorting the queries is beyond me, and having listed my major wait events I then have to go probing through the raw file again plus querying the data dictionary to find out what segment and file the events occured on. And remember that we may have upwards of a dozen file to contend with. Unless I'm missing out on something, that means a dozen executions of tkprof then a manual process to merge and examine all those raw files anyway.

Merging, sorting and reporting: these are the bread and butter tasks of a database, not a command-line programs. So while staring at the ceiling in the dark one night I had the following thoughts:
  1. "Load the raw files into the database then report on them from there."
  2. "That means I can link the I/O-related wait events directly to the data dictionary to get object and file names automatically".
  3. "That means we'd have a history of the the exact wait events and the real execution plan associated with a query within the database itself"
  4. "That's so obvious that it must have been done before."
  5. "Even if it has, I bet I could do it easily and customize it for my own purposes."
So that's what I've been up to for the past couple of weeks, in the little time I have between my real work and my family, instead of going through the materialized view refresh analysis that I actually promised. But it's a step in the right direction anyway -- should make that process more easy.

How do you go about desinging and implementing such a methodology then? Here are the aims that I started out with.

Load into database tables ther significant data from multiple trace files

For me the data of interest are the wait events, and in order to put them into a usable context I also need to associate them with SQL statements (cursors, in the trace file), the file name, operations such as PARSE, EXEC and FETCH. They all have to be stored in a set of dedicated tables.

Exactly which data would we load for each of these types of trace information? As much as possible, and for the most part that means all of it. We can get timings, recursive depth (very useful!), user id's, just about anything we need.

Associate significant system data with the trace file data

In our world of data warehousing we regularly add, truncate, split, merge and drop partitions, and when we do so we change the object id to which the wait event parameters refers. It seems prudent for this reason and for performance purposes (data dictionary views being notoriously cranky when it comes to joining to other data) to provide the option of storing the appropriate data at the time that the trace file is loaded.

Associate multiple trace files with a single report

This is of particular use in coalescing the results of multiple files generated by parallel query for a single submitted SQL statement. So we'd want to load multiple trace files and say "these belong together".

Provide predefined mechanisms for reporting on the events

A set of views on the database tables was going to be required, to allow more simple custom queries to be defined on the trace data. Analagous to the data dictionary views provided in the database in fact. In just the same way as "select * from table(dbms_xplan.display())" reports on the most recently generated explain plan, we can provide a mechanism for reporting on the most recently loaded trace file(s) for example.

Provide some administrative tools for purging old data

Purge all the data, or all the data prior to some date, and some options like that.

The progress so far has been encouraging. Loading in a trace file with a known name is now a trivial task, and basic reporting on the events is a doddle. For example I'm currently using the following command to load a single file:

(directory =>'UDUMP', -- 1
file_name => 'slimdb_ora_4680.trc', -- 2
debug => 0, -- 3
log_extents => true, -- 4
log_files => true); -- 5

The parameters are pretty intuitive:
  1. The name of the directory object that points to the location of the file to be loaded.
  2. The name of the file to be loaded (duh).
  3. The level of debugging message required (spat out through DBMS_OUTPUT).
  4. Whether to take a snapshot of the relevant entities in the dba_extents system view, to enhance later reporting (default "false" due to relatively high overhead).
  5. Whether to take a snapshot of the relevant entities in the dba_data files system view, to enhance later reporting (default "true").
With the data loaded I can then report in the following manner:

SQL> select * from all_trace_files;

---------- ------------------------------ ---------
132440 slimdb_ora_4680.trc 05-OCT-05

SQL> column SUM_WT_ELA format 999,999,990
SQL> column AVG_WT_ELA format 999,999,990
SQL> column waits format 999,990
SQL> break on report
SQL> compute sum of SUM_WT_ELA on report
SQL> select substr(WT_NAME,1,25) wt_name,
2 sum(WT_ELA) sum_wt_ela,
3 avg(wt_ela) avg_wt_ela,
4 count(*) waits
5 from last_trace_waits
6 group by substr(WT_NAME,1,25)
7 /

------------------------- ------------ ------------ --------
SQL*Net message from clie 27,372,317 9,124,106 3
SQL*Net message to client 14 5 3
db file scattered read 4,771,858 18,640 256
db file sequential read 1,490,384 3,644 409
sum 33,634,573

SQL> select CUR_DEP,
2 count(*)
3 from last_trace_cursors
4 group by cur_dep
5 order by 1
6 /

---------- ----------
0 2
1 391
2 2189
3 92
4 2
5 1

6 rows selected.

SQL> set pagesize 100
SQL> select substr(WT_NAME,1,23) wt_name,
2 substr(segment_name,1,20) segment_name,
3 sum(WT_ELA) sum_wt_ela,
4 avg(wt_ela) avg_wt_ela,
5 count(*) waits
6 from last_trace_obj_waits
7 group by substr(WT_NAME,1,23),substr(segment_name,1,20)
8 order by 3 desc
9 /

----------------------- -------------------- ------------ ------------ --------
db file scattered read FCT_FSA_1 4,698,204 19,908 236
db file sequential read INDSUBPART$ 3,963,720 1,149 3,450
db file sequential read C_OBJ#_INTCOL# 2,355,677 4,403 535
db file sequential read C_FILE#_BLOCK# 2,129,218 1,342 1,587
db file sequential read I_OBJ2 676,269 1,375 492
db file sequential read I_FILE#_BLOCK# 470,275 4,237 111
db file sequential read FCT_FSA_1 417,323 5,962 70
db file sequential read TABSUBPART$ 170,888 4,069 42
db file scattered read INDCOMPART$ 130,647 6,221 21
db file sequential read HIST_HEAD$ 62,906 7,863 8
db file sequential read I_INDSUBPART$ 61,507 1,922 32
db file scattered read TABCOMPART$ 57,753 6,417 9
db file sequential read I_CDEF2 34,272 4,284 8
db file sequential read SNAP$ 30,162 7,541 4
db file sequential read PARTOBJ$ 29,690 7,423 4
db file sequential read SUMAGG$ 24,404 6,101 4
db file sequential read INDCOMPART$ 19,534 9,767 2
db file sequential read SUMKEY$ 19,116 4,779 4
db file sequential read DIMJOINKEY$ 17,296 4,324 4
db file sequential read PARTCOL$ 15,068 3,767 4
db file sequential read TABCOMPART$ 14,988 7,494 2
db file sequential read SUMDELTA$ 12,143 12,143 1
db file sequential read I_TABSUBPART$ 11,146 5,573 2
db file sequential read I_CDEF3 10,627 10,627 1
db file sequential read I_SUBCOLTYPE1 10,131 10,131 1
db file sequential read I_NTAB2 9,696 9,696 1
db file sequential read I_CCOL1 9,366 4,683 2
db file sequential read C_COBJ# 9,322 9,322 1
db file sequential read SUM$ 8,722 4,361 2
db file sequential read I_OBJ#_INTCOL# 7,496 7,496 1
db file sequential read I_SUBPARTCOL$ 7,059 7,059 1
db file sequential read I_DIMATTR$_2 6,601 6,601 1
db file sequential read DEPENDENCY$ 6,432 3,216 2
db file sequential read I_HH_OBJ#_INTCOL# 6,267 6,267 1
db file sequential read I_DEFSUBPART$ 5,595 5,595 1
db file sequential read VIEW$ 5,231 5,231 1
db file sequential read DEFSUBPART$ 4,795 4,795 1
db file sequential read I_DEPENDENCY1 4,649 4,649 1
db file sequential read I_REFCON2 3,544 3,544 1
db file sequential read I_PARTCOL$ 2,705 2,705 1
db file scattered read SUM$ 742 371 2
db file sequential read I_OPQTYPE1 195 195 1
db file sequential read SUBPARTCOL$ 138 138 1
db file sequential read I_DIMLEVELKEY$_2 136 136 1
sum 15,541,655

The last two are pretty interesting, for a trace file based on a single star schema query, huh? Sorry about the crappy formating though.

There's still work to do, and at the top of my list are the following:
  • Allow grouping of trace files as a single logical entity
  • Add more reporting views, focusing on filtering out recursive statements and on a set of "MY_%" views
  • Associating recursive queries with their parents to enable reporting views based on CONNECT BY queries
  • Better handling of files
  • Creating a load script for the TRACELDR schema under which all this runs
  • Creating a Trace Loader role to allow other users to access the package and views.
Based on that you can probably expect to hear from me again around Christmas.

Nah, not really. Soon.


At 9:48 PM, Anonymous Mark said...

Nice idea, David - how do you avoid double-counting the numbers that relate to parent-child cursors (i.e. dep=n, where n>1)?

That's where I gave up trying to build my own trace file analyzer - too difficult working out which cursors' numbers are already counted.

I find the Holt/Millsap book to be the best explanation of trace file contents in the literature - so simple and obvious, yet a true revelation in the world of performance analysis...

SUM(ela) = SUM(cpu) + SUM(wait)

.. your SQL is either working its brains out or waiting in line to work its brains out.

At 10:08 PM, Blogger David Aldridge said...

At the moment I'm relying on being able to break out the cursors by their recursive depth ... as long as you don't sum the metrics across recursive depth then everything goes fine.

I think that a flexible approach is called for here, and while I was planning at first on automatically subtracting child cursor metrics from their parents', that still seemed suboptimal to me as it could misrepresent the total work involved in processing the parent SQL. I think that you need to be able to present the data both ways, so that you can "drill down" from total elapsed time to child elapsed time, for example, or present parent elapsed time minus child elapsed time.

So the plan that I'm working on at the moment is to present metrics for cursors alongside the metrics for their child cursors. I don't know whether it's feasible to do this dynamically through the views I'm writing, or whether to do it as part of the load process in which case my "operations" table would look something like:

create table trace_ops
op# number(30) primary key,
cursor# number(38) not null
references trace_cursors,
op_name varchar2(10) not null,
op_c number(38,0) not null,
op_e number(38,0) not null,
op_p number(38,0) not null,
op_cr number(38,0) not null,
op_cu number(38,0) not null,
op_mis number(38,0) not null,
op_r number(38,0) not null,
op_dep number(38,0) not null,
op_og number(38,0) not null,
op_tim number(38,0) not null,
op_child_c number(38,0) not null,
op_child_e number(38,0) not null,
op_child_p number(38,0) not null,
op_child_cr number(38,0) not null,
op_child_cu number(38,0) not null,
op_child_mis number(38,0) not null,
op_child_r number(38,0) not null

As long as the parent cursor number can be identified for the recursives, then this shouldn't be a show-stopper. I hope.

Thanks for the feedback Mark.

At 3:00 PM, Blogger Jan said...

Just an fyi, since you mentioned the Oracle Wait Interface book, i took a stab at implementing the event data collector:

Comes in quite handy every now and then :-)

Love the Millsap & Holt book. Went to the Hotsos class (pd101), with Millsap instructing. Great class, great fun! Very informative, loved it! I'll go to their sql class as soon as my boss lets me ;-) So should everybody else!


At 1:36 PM, Blogger Niall said...

Hi David

You said

So while staring at the ceiling in the dark one night I had the following thoughts:

1. "Load the raw files into the database then report on them from there."
2. "That means I can link the I/O-related wait events directly to the data dictionary to get object and file names automatically".
3. "That means we'd have a history of the the exact wait events and the real execution plan associated with a query within the database itself"
4. "That's so obvious that it must have been done before."

5. "Even if it has, I bet I could do it easily and customize it for my own purposes."

It has been done before, you were correct. :) The officially supported trace analyzer utility from Metalink does exactly this:) There are a number of bug fixes in the version 2 that just came out.

Simple Profiler is an attempt to do step 5 as well.

Anyway, I have a couple of further - maybe more useful - comments.

1. take a look at the trcsess utility that comes with 10g, it will do some aggregation on pre-10g trace files as well.

2. 10g R2 redoes the format of the trace files, in a way that is both helpful, and may well break your utility - and mine :(

3) I found getting the explain plan from the STAT lines a real PITA (and Carlos Sierra evidently did with trcanlyzer as well) - my hypothesis is that Oracle is writing the explain lines in a really lazy fashion, but you might want to double check that you get all of them.

word verification is pjmmem which must surely be a linux kernel parameter.

At 4:59 PM, Blogger David Aldridge said...


yes I had a look at trace analyzer ... it chewed on one of my trace files for an hour before I put it out of it's misery. Documentation seemed a little sparse, but maybe I should make sure I look at the most recent version.

However I must say that I've found the process of trying to write this utility to be so educational on the subject of trace files that it's worth it just for that alone.


Post a Comment

Links to this post:

Create a Link

<< Home