.comment-link {margin-left:.6em;}

Oracle Sponge -- Now Moved To Wordpress

Please use http://oraclesponge.wordpress.com

Thursday, April 28, 2005

Writing Good SQL

This is prompted by a question asked by eastmael at the Oracle DBA Forum -- "what I'd like to accomplish are optimized SQL statements ... do you guys have any tips, links, or resources that could help me with my study of efficient/optimized SQL statements?" The topic deserves a little more then the brief forum answer, so here goes for a more considered approach.

This is a fundamental question that ought to be asked more often, and it's to eastmael's credit that he (just guessin' there) is asking now when just starting out on the topic. We see an enormous number of forum postings that ask, "Why is this SQL performing badly?", "How do I get this to ... ?", "Is there a faster way of ...?" and very few that address the basic educational issues. So instead of handing out fish I'll see if I can teach/suggest how to catch your own.

I think that there are at least five elements to consider. In no particular order:
  • Understanding what you are trying to do
  • Knowledge of SQL syntax
  • Appreciation of the data and the structure
  • Optimization
  • Measurement of performance
Which is to say that if you are going to write top-notch efficient SQL then you must do the following:
  1. Understand what question you are asking of the database
  2. Know how to write multiple SQL statements to ask that question
  3. Know about the data and the structures being queried
  4. Understand how the optimizer works with different statements in the context of the data and structures present
  5. Know how to read and measure how the statement is optimized and executed.
Piece of cake, once it's broken down like that.


1: Understand what question you are asking of the database

Most SQL statements in an application serve some aim that can be expressed in business terms. Or in other words, what question does the SQL purport to answer? "Get a list of all customers who were not billed last month" ... "Work out what the total amount is to be billed"... "Does this customer have multiple shipping addresses?" ... "Fire all the Capricorns".

Phrases like "Work out ...", "Get ...", "Change ..." sound to some people distressingly informal and unscientific, but I like 'em. They help to keep our minds in the real world, in which we are solving real business problems. It is unlikely that your code is going to be submitted to the Nobel Prize committee, distressing and unfair as that sounds, so imagine that someone has asked you "So, what does this bit of code do then?" and just write down the answer. and when I say "write down the answer" I mean "write down the answer in a comment block immediately above where you are going to write the code".

So aside from reminding us what the code is going to do, what else does it help with?

After you've gained some experience working with the data you'll find that the statement will give you some instinctive performance clues. "Let me see now, 'Get the employee's length of service'. Two seconds? What the... ?!? Two seconds is way too long for that!" Do everyone a favour and keep that monologue internal, though, thanks. I might be sitting at the next cube to you one day, and I wouldn't want to have to smack you or nuttin'.

Alse at some point in the future the application or the data may (actually I mean "will") change. Maybe the performance will suddenly go pear-shaped and people will be crawling over the code trying to find out what the problem is. If you followed the philosophy on commenting the code that I expressed here then this statement is going to help future maintainers of the system to keep your code efficient and correct.


2: Know how to write multiple SQL statements to ask that question

Of all of the suggestion that I'm making here, the one that probably requires the least amount of hard work is the knowledge of the SQL syntax because we can start by just reading the Oracle SQL Reference (see link in side-panel). There are a number of sections to this document but since nearly all of them are of interest you might as well read the whole lot.

You cannot write good SQL unless you understand the data types you are working with. The most misunderstood and abused type appears from my own observations to be the DATE. Sometimes people seem to have an almost instinctive fear of using the DATE type, possibly assuming that there is an enormous overhead associated with it (either in terms of storage or in terms of processing time) and they start storing dates and times in character or number columns, or splitting out times from the date and using multiple columns. The DATE datatype is your friend -- if you think that you can work out the number of days between two arbitrary dates faster than the built-in Oracle functions then you're wrong, and in any case don't forget that the built-in functions are tested on your behalf billions of times a day, by users on multiple platforms, versions, and configurations.

So read all about data types, and pay attention to how you convert between them. Understand what precision and scale mean on a numeric type, and how they affect storage. Understand the difference between VarChar2 and Char. Never rely on implicit datatype conversion, especially between date types and character types -- always code the conversion yourself.*3

Practice both forms of the CASE expression.

Play around with all the forms of condition. I've met people with some years of experience who didn't know that just as you can use a BETWEEN condition, you can also use a NOT BETWEEN condition.WTF, dude? It's there in black and white.

Some conditions do pretty much the same thing in slightly different ways. There are no conditions that are inherently bad to use, just bad applications of them in particular circumstances, and you'll learn those by reading and measuring (later on I'll talk about measuring).

There are around 170+ built-in SQL functions in Oracle 9i. You don't have to learn them all, but you do need to know which ones exist. If you think that a particular built-in function ought to exist then it probably already does. If it doesn't already exist then you can probably combine and nest a few functions to do what you want. String manipulations like InStr() and SubStr() seem to be overlooked by a lot of beginners who want to extract the next contiguous block of characters after the second occurance of a space, for example. Translate() is great for pattern checking to see if a string is in SSN format 999-99-9999 for example. You do not understand any of these functions until you have solved a real problem with them.

Work to understand the aggregate and analytic functions -- aggregates are extremely common and analytics are extraordinarily powerful. Look at Tom Kyte's website for many, many examples of problems solved with the application of analytic functions -- hint: search for the phrase "analytics rock" ;)

The whole of the section on SQL Queries and Subqueries is essential reading.

It is a fundamental property of relational databases that the result of a query can be used as the input for another query -- not only fundamental in terms of syntax but fundamental in terms of performance. Read about queries and subqueries, and in-line views -- the Oracle query optimizer is very good at using them. Subquery factoring is a powerful technique in the right circumstances, but until you've played around with it a while you won't get any instinctive grasp of when it can be used.

Many people don't seem to have heard of hierarchical queries, which use the CONNECT BY clause. This is used for querying an hierarchical data structure such as the decomposition of manufactured items into their component parts, or for manager-employee relationships. If you've read this section in the documentation then you already have an advantage over them.

The set operators of UNION [ALL], MINUS and INTERSECT are often strangely neglected. I like them -- a query that uses them is often much easier to read than other techniques for the same result. Understand the difference between UNION and UNION ALL -- by default use UNION ALL unless you need to use UNION. Many people seem to get this the wrong way round.

When people talk about performance they are othen talking about SELECT statements, which tend to be more complex and more common than DELETE's, INSERT's and UPDATE's. Focus on SELECT first to get a grip on the fundamentals. Performance problems in SQL statements usually come from two factors:
  • Finding rows
  • Changing rows
Selects need to find rows, inserts need to change rows (well, sort of), and deletes and updates need to do both. There is more "change overhead" in updates than in deletes or inserts. These are fundamental differences that you must bear in mind in order to anticipate potential sources of performance problems.

Now that you understand the syntax you ought to understand more about what Oracle does with it. There is no section of the Concepts Guide that you can afford to completely ignore, but you can start with the section on Schema Objects, then brush up on your knowledge of Datatypes, then read through the section on SQL Overview (in particular how SQL is executed).


3: Know about the data and the structures being queried

If you're going to write SQL then obviously you're going to need to know at least the bare minimum about the tables and columns which you are addressing, but it's important to note that the names of the tables and the names and dataypes of the columns are absolutely the bare minimum, and like most bare minima they are only barely sufficient to make your work barely functional.

There is so much more information available to you that at first pass it threatens to boggle the mind. In the Concepts Guide you can read about some significant differences between different table types, for example -- the default heap table, the index-organized table, the cluster (both index and hash based) to name a few. You may not be responsible within your organization for deciding which of these is appropriate, but if you appreciate the strengths and weaknesses of each then you can not only leverage them where they are found in your schema but you can also advise others on any benefits and detriments in their use and non-use in your environment. You will be a hero.

Two quick things to add in there. Firstly, wherever there is choice between two options, whether in table type or in SQL structure or in where you buy your groceries there are always strengths and weaknesses to each option. If one of the options was 100% strengths and 0% weaknesses then the other options would not exist (OK, except maybe for backwards compatibility purposes). Secondly, whenever you accept a default you are making a decision, conscious or unconscious, to reject the alternatives. If your schema is 100% vanilla heap tables then you have made a decision to reject hash clusters and partitioning and index-organized tables etc. -- whether you realise it or not, and whether you act through ignorance or through informed choice, is determined by how much effort you have put into your own search for knowledge. So you'd better try to make that choice a conscious one, right?

Moving along ...

So to write efficient SQL you have to be very aware of the structures that you are using. Not only the table type, but what indexes are present, and what type of index they are. You have to be aware of the nature of the data. Not just the datatype but how many distinct values are there likely to be? Are common values going to be clustered together or randomly distributed throughout the table? Learn about index clustering factors -- it is not the percentage of rows to be retrieved that determines whether an index is likely to be used but the percentage of blocks, and the clustering factor is used by the optimizer to help estimate this. Howard Rogers' Dizwell web site has an excellent explanation.

Learn how to query the data dictionary -- the Database Reference lists all of the system views that you need in order to find out what table, columns, indexes, constraints, source code, materialized view ... in short what objects are stored in the database, and a skill in querying the data dictionary views is priceless. If someone suggests that you use a GUI tool then see if it will tell you all the tables that have a column with a particular name, or will list all the tables that have no primary key, or which have no foreign keys against them from other tables. It probably won't do it. But you will be able to if you know how to query these views. Unfortunately you'll then have to be the go-to guy/girl for such oddball requests -- it's the price of competence in any field of endeavor so you'll just have to suffer with that. Also, ask for a pay-rise.

Work to understand the business and it's data. Understand how existing applications use the data. Draw sketches of how the tables relate to each other.

4. Understand how the optimizer works with different statements in the context of the data and structures present

A long title for a challenging subject.

Oracle provides (for free!) a very long document on the subject of Performance Tuning. You cannot write efficient SQL unless you understand the optimizer, and unless you are forced to use the Rule Based Optimizer (RBO) then you should be using the Cost Based Optimizer (CBO).

Understanding the functioning of the CBO is no trivial matter, but that doesn't excuse a merely superficial knowledge of it. It's internal workings are very well documented both in the Oracle documentation and extensively on the internet, and there is no-one out there writing efficient SQL who does not know how it works. The websites of Jonathon Lewis and Tom Kyte are also choc-a-bloc with information.

Learn about the choices that the CBO can make -- what order to join the tables in, how to access the tables, what kind of join to use, how the joins work and what their advantages and disadvantages are. Learn why full table scans are Not Always A Bad Thing, and why index-based access is Not Always A Good Thing.

Why is a hash join beneficial in joining large data sets? Why does it only work with equality joins? Does indexing help with it?

Learn how the optimizer uses indexes, and how different types of indexes relate to NULL values.

Read all about how the optimizer uses statistics. Learn how it is affected by initiation parameters such as cpu_count and optimizer_index_cost_adj.

The CBO is based on statistics -- statistics about tables, statistics about indexes, statistics about the distribution of values within a column. Learn where the statistics are sorted, how they are calculated, and how they are used.

Learn how to manipulate the decisions made by the CBO through the use of optimizer hints, and then try not to use them in your code anyway. If you think you need them then look for other explanations for the problem you are trying to solve -- inadequate statistics or missing constraints, for example.

Perhaps more than any other element of the Oracle RDBMS the cost based optimizer is constantly changing. It gets new access methods, new initiation parameters, new ways of accessing data that have even been patented. Stay alert to changes between versions and between patch levels.

You will never know everything about the optimizer,
because there is so much to learn and because it keeps evolving.

But try anyway.



5: Know how to read and measure how the statement is optimized and executed

Question: "My query runs too slowly. How can I improve it's performance?"
Answer: "By understanding and measuring it's performance"

Oracle now has very extensive internal mechanisms for documenting and measuring just about everything to do with SQL execution, and it is all at your fingertips.

Probably the simplest method is to type into SQL*Plus the command "set timing on", and you'll get wall clock timings of how long your statement took to execute.

In the Database Performance Tuning Guide and Reference there is a section which will tell you how to invoke and read an "Explain Plan" through SQL*Plus -- this will show you how the optimizer expects to get the result you have asked for. Remember that SQL is a Fourth Generation Language (4GL) in which you describe the result set, not how to get it. The explain plan tells you how the optimizer is going to tell the database instance to get the result, and the documentation tells you exactly how to read the cryptic-at-first-glance output.

When you understand the explain plan then learn about SQL Trace, in which details of the actual execution of the SQL is written to a server trace file. The execution plan here may be different to the one you saw through SQL*Plus, so understand why that is.

I don't recall ever using the methods documented in the Database Performance Tuning Guide and Reference -- the command line interface, the initialization parameters, and the PL/SQL methods. I jump straight to using "alter session set sql_trace = true;" or the 10046 event invoked with syntax such as "alter session set events '10046 trace name context forever, level 4;". Search for information about these on Tom Kyte's website, on Metalink, on Oracle forums, and wherever you can find it. They give you access to the Oracle "wait interface" that will tell you exactly what is making your SQL take as long to execute as it does, and with experience you will be able to translate into optimization problems or structural problems -- wrong/missing indexes for example, or an over-enthusiasm for them.


Final Thoughts

  1. You are going to be reading a lot of opinions in your pursuit of SQL excellence, so here is my opinion on finding good information.
  2. Add comments to your code.
  3. When you screw up, admit it.
  4. Test every assumption.
  5. Benchmark every feasible alternative.
  6. Don't be afraid to ask questions.
  7. Never stop learning.
  8. Erm ...
  9. That's it. Thanks for reading.


Acknowledgements

Thanks to eastmael for asking the question that prompted the original article.

Very many thanks to rbaraer and pete_s for suggestions on material, corrections, and encouragement along the way. I probably would never have finished without their feedback.

Further comments and contributions of any kind are always very welcome.

10 Comments:

At 1:07 AM, Anonymous Anonymous said...

Very good starting point in my opinion, I like the way you present things. Excellent for beginners, and not so beginners as well :-). People like you and Tom Kyte really contribute to the Oracle Community.

PS : maybe in your paragraph about datatypes you could also have talked about avoiding implicit conversions, because it seems to me that such conversions are quite commonly used, especially for DATE fields.

 
At 5:04 AM, Blogger Pete Scott said...

"When people talk about performance they generally are often talking about SELECT statements"
perhaps that is because "the business" ask questions of the database and have the loudest voice when things seem slow, the behind the sceens stuff is only noticed by IT.
But great piece, lucid writing, wish mine was the same.

(oh before others complain ;-) I only do data warehouse - OLTP is not the same....)

 
At 11:02 PM, Blogger David Aldridge said...

Thanks rbaraer.

This subject appealed to me because at first it seemed so intractable -- so many influences, so many documents to read, so many techniques, but to me that usually means that you just have to decompose the problem. It must be the engineer in me.

And thanks for the suggestion -- very important issue. I've included it in the latest version.

 
At 12:57 AM, Anonymous Anonymous said...

Really excellent material ! With your permission, once this post is finished, I'll definitely keep the link somewhere to show anyone that wants to learn how to write efficient SQL with Oracle. Really excellent for beginners IMHO.

BTW, thanks for taking my suggestion about implicit conversions into account.

Another one maybe, I think you could insist on using the CBO, because I think a beginner might wonder "why not the RBO ? Someone told me the CBO could hardly be trusted, so use the RBO". So a little argumentation may show the main differences, and why to use the CBO, I think from there on, the documentation you point is the way to go for details.

What do you think of arguments like :

- the CBO takes into account data volumes, data distribution and many other statistics about YOUR environment, while the RBO doesn't, it just follows static rules supposed to be always true whatever the environment
- the CBO considers more access paths than the RBO
- the CBO will not always favour indexes as the RBO does : it might prefer a Full Table Scan when appropriated
- and maybe most importantly the RBO is not supported anymore in 10g, and will never be again, which means no more upgrades, while the CBO will continue to get smarter and smarter and use new available access paths.

That "someone told me" story makes me think of the "prove-it" approach Tom Kyte advocates. I think this is also important advice for beginners, so you could add something like : only trust people that can show you what they say with examples and hard numbers (I love when Tom Kyte says so :-)), and only take advice for what it is, not as truth. In the end, always test and benchmark on YOUR own environment to see which solution works best for YOU.

Well, it's a little long... I hope I won't have bothered you ;-).

PS : you have written "sotred" instead of "sorted" at the 7th paragraph of OPTIMIZATION.

 
At 10:38 AM, Blogger David Aldridge said...

Pete,

There's a lot of truth in that, but also the select statement is fundamental to many inserts and to deletes and updates also. The latter especially have a strong dependency on finding particular rows in the most efficient way, but also have more complex overheads associated with redo and rollback that I didn't want to write about in this article.

 
At 11:02 AM, Anonymous Anonymous said...

"More tomorrow ..."

Hummm. ;-)

Please don't forget to end this post, it's really a nice piece of advice, and I think that once finished, you should put it in your "articles" menu or something like that, so that it doesn't get lost among all the other posts. It definitely deserves better.

 
At 11:15 AM, Blogger David Aldridge said...

Oh, forgot about this.

Well I think I'll have a good read through it again and maybe rephrase some stuff, 'cos I think it's mostly done. I don't think that the blog software let's me post a pdf or whatever, so i guess it'll just have to stay in this sort of format.

Thanks for the reminder, rbaraer.

 
At 4:43 AM, Anonymous Anonymous said...

You're welcome.

Many thanks for finishing this post. I now keep this link preciously. :-)

Again, very nice piece of work.

Best regards,

RBARAER

 
At 1:19 AM, Blogger DaPi said...

Excellent stuff slim!

My 2ยข: based on 25 years of working with db's, with SQl or not, the most valuable piece of advice you have given is:

Draw sketches of how the tables relate to each other.

 
At 7:13 AM, Blogger Phantom Nitpicker said...

I think you mean "its" in the following:

Work to understand the business and it's data.

It's internal workings are very well documented...

How can I improve it's performance?

By understanding and measuring it's performance.

 

Post a Comment

<< Home