Jun 182014
 

We’ve had one move great meeting of BGOUG

I came back in the role of a lecturer for a session regarding Real-world DataGuard (click to download). I am happy the session was well-attended, and there was also good communication with the audience.

In the evening we had quite a chat with SPI, Radu Parvu and Ciprian Onofreiciuc. Networking, as they call it. Until half past three.

I left the second day, right after the last session of Joze Senegacnik. For the next meeting I’ll do my best to stay for the whole three days, brcause teh best p2p chats usually happen on the 2nd evening.

 Posted by at 16:10

Average DB load, RAC edition

 Oracle  Comments Off on Average DB load, RAC edition
Feb 172014
 

A few days ago I posted a query that returns the average load for given database per day, week or month. Works great on single instance databases, but not in RAC environments

RAC databases have one line per instance in DBA_HIST views, so my strategy of subtracting the value from the previous line returns very strange results. So here is the RAC-aware version of the query:

select trunc(this_day, 'MONTH'), avg(day_db_time), sum(cnt) scnt
  from (select this_day, count(1) cnt,
                sum(case
                       when snap_db_time > 0 then
                        snap_db_time/1000/1000/60/60/24
                       else
                        null
                     end) day_db_time
           from (select trunc(begin_interval_time) this_day, tm.snap_id,
                         value - lag(value, 1) over(order by tm.instance_number, tm.snap_id) snap_db_time
                    from dba_hist_sys_time_model tm, dba_hist_snapshot s
                   where tm.stat_name = 'DB CPU'
                     and s.snap_id = tm.snap_id
                     and s.instance_number = tm.instance_number
                     and s.begin_interval_time > add_months(trunc(sysdate, 'YEAR'), -24)
                     and s.begin_interval_time < trunc(sysdate, 'MONTH')
                     )
          group by this_day)
 group by trunc(this_day, 'MONTH')
 order by 1

The only difference is, I'm careful which is the instance on the previous line. This statement works also on single-instance databases.

 Posted by at 19:38

Average DB load

 Oracle  Comments Off on Average DB load
Feb 052014
 

Sooner or later, the year ends. Again, it’s time to do some reports, to plan a budget, to say “The overall performance of our database is this and this”. But how to find this data?
As I have told already, we keep AWR reports for very long period. One I calculated the database space usage growth, in order to plan the storage needs; now I need tha database server load.

select trunc(this_day, 'MONTH'), avg(day_db_time), sum(cnt) scnt
  from (select this_day, count(1) cnt,
                sum(case
                       when snap_db_time > 0 then
                        snap_db_time/1000/1000/60/60/24
                       else
                        null
                     end) day_db_time
           from (select trunc(begin_interval_time) this_day, tm.snap_id,
                         value - lag(value, 1) over(order by tm.snap_id) snap_db_time
                    from dba_hist_sys_time_model tm, dba_hist_snapshot s
                   where tm.stat_name = 'DB time'
                     and s.snap_id = tm.snap_id
                     and s.begin_interval_time > add_months(trunc(sysdate, 'YEAR'), -24)
                     and s.begin_interval_time < trunc(sysdate, 'MONTH')
                     )
          group by this_day)
 group by trunc(this_day, 'MONTH')
 order by 1

This query is a tricky one. It gives the average load (DB time) for every month in the last 2 years, if you keep such amount of snapshots. If you keep only one month, you can report by day - it it is not that impressive.

One of the detail worth mentioning is, DB time (just like every system stat) is an ever growing number, counted since the instance restart. The query takes the value for every snapshot and subtracts the value of the former snapshot. If the instance is restarted, the number is negative - because the counter starts form zero. To build proper stats, I simply skip such numbers. Instance restart happens so rarely, so I think it cannot influence the stats so much.

SQL> select sysdate, startup_time, trunc(sysdate-startup_time) Days_since_restart from v$instance;

SYSDATE   STARTUP_T DAYS_SINCE_RESTART
--------- --------- ------------------
05-FEB-14 19-FEB-13                350

The next thing I do in the query is sum the numers by day, and get he average for each month.

To be honest, the query took me quite a while to create. If you see some error, please share your thoughts in the comments below

Once I have the data, I draw a graph in Excel - one line for DB time, one for DB CPU (where tm.stat_name = 'DB CPU') and, to put some business context - number of business transactions for the month (which represents the actual, real-world "work done"). The I show to the Bug Boss how we do more and more actual work, but manage to keep the database load almost constant. In some months there's even decrease in the load - great example for the optimizations done. For example, we we have enabled result cache for some queries, this was easy to see on the graph; optimizations to the stats gathering procedure was also easily seen. The good admins alway suffer from lack of visibility of their efforts. It's hard to show the lack of problems. So a graph like this speaks the same language as the people which decide on salaries and head count.

Another interesting graph is the growth over some period - e.g. a year. "During Jan 2014 we made X% more payments through our system compared to Jan 2013, but the database server load growth is only Y%". Every manager aware of Oracle's CPU license cost will understand that this means.

P.S. RAC-aware edition of this query can be found here

 Posted by at 10:03

Recreating database jobs

 Oracle  Comments Off on Recreating database jobs
Apr 122013
 

I had to do this recently, and I will write down the scripts for future reference. The request was to recreate specific database jobs with a new user. But the scripts can be adapted for any bulk change of database jobs.

For the standard dbms_job I used the following:

create or replace function get_dbms_job_ddl(v_job in number) return varchar2 is
  v_tmp varchar2(4000);
begin
	dbms_ijob.full_export(v_job, v_tmp);
	return v_tmp;
end;
/

select 'dbms_ijob.remove(' || job || ');'
  from dba_jobs
 where log_user = 'OLDUSER';

select replace(get_dbms_job_ddl(job), '''OLDUSER''', '''NEWUSER''')
  from dba_jobs
 where log_user = 'OLDUSER';

If you have some job with longer WHAT, you may have to use CLOB instead of varchar2(4000).
Also be careful with the REPLACE, as it may break something in the job. So proofread the produced script.

For DBMS_SCHEDULER jobs I used (warning! requires additional changes! please read the comments after the script):

select 'exec DBMS_SCHEDULER.drop_job (job_name => ''' || owner || '.' || job_name || ''');'
  from dba_scheduler_jobs
 where job_creator = 'OLDUSER';

select dbms_metadata.get_ddl('PROCOBJ', job_name, owner)
  from dba_scheduler_jobs
 where job_creator = 'OLDUSER';

I must say that DBMS_METADATA has some difficulties on this matter. First, the type of an object to be exported is ‘PROCOBJ’ (PROCedural OBJect), which sounds to me like some quick-and-dirty patch to the code. It also generated tons of quotes and double quotes, which I had to replace in a text editor. And the real problem is, it did not put and “owner” in the job creation scripts, so I had to put it manually.
Anyway, id helped me a lot.

 Posted by at 8:58

Query transformation in action

 Oracle  Comments Off on Query transformation in action
Apr 102013
 

I really enjoy when I see some small detail in the huge and complex Oracle Database. I want to tell you about one such case.

It started with an urgent email from a Development manager. The new functionality ХYZ works slow (12-13 sec), and they have identified the problem to be in one specific query:

select .... from SOME_TABLE where IP = :1 and ACTIVE_DATE > current_timestamp - :2;

I needed only a few seconds to see that it uses full table scan on a multi-million-row table. There is no index neither on IP, nor on ACTIVE_DATE. Of course the whole table is in memory, after such heavy usage. But even LIOs, when you use millions of them, take time.

I asked if they could change the app to use some of the existing indexes. Well, no, we can’t.

OK, the we’ll make an index on (IP, ACTIVE_DATE). I have created the index on one of the DEV databases and I asked the QAs to test is. There was not effect. Blah!

I decided to prod Oracle to take it more seriously. What a moron! I’ve create the perfect index for the query, and it didn’t even notice! I have started the Тuning advisor to fix this. An then – surprise, surprise – Tuning advisor could not use the index too! But it gave me a hint where the problem is:

The predicate TO_NUMBER(“IP”)=:B1 used at line ID 1 of the execution plan contains an implicit data type conversion on indexed column “IP”. This implicit data type conversion prevents the optimizer from selecting indices on table “SOME_TABLE”

That’s a good tool! As a former developer I instantly understood what happens. Isn’t it a pity – after so many years with newer and better IDEs and frameworks, the errors are the same.

The problem is that the column in the table is VARCHAR2. However, the application binds a NUMBER. If we did not use binds, this should have been visible with a naked eye. But in that case I had to take a look at V$SQL_BIND_CAPTURE. Or even better, ask some developer to check – and fix – the java code

Now I want to share one thought about OEM CC. It s very good, very convenient and useful product, but at that case I’ve waisted some time with it. I could have seen this information at the beginning, if I wasn’t too lazy to use dbms_xplan.display_cursor. Here’s what it shows (check the filter information):

SQL_ID  436wr79chvhqx, child number 0
-------------------------------------
select ...
  from SOME_TABLE 
 where IP = :1
   and ACTIVE_DATE > current_timestamp - :2
 
Plan hash value: 2330990216
 
----------------------------------------------------------------------------------------
| Id  | Operation         | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                    |       |       | 12528 (100)|          |
|*  1 |  TABLE ACCESS FULL| SOME_TABLE         |     1 |    87 | 12528   (2)| 00:02:31 |
----------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter((TO_NUMBER("IP")=:1 AND 
              "ACTIVE_DATE">CURRENT_TIMESTAMP(6)-:2))

And here’s the tiny piece of Oracle which made my day. The developers fixed the code, binded the IP as a string. The DB started using the index. Now look at the following:

SQL_ID  4ktc94z4ypdpu, child number 0
-------------------------------------
select ...
  from SOME_TABLE 
 where IP = :1
   and ACTIVE_DATE > current_timestamp - :2
 
Plan hash value: 3569887601
 
-----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                       |       |       |     4 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| SOME_TABLE            |     1 |    87 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | I_SOME_TABLE_IP_TIME  |     1 |       |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("IP"=:1 AND "ACTIVE_DATE">CURRENT_TIMESTAMP(6)-:2 AND 
              "ACTIVE_DATE" IS NOT NULL)

Did you notice the new predicate that appeared from nowhere: AND "ACTIVE_DATE" IS NOT NULL? The optimizer did this!

Here’s what happens. The b-tree indexes do not store null values. If I want to use an index, I have to be sure that the user do not need the rows with null in the column. In other words, Oracle needs to know, that the user issuing the query did not request this data. On the other hand the optimizer knows the no date is lower (or higher) than null. That’s why it dares to add to the where clause one predicate, which lets him use and indexed access.

I like it!

 Posted by at 7:21

Parallel problems – part 2

 Oracle  Comments Off on Parallel problems – part 2
Apr 012013
 

(to part 1)

Day N

OK, now I’m ready for the big gun – Metalink notes 729903.1, 400886.1 and 444164.1. In those notes I found a way to trace the parallel execution. It’s not very well explained, so the interpretation of the logs will be real guesswork (have this in mind – i may be wrong).

I’ve asked the developers to enable _px_trace in the ETL code. One emergency release and we’re ready to wait for tonight execution.

Day N+1
F*CK! Last night the execution was fast 🙁

The only change in the code is enabling _px_trace for the session, which starts this specific query (and disabling it after the query completion). Nothing else!

Fast or slow, I can’t wait to take a look at the trace and see if it is readable. Remember, I’m not an expert on reading PX traces.

This is what I found at line 5694 (yes, I’ve scanned the first 5693 lines without finding something related to my case):

2013-02-20 01:53:58.837069*:PX_Granule:kxfr.c@5598:kxfrFlushObjs(): flushed 4436 objects and 0 temp objects
        kxfrialo                                                       [   24520/ 23550]
          best object 0x21f7ab9ce0
            hgt:0 blks:21872380 acp:0 nds:1 thr:4
kxfrialo                                                       [   24520/     0]
          threads requested = 4 (from kxfrComputeThread())
kxfrialo                                                       [   24520/     0]
          adjusted no. threads = 4 (from kxfrAdjustDOP())
kxfrAllocSlaves                                                [   24520/     0]
          DOP trace -- call kxfpgsg to get 4 slaves

Looks like we’ll use parallel 4. Remember, for parallel execution X you need at least 2 x X threads – X producers and X consumers.

2013-02-20 01:53:58.837069*:PX_Messaging:kxfp.c@9448:kxfpgsg():         reqthreads:4 KXFPLDBL/KXFPADPT/ load balancing:on adaptive:on
2013-02-20 01:53:58.837069*:PX_Messaging:kxfp.c@17464:kxfpiinfo():      inst [cpus:mxslv]
2013-02-20 01:53:58.837069*:PX_Messaging:kxfp.c@17468:kxfpiinfo():      1    [16  :32   ]
2013-02-20 01:53:58.837069*:PX_Messaging:kxfp.c@22218:kxfpGetNumActiveSlaves():
        number of active slaves on the instance: 24,number of active slaves but available to use: 0
2013-02-20 01:53:58.837069*:PX_Messaging:kxfp.c@17785:kxfpclinfo():     inst(load:user:pct:fact:servtarget:queued:started:granted:active:active(free)): aff
2013-02-20 01:53:58.837069*:PX_Messaging:kxfp.c@17791:kxfpclinfo():     1   (30  :1   :100:53  :256   :0     :0      :0      :24    :0     )
          load adapt num servers requested to = 4 (from kxfpAdaptDOP())
kxfpgsg                                                        [   24520/     0]
          getting 2 sets of 4 threads, client parallel query execution flg=0x230

If I read this correctly, 24 slaves are already active. Wtf? Let’s see what follows

          Height=4, Affinity List Size=0, inst_total=1, coord=1
          Insts     1
          Threads   4
kxfpg1sg                                                       [   24520/     0]
          q:0x22ffaeb5e0 req_threads:4 nthreads:4 unit:1 #inst:1 normal
          jStart:0  jEnd:32  jIncr:1 isGV:0  i:0 instno:1 kxfpilthno:4
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P000 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P001 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P002 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P003 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P004 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P005 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P006 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P007 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P008 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P009 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P010 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P011 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P012 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P013 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P014 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P015 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P016 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P017 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P018 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P019 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P020 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P021 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P022 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]
          trying to get slave P023 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24520/     0]

Looks like P000 to P023 are already busy, they do not want to work for us

kxfpg1srv                                                      [   24520/     0]
          trying to get slave P024 on instance 1 for q:0x22ffaeb5e0
          slave P024 is local
          found slave P024 dp:0x21319a0220 flg:0
kxfpcrel                                                       [   24520/     0]
          Creating local slave 24 flg:30
kxfpcre1                                                       [   24520/     0]
          Creating slave 24 flg:30
          free descriptor found dp:0x21319a0e48
          Allocated slave P024 dp:0x21319a0e48 flg:4
          Got It. 1 so far.
kxfpg1srv                                                      [   24530/    10]
          trying to get slave P025 on instance 1 for q:0x22ffaeb5e0
          slave P025 is local
          found slave P025 dp:0x21319a0220 flg:0
kxfpcrel                                                       [   24530/     0]
          Creating local slave 25 flg:30
kxfpcre1                                                       [   24530/     0]
          Creating slave 25 flg:30
          free descriptor found dp:0x21319a0da8
          Allocated slave P025 dp:0x21319a0da8 flg:4
          Got It. 2 so far.
kxfpg1srv                                                      [   24550/    20]
          trying to get slave P026 on instance 1 for q:0x22ffaeb5e0
          slave P026 is local
          found slave P026 dp:0x21319a0220 flg:0
kxfpcrel                                                       [   24550/     0]
          Creating local slave 26 flg:30
kxfpcre1                                                       [   24550/     0]
          Creating slave 26 flg:30
          free descriptor found dp:0x21319a0a70
          Allocated slave P026 dp:0x21319a0a70 flg:4
          Got It. 3 so far.
kxfpg1srv                                                      [   24560/    10]
          trying to get slave P027 on instance 1 for q:0x22ffaeb5e0
          slave P027 is local
          found slave P027 dp:0x21319a0220 flg:0
kxfpcrel                                                       [   24560/     0]
          Creating local slave 27 flg:30
kxfpcre1                                                       [   24560/     0]
          Creating slave 27 flg:30
          free descriptor found dp:0x21319a09d0
          Allocated slave P027 dp:0x21319a09d0 flg:4
          Got It. 4 so far.
kxfpPrepareJoin                                                [   24580/    20]
          grln:3 grlr:3 size:1536 var:486 evntCnt:11 eventSize:466 totSize:2014
kxfpg1sg                                                       [   24580/     0]
          wait reply from qref 0x22fd47e038
kxfpg1sg                                                       [   24580/     0]
          received reply from qref 0x22fd47e038
kxfpg1sg                                                       [   24580/     0]
          wait reply from qref 0x22fd481aa8
kxfpg1sg                                                       [   24580/     0]
          received reply from qref 0x22fd481aa8
kxfpg1sg                                                       [   24580/     0]
          wait reply from qref 0x1f23ebf38
kxfpg1sg                                                       [   24580/     0]
          received reply from qref 0x1f23ebf38
kxfpg1sg                                                       [   24580/     0]
          wait reply from qref 0x1f23ebc90
kxfpg1sg                                                       [   24580/     0]
          received reply from qref 0x1f23ebc90
kxfpg1sg                                                       [   24580/     0]
          got 4 servers (sync), errors=0x0 returning
kxfpg1sg                                                       [   24580/     0]
          q:0x22ffaeb5e0 req_threads:4 nthreads:4 unit:1 #inst:1 normal
          jStart:0  jEnd:32  jIncr:1 isGV:0  i:0 instno:1 kxfpilthno:4

OK, we’ve created 4 slaves so far (for the first set). We need 4 more for the second set…

kxfpg1srv                                                      [   24580/     0]
          trying to get slave P000 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P001 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P002 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P003 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P004 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P005 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P006 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P007 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P008 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P009 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P010 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P011 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P012 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P013 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P014 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P015 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P016 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P017 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P018 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P019 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P020 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P021 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P022 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P023 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P024 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P025 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P026 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]
          trying to get slave P027 on instance 1 for q:0x22ffaeb5e0
kxfpg1srv                                                      [   24580/     0]

P000-P027 seem to be busy already (actually P024-P027 are busy for us)

          trying to get slave P028 on instance 1 for q:0x22ffaeb5e0
          slave P028 is local
          found slave P028 dp:0x21319a0220 flg:0
kxfpcrel                                                       [   24580/     0]
          Creating local slave 28 flg:30
kxfpcre1                                                       [   24580/     0]
          Creating slave 28 flg:30
          free descriptor found dp:0x21319a0698
          Allocated slave P028 dp:0x21319a0698 flg:4
          Got It. 1 so far.
kxfpg1srv                                                      [   24600/    20]
          trying to get slave P029 on instance 1 for q:0x22ffaeb5e0
          slave P029 is local
          found slave P029 dp:0x21319a0220 flg:0
kxfpcrel                                                       [   24600/     0]
          Creating local slave 29 flg:30
kxfpcre1                                                       [   24600/     0]
          Creating slave 29 flg:30
          free descriptor found dp:0x21319a05f8
          Allocated slave P029 dp:0x21319a05f8 flg:4
          Got It. 2 so far.
kxfpg1srv                                                      [   24610/    10]
          trying to get slave P030 on instance 1 for q:0x22ffaeb5e0
          slave P030 is local
          found slave P030 dp:0x21319a0220 flg:0
kxfpcrel                                                       [   24610/     0]
          Creating local slave 30 flg:30
kxfpcre1                                                       [   24610/     0]
          Creating slave 30 flg:30
          free descriptor found dp:0x21319a02c0
          Allocated slave P030 dp:0x21319a02c0 flg:4
          Got It. 3 so far.
kxfpg1srv                                                      [   24630/    20]
          trying to get slave P031 on instance 1 for q:0x22ffaeb5e0
          slave P031 is local
          found slave P031 dp:0x21319a0220 flg:0
kxfpcrel                                                       [   24630/     0]
          Creating local slave 31 flg:30
kxfpcre1                                                       [   24630/     0]
          Creating slave 31 flg:30
          free descriptor found dp:0x21319a0220
          Allocated slave P031 dp:0x21319a0220 flg:4
          Got It. 4 so far.
kxfpPrepareJoin                                                [   24640/    10]
          grln:3 grlr:3 size:1536 var:486 evntCnt:11 eventSize:466 totSize:2014

kxfpg1sg                                                       [   24640/     0]
          wait reply from qref 0x1f23eb740
kxfpg1sg                                                       [   24640/     0]
          received reply from qref 0x1f23eb740
kxfpg1sg                                                       [   24640/     0]
          wait reply from qref 0x1f23eb498
kxfpg1sg                                                       [   24640/     0]
          received reply from qref 0x1f23eb498
kxfpg1sg                                                       [   24640/     0]
          wait reply from qref 0x1f23eb1f0
kxfpg1sg                                                       [   24650/    10]
          received reply from qref 0x1f23eb1f0
kxfpg1sg                                                       [   24650/     0]
          wait reply from qref 0x1f23eaf48
kxfpg1sg                                                       [   24650/     0]
          received reply from qref 0x1f23eaf48
kxfpg1sg                                                       [   24650/     0]
          got 4 servers (sync), errors=0x0 returning

Phew, that was close. We have the other 4 slaves ready – P028-P031. Remember, the limit is set to 32 total.

2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10118:kxfpgsg():        Acquired 8 slaves on 1 instances avg height:4 #set:2 qser:893953
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10128:kxfpgsg():        P024 inst 1 spid 31409
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10128:kxfpgsg():        P025 inst 1 spid 31413
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10128:kxfpgsg():        P026 inst 1 spid 31417
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10128:kxfpgsg():        P027 inst 1 spid 31422
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10137:kxfpgsg():        P028 inst 1 spid 31427
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10137:kxfpgsg():        P029 inst 1 spid 31433
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10137:kxfpgsg():        P030 inst 1 spid 31438
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10137:kxfpgsg():        P031 inst 1 spid 31443
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10144:kxfpgsg():        Instance(servers):
2013-02-20 01:53:58.965839*:PX_Messaging:kxfp.c@10147:kxfpgsg():        inst:1 #slvs:8

Let’s wait and see if we can get a slow execution tonight.

 Posted by at 6:41

Parallel problems – part 1

 Oracle  Comments Off on Parallel problems – part 1
Mar 272013
 

This one took me a lot of time for debugging. First, the process runs for an hour – hour and a half when all is fine and 4-5 hours when it is not. Second, it is part of an ETL process running in the night – it’s hard to see it running “live”. Third, sometimes it runs just fine, other times it is very slow – with no change in the system or code. And teh worst is, when I try to reproduce the issue, the query always runs fast as expected. and in the end it turned out the problem is not on this statement/process, but in one other, which runs in the same time. But let’s see the raw facts.

Day 1
The compleint is that one of the steps of the nightly ETL needed a lot more time to complete last night than then nights before. The ETL process is running as one relatively complex scheduler job chain with 30+ steps and comeplx dependencies. Fortunately the process is very well instrumented an one can easily see what step started when, what other steps were active at that time and so on.

Of course, nobody have changed anything – the last change in the ETL as last Wednesday, while the problem first appeared in Saturday (and Sunday, and Monday…). In such situations one can get some hints gorm teh graph generated by teh OEM Cloud Control. Here;s what i saw for one “fast” and one “slow” execution:

screenshot-1.

The symptoms are easy to see – only one session is working on this query, there’s no parallel execution. Let’s see the execution plan form the last run (last night):

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                        |                          |       |       |       |    11M(100)|          |       |       |        |      |            |
|   1 |  LOAD AS SELECT                         |                          |       |       |       |            |          |       |       |        |      |            |
|   2 |   PX COORDINATOR                        |                          |       |       |       |            |          |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)                  | :TQ10012                 |   445M|  1737G|       |    11M  (1)| 31:41:37 |       |       |  Q1,12 | P->S | QC (RAND)  |
|   4 |     HASH JOIN OUTER BUFFERED            |                          |   445M|  1737G|   779M|    11M  (1)| 31:41:37 |       |       |  Q1,12 | PCWP |            |
|   5 |      PX RECEIVE                         |                          |    33M|  2738M|       |   104K  (2)| 00:17:41 |       |       |  Q1,12 | PCWP |            |
|   6 |       PX SEND HASH                      | :TQ10010                 |    33M|  2738M|       |   104K  (2)| 00:17:41 |       |       |  Q1,10 | P->P | HASH       |
|   7 |        VIEW                             |                          |    33M|  2738M|       |   104K  (2)| 00:17:41 |       |       |  Q1,10 | PCWP |            |
|   8 |         WINDOW SORT                     |                          |    33M|  1636M|  1896M|   104K  (2)| 00:17:41 |       |       |  Q1,10 | PCWP |            |
|   9 |          HASH JOIN RIGHT OUTER          |                          |    33M|  1636M|       |  5799  (17)| 00:00:59 |       |       |  Q1,10 | PCWP |            |
|  10 |           PX RECEIVE                    |                          |    33M|   409M|       |   544  (19)| 00:00:06 |       |       |  Q1,10 | PCWP |            |
|  11 |            PX SEND HASH                 | :TQ10007                 |    33M|   409M|       |   544  (19)| 00:00:06 |       |       |  Q1,07 | P->P | HASH       |
|  12 |             PX BLOCK ITERATOR           |                          |    33M|   409M|       |   544  (19)| 00:00:06 |       |       |  Q1,07 | PCWC |            |
|  13 |              INDEX FAST FULL SCAN       | DIM_xxxx_x_xx_x_xxx_xx_x |    33M|   409M|       |   544  (19)| 00:00:06 |       |       |  Q1,07 | PCWP |            |
|  14 |           PX RECEIVE                    |                          |    33M|  1227M|       |  5092  (15)| 00:00:52 |       |       |  Q1,10 | PCWP |            |
|  15 |            PX SEND HASH                 | :TQ10008                 |    33M|  1227M|       |  5092  (15)| 00:00:52 |       |       |  Q1,08 | P->P | HASH       |
|  16 |             HASH JOIN RIGHT OUTER       |                          |    33M|  1227M|       |  5092  (15)| 00:00:52 |       |       |  Q1,08 | PCWP |            |
|  17 |              PX RECEIVE                 |                          |   152K|  1786K|       |    11   (0)| 00:00:01 |       |       |  Q1,08 | PCWP |            |
|  18 |               PX SEND BROADCAST         | :TQ10000                 |   152K|  1786K|       |    11   (0)| 00:00:01 |       |       |  Q1,00 | P->P | BROADCAST  |
|  19 |                PX BLOCK ITERATOR        |                          |   152K|  1786K|       |    11   (0)| 00:00:01 |       |       |  Q1,00 | PCWC |            |
|  20 |                 TABLE ACCESS FULL       | DIM_xxxxxxxx             |   152K|  1786K|       |    11   (0)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|  21 |              PX BLOCK ITERATOR          |                          |    33M|   849M|       |  5015  (14)| 00:00:51 |       |       |  Q1,08 | PCWC |            |
|  22 |               TABLE ACCESS FULL         | DIM_xxxxxxxx             |    33M|   849M|       |  5015  (14)| 00:00:51 |       |       |  Q1,08 | PCWP |            |
|  23 |      PX RECEIVE                         |                          |   448M|  1713G|       |  5689K  (1)| 16:03:56 |       |       |  Q1,12 | PCWP |            |
|  24 |       PX SEND HASH                      | :TQ10011                 |   448M|  1713G|       |  5689K  (1)| 16:03:56 |       |       |  Q1,11 | P->P | HASH       |
|  25 |        VIEW                             |                          |   448M|  1713G|       |  5689K  (1)| 16:03:56 |       |       |  Q1,11 | PCWP |            |
|  26 |         SORT UNIQUE                     |                          |   448M|    50G|    54G|  5689K  (1)| 16:03:56 |       |       |  Q1,11 | PCWP |            |
|  27 |          WINDOW SORT                    |                          |   448M|    50G|    54G|  5689K  (1)| 16:03:56 |       |       |  Q1,11 | PCWP |            |
|  28 |           PX RECEIVE                    |                          |   448M|    50G|       |   120K (16)| 00:20:21 |       |       |  Q1,11 | PCWP |            |
|  29 |            PX SEND HASH                 | :TQ10009                 |   448M|    50G|       |   120K (16)| 00:20:21 |       |       |  Q1,09 | P->P | HASH       |
|  30 |             HASH JOIN                   |                          |   448M|    50G|       |   120K (16)| 00:20:21 |       |       |  Q1,09 | PCWP |            |
|  31 |              PX RECEIVE                 |                          |    21 |   126 |       |     2   (0)| 00:00:01 |       |       |  Q1,09 | PCWP |            |
|  32 |               PX SEND BROADCAST         | :TQ10001                 |    21 |   126 |       |     2   (0)| 00:00:01 |       |       |  Q1,01 | P->P | BROADCAST  |
|  33 |                PX BLOCK ITERATOR        |                          |    21 |   126 |       |     2   (0)| 00:00:01 |       |       |  Q1,01 | PCWC |            |
|  34 |                 TABLE ACCESS FULL       | DIM_xxx_xxx              |    21 |   126 |       |     2   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|  35 |              HASH JOIN                  |                          |   448M|    47G|       |   119K (16)| 00:20:12 |       |       |  Q1,09 | PCWP |            |
|  36 |               PX RECEIVE                |                          |     8 |   128 |       |     2   (0)| 00:00:01 |       |       |  Q1,09 | PCWP |            |
|  37 |                PX SEND BROADCAST        | :TQ10002                 |     8 |   128 |       |     2   (0)| 00:00:01 |       |       |  Q1,02 | P->P | BROADCAST  |
|  38 |                 PX BLOCK ITERATOR       |                          |     8 |   128 |       |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWC |            |
|  39 |                  TABLE ACCESS FULL      | DIM_xxx_xxxxxx           |     8 |   128 |       |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWP |            |
|  40 |               HASH JOIN                 |                          |   448M|    40G|       |   118K (15)| 00:20:03 |       |       |  Q1,09 | PCWP |            |
|  41 |                PX RECEIVE               |                          |   135 |  2160 |       |     2   (0)| 00:00:01 |       |       |  Q1,09 | PCWP |            |
|  42 |                 PX SEND BROADCAST       | :TQ10003                 |   135 |  2160 |       |     2   (0)| 00:00:01 |       |       |  Q1,03 | P->P | BROADCAST  |
|  43 |                  PX BLOCK ITERATOR      |                          |   135 |  2160 |       |     2   (0)| 00:00:01 |       |       |  Q1,03 | PCWC |            |
|  44 |                   TABLE ACCESS FULL     | DIM_xxx_xxxx             |   135 |  2160 |       |     2   (0)| 00:00:01 |       |       |  Q1,03 | PCWP |            |
|  45 |                HASH JOIN                |                          |   455M|    34G|       |   117K (15)| 00:19:54 |       |       |  Q1,09 | PCWP |            |
|  46 |                 PX RECEIVE              |                          |    55 |   385 |       |     2   (0)| 00:00:01 |       |       |  Q1,09 | PCWP |            |
|  47 |                  PX SEND BROADCAST      | :TQ10004                 |    55 |   385 |       |     2   (0)| 00:00:01 |       |       |  Q1,04 | P->P | BROADCAST  |
|  48 |                   PX BLOCK ITERATOR     |                          |    55 |   385 |       |     2   (0)| 00:00:01 |       |       |  Q1,04 | PCWC |            |
|  49 |                    TABLE ACCESS FULL    | DIM_xxxxxxxx             |    55 |   385 |       |     2   (0)| 00:00:01 |       |       |  Q1,04 | PCWP |            |
|  50 |                 HASH JOIN               |                          |   463M|    32G|       |   116K (14)| 00:19:45 |       |       |  Q1,09 | PCWP |            |
|  51 |                  PX RECEIVE             |                          |   152K|  1786K|       |    12   (9)| 00:00:01 |       |       |  Q1,09 | PCWP |            |
|  52 |                   PX SEND BROADCAST     | :TQ10005                 |   152K|  1786K|       |    12   (9)| 00:00:01 |       |       |  Q1,05 | P->P | BROADCAST  |
|  53 |                    PX BLOCK ITERATOR    |                          |   152K|  1786K|       |    12   (9)| 00:00:01 |       |       |  Q1,05 | PCWC |            |
|  54 |                     TABLE ACCESS FULL   | DIM_xxxxxxxx             |   152K|  1786K|       |    12   (9)| 00:00:01 |       |       |  Q1,05 | PCWP |            |
|  55 |                  HASH JOIN              |                          |   557M|    32G|       |   115K (13)| 00:19:34 |       |       |  Q1,09 | PCWP |            |
|  56 |                   PX RECEIVE            |                          |     2 |    10 |       |     2   (0)| 00:00:01 |       |       |  Q1,09 | PCWP |            |
|  57 |                    PX SEND BROADCAST    | :TQ10006                 |     2 |    10 |       |     2   (0)| 00:00:01 |       |       |  Q1,06 | P->P | BROADCAST  |
|  58 |                     PX BLOCK ITERATOR   |                          |     2 |    10 |       |     2   (0)| 00:00:01 |       |       |  Q1,06 | PCWC |            |
|  59 |                      TABLE ACCESS FULL  | DIM_xxxxxxxxx            |     2 |    10 |       |     2   (0)| 00:00:01 |       |       |  Q1,06 | PCWP |            |
|  60 |                   PX PARTITION RANGE ALL|                          |   557M|    30G|       |   114K (12)| 00:19:22 |     1 |  4429 |  Q1,09 | PCWC |            |
|  61 |                    TABLE ACCESS FULL    | FACT_xxxxxx              |   557M|    30G|       |   114K (12)| 00:19:22 |     1 |  4429 |  Q1,09 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Hm, looks a lot of parallel to me. So the optimizer decided to go with a parallel plan, but the execution engine decided to ignore this and do a serial execution..

I decided to run the statement manually, to see what will happen. Well, what happened is a parallel execution compelted in ~70 minutes. o matter what I try, if the hint parallel exists, it runs in parallel! And yes, the hint is in the query in the ETL packages.

I’d like to quote here one of the extremely valuable advices by Jonathan Lewis – it’s about the hint parallel. This hint is not mandatory. It does not instruct the optimizer “make this run with that amount of parallel processes”. No, the hint only… hm… hints “ok, why don’t you check if a better plan could be produced with that amount of parallelism”. The optimizer has the right to say “No”. But this is not the case with this query – the optimizer have created a parallel plan. This is easy to see in v$sql_plan. In other words, the optimizer, which is guilty “by default” for many performance problem, has done the right thing. Therefore the standard approaches like creating baselines or sql profiles cannot help us.

After have spent a few hours trying to reproduce the problem. During the tests everything runs fine. So I gave up for the day. Let’s see what will happen tonight…

Day 2

We have yet another slow execution. This is very annoying. No matter how hard I try to reproduce it manually, it always runs “fast” (relatively fast – for a hour and few minutes). But when it runs as a part of the ETL job chain, we have a slow, serial execution. And the most irritating thing is, the plan in the SQL cache is parallel again, but the execution is serial. And no, we do not have resource management limitations during the night.

Maybe we need more parallel query slaves? Let’s increase the limit from 20 to 32 for tonight:

alter system set parallel_max_servers=32;

Day 3

We have another serial execution 🙁

But it’s Friday. I do not want to touch it, to avoid urgent interventions during the weekend. It may work slower, but nobody is affected – the ETL completes before sunrise 🙂

(to be continued)

 Posted by at 7:36

OBIEE databаse requirements

 Oracle  Comments Off on OBIEE databаse requirements
Mar 222013
 

I am reading the installation guide for Oracle Business Intelligence. And I wonder who is that extremely smart and knowledgeable person that wrote the following in the documentation:

Ensure that your database is not hardened for security purposes. Installing Oracle Business Intelligence against a hardened database is not supported.

It sounds like Oracle Business Intelligence Enterprise Edition Plus – not the cheapest product on the market – is designed to be used only in small shops with less than 3 employees. Because every bigger organization has to take care of the security of the database. Even if it’s a BI/DWH database.

On the other hand, it’s lot easier to write it that way…

 Posted by at 9:41

ORA-16631: Operation requires shutdown of database/instance “”

 Oracle  Comments Off on ORA-16631: Operation requires shutdown of database/instance “”
Mar 052013
 

One more funny bug. We have the following
– 4 databases in the configuration (1 primary and 3 standby)
– we have restarted one of the standbys. Everything works as expected
dgmgrl shows something different:

DGMGRL> show configuration

Configuration - DG

  Protection Mode: MaxAvailability
  Databases:
    primary  - Primary database
    standby1 - Physical standby database
    standby2 - Physical standby database (disabled)
    standby3 - Logical standby database (disabled)

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS

(disabled) – WTF? I have to say that we have restarted standby1 – we didn’t even touch the others.

OK, let’s try the easiest solution for every DG problem:

DGMGRL> enable database standby2;

Here comes the surprise:

ORA-16631: Operation requires shutdown of database/instance ""

Errr… which instance do you need restarted?

The only good thing about this crappy error is that it lets you find the following note in MOS:
Metalink note 1258074.1

Ora-16631: Operation Requires Shutdown Of Database Or Instance “” On Physical Standby

Applies to:
Oracle Server – Enterprise Edition – Version: 11.1.0.7 and later [Release: 11.1 and later ]
Information in this document applies to any platform.

Symptoms

ORA-16631: Operation requires shutdown of database/instance “”

Cause
Both the primary and standby were shutdown in an graceful fashion.
Since standby is shutdown before primary, standby is marked as shutdown disabled
standby is restarted first and then the primary.
standby sends a bootstrap request to the primary but since it’s not started yet,
it’s request goes unanswered. When primary is then started and as part of
its bootstrap processing the primary intentionally leaves the standby in a
shutdown disabled state. The FSFO state then transitions to an unsynchronized
state since the standby is not available.

Solution
— on primary —-

dgmgrl /
edit database 'standby db unique name here' set state='ONLINE';
show configuration;
exit

( or )

—- on standby —

sqlplus / as sysdba
shu immediate
startup mount

OK, we have problem with both physical, and logical standby. And we did not restart any of them, nor the primary database. I can only guess that during the restart standby1 sent to primary the “I’m leaving the party” signal. Then primary made a mistake and dropped the connection to all the standbys. After taht, when standby1 has beens started, he sent a bootstrap signal. Primay have added him to the configuration, leaving the others away.
Fortunately the solution worked. Of course I choose the first one. I’m not sure what will happen is I restart standby2, maybe we will loose standby1 again… 🙂

 Posted by at 9:43