Exibir mapa completo

What to Do When SQL Server is Slow

Outro

Ian Bradshaw

Seguir

Começar. É Gratuito

ou inscrever-se com seu endereço de e-mail

Mapas mentais semelhantes Esboço do mapa mental

What to Do When SQL Server is Slow por Ian Bradshaw Mind Map: What to Do When SQL Server is Slow

1. Triage process

1.1. Rapid identification of emergency priority and what area to focus on

1.2. Triage levels

1.2.1. Level 1

1.2.1.1. Server is down, not responding to any queries

1.2.2. Level 2

1.2.2.1. Server is going to go down in minutes if we don't do something quickly

1.2.3. Level 3

1.2.3.1. Server is in bad shape but will survive if we don't do anything

1.2.4. Level 4

1.2.4.1. Some parts of some apps are unusually slow

1.2.5. Level 5

1.2.5.1. Bob's report is running "unusually" slow (i.e. serial offenders due to badly formed queries - no-one else affected)

2. Triage report

2.1. Goal is to fill in the blanks of the following report as fast as possible:

2.1.1. The symptoms are _____.

2.1.2. I believe the root cause is _____.

2.1.3. If we don't take action now, the effect will be _____.

2.1.4. I recommend that we:

2.1.4.1. Take action _____ now to fix it, or...

2.1.4.2. Take action _____ to investigate further, or ...

2.1.4.3. Put this in _____'s queue

2.2. Symptoms are what the users are screaming about, not technical observations about the state of SQL Server

2.3. Root cause will probably not be known but you should form an initial opinion as quickly as possible, based on your experience and initial observations

2.4. Risk and effect of delaying remedial action is another subjective judgement based on your experience, taking symptoms and probable root cause into account

2.5. The next action should always be informed by the preceding assessments - never dive in to action before completing the first 3 assessments

3. Triage tools

3.1. 1. Your monitoring software

3.1.1. will not tell you what the symptoms are, root cause is or what will happen if you don't take immediate action

3.1.2. it's like the monitoring equipment in a surgeon's operating theatre that tells you whether the vital signs of the patient (SQL Server) are stable, degrading or improving

3.1.3. it's like a backup for the surgeon, which means you - i.e. it's still your job to probe deeper to make discoveries and take next actions

3.1.4. Monitoring software will probably suggest actions but it's up to you to validate those and make the decision about what action to try and in what order

3.2. 2. sp_WhoIsActive

3.2.1. Considered in community as gold standard replacement for sp_who and sp_who2

3.2.2. runs immediately or small number of seconds

3.2.3. Run this repeatedly to test how responsive the server is with its results

3.2.3.1. if results are coming back each time quickly, you can rule out Level 1 emergency in triage

3.2.4. Look for long running queries and blocked sessions

3.2.4.1. if the session at top of blocking chain cannot complete, this is going to cause log file growth as open transactions keep writing in order to facilitate rollback

3.2.4.1.1. if log file growth hits a hard limit, SQL Server cannot function until that situation is resolved

3.2.4.2. can be considered a Level 2 emergency

3.2.4.3. resist initial impulse to kill session at top of blocking chain

3.2.4.3.1. large forward transactions are typically run multi-threaded but rollback transactions typically run single threaded

3.2.4.3.2. be aware that blocking session might involve multiple SQL statements and you will only see that last of these, but it might be that it was an earlier statement causing the blocking problem

3.2.4.4. run sp_WhoIsActive @get_locks = 1

3.2.4.4.1. adds locks column to output

3.2.4.4.2. inspect locks column of blocking session to get a feel for what locks that session is holding

3.2.4.4.3. if you see exclusive (X) locks on whole tables (OBJECT), you can get a hint on how heavy the rollback will be by considering the size of the tables being locked for updates

3.2.4.5. if deciding to kill the session at head of blocking chain, when you re-run sp_WhoIsActive, you may see that session still there, holding locks and blocking, but status will change to "rollback"

3.2.4.5.1. note: percent_complete will NOT be populated for rollbacks, due to the lack of predictability on rollbacks

3.2.4.5.2. you can run kill <spid> with statusonly in a separate session window, which will attempt to get you a rough estimate of how far that rollback has progressed

3.3. 3. sp_BlitzFirst

3.3.1. turn to this once sp_whoIsActive has ruled out Level 1 (because you can repeatedly run sp_WhoIsActive) and Level 2 (because there are no blocking chains)

3.3.2. runs for at least 5 seconds

3.3.3. without any parameters is designed for use by Helpdesk support

3.3.3.1. run sp_BlitzFirst @ExpertMode = 1 for deeper dive

3.3.4. takes a snapshot of data from a bunch of DMVs, waits 5 seconds and then takes another snapshot, and uses this 5 second interval to calculate a prioritised list of observations

3.3.5. can be used to rule in or out Level 3

3.3.5.1. example is a surprise backup operation being run outside of normal maintenance window

3.3.5.2. alert user running that process to give them opportunity to cancel it

3.3.5.3. may just decide to ride it out if the duration is not too long

3.4. 4. sp_BlitzCache

3.4.1. runs normally in 10 - 60 seconds

3.4.2. first result set is "sucker board" - top 10 recent execution plans in plan cache that sucked most in terms of performance

3.4.3. one thing to look out for is surprise entries to the sucker board

3.4.3.1. might indicate parameter sniffing problem, where execution plan got cached for a particular parameter value and that re-used plan sucks really badly for some new parameter value

3.4.3.2. sp_recompile '<name_of_proc' will remove bad execution plan from cache and build a new one

3.4.3.2.1. doesn't work for parameterised SQL that may be coming from applications

3.4.3.2.2. better way is to run sp_BlitzCache = @ExportMode = 1

3.4.3.2.3. note: nuking an offending plan from the plan cache will still mean that in flight queries using the bad plan are still using that plan - you may need to kill these off

3.4.3.3. note: many people do things like DBCC FREEPROCCACHE, rebuild indexes or statistics, or simply reboot server - but this is a bad idea

3.4.3.3.1. all of these things basically cause SQL Server to have amnesia - i.e. forget everything it knows and start from scratch

3.4.3.3.2. sledgehammer to crack a walnut approach

3.5. 5. sp_BlitzIndex

3.5.1. at this point we've probably ruled out Levels 1 to 3, and maybe 4

3.5.1.1. server is still responsive

3.5.1.2. sp_BlitzFirst did not reveal any unusual activity adding stress to the server workload

3.5.1.3. sp_BlitzCache is still showing the same usual suspects on the sucker board (i.e. nothing out the ordinary)

3.5.2. diagnoses index design issues

3.5.2.1. not fragmentation

3.5.2.2. can lead to conclusion that one or more slow running queries can benefit significantly from new or better indexation

3.5.2.3. calls for a more measured approach, not a hasty change

3.6. 6. sp_Blitz

3.6.1. a general health check of the server that produces a prioritised list of issues

3.6.1.1. priorities 1 to 50 demand action, above that they are warnings

4. Practice tools

4.1. SQLQueryStress

4.1.1. originally developed by Adam Machanic - same guy that developed sp_WhoIsActive, but now maintained by someone else

4.1.2. use this to simulate stress on SQL Server

ou Inscrever-se