Wednesday, December 23, 2009

Spinlock Issues, Unresponsive Servers, and TF2330

We have had a handful of servers go completely unresponsive for no apparent reason and even a DAC gets tied up after a minute or two. When we see this issue, we also notice the Application Event Viewer log is inundated with the following errors:

Event Type: Information
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 17883
Date: 12/17/2009
Time: 12:21:33 AM
User: N/A
Computer: XXXXXXXXX
Description:
Process 228:0:0 (0x8d0) Worker 0x00000003A92981C0 appears to be non-yielding on Scheduler 3. Thread creation time: 12905499755717. Approx Thread CPU Used: kernel 46 ms, user 1968 ms. Process Utilization 0%%. System Idle 99%%. Interval: 2719241 ms.


And:

Event Type: Information
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 17884
Date: 12/17/2009
Time: 1:32:51 AM
User: N/A
Computer: XXXXXXXXX
Description:
New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 6960 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 99%%.

All with corresponding errors in the SQL log.

When this issue arises, most of the time the SQL service will not stop, it simply hangs and times out until the entire server is rebooted which is understandable given the issue with the spin locks and effect on schedulers; although if you are able to connect to the server (usually through DAC) and you recognize this issue soon enough, you can use the shutdown command to bounce the service.

After the normal research routines and opening a case with MS, we were informed that this is a bug initially identified in SQL 2005 pre SP1 (build 2039), however we did not see it until we applied CU 6 for SP3 (build 4226). The issue was identified as spin-lock contention on OPT_IDX_STATS as SQL tries to update the stats feeding sys.db_index_usage_stats for Tempdb and the fix/ workaround provided to us was to enable trace flag 2330 as a startup parameter and seeing as how this is an undocumented trace flag, I thought I would share what we found and subsequently confirmed with the support team.

Initially, we were told that TF2330 would only disable stats collection for objects within tempdb, however we noticed that our index usage stats for all databases were not being updated so we followed up with Support and they confirmed that this was the case; actual response is below:


This regards your case SR # 0000000. We checked the source code and found that TF 2330 just disables collection of index usage statistics to feed data for sys.db_index_usage_stats DMV. So after you enable this TF and restart the server, you should see empty rows in this DMV. I tested this and did not see any rows for this DMV. This data also feeds into the Missing Indexes DMV suggestions.
For example: sys.dm_db_missing_index_group_stats will also rely on this info and hence will not be available. I tested the sys.dm_db_missing_index_group_stats and confirmed as well. Can you trying creating the missing indexes and see if that helps without enabling the TF2330?
In summary these are the two dmvs that are being affected:
sys.db_index_usage_stats
sys.dm_db_missing_index_group_stats

According to MS Support, there are no plans to fix this in SQL Server 2005 as it was supposed to have been corrected in SQL 2008.

The server info (identical for all for all affected servers) is:
Microsoft SQL Server 2005 - 9.00.4226.00 (X64)
Developer Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)

Update - 12/24/2009: Since I sat on this for a few months before posting, there has been documentation from MS and another recent blog posting here: http://www.pythian.com/news/5211/sql-server-more-light-shed-on-non-yielding-scheduler-and-indexes-stats/

Tuesday, December 22, 2009

Using SQLNCLI from SQL Server 2008 to 2005 can cause stack dump and service restart

I noticed recently when expanding the "Catalogs" folder under a linked server connection from a SQL Server 2008 server to a SQL Server 2005 server, that the 2008 server stack dumped and restarted the MSSQL service.

I was able to reproduce this on two machines running post SP1 builds of Developer 2008 using SSMS 2005 and 2008. It happens sporadically on one server and everytime on the other.

Source Servers:
1. Microsoft SQL Server 2008 (SP1) - 10.0.2746.0 (Intel X86) Nov 9 2009 16:59:31 Copyright (c) 1988-2008 Microsoft Corporation Developer Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

2. Microsoft SQL Server 2008 (SP1) - 10.0.2734.0 (Intel X86) Sep 11 2009 15:12:52 Copyright (c) 1988-2008 Microsoft Corporation Developer Edition on Windows NT 5.1 (Build 2600: Service Pack 3)

Remote Servers (servers linked connection is made to):

1. Microsoft SQL Server 2005 - 9.00.4220.00 (Intel X86) Apr 2 2009 18:42:07 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)
2. Microsoft SQL Server 2005 - 9.00.4226.00 (X64) May 26 2009 14:58:11 Copyright (c) 1988-2005 Microsoft Corporation Developer Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)


Steps to recreate behavior:

1. create linked server using SQLNCLI provider on a 2008 server pointing to any 2005 server

EXEC master.dbo.sp_addlinkedserver @server = N'USESQLNCLI', @srvproduct=N'SQL', @provider=N'SQLNCLI', @datasrc=N'NAMEOF2005INSTANCE'

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'USESQLNCLI',@useself=N'True',@locallogin=NULL,@rmtuser=NULL,@rmtpassword=NULL


2. from SSMS, expand the "Catalogs" folder under "Linked Servers" in the object explorer.

Step 2 above eventually times out with "Failed to retrieve data for this request", SQL stack dumps and the MSSQL service restarts. I was able to reproduce this everytime on one server and sporatically on the other.

The linked server connection itself works fine when you use it in a query but fails when you attempt to view the "Catalogs" through SSMS as above. The stack dump occurs when it attempts to populate the catalog information in #tmp_sp_catalogs.

Stack dump:
* Exception Address = 00000006 Module(UNKNOWN+00000000)
* Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
* Access Violation occurred reading address 00000006
* Input Buffer 510 bytes -
* create table #tmp_sp_catalogs (is_catalog_s
* upport bit null,server_name nvarchar(128) null, product_name nvarchar(12
* 8) null,provider_name nvarchar(128) null,catalog_name nvarchar(128) null
Event Viewer entries from each 2008 server:

Attempt to fetch logical page (1:200) in database 2 failed. It belongs to allocation unit 25896092997713920 not to 196608.

Attempt to fetch logical page (1:161) in database 2 failed. It belongs to allocation unit 25896092997713920 not to 281474980642816

If the service doesn't happen to restart you can run a checkdb on tempdb and generate the error as well.

CHECKDB found 0 allocation errors and 0 consistency errors in database 'tempdb'.
Msg 605, Level 21, State 3, Line 2
Attempt to fetch logical page (1:161) in database 2 failed. It belongs to allocation unit 25896092997713920 not to 281474980642816.


If I create the same linked server connection instead using SQLNCLI10 from 2008 to 2005, I don't encounter this same behavior, and expanding "Catalogs" on the linked server in SSMS works as it should.

Maybe this is already a known issue using SQLNCLI from 2008, especially since it's at SP1 CU5, but I wasn't able to find anything online about this behavior. We're going to make sure to use the SQLNCLI10 provider going forward but curious if anyone else has noticed this same issue.

Update 12/24/09:  Submitted this as bug id 522048.

Update 1/27/10: Another user was able to reproduce this connecting to SQL 2000 from SQL 2008.  Our workaround of using the SQLNCLI10 provider has worked out well.