Monday, October 17, 2016

Database Corruption/ Torn Page Repair

Sometimes bad things happen to good databases and corruption can be one of those things. This post goes through a few different methods of recovering from the corruption with minimal or no data loss when all other standard methods have been exhausted. If you find that the corruption exists in a non-clustered index then you can fix the issue by simply scripting out the create index syntax from Management Studio, dropping the index, and using the script to recreate it (you cannot just run the create script while specifying the Drop_Existing option, it doesn’t work). If the non-clustered index index happens to be a primary key then be sure to add a unique constraint on the table with the same definition to help maintain integrity, although this will not help any foreign keys referencing the table so it will take much more work to take care of those and is beyond the scope of this post

Basic Information

A torn page error is a severity 24 error that will display similar to the following:
SQL Server Alert System: 'Severity 24 - Fatal Error: Hardware Error' occurred on
DESCRIPTION: SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0x55555556). It occurred during a read of page (1:30857339) in database ID 8 at offset 0x00003adb0f6000 in file 'F:\MSSQL\Data\DBfile_data.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Warnings

The best way to recover from data corruption is to restore the database from a known good backup and then restore any subsequent differential and transaction log backups. However, this method is for situations in which that is not an option for one reason or another. Also, this process will take the entire database off line so be sure to involve all stakeholders and take appropriate measures to accommodate. This process will remove data and should only be performed when all other courses of action have been considered.

For Page-Level Restore Instead of Repair

Note: For an article on how to do a repair with a page-level restore instead of DBCC CheckTable ('', REPAIR_ALLOW_DATA_LOSS)) see the link below on MSSQLTips to the article: "Using Page Level Restore as a Disaster Recovery Procedure in SQL Server 2005". I have actually used this method on a production machine and it worked perfectly.

Troubleshooting and repair

1) ****The very first thing you should do is start a restore on another server of the last known good database for later use and/ or possibly restoring transaction logs if available****
2) The error message contains the information that you will need to start the repair process, specifically, the database ID and the page number. You will first want to get the name of the database via the database ID and filename. Next, use the numbers in parenthesis which specify the file number and page number to determine the ObjectID, previous page number and next page number with the following command (of course, replacing with the correct name):
DBCC TraceOn (3604)
DBCC Page (, 1, 30857339)
DBCC TraceOff (3604)
Note that this is the corrupt page so you will receive an error, but you will also receive command output in which you will see a row that looks like this:
Metadata: ObjectId = 1108303108 m_prevPage = (1:30857338) m_nextPage = (1:30857340)
3) Make sure you are using the correct database and determine the table name from the ObjectID and note the previous and next pages for later use:
Select Object_Name(1108303108)
4) Get information on the primary key, clustered index (if different), computed columns, identity columns, guid columns, and triggers:
Exec sp_help
Exec sp_helptrigger
5) Next, determine the last value of the previous page and the first value of the next page in order to determine the contents of the bad page:
DBCC Page(,1,30857338,3) With TableResults
This will display the contents of the previous page in tabular format and in clustered index order so note the last value in the clustered index/ primary key, then do the same for the next page, but noting the first value instead:
DBCC Page(,1,30857340,3) With TableResults
6) Once available you should run the DBCC Page command against the restored good database to confirm your results then write a query for the values:
DBCC Page(, 1, 30857339, 3) With TableResults
7) Once you are confident that you know the values in the corrupt page and have a way to transfer them back in to the table (e.g. restored good database or another data center), you need to repair the table:
***At this point it would be a good idea to make a full backup of the database should something not go as expected with the repair operation***
Set the database in restricted mode to help mitigate the potential for another connection intercepting yours once in single user mode:
Alter Database Set restricted_user With RollBack Immediate
Use
Go
Set the database to single user in order to perform the repair (make sure you do this in a single batch as below):
Alter Database Set Single_User With RollBack Immediate
Use
Go
Get the row count before the repair:
Exec sp_spaceused
Then run the repair statement:
**Please note that this will result in loss of data and should only be used if other alternatives (restoring from backup, etc.) are not feasible due to age of backup etc. !!!**
DBCC CheckTable ('', REPAIR_ALLOW_DATA_LOSS) With All_ErrorMsgs
Once complete, get the row count after the repair to validate your findings and query for re-insertion:
Exec sp_spaceused
Then set the database back to multi user:
Alter Database Set Multi_User With RollBack Immediate
Use
Go
8) Re-insert the rows, disabling any triggers and accounting for identity, computed, and guid columns. 

Sunday, October 16, 2016

Process for Changing IP Address on Cassandra Nodes in Windows Environment

I needed to perform an IP address change for Cassandra nodes in a Windows environment and couldn't find anything documented so I created the below procedure for performing the task. This process was for a Cassandra v1.2 cluster. 

1. Cleanly shut-down the node (only 1 node should be brought down at a time)
a. Perform drain
nodetool -h drain
b. Disable Gossip
nodetool -h disablegossip
c. Disable Thrift
nodetool -h disablethrift
d. Disable binary (native protocol)
nodetool -h disablebinary
e. Stop Cassandra service

2. Set Cassandra service to disabled

3. Configure purge gossip state information:
Edit cassandra-env.sh file adding: JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"

4. Change yaml file setting start_rpc = false and start_native_transport = false (this is only for the node involved in the IP change)

5. Change hosts files on all nodes to the new IP address for the node involved in the change

6. Perform IP change

7. Reboot machine

8. Set Cassandra service to automatic

9. Start service

10. Verify node state, ring, load, activity, etc

11. Run repair on current column families that may have missed writes to re-IP'd node

12. Confirm that data is in sync

13. Edit cassandra-env.sh file removing the line to purge gossip state: JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"

14. Change yaml file for the node involved in the IP change setting start_rpc = true and start_native_transport = true

15. Enable binary
nodetool -h enablebinary

16. Enable thrift
nodetool -h enablethrift

Wednesday, December 23, 2009

Spinlock Issues, Unresponsive Servers, and TF2330

We have had a handful of servers go completely unresponsive for no apparent reason and even a DAC gets tied up after a minute or two. When we see this issue, we also notice the Application Event Viewer log is inundated with the following errors:

Event Type: Information
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 17883
Date: 12/17/2009
Time: 12:21:33 AM
User: N/A
Computer: XXXXXXXXX
Description:
Process 228:0:0 (0x8d0) Worker 0x00000003A92981C0 appears to be non-yielding on Scheduler 3. Thread creation time: 12905499755717. Approx Thread CPU Used: kernel 46 ms, user 1968 ms. Process Utilization 0%%. System Idle 99%%. Interval: 2719241 ms.


And:

Event Type: Information
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 17884
Date: 12/17/2009
Time: 1:32:51 AM
User: N/A
Computer: XXXXXXXXX
Description:
New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 6960 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 99%%.

All with corresponding errors in the SQL log.

When this issue arises, most of the time the SQL service will not stop, it simply hangs and times out until the entire server is rebooted which is understandable given the issue with the spin locks and effect on schedulers; although if you are able to connect to the server (usually through DAC) and you recognize this issue soon enough, you can use the shutdown command to bounce the service.

After the normal research routines and opening a case with MS, we were informed that this is a bug initially identified in SQL 2005 pre SP1 (build 2039), however we did not see it until we applied CU 6 for SP3 (build 4226). The issue was identified as spin-lock contention on OPT_IDX_STATS as SQL tries to update the stats feeding sys.db_index_usage_stats for Tempdb and the fix/ workaround provided to us was to enable trace flag 2330 as a startup parameter and seeing as how this is an undocumented trace flag, I thought I would share what we found and subsequently confirmed with the support team.

Initially, we were told that TF2330 would only disable stats collection for objects within tempdb, however we noticed that our index usage stats for all databases were not being updated so we followed up with Support and they confirmed that this was the case; actual response is below:


This regards your case SR # 0000000. We checked the source code and found that TF 2330 just disables collection of index usage statistics to feed data for sys.db_index_usage_stats DMV. So after you enable this TF and restart the server, you should see empty rows in this DMV. I tested this and did not see any rows for this DMV. This data also feeds into the Missing Indexes DMV suggestions.
For example: sys.dm_db_missing_index_group_stats will also rely on this info and hence will not be available. I tested the sys.dm_db_missing_index_group_stats and confirmed as well. Can you trying creating the missing indexes and see if that helps without enabling the TF2330?
In summary these are the two dmvs that are being affected:
sys.db_index_usage_stats
sys.dm_db_missing_index_group_stats

According to MS Support, there are no plans to fix this in SQL Server 2005 as it was supposed to have been corrected in SQL 2008.

The server info (identical for all for all affected servers) is:
Microsoft SQL Server 2005 - 9.00.4226.00 (X64)
Developer Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)

Update - 12/24/2009: Since I sat on this for a few months before posting, there has been documentation from MS and another recent blog posting here: http://www.pythian.com/news/5211/sql-server-more-light-shed-on-non-yielding-scheduler-and-indexes-stats/

Tuesday, December 22, 2009

Using SQLNCLI from SQL Server 2008 to 2005 can cause stack dump and service restart

I noticed recently when expanding the "Catalogs" folder under a linked server connection from a SQL Server 2008 server to a SQL Server 2005 server, that the 2008 server stack dumped and restarted the MSSQL service.

I was able to reproduce this on two machines running post SP1 builds of Developer 2008 using SSMS 2005 and 2008. It happens sporadically on one server and everytime on the other.

Source Servers:
1. Microsoft SQL Server 2008 (SP1) - 10.0.2746.0 (Intel X86) Nov 9 2009 16:59:31 Copyright (c) 1988-2008 Microsoft Corporation Developer Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

2. Microsoft SQL Server 2008 (SP1) - 10.0.2734.0 (Intel X86) Sep 11 2009 15:12:52 Copyright (c) 1988-2008 Microsoft Corporation Developer Edition on Windows NT 5.1 (Build 2600: Service Pack 3)

Remote Servers (servers linked connection is made to):

1. Microsoft SQL Server 2005 - 9.00.4220.00 (Intel X86) Apr 2 2009 18:42:07 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)
2. Microsoft SQL Server 2005 - 9.00.4226.00 (X64) May 26 2009 14:58:11 Copyright (c) 1988-2005 Microsoft Corporation Developer Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)


Steps to recreate behavior:

1. create linked server using SQLNCLI provider on a 2008 server pointing to any 2005 server

EXEC master.dbo.sp_addlinkedserver @server = N'USESQLNCLI', @srvproduct=N'SQL', @provider=N'SQLNCLI', @datasrc=N'NAMEOF2005INSTANCE'

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'USESQLNCLI',@useself=N'True',@locallogin=NULL,@rmtuser=NULL,@rmtpassword=NULL


2. from SSMS, expand the "Catalogs" folder under "Linked Servers" in the object explorer.

Step 2 above eventually times out with "Failed to retrieve data for this request", SQL stack dumps and the MSSQL service restarts. I was able to reproduce this everytime on one server and sporatically on the other.

The linked server connection itself works fine when you use it in a query but fails when you attempt to view the "Catalogs" through SSMS as above. The stack dump occurs when it attempts to populate the catalog information in #tmp_sp_catalogs.

Stack dump:
* Exception Address = 00000006 Module(UNKNOWN+00000000)
* Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
* Access Violation occurred reading address 00000006
* Input Buffer 510 bytes -
* create table #tmp_sp_catalogs (is_catalog_s
* upport bit null,server_name nvarchar(128) null, product_name nvarchar(12
* 8) null,provider_name nvarchar(128) null,catalog_name nvarchar(128) null
Event Viewer entries from each 2008 server:

Attempt to fetch logical page (1:200) in database 2 failed. It belongs to allocation unit 25896092997713920 not to 196608.

Attempt to fetch logical page (1:161) in database 2 failed. It belongs to allocation unit 25896092997713920 not to 281474980642816

If the service doesn't happen to restart you can run a checkdb on tempdb and generate the error as well.

CHECKDB found 0 allocation errors and 0 consistency errors in database 'tempdb'.
Msg 605, Level 21, State 3, Line 2
Attempt to fetch logical page (1:161) in database 2 failed. It belongs to allocation unit 25896092997713920 not to 281474980642816.


If I create the same linked server connection instead using SQLNCLI10 from 2008 to 2005, I don't encounter this same behavior, and expanding "Catalogs" on the linked server in SSMS works as it should.

Maybe this is already a known issue using SQLNCLI from 2008, especially since it's at SP1 CU5, but I wasn't able to find anything online about this behavior. We're going to make sure to use the SQLNCLI10 provider going forward but curious if anyone else has noticed this same issue.

Update 12/24/09:  Submitted this as bug id 522048.

Update 1/27/10: Another user was able to reproduce this connecting to SQL 2000 from SQL 2008.  Our workaround of using the SQLNCLI10 provider has worked out well.