This post isn’t a walk though on how to configure log shipping. Those can be found other places. This more of a story, culminating in recommendations for a log shipping configuring for servers will many large databases. The environment I was working with happens to be used by SharePoint 2010 but these recommendations apply to any log shipping configuration of any size.
We have a fairly large SharePoint 2010 environment. Spread across 2 separate farms, there is about 3 TB of used data consisting of about 56 separate content databases, plus all the search and supporting databases. Each farm is deployed on highly available infrastructure and we have “read only” fall back farms to provide access to content during outages. To keep the databases on these read only farms in sync we used to do a simple backup/restore job, but we are now using log shipping.
As you can imagine, with such a large amount of data and a large amount of databases, a full backup/restore took some time. About 6 hours for both farms, which I thought was impressive for 3 TB of data. Over the past year the size of our SharePoint environment more than doubled. I went to some extensive lengths to keep the restore time this time at a minimum, like adding multiple NIC cards and sending data to each network interface. However, it still wasn’t ideal. Originally, we discussed log shipping, but since the farms were not as large back then we decided to move quickly with a backup/restore method and follow up later after our SharePoint 2007 to 2010 upgrade to implement log shipping.
We finished the upgrade around September 2011. So after that, I went ahead and spent the time setting up log shipping. I knew I would have some unique challenges because of the amount of databases that need to be log shipped and the size.
The SQL Server 2008 R2 Management Studio GUI for setting up log shipping is pretty good. However it is missing the logic to enable compression on standard edition servers. As you may know, Microsoft allows backup compression on standard edition for 2008 R2. However it seems they missed updating the log shipping wizard in Management Studio. To work around this issue you have to script out your configuration and manually add the parameter to enable compression. The stored procedure is sp_add_log_shipping_primary_database, and the parameter is, @backup_compression = 1.
From the beginning we decided that we wanted to kill all connections in the destination database each time the transaction log is shipped and restored. The GUI has a checkbox for this functionality and it corresponds to a parameter, @disconnect_users = 1, for the sp_add_log_shipping_secondary_database stored procedure. However, even with this I still ran into issues with the restore of the transaction log. Every once and a while a restore would fail. Looking though the error logs I could see the following error.
Error: Exclusive access could not be obtained because the database is in use.
RESTORE LOG is terminating abnormally.(.Net SqlClient Data Provider) ***
I am configuring the log shipping to disconnect users. How could this be happening? I did some tracing and came to the following conclusion. When the logship.exe executable runs on the server to do the restores, it executes “alter database set single_user with rollback immediate” to kill all user connections to the database. And immediatly after, before the logship.exe can execute the restore statement, a new connection from SharePoint comes in and consumes that single user connection. There isn’t anything I can do about this, so I looked for workarounds.
I decided to set 3 retry attempts for the lsrestore jobs. This way if it fails it will simply try to do the restore again up to three times. After letting this run for a while, it was fairly effective in reducing the failures. But it didn’t eliminate them. Every couple days there would still be a failure, which meant the restore failed 3 times in a row. Each time it was always the same error as before.
I then decided to investigate the timing of these jobs. Originally we setup the jobs to run every 15 minutes. Now if you are not careful with the scheduling it can take up to 3 times your interval to actually have data show up on your secondary database. This happens if you schedule each set of jobs to start at the same time. So for example, if you go with the default all your jobs will start at 12:00:00 AM. The job to copy the backup file will run, and it will see no backups available to copy because the backup job also started at the same time and is still writing the backup file to disk. Same for the restore job, it will look for a file to restore and nothing will be there because the copy job also started at the same time and is still copying the file over the network. With a 15 minute interval, a 45 minute delay was acceptable for us. But I wanted to run the jobs less frequently to avoid this issue I was having. I talked to our SharePoint administrator and he was fine with running the jobs every hour. But since they were all starting at the same time, that would mean it would take 3 hours for data to show up in the secondary. The simple solution here is to delay the start of each job so that the prerequisite job can finish. So it setup all the lsbackup jobs to start at 12:00:00, then all the lscopy jobs to start at 12:10:00, then all the lsrestore jobs to start at 12:20:00. This way every hour on the hour, data would take 20 minutes or so to show up in the secondary.
This helped a little bit, but didn’t solve the problem with the restore jobs not getting exclusive access. I then decided to stagger the start time of each individual database. The first farm had 20 content databases, so I staggered the start time of each set of jobs, lsbackup, lscopy, and lsrestore, for each database by 30 seconds. Bingo. After I did this I didn’t see any more restore failures.
I think this was successful because starting the jobs for 20 databases all at the same time put too much load on the server and caused too much delay between setting it to single_user and running the restore statement. I suspected this would be an issue from the start. Most servers out there would probably experience a noticeable impact if 20 databases suddenly did a transaction log backup simultaneously. There would be noticeable network utilization spike if 20 backup files were all copied at the same time, and there would be a noticeable spike if 20 databases all restored transaction log backups at the same time.
After going through all this my recommendations for log shipping a lot of large databases are:
- Stagger the start time for each set of log shipping jobs (lsbackup, lscopy, lsrestore) for each database by a decent amount of time to give each time time to complete before the next one starts.
- Stagger the start time for each set of lsbackup jobs, each set of lscopy jobs, and each set of lsrestore jobs by a few seconds to spread out the load on your server.
- Set some retry attempts for the lsrestore jobs.