By Steve Endow

This is a long post, but I think the context and the entire story help paint a picture of how things can fail in unexpected and odd ways, and how storage failures can be more complicated to deal with than you might expect.  I learned several lessons so far, and I'm still in the middle of it, so I may learn more as things unfold.

On Tuesday evening, I received several emails from my backup software telling me that backup jobs had failed.  Some were from Veeam, my absolute favorite backup software, saying that my Hyper-V backups had failed.  Others were from Acronis True Image, saying that my workstation backup had failed.

Hmmm.


Based on the errors, it looks like both backup apps were unable to access my Synology NAS, where their backup files are stored.

That's odd.

When I tried to access the UNC path for my Synology on my Windows desktop, I got an error that the device could not be found.  Strange.

I then opened a web browser to login to the Synology.  But the login page wouldn't load.  I then checked to make sure the Synology was turned on.  Yup, the lights were on.

After several refreshes and a long delay, the login page eventually loaded, but I couldn't login.  I then tried connecting over SSH using Putty.  I was able to connect, but it was VERY slow.  Like 30 seconds to get a login prompt, 30 seconds to respond after entering my username, etc.  I was eventually able to login, so I tried these commands to try and reboot the Synology via the SSH terminal.

After issuing the command for a reboot, the power light started blinking, but the unit didn't shutdown.  Strangely, after issuing the shutdown command, I was able to login to the web interface, but it was very slow and wasn't displaying properly.  I eventually had to hold the power button down for 10 seconds to hard reset the Synology, and then turned it back on.

After it rebooted, it seemed fine.  I was able to browse the shares and access the web interface.  Weird.

As a precaution, I submitted a support case with Synology asking them how I should handle this situation in the future and what might be causing it.  I didn't think it was a big deal.

On Wednesday evening, I got the same error emails from my backup software.  The backups had failed.  Again.  Once again, the Synology was unresponsive, so I went through the same process, and eventually had to hard reset it to login and get it working again.

So at this point, it seemed pretty clear there is a real problem.  But it was late and I was tired, so I left it and would look into it in the morning.

On Thursday morning, the Synology was again unresponsive.  Fortunately, I received a response from Synology support and sent them a debug log that they had requested.  Within 30 minutes I received a reply, informing me that the likely issue was a bad disk.

Apparently the bad disk was causing the Synology to deal with read errors, and that was actually causing the Synology OS kernel to become unstable, or "kernel panic".


This news offered me two surprises.  First, I was surprised to learn that I had a bad disk.  Why hadn't I known that or noticed that?

Second, I was surprised to learn that a bad disk can make the Synology unstable.  I had assumed that a drive failure would be detected and the drive would be taken offline, or some equivalent.  I would not have guessed that a drive could fail in a way that would make the NAS effectively unusable.

After reviewing the logs, I found out why I didn't know I had a bad drive.


The log was filled with hundreds of errors, "Failed to send email".  Apparently the SMTP authentication had stopped working months ago, and I never noticed.  I get so much email that I never noticed the lack of email from the Synology.

The drive apparently started to have problems back in July, but up until this week, the Synology seemed to still work, so I had no reason to suspect a problem.

Synology support also informed me that the unit was running a "parity consistency check" to try and verify the data on all of the drives.  This process normally slows the unit down, and the bad drive makes the process painfully slow.

After a day and a half, the process is only 20% complete, so this is apparently going to take 4-5 more days.


So that's great and all, but if I know I have a bad drive, can't I just replace the drive now and get on with the recovery process?  Unfortunately, no.  Synology support said that I should wait for the parity consistency check to complete before pulling the bad drive, as the process is "making certain you are not suffering data/ volume corruption so you can later repair your volume with no issues."

Lovely.  So waiting for this process to complete is preventing me from replacing the bad drive that is causing the process to run so slowly.  And I'm going to have to wait for nearly a week to replace the drive, all the while hoping that the drive doesn't die completely.

I'm sensing that this process is less than ideal.  It's certainly much messier than what I would have expected from a RAID array drive failure.

But that's not all!  Nosiree!


In addition to informing me that I have a bad drive that is causing the Synology to become unusable, it turns out that I have a second drive that is starting to fail in a different manner.


Notice that Disk 6 has a Warning status?  That's actually the second bad drive.  The first bad drive is Disk 2, which shows a nice happy green "Normal" status.

After reviewing my debug log, Synology support warned me that Disk 6 is accumulating bad sectors.

Sure enough, 61 bad sectors.  Not huge, but a sign that there is a problem and it should probably be replaced.


Lovely.

So why did I not know about this problem?  Well, even if SMTP had been working properly on my Synology, it turns out that the bad sector warnings are not enabled by default on the Synology.  So you can have a disk failing and stacking up bad sectors, but you'd never know it.  So that was yet another thing I learned, and I have now enabled that warning.


So, here's where I'm at.

I've fixed the email settings so that I am now getting email notifications.

I'm 20% into the parity consistency check, and will have to wait 5+ more days for that to finish.

As soon as I learned that I had 2 bad drives on Thursday morning, I ordered two replacement drives.  I paid $50 for overnight express shipment with morning delivery.  Because I wanted to replace the drives right away, right?  But that was before Synology emphasized that I should wait for the parity check to complete.  So those drives are going to sit in the box for a week--unless a drive dies completely in the meantime.

If the parity check does complete successfully, I'll be able to replace Drive 2, which is the one with the serious problems.  I'll then have to wait for the Synology to rebuild the array and populate that drive.

Once that is done, I'll be able to replace Drive 6, and wait for it to rebuild.

Great, all done, right?

Nope.  I'll need to hook up the two bad drives and run the manufacturer diagnostics and hopefully get clear evidence of an issue that allows me to RMA the drives.  Because I will want the extra drives.  If I can't get an RMA, I'll be buying at least 1 new drive.

This experience has made me think differently about NAS units.  My Synology has 8 drive bays, and I have 6 drives in it.  The Synology supports hot spare drives, so I will be using the additional drives to fill the other two bays and have at least one hot spare available, and most likely 2 hot spares.

Previously, I didn't think much of hot spares.  If a drive fails, RAID lets you limp along until you replace the bad drive right?  In concept.  But as I have experienced, a "drive failure" isn't always a nice clean drive death.  And this is the first time I've seen two drives in the same RAID array have issues.

And it's also shown me that when drives have issues, but don't fail outright, they can make the NAS virtually unusable for days.  I had never considered this scenario.  While I'm waiting to fix my main NAS, my local backups won't work.  And this Synology is also backing up its data to Backblaze B2 for my offsite backup.  That backup is also disabled while the parity check runs.  And I then have another on-site backup to a second Synology unit using HyperBackup.  Again, that backup is not working either.  So my second and third level backups are not available until I get my main unit fixed.

Do I redirect my backup software to save to my second Synology?  Will that mess up my backup history and backup chains?  I don't know.  I'll have to see if I can add secondary backup repositories to Veeam and Acronis and perhaps merge them later.

Another change I'll be making is to backup more data to my Backblaze B2 account.  I realized that I was only backing up some of the data from my main Synology to B2.  I'll now be backing up nearly everything to B2.

So this has all been much messier than I would have imagined.  Fortunately it hasn't been catastrophic, at least not yet.  Hopefully I can replace the drives and everything will be fine, but the process has made me realize that it's really difficult to anticipate the complications from storage failures.


You can also find him on Twitter, YouTube, and Google+