As many of you reported TV Tropes went offline at 2:40am EST on July 7th. It was offline for over 14 hours. The worst outage in many years. We did receive email and text alerts when it happened but unfortunately it was a major hardware failure which took quite a while to get under control.
The cause was a total failure of our database cluster. 6 of our 8 hard drives failed simultaneously resulting in a complete loss of data. We had to have our server company replace the cluster and then we had to rebuild the site from database backups. We do automatic backups every morning. Unfortunately the failure happened hours before the next backup so 24 hours of changes were lost.
To make it worse the history of wiki changes is only updated twice a week because it is over a 1TB in size. We are working on restoring that now so the history tab is blank on all pages until it's done. Editing will be offline for another 24 hours until we get that fixed. And it means we'll lose 72 hours of wiki history due to the timing of the last backup.
We will be working on optimizing our database structure so we can increase the frequency of our database backups to protect the data in the future.
We have redundant web servers on a load balancer, redundant database servers in a cluster and redundant hard drives in every server. So how did this happen? According to our server company there was a manufacturers bug in the firmware of the specific model that 6 of our 8 hard drives were on. That bug caused the disks to die after a certain number of hours running. We don't yet have all the details. They are reaching out to the manufacturer to get more information. I'll update here as I learn more.
UPDATE: (July 8th)
Editing is now enabled! History should be restored as of July 4th 10am EST. The history database imported faster than I expected (4 hours to decompress 1.1TB sql file, 12.5 hours to import)
The only thing I haven't done yet is purge the CDN cache. You must logout to view a page cache. Logged-in users get the live site. Not all pages are still cached and they do expire.
I'll hold off on purging the cache for a few more hours. If there is some specific edit you remember doing during those 24 hours that were lost you may be able to find it by logging out and viewing the cached page. Then login and make that edit again.
Edited by itcdr on Jul 8th 2020 at 7:12:45 AM
I work for a company that makes storage appliances. I know all about drive firmware bugs. It's a damned miracle any of this stuff works at all... <sigh>
Oh man, 6 out of 8 drives failed simultaneously, what?! I gotta wonder if those drives had never been run for that long before and then failing after a certain number of hours is a new issue or not.
It was bugged firmware that caused them to brick out of nowhere after (possibly) 40,000 hours. It's not that new an issue. If it was the brand I suspect it was and not a new model turning out to be affected, the server company should have updated the firmware.
Edited by nm3youtube on Jul 10th 2020 at 8:21:27 PM
I know this disaster was almost a week ago, but GEEZ. If it weren't for those automatic server backups, this would've been the end of TV Tropes.
If you don't mind my asking, what plans do we have going forward?
As stated above "We will be working on optimizing our database structure so we can increase the frequency of our database backups to protect the data in the future." (I don't know how they'll optimise it as it seems like mostly text and what the frequency would be, but we'll see). They'll also likely replace those bugged hard drives (not stated, but seems like common sense).
Edited by Piterpicher on Jul 13th 2020 at 4:32:19 PM
Currently mostly inactive. An incremental game I tested: https://galaxy.click/play/176 (Gods of Incremental)The hosting company already replaced them under warranty.
Under, "It's your goddamn responsibility that these failed so you'd better fix it sharpish," warranty or no.
Edited by Fighteer on Jul 13th 2020 at 4:17:50 AM
"It's Occam's Shuriken! If the answer is elusive, never rule out ninjas!"TV Tropes crashed again.
I think it was a server thing. A lot of sites went down temporarily, at least for me and my boyfriend.
Currently Working On: Incorruptible Pure PurenessAgh, I'm so thrown off by the borked timezones I keep thinking these posts were made hours ago.
Jawbreakers on sale for 99ยขSame...it's always jarring to re-adjust after just waking up.
But weird, yeah, at least it wasn't just us this time.
Currently Working On: Incorruptible Pure PurenessI was scared we'd lose another 24 hours, but since it was an external server issue I guess it didn't hurt the site's data.
I do some cleanup and then I enjoy shows you probably think are cringe.We didn't lose any content this time, right?
With Great Power, Comes Great MotivationNah, it was just cloudflare crashing.
Currently Working On: Incorruptible Pure PurenessI've slept through it, but yeah, services did crash.
Currently mostly inactive. An incremental game I tested: https://galaxy.click/play/176 (Gods of Incremental)Seems like the issue is now resolved, so I've taken down the headline and will close this thread as soon as I am done typing this post.
"For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled." - Richard Feynman
Ah, good to know.
CM Dates; CM Pending; CM Drafts