Dismounted Feudal Knight/Site Updates: Difference between revisions
From TWC Wiki
m upd stats |
from the forum |
||
| Line 9: | Line 9: | ||
__TOC__ | __TOC__ | ||
==As of March 19th== | |||
First, apologies to everyone. It's been a rough 10 days or so. | |||
For those that did not see the notice I posted, on Friday the 7th we had a second hard drive failure in the RAID 5 array. When that happened the array went into read only mode, and there were a BUNCH of database tables that were trying to be written to. At that time I had not even ordered a replacement for the first drive failure. The array runs on 3 drives with a 4th as a hotswap spare. These are enterprise class Hitachi drives spinning at 15k rpm. I have never had one fail, much less 2 fail on the same array within 10 days of each other. I guess we won the lottery on that. Maybe the power issues I had here were part of it, maybe someone was asleep at the wheel on the production line when these were made. In any case, it wreaked havoc on our data. | |||
Here is part of the SMART info on one of the drives for those that care: | |||
<pre>Vendor (Hitachi) factory information | |||
number of hours powered up = 21362.13 | |||
number of minutes until next internal SMART test = 58 | |||
Error counter log: | |||
Errors Corrected by Total Correction Gigabytes Total | |||
ECC rereads/ errors algorithm processed uncorrected | |||
fast | delayed rewrites corrected invocations [10^9 bytes] errors | |||
read: 9806 128 0 9934 11033 4.463 1099 | |||
write: 0 0 0 0 0 0.191 0 | |||
verify: 495 17 0 512 550 0.000 38</pre> | |||
Notice the hours in red. That's how long the drives have been running since I installed them. I upgraded this array from the WD Raptors at 10k rpm to the Hitachi at 15k rpm 2.5 years ago. Maybe I should have gone with SSD back then, but I still do not fully trust them in high write cycle environments. Whatever, the point is they failed at nearly identical hours of use at 2.43 years. | |||
So at that point I had to make a decision. Either apply a bandage and get the site up quickly, or leave it down for a while and make a ton of changes I have been wanting to make anyways. I decided to bite the bullet and leave it down. You can all yell at me for that now. | |||
I have a small server with [https://www.ispconfig.org/ ISPConfig] on it running my personal business site and a couple of small other ones. That is what the notices were posted on, but that machine is not even close to powerful enough to host something like TWC. So the plan was to throw some quick info up on that, get some new drives, rebuild the array, update the wiki, update the forum, reload Odin with new OS, and put it backup running under Xenforo. That is a PILE of work to do in a week. | |||
I got the drives last Weds (a week ago today) and got the array mounted in a half-assed stable manner. Performed a database dump of the wiki and the forums to my local PC and started loading up Thor. Thor is the web server we had before Odin back when we were on vBulletin 3. Its a 16 core server @ 1.8 ghz with 32 gigs of RAM with the Raptor drives in it. Its been powered down for quite a while now but I kept it around. | |||
I also ended up having to replace one of my Cisco routers that bricked during the power outage issue. They do not like being powered off when loading Cisco IOS into them and I could not recover it. So I got an [https://store.ui.com/us/en/products/er-4 Ubiquiti EdgeRouter 4] instead of buying another $900 Cisco. As I said I have been wanting to make a lot of changes anyways and that includes to my home network. I am not going to go into a ton of detail on that boring stuff but basically I got a bunch of stuff moved from my office into a room in the basement for the server racks and network gear blah blah. Still some stuff to do on that. | |||
Anyways, After I got Thor loaded the plan was to install the wiki and update it, going through the new versions to get us up to 1.43. We are currently on 1.32 which is running on PHP 7.1. Major PHP releases are 8.1 or even 8.4 now, and I was way behind on operating system updates too. The first time I ran the wiki on a local IP only there were a TON of database problems. In the thousands. Some of them could be corrected by tools like the update.php script included with mediawiki. A lot of them could not. Every time I thought I had it fixed I would find more. The same with the forums. I had to fix a bunch of stuff manually, then make a copy of that fix so I could roll back if needed. It was pretty grueling. | |||
Another thing people probably do not know is how big our database is. We are a total of 306 gigs between the forums and the wiki just on the database. And another 400ish gigs on web files; images, downloads, all that stuff. Just moving stuff from one server to another takes forever. If you do a straight copy its faster than using rsync, but I use rsync because it preserves user groups and permissions | |||
<pre>root@odin:/# du -sh /var/lib/mysql | |||
306G /var/lib/mysql | |||
root@odin:/# du -sh /var/www | |||
380G /var/www</pre> | |||
When I got the database stable I loaded a blank version of mediawiki on Thor to make sure I had all the dependencies correct and it ran just fine. Some of you probably saw that at wikibox.twcenter.net (down now) and I started mobing up the mediawiki versions and PHP versions. And then I got stuck in dependency hell. Mediawiki uses a tool called Composer to help with this. I am not sure if it helps or makes it worse. I got stuck in a vicious cycle between PHP 7.3 and 7.4 and I couldn't get out of it. Somewhere in there I seriously ♥ed up the PHP install on Thor and couldn't get any version of PHP to load at all. Even doing an apt purge php* and removing every single PHP reference would not work. I was still missing something. So I said screw it and wiped the operating system and started over. And got stuck in the same place. Twice. | |||
Thor has now been completely reloaded probably 5 times because I am running into an issue upgrading PHP versions to match what mediawki wants, and something in there is still trying to reference an older version of PHP and throws a 500 error with a completely useless log entry. I say Thor has been reloaded 5 times not because I have been counting, but because Letsenrypt will only issue a certificate 5 times in a 7 day period and I have maxed that out. Its frustrating as hell, and only compounded by my lack of sleep and trying to do things fast to get the site back up. I don't know what I screwed up but I know I need to take a step back and catch my breath before I dive back into it. I am sure its something totally stupid or skipping a step someplace but for now I cannot see it. | |||
So I decided about 3:30 this morning to take the array out of the virtual environment I had it loaded in and put it back on Odin with the new drives, and put the site back online. That took most of today. | |||
The plan is to take a couple of days off and get some personal stuff done. This weekend I am wiping my personal server (I loaded about 9 versions of PHP on that too) and reloading it. And to finish some network layout in the basement. I have a new UPS coming to keep switches and routers up and I should be able to shut them down gracefully if I have more power issues. Odin can stay up for about 10 minutes in the case of a power failure which also gives it time to shut down. | |||
Once that stuff is done then I will get back into updating the wiki and figure out what I screwed up without being on a time crunch. Once that is stable then it will be moved to a server separate from the forums so I can upgrade the main site. I plan to do that on a separate machine as well so the site can stay up while I do it. Once I have it stable then I can shut the site down and dump all the new stuff and put it back up. | |||
Again, apologies. | |||
'''--GrnEyedDvl''' | |||
==As of February 27th== | ==As of February 27th== | ||
| Line 15: | Line 65: | ||
This is a RAID5 setup with 4 drives. So I flushed a bunch of caches and performed a reboot. The server would not boot. One of the drives in the array failed, which should NOT prevent it from booting. What it is supposed to do is boot in a degraded status and then automatically kick in the spare drive and sync it up, then continue to a normal boot. Instead it was giving me a kernel panic and refusing to read the entire array. | This is a RAID5 setup with 4 drives. So I flushed a bunch of caches and performed a reboot. The server would not boot. One of the drives in the array failed, which should NOT prevent it from booting. What it is supposed to do is boot in a degraded status and then automatically kick in the spare drive and sync it up, then continue to a normal boot. Instead it was giving me a kernel panic and refusing to read the entire array. | ||
This has been driving me bat | This has been driving me bat ♥ crazy for two days as I could not figure out why GRUB would not update so it could boot degraded and then bring in the spare. I was literally almost at the point of taking the server down to a company I know in Denver that specializes in data recovery for failed RAID arrays when I tried one last thing that actually worked and the server came up. I was actually very surprised when I walked back in my office after letting my dogs out and saw a login prompt. | ||
I am going to leave it as it is for a day or two, I haven't gotten much sleep the last two nights and I have some personal stuff to do. | I am going to leave it as it is for a day or two, I haven't gotten much sleep the last two nights and I have some personal stuff to do. | ||
Revision as of 18:49, 19 March 2025
This page is a mirror of GED's updates posted on the forum in case the forum is down or for the public interest of unregistered users. If this page goes down, my personal mirror at raidarr wiki will be up (you may wish to save both pages until all steps are complete). When the forum itself is down GED will post updates primarily on the root page of the site.
Summary
GED is making long-needed, massive updates to the site. This will begin with fixing forum performance issues and striking down bot spam, which in the meantime has resulted in disabled registration, restricted visibility and turned off features until they can be performant again. Larger projects are better integration of cloudflare and a massive pruning of unnecessary forums, users, and unneeded data. This will be tedious and require care. Finally, TWC's software will be completely upgraded from operating system to forum version and wiki version. The middle step is particularly tedious as it is likely plugins will not be as compatible, and the bulk of TWC subforums and permissions is unmatched by most websites on the internet. The destination software will have to be very good at handling this.
Currently the forum is focusing on cleanup and addressing immediate performance problems. Registration and guest visibility are disabled.
If there is a link to a resource you need and the forum is online but you do not have an account, please reach out via email to [email protected] or on discord.
As of March 19th
First, apologies to everyone. It's been a rough 10 days or so.
For those that did not see the notice I posted, on Friday the 7th we had a second hard drive failure in the RAID 5 array. When that happened the array went into read only mode, and there were a BUNCH of database tables that were trying to be written to. At that time I had not even ordered a replacement for the first drive failure. The array runs on 3 drives with a 4th as a hotswap spare. These are enterprise class Hitachi drives spinning at 15k rpm. I have never had one fail, much less 2 fail on the same array within 10 days of each other. I guess we won the lottery on that. Maybe the power issues I had here were part of it, maybe someone was asleep at the wheel on the production line when these were made. In any case, it wreaked havoc on our data.
Here is part of the SMART info on one of the drives for those that care:
Vendor (Hitachi) factory information
number of hours powered up = 21362.13
number of minutes until next internal SMART test = 58
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 9806 128 0 9934 11033 4.463 1099
write: 0 0 0 0 0 0.191 0
verify: 495 17 0 512 550 0.000 38
Notice the hours in red. That's how long the drives have been running since I installed them. I upgraded this array from the WD Raptors at 10k rpm to the Hitachi at 15k rpm 2.5 years ago. Maybe I should have gone with SSD back then, but I still do not fully trust them in high write cycle environments. Whatever, the point is they failed at nearly identical hours of use at 2.43 years.
So at that point I had to make a decision. Either apply a bandage and get the site up quickly, or leave it down for a while and make a ton of changes I have been wanting to make anyways. I decided to bite the bullet and leave it down. You can all yell at me for that now.
I have a small server with ISPConfig on it running my personal business site and a couple of small other ones. That is what the notices were posted on, but that machine is not even close to powerful enough to host something like TWC. So the plan was to throw some quick info up on that, get some new drives, rebuild the array, update the wiki, update the forum, reload Odin with new OS, and put it backup running under Xenforo. That is a PILE of work to do in a week.
I got the drives last Weds (a week ago today) and got the array mounted in a half-assed stable manner. Performed a database dump of the wiki and the forums to my local PC and started loading up Thor. Thor is the web server we had before Odin back when we were on vBulletin 3. Its a 16 core server @ 1.8 ghz with 32 gigs of RAM with the Raptor drives in it. Its been powered down for quite a while now but I kept it around.
I also ended up having to replace one of my Cisco routers that bricked during the power outage issue. They do not like being powered off when loading Cisco IOS into them and I could not recover it. So I got an Ubiquiti EdgeRouter 4 instead of buying another $900 Cisco. As I said I have been wanting to make a lot of changes anyways and that includes to my home network. I am not going to go into a ton of detail on that boring stuff but basically I got a bunch of stuff moved from my office into a room in the basement for the server racks and network gear blah blah. Still some stuff to do on that.
Anyways, After I got Thor loaded the plan was to install the wiki and update it, going through the new versions to get us up to 1.43. We are currently on 1.32 which is running on PHP 7.1. Major PHP releases are 8.1 or even 8.4 now, and I was way behind on operating system updates too. The first time I ran the wiki on a local IP only there were a TON of database problems. In the thousands. Some of them could be corrected by tools like the update.php script included with mediawiki. A lot of them could not. Every time I thought I had it fixed I would find more. The same with the forums. I had to fix a bunch of stuff manually, then make a copy of that fix so I could roll back if needed. It was pretty grueling.
Another thing people probably do not know is how big our database is. We are a total of 306 gigs between the forums and the wiki just on the database. And another 400ish gigs on web files; images, downloads, all that stuff. Just moving stuff from one server to another takes forever. If you do a straight copy its faster than using rsync, but I use rsync because it preserves user groups and permissions
root@odin:/# du -sh /var/lib/mysql 306G /var/lib/mysql root@odin:/# du -sh /var/www 380G /var/www
When I got the database stable I loaded a blank version of mediawiki on Thor to make sure I had all the dependencies correct and it ran just fine. Some of you probably saw that at wikibox.twcenter.net (down now) and I started mobing up the mediawiki versions and PHP versions. And then I got stuck in dependency hell. Mediawiki uses a tool called Composer to help with this. I am not sure if it helps or makes it worse. I got stuck in a vicious cycle between PHP 7.3 and 7.4 and I couldn't get out of it. Somewhere in there I seriously ♥ed up the PHP install on Thor and couldn't get any version of PHP to load at all. Even doing an apt purge php* and removing every single PHP reference would not work. I was still missing something. So I said screw it and wiped the operating system and started over. And got stuck in the same place. Twice.
Thor has now been completely reloaded probably 5 times because I am running into an issue upgrading PHP versions to match what mediawki wants, and something in there is still trying to reference an older version of PHP and throws a 500 error with a completely useless log entry. I say Thor has been reloaded 5 times not because I have been counting, but because Letsenrypt will only issue a certificate 5 times in a 7 day period and I have maxed that out. Its frustrating as hell, and only compounded by my lack of sleep and trying to do things fast to get the site back up. I don't know what I screwed up but I know I need to take a step back and catch my breath before I dive back into it. I am sure its something totally stupid or skipping a step someplace but for now I cannot see it.
So I decided about 3:30 this morning to take the array out of the virtual environment I had it loaded in and put it back on Odin with the new drives, and put the site back online. That took most of today.
The plan is to take a couple of days off and get some personal stuff done. This weekend I am wiping my personal server (I loaded about 9 versions of PHP on that too) and reloading it. And to finish some network layout in the basement. I have a new UPS coming to keep switches and routers up and I should be able to shut them down gracefully if I have more power issues. Odin can stay up for about 10 minutes in the case of a power failure which also gives it time to shut down.
Once that stuff is done then I will get back into updating the wiki and figure out what I screwed up without being on a time crunch. Once that is stable then it will be moved to a server separate from the forums so I can upgrade the main site. I plan to do that on a separate machine as well so the site can stay up while I do it. Once I have it stable then I can shut the site down and dump all the new stuff and put it back up.
Again, apologies.
--GrnEyedDvl
As of February 27th
This has NOT been a good week. I fixed most (maybe all) of the database problems manually and was doing some testing with Cloudflare on Monday night when suddenly I could not write to the drives on the server. I had already done all the major copies of the database and the entire file structure. All I was trying to do was get an SSL certificate issued for sandbox.twcenter.net and it would NOT write the file. That is where I am going to install a second copy of the site for when I start purging some database stuff. At the moment its just a placeholder index file, and as you can see it does not have SSL running on it.
This is a RAID5 setup with 4 drives. So I flushed a bunch of caches and performed a reboot. The server would not boot. One of the drives in the array failed, which should NOT prevent it from booting. What it is supposed to do is boot in a degraded status and then automatically kick in the spare drive and sync it up, then continue to a normal boot. Instead it was giving me a kernel panic and refusing to read the entire array.
This has been driving me bat ♥ crazy for two days as I could not figure out why GRUB would not update so it could boot degraded and then bring in the spare. I was literally almost at the point of taking the server down to a company I know in Denver that specializes in data recovery for failed RAID arrays when I tried one last thing that actually worked and the server came up. I was actually very surprised when I walked back in my office after letting my dogs out and saw a login prompt.
I am going to leave it as it is for a day or two, I haven't gotten much sleep the last two nights and I have some personal stuff to do.
A bunch (maybe all) of the plugins are still disabled. I do not have time to look through all that stuff at the moment. I will do some small tweaking later tonight but I am not doing any major changes until I get a new spare drive and get that installed. I figured I could leave the forums up while I waited.
I DO need to shut it down entirely either tonight or sometime tomorrow and remove all the drives and physically relabel them. They are hotswap drives so I could theoretically just pull them and do it, but that would degrade the array forcing a rebuild and I do NOT want to do that without a good spare. This is the second time in 10ish years we have had a drive fail, and the last time I replaced a drive I labeled it wrong. I did not discover that until last night, so I need to fix it.
I will post more later.
Some site functions might be missing because of disabled plugins.
--GrnEyedDvl
Regarding cloudflare
The Cloudflare problem of not allowing our SSL certificate to pass through the caching system seems to have been solved, on the wiki at least.
If you ping twcenter.net you should get 199.xxx.xxx.150 which is in the block of IP addresses I own.
If you ping wiki.twcenter.net you should get a Cloudflare IP something like a 172.xxx.xxx.xxx
I am not going to enable that on the main forums until I know there are no problems with the wiki. I need someone to edit a few pages, login, logout, stuff like that. And tell me if there are any issues. I can do it from here, but its not the same. I am inside the network basically.
Some time later, 40 minutes after enabling on the forum:
These are the stats from Cloudflare about SSL requests and blocked traffic. Note that it says for the last month, but Cloudflare has only been enabled for about 20 hours now on the wiki and about 40 minutes on the main site.
600,000 SSL redirects. That's 600,000 redirects that our server did not have to perform. 200,000 Attacks blocked. That is what took the load off us. That's 5,000 connections per minute for the last 40 minutes. Or 83 per second. The wiki was not getting hit.
--GrnEyedDvl
First notice
7:58 PM Feb 24
The database and file structure are about 30% copied into a sandbox environment. I am going to let that finish overnight and do some more tomorrow afternoon. I might as well throw out a few goals so you guys can see what the plan is.
- Create a sandbox - partially complete, when this is up it will be at sandbox.twcenter.net
- Figure out what parts did not get written correctly and manually fix - could be quick, probably not
- Remove the old front end and point the index to the Articles section. This was always the plan for Articles
- Prune - Manually prune a ton of stuff. Users, old posts from stuff like the Mudpit, etc
- Update the OS on the server. Its time for the latest Long Term Support version of Ubuntu
- Update vBulletin. We are still using vBulletin 4, mostly because some of the numerous plugins we use were never updated for vBulletin 5, and the latest is now vBulletin 6. We may lose some plugins. We will have to survive without them.
- Figure out Cloudflare. It's not playing nicely with our SSL from Let's Encrypt. MIght have to go with a different solution for that, but I definitely want Cloudflare back on the front end to help mitigate DDoS attacks in the future. On my personal business site it works fine. Irritating.
- Fix the site email situation. There is some stuff on this I won't go into yet
- Update the wiki software. This one should be pretty easy, so it will be last.
Make no mistake, this will be an extended down time. Mostly because since I have to do a bunch of stuff anyways I might as well take the time to do some things that have been needed for quite a while. Some of this is very time consuming.
One of the problems with vBulletin is that you cannot easily purge data. For instance I saw a comment about purging old users. I wish it was as simple as clicking a few buttons and purging those, but it isnt. There is a function to prune users but its clumsy as hell, very server intensive and takes forever to run. And it has to be run manually. Running mass operations on the database via php is ridiculously slow compare to doing it via command line on the server.
The same applies to pruning old posts, or forums, or anything else. I hve to write some custom scripts for this, and I HAVE to make sure I get it right before I deploy this.
The wiki will stay up for now.
wiki.twcenter.net
GrnEyedDvl
