Server Crash Guide
Hey guys, I just wanted to create this thread in order to help people determine what is a server crash, what type, and what is not really a crash.
The idea is that you guys will be able to help notify me of crashes and what kind it was. This will allow me to troubleshoot and ideally reduce the quantity of crashing.
"Crashing" on mapchange
The server is considered to have crashed if when it comes back it is on the default map.
This is either dustbowl or granary, depending on the server (for tf2f). If the server continues onto the next map, it is not crashed.
There is a bug in valves server code that causes a longer than normal mapchange, and clients will get one of those red auto-disconnect messages. If you see this message at the end of a round, the server has likely NOT crashed. Its just doing a very lame mapchange. I don't know what causes this, but i DO know that it happens on all TF2 servers. If you go to other servers sometimes you will see a message telling people to rejoin on mapchange if they get the disconnect error.
A normal mapchange should take <20 seconds. A buggy valve mapchange might take up to 45 seconds. Just wait till the server has finished changing map and rejoin. I will be adding an ad to the servers tonight that notifies people to do this.
This issue is something neither I or clutchkill can fix. While I don't technically consider it crashing, most ppl will call it one and it does cause a loss of players.
Typical Frequency: Multiple times per day.
ISP downtime
Our clutchkill servers are hosted in chicago at the internap datacenter. From time to time, all clutchkill servers and possibly other servers at internap will lose their net connections. Sometimes this lasts for only 5 minutes, sometimes its more like hours.
You can identify ISP downtime by all of the servers going down at once, and then coming back on non default maps. Note that this issue is not technically a crash, but is still something that affects player count.
This issue is something neither I or clutchkill can fix. Best we can do is wait it out.
Typical Frequency: Less than twice per month.
Dedicated box crashes
Everyonce in a while the machine that hosts all 7 of our servers will go down. Either because of hardware failure, or a reset, or just a general crash (it runs windows).
You can identify this sort of crash as all of the servers will go down at once, and then returning on their default maps.
Typical Frequency: Less than twice per month.
General crashes
The most common type of crash, is where one server goes down and comes back (or even doesn't) and the rest are stable.
Causes of these crashes range from general valve server crashing, to one of our addons, to one of our plugins.
I run about ~30 plugins on our servers. The first 15 are relatively mandatory, and the next 15 give us a lot of nice features.
All plugins are developed by the gaming community and are possible sources of crashing. Same with our two addons (sourcemod and metamod). The fact that plugins and addons are created by non-professional programmers, means two things.
1. We can expect bugs/crashes
2. There is no support for these plugins. By support i mean, i cant call someone up, say plugin x is crashing, and expect anyone to give a damn.
I have recently taken tftrue off of our private servers to confirm that the plugin was indeed crashing the servers.
For example, the only difference between server 1 and server 4 is TFTrue. But, server 4 crashed 13 times in 5 days, and server 1 had not crashed once.
Typical Frequency of general crashes is super random. At least once per day, sometimes higher, sometimes less.
It is also dependent on the quantity of traffic a particular server recieves. A server that recieves more traffic will have a higher chance to crash than one that doesnt.
Whether or not I can do something about these crashes depends on whether or not I can determine its source and or remove the offending plugin/addon. Determining the source of the crash can be difficult. Part of the reason is that sourcemod plugins dont log crashes very well. They do log errors, but if the plugin crashes before it reports an error I'm SOL, and this is often the case. The other issue is that I can't sit infront of my computer 24/7 and thus am not able to collect my own data on the crashes.
How you can help
I would very much appreciate a steam message informing me of server crashes. I would like to know this information:
-date/time (please specify timezone)
-server
-type (one of the things above)
I dont really want to know about the mapchange 'crash' so much. Theres nothing I can do about it.
ISP downtime and box crashes are good for me to know so I can track how often it happens. If box crashing or downtime is high enough I can talk to clutchkill and help them help me.
General crashes are the ones i want to know about most.
Anyway, hopefully with everyones assistance we can minimize TF2F downtime to something negligable.