May 29

Introduction

At my current gig with Primedia, we have some pretty high volume Rails sites. Since having highly available sites are essential for our core business, we have to be very attentive to production issues. Below is a list that I’ve compiled recently in my head and by working with some very bright folks. This doc is to help Ops/Dev remember some basics and tricks when debugging issues. So remember to try this stuff.

Non-Technical (READ FIRST)

1. If the issue is not a major impact on production, make sure all your troubleshooting is passive! You might introduce an outage if you try serious troubleshooting.
2. Confirm the issue with Ops/Infrastructure in your presence, this gets buy-in from with the folks who have official root access.
3. As a developer, you should not work alone on production systems except to gather information (if you get temporary access). All changes (in seeking issue mitigation) should be paired with Ops/Infra. This will avoid finger-pointing and headaches later.

Technical

Elimination + log files is my troubleshooting method of choice.

1. Skip the load balancer if possible and hit direct IP/Port of a single app server to eliminate load balancer in mix.
2. Enable DEBUG log level and restart app server(s). Preferably, take an app server out of load balancer pool and work on in isolation. Regardless, DEBUG can slow response signficantly so be aware. However, there are situations where DEBUG is the only way to get an indication from Rails what is going on.
3. Tail log file of single app IP/Port to squelch noise from other servers and hit just that app server.
4. Make sure that the PIDs are not hung,old versions. Look at time/date of PIDS. If the deploy is screwed up, you can be deploying new code and yet the processes running are code from a month ago.
5. To be thorough, look at the kernel logs too. In Linux, /var/log/messages etc. You could have something else that is going on.
6. The logs should give status regarding database(s), but especially look for the words “timeout” if you have any custom network libraries that fetch data (we do) or “ActiveRecord”.  Perhaps even, grep for it.
7. Use a non-browser client, like Curl or wget to eliminate possible browser issues. Case in point, we had an issue related to ActiveRecord sessions that were migrated to another datacenter. When accessed via Curl there was no issue (Curl does not save cookies by default).

Other

1. If you cannot duplicate issue in development, remember to change your configs to look like production mode and run your development workstation in production mode. Rails behaves different in its modes.

Tagged with: