The Hardest (and most fun) Problems to Troubleshoot

I recently wrote a FAQ-style post about System Administration and technology careers in general. One of the best questions I was asked was about what kinds of really interesting troubleshooting problems I’ve had to deal with. Here’s that question, along with my answer:

What’s one of the most interesting things you’ve had to troubleshoot / do while maintaining a system?

I’m leaving out specific examples because they’re a mixture of non-public information and hyperspecific (uninteresting) technical stuff, but I can give some outlines for what generally makes for interesting problems to solve.

The really interesting problems I’ve seen tend to be related to performance, networking, and distributed systems. Usually they require a combination of different knowledge to solve:

  • Systems/OS: What is the operating system doing when everything slows down? What’s causing it to do that?
  • Networking/Distributed Systems: What’s actually happening when these machines communicate? How are they supposed to share and manage state, deal with network partitions, and ensure high availability? What are they *actually* doing when this problem happens?
  • Software Development: Which part of the code is causing this network/OS issue, and which code path leads there? Can I actually look at and modify this code? Is this code written by our developers, or an open-source project? What can I do to confirm the issue and test a fix? Can I contribute a fix back to the upstream project?