The Hardest (and most fun) Problems to Troubleshoot

I recently wrote a FAQ-style post about System Administration and technology careers in general. One of the best questions I was asked was about what kinds of really interesting troubleshooting problems I’ve had to deal with. Here’s that question, along with my answer:

What’s one of the most interesting things you’ve had to troubleshoot / do while maintaining a system?

I’m leaving out specific examples because they’re a mixture of non-public information and hyperspecific (uninteresting) technical stuff, but I can give some outlines for what generally makes for interesting problems to solve.

The really interesting problems I’ve seen tend to be related to performance, networking, and distributed systems. Usually they require a combination of different knowledge to solve:

  • Systems/OS: What is the operating system doing when everything slows down? What’s causing it to do that?
  • Networking/Distributed Systems: What’s actually happening when these machines communicate? How are they supposed to share and manage state, deal with network partitions, and ensure high availability? What are they *actually* doing when this problem happens?
  • Software Development: Which part of the code is causing this network/OS issue, and which code path leads there? Can I actually look at and modify this code? Is this code written by our developers, or an open-source project? What can I do to confirm the issue and test a fix? Can I contribute a fix back to the upstream project?

System Administration Careers: Frequently Asked Questions

If you’re wondering whether or not a System Administration career is right for you, this article might help. Someone just sent me some questions about what a career in System Administration is like, and asked some questions that I hadn’t thought to answer on YouTube or here on the tutorialinux blog. Since I’ve been working in various IT Operations and Software Development disciplines professionally for the last 7 years, I love talking about this stuff. Here are my two cents:

 

What are the Responsibilities a Linux / Unix System Administrator?

Architecting, Building, and Troubleshooting/Maintaining the technical infrastructure for a company or software product. At a high level, you’re responsible for figuring out what the technical requirements are and using existing software/products/tools to provide those requirements (or occasionally writing your own). Early-career focuses on implementing predefined task-chunks:

  • Figure out why this server is doing XYZ
  • Fix a problem with our software deployment automation code
  • Increase storage size for one of the database clusters
  • Replace failed disks at a datacenter
  • Create a web server configuration file for a new project that Dev is working on

Senior-level positions often have more design/architecture:

  • Deep introspection of the OS and the software product you’re supporting
  • Design a new deployment pipeline
  • Build a technical team for a new project
  • Evaluate new software that would change how we run our infrastructure (e.g. containers + scheduler vs. VMs + config management)
  • Troubleshoot tough problems (OS-level issues, bugs arising from a confluence of several edge cases across the stack, etc.)

 

What education/experience/credentials are required to do this job?

Read more