„The Matrix“ ruined it a little but for all of us. Outsiders have sometimes the impression that admins can find errors on their servers just by staring at the lines scrolling by. And admins with the reputation to pull rabbits out of their hats do have this problem even more frequently. Well, sometimes this is even true. however not because we parsed the sense of the error messages on the fly and made sense out of them but because there were some lines containing „FATAL“ or „Error“ in them. Or we saw a strange non-descriptive line before and saw it passing by now.
However: I’m pretty opinionated that in most situations we need some way of doing a graphical visualisation to find performance problems or availability problems resulting from performance problems.
Many, many years ago I found the first hint for a problem in a system just by importing the data into Excel. A system that has really slow reaction times sometimes. The times between the occurrences looked really random and at first I didn’t have the foggiest idea where I had to search for the problem. So I measured several hundreds of thousands of the suspected events with Dtrace. It was just a long list of numbers noting down the runtime of the event and the hi-res time when the event was executed. And it was a lot of boring data. Several Megabytes worth of them. There was no pattern in the events when you looked at time. However as soon as you put them down in a graph it was pretty obvious that they didn’t occur after a certain time but after a certain number of events. It turned out, the problem was that after a number of events a table had to be cleaned up before further events could be executed. It’s something I’ve learned over time: If you have no pattern in time, look if there is a pattern in the number of events.
It’s pretty much the same with tcpdump. Of course you could try to diagnose a problem by looking at each frame or searching for certain events in them. However when you look at a 1 minute pcap file of four aggregated 10 GBe interfaces you look at a lot of data, even when you just record the first 128 bytes of them. I found the IO graph feature of Wireshark really helpful. As soon as you do a graphical representation of your data (like packets per 1 ms) you see in an instance that you have periods of time where no data is transferred and you have pretty much reduced the time frame you have to examine by starting to search at the moments
Remember when Brendan Gregg found out that screaming at a hard disk had an impact on the latency and how easy this was to demonstrate when looking at the Analytics feature of the storage system? Up to today I think it’s a really nice example of the power of a good visualisation.
I’m frequently using a script framework for importing iostat data into R and analysing it with it , so you see block sizes and hiccups right in a graphical representation, you can calculate percentiles. So you can make sense out of a day worth of iostat data. I’m importing kstat data into an influxdb in order to analyse the data with Grafana.
Another important lesson I learned over time: When you don’t record performance data all the time, record data of your system when everything is okay, update this recording frequently, keep the old recordings. From a graphical representation it’s often easy to see a development over time, what situations were the precursor to a problem. By overlaying graphs from „everything was okay“ with „everything is belly up“ you will often see changes within one look to the graphs. Given how cheap storage is today, you should really think about long term recording of performance data just to have a rich repository of data from the past for comparison purposes.
Finding a performance problem often begins for me with a good visualisation or reducing the data to some statistical numbers . Because I can’t analyse really large log files with eyeball Mark I. I have computers to analyse the log files for me.
My lessons learned in short:
- record a lot of data even before something happens
- visualise data early in the process of searching problems and don’t stare at large pools of data manually
- if you don’t find a pattern of events in time of occurrence , try to find it by the number of events.