15 command every DevOps/SRE should know during their Oncall

Smash the production issue in 120 sec!!

9 min readJan 8, 2021

Just as doctors go on-call to support emergency patient needs around the clock, IT organizations task dedicated groups of engineers with going on-call to fix issues for software services as they arise.

The ability to quickly and effectively find & resolve bugs in new and established production systems is one of the most valuable engineering skills that you can develop. Surprisingly, many engineers do not have a clear understanding of the basic architecture, I intend to address this deficit now. Here is my treatise on debugging.

Most server performance degradation is either due to the CPU, memory, network, and disk. The commands below gives you a high-level idea of system resource usage, error & health.

uptime / w
dmesg 
free
top / ps
du /df 
iostat
pidstat
vmstat 
mpstat 
SAR 
netstat / ss
tcpdump 
lsof
thread / heap dump
strace

🔹 uptime / w: (load average)

It gives current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past
1, 5, and 15 minutes.

Load in the system indicates the number of tasks (processes) wanting to run.
These numbers include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O).

The three numbers give us some idea of how the load is changing over time.
The numbers are big, probably high CPU demand.

$uptime
 12:22:16 up 10 days, 5:06, 1 user, load average: 0.68, 0.63, 0.61

Alternately, If a lot of users login quick check who all logs & what they running. along with uptime details.

w: Show who is logged on and what they are doing.

w
 12:23:37 up 10 days, 5:07, 1 user, load average: 0.43, 0.59, 0.61
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
opc pts/0 10.0.0.2 12:22 1.00s 0.05s 0.03s w

🔹 dmesg (Karnel Error)
Displays the messages from the kernel ring buffer. gives details about problem encounters by the system during its start-up, check any error message in the logs.(oom-killer,TCP dropping a request etc..)

Usage : dmesg -TL
T : timestamp
L : highlight error message

sudo dmesg | grep -i “sda” (search for specific keyword)

🔹 free (Memory Usage)
Free displays the total amount of free and used physical and swap memory in the system (RAM memory), as well as the buffers and caches used by the kernel.

Usage: free -h (-h human readable )

used = Total-(Free+Buffers+Cache).
buffers: memory used by kernel buffers. used for block device I/O.
cached: memory used by page cache, used by file systems.

Available : An estimate of the amount of memory that is available for starting new applications, without swapping. (should not be zero)

free -s 2 -c 5 (check every 2 seconds 5 times)

🔹 top / ps (Who is consuming CPU? & by how much)

Provides a dynamic real-time view of a running system (running process)
this command shows the summary information of the system and the list of processes or threads which are currently managed by the Linux Kernel.

A downside to top is that it is harder to see patterns over time. & difficult makes conclude decisions

top -u sk (list by the specific users process)
top -o %MEM (display the process by memory or CPU)

%CPU : Represents the CPU usage
%MEM: Shows the memory usage of task.

ps : Report a snapshot of the running processes details,duration, CMD,UID..

Usage: ps -ef

Check the highest CPU & memory consuming process.

ps -eo pid,ppid,cmd,%mem,%cpu — sort=-%mem | head

🔹 du / df ( Disk space)
du :How much disk space consumed by the files & directory.
df : How much free space each partition has.

Usage: du -h /df -kh

df -i : How many inodes in used/free. Running out of inode is anonying you cannot create new file or folder.
ncdu: Anatomy of the each files & folders (who consume what)
lsblk: lists information about all available or the specified block devices.
cat /etc/fstab: file is a system configuration file that contains all available disks, disk partitions and their options

🔹 iostat: (Disk I/O)

Gives the CPU Utilization report, Device Utilization report & Network Filesystem (NFS) report.monitoring system input/output (disk read & write) statistics for devices and partitions.

%util: It tells us that how much time did the storage device have outstanding work (was busy). Values close to 100% usually indicate saturation.
%idle: the percentage of time system was idle with no outstanding request.
%iowait: Percentage of time the CPU is idle AND there is at least one I/O
in progress.
await: indicates how fast do requests go through. includes requests in queue and the time spent servicing them.

Usage:iostat -x 1
-x :Display extended statistics.

🔹 pidstat (Process Usage)
Used for monitoring individual tasks currently being managed by the Linux kernel.

pidstat -C “mysql”
pidstat -p <pid>
pidstat -p <pid> -r 1 
pidstat -urd -h (combine all)

d : Report I/O statistics of process. (disk read & write)
-r : Display page faults & memory utilization.
-u: CPU details (by default display this)

CPU : Processor number to which the task is attached.
%CPU : Total percentage of CPU time used by the task.

🔹 vmstat (Overall stats by time)
vmstat reports information about processes, memory, paging, block IO, traps, & cpu activity. (except network)

vmstat [interval] [count]
vmstat 1 -S M

-d : gives you read/write stats for various disks

si, so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.
r: Number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an “r” value greater than the CPU count is saturation.
b: Number of processes blocked waiting for I/O to complete.

🔹mpstat (CPU Balance)
mpstat is used to monitor cpu utilization on your system. It will be more useful if your system has multiple processors.
This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance.

A single hot CPU can be evidence of a single-threaded application.
Usage: mpstat -P ALL 1

🔹 SAR (System Activity Report)
Collect, report, or save system cumulative activity of the operating system.
All the cumulative data stored in /var/log/sa / /var/log/sysstat

Usage : sar [ options ] [ <interval> [ <count> ] ]

sar -q : list the details from start of the day 
sar -r : list the details of memory usage from the start of the day. 
sar -P : list the details of CPU usage. 
sar -n : Report network statistics
sar -u 1 3 : Displays real time CPU usage every 1 second for 3 times.
sar -P ALL 1 1 : Display the CPU usage for all the cores 1 sec 1 time
sar -P 1 1 3 : Displays real time CPU usage for core number 1, every 1 second for 3 times.
sar -s 16:00:00 -e 17:00:00 : start & endtime
sar -n DEV 1 : statistics from the network devices are reported.
sar -n TCP,ETCP 1: statistics about TCPv4 network traffic & network errors are reported.
DEV: statistics from the network devices are reported.
IFACE: Name of the network interface for which statistics are reported
ifutil : Utilization percentage of the network interface

To get max Load avg in machine in specific day.

sar -q -f sa04 | tail -n +4 | sed -e ‘s/ */,/g’ | cut -d ‘,’ -f 5 | sort -rn | head -1

🔹 netstat / ss (Network I/O)
List all the network connections, routing tables, interface statistics, number of network interface, and multicast memberships in that machine.it also provide information of TCP/UDP connections, connections state, how much data is getting received and send, from which address to which address.

netstat -a
netstat -at (tcp connection)
netstat -au (udp connection)
netstat -tnl (list of listing port)

-l : listening display listening server sockets
-a : all display all sockets
-r : route display routing table
–i : interfaces display interface table
-n: numeric don’t resolve names
-p: programs display PID/Program name for sockets

ss: Same but, faster compared to netstat. used to dump socket statistics. tool for tracking TCP connections and sockets.
Usage: ss -l

🔹 tcpdump (TCP stats)
Utility that allows you to capture and analyze network traffic going through your system.

tcpdump -i eth1 'port 80' HTTP
tcpdump -c 5 -i eth0 capture 5 packets
tcpdump -c 5 -i any  
tcpdump -i eth0 src 192.168.0.2 capture from specific host
tcpdump -i eth0 dst 50.116.66.139
tcpdump -i eth0 tcp
tcpdump -w 0001.pcap -i eth0 capture output in the file (.pcap format) 
tcpdump -D list of interface available

🔹 lsof / ulimit (list of open file)

It gives the information to find out the files which are opened by which process.

lsof <FILE_SYSTEM>
lsof +D <DIR_NAME> 
lsof -i (option to find all open network connections) 
lsof -i 4 : ipv4 files
lsof -i 6 : ipv6 files
lsof -p <PID> (list based on specific process id) 
lsof -c ssh (search file based on specific command)
lsof -u opc (search based on users)

kill the activity of particular users:

kill -9 `lsof -t -u skanda`

ulimit

If you reach its threshold, system not allow to create new file. Program may fail to execute. then, either close the existing files or increase its limit.

ulimit -a (max allowed open file in the machine)
ulimit -n <open count> Changes the file limit in the machine.

Socket connections are treated as files . so any socket connection created by any linux process will be consider one open file.

🔹 Thread & Heap Dump

A thread dump contains a snapshot of all the threads active at a particular point during the execution of a program.

jstack [-F] [-l] <pid> file path
jcmd <pid> Thread.print

-F: option forces a thread dump; handy to use when jstack pid does not respond (the process is hung)
-l: option instructs the utility to look for ownable synchronizers in the heap and locks

Heap Dump: is a snapshot of all the objects that are in memory in the JVM at a certain moment.

jmap -dump:[live],format=b,file=<file-path> <pid>
jcmd <pid> GC.heap_dump <file-path>

live: if set it only prints objects which have active references and discards the ones that are ready to be garbage collected.
format=b: specifies that the dump file will be in binary format. If not set the result is the same
file: file where the dump will be written
pid: id of the Java process

Set the heapdump during the JVM start up as an parameter.
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${DOMAIN_HOME}/logs/mps

🔹 strace (system call tracer)
trace system calls & signals. gives an idea about how you interacts with your operating system. useful diagnostic, instructional &debugging tool.

strace -p <pid>
strace -p <pid> -o filename
strace ls

when to use:
▪️ Debugging why an installation crashes on a machine.
▪️ Debugging random crashes that are most probably due to the program running out of memory
▪️Finding out how the program interacts with the file system.
▪️Debugging crashes reproducible only on one machine.
▪️Debugging crashes in unfamiliar code or in cases when sources are unavailable.

Finally,

Some brain teasers, How to debug these scenarios?

Hint : CPU Bound means the rate at which process progresses is limited by the speed of the CPU.

Special thanks to @brendangreg. If you wanted to know in-depth, Please check http://www.brendangregg.com/

15 command every DevOps/SRE should know during their Oncall

Smash the production issue in 120 sec!!

Enjoyed this article? Click 👏 to help other people find it.

Written by Skanda Shastry