You must be wondering why are we discussing conventional methods for system monitoring. Isn’t it the thing of the past?
Well, we must never forget the roots ! They make us stronger.
Why do we need to understand monitoring?
System monitoring is an important aspect of system administration. Enterprise applications demand a high level of proactiveness to reduce and even prevent failure impact. For that purpose, today, we can see numerous efficient tools that automatically monitor the system for us. They are capable enough to gather stats, generate alerts based on thresholds, and even guarantee excellent visuals. ELK, Nagios, and Pandora FMS are some examples. AI makes monitoring tools even smarter. Darktrace is an example of a cybersecurity defense tool that detects real-time threats using AI.
However, have you ever wondered how these tools work? Also, in order to understand them better, we need to understand manual system monitoring. It is also possible (very rare, but possible), that automated monitoring systems are unavailable. One such example is when they are going through maintenance. Can you imagine the chaos if your system was crashing while monitoring tools were on vacation? YIKES! 😮 At that point, manual monitoring is a lifesaver.
It might be possible that you are experimenting with VMs or trying to set up your own lab. In the early phases, when monitoring tools are not installed, detailed system insights can be gathered using the methods shared in this blog.
Linux offers very powerful tools to gauge system health. The commands I’ll be discussing are generic and present on all Linux distros. I am using Linux- Ubuntu.
By the way, if you want to follow along, do check replit. There you can launch a Linux shell ready for use within minutes.
1. Finding Load Average and system uptime
System reboots may occur which can also mess up some configurations. To check how long the machine has been up, use the command: uptime. In addition to the uptime, the command also displays load average.
The three values at the right represent the load average. Load average is the system the load over the last 1, 5, and 15 minutes. A quick glance indicates whether the system load appears to be increasing or decreasing over time.
[[email protected] ~]$ uptime 19:15:00 up 1:04, 0 users, load average: 2.92, 4.48, 5.20
Ideal CPU queue is 0. This is only possible when there are no waiting queues for the CPU.
* Per-CPU load lower than 1 is considered to be normal.
Per-CPU load can be calculated by dividing load average with the total number of CPUs available.
find the number of CPUs from the command: lscpu
[[email protected] ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 . . output omitted
For red signals, do keep an eye on Load Average!
2. Calculating Free Memory
Sometimes, high memory utilisation might be causing problems. To check the available memory, use the below command.
[[email protected] ~]$ free -mh total used free shared buff/cache available Mem: 25G 6.6G 11G 15M 7.3G 18G Swap: 0B 0B 0B
3. Calculating Disk Space
To ensure the system, is healthy, don’t miss out on the disk space. To list all the available mount points and their respective used percentage, use the below command. Ideally, utilized disk spaces should not exceed 80%.
[[email protected] ~]$ df -h Filesystem Size Used Avail Use% Mounted on overlay 97G 52G 46G 54% / tmpfs 64M 0 64M 0% /dev tmpfs 13G 0 13G 0% /sys/fs/cgroup overlay 2.4G 904K 2.4G 1% /nix tmpfs 2.6G 17M 2.6G 1% /io tmpfs 13G 0 13G 0% /proc/acpi tmpfs 13G 0 13G 0% /proc/scsi tmpfs 13G 0 13G 0% /sys/firmware /dev/mapper/conman-850 2.4G 892K 2.4G 1% /home/runner/Bash
4. Understanding Process states
Processes are the most important part of any OS kernel. They give deep insights into the processes and their states and how they communicate with each other. The column ‘stat’ in the below result shows process states.
[[email protected] ~]$ ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND runner 1 0.1 0.0 1535464 15576 ? S 19:18 0:00 /inject/init runner 14 0.0 0.0 21484 3836 pts/0 S 19:21 0:00 bash --norc runner 22 0.0 0.0 37380 3176 pts/0 R+ 19:23 0:00 ps aux
Below are some of the common process states:
|Running||R||The task is running.|
|Sleeping||S||The task is sleeping and will not respond to any prompts.|
|Stopped||T||The task is stopped and suspended after receiving the signal from another process.|
|Zombie||Z||Z denotes zombie processes.|
5. Real Time System monitoring
Real time monitoring is one of the most powerful utilities. To check runtime stats, ‘top’ command is used.
The top program displays a dynamic view of the system’s processes, displaying a summary header
followed by a process or thread list. Unlike its static counterpart, ‘ps’, ‘top’ continuously refreshes the system stats.
With ‘top’, we can see well organised details in a compact window. There a number of flags, shortcuts and highlighting methods that come along with ‘top’. Do experiment with it as it is powerful and convenient to use.
You can also kill processes using ‘top’. Use the flag ‘k’ and enter process id.
6. Interpreting Logs
System and application logs carry tons of information about what the system is going through. It’s like a dialogue between you and the system( but one-sided 😀 ). They contain useful information and error codes that point towards errors. If you search for error codes in logs, issue identification and rectification time can be greatly reduced.
For detailed logs parsing, check this post.
7. Network Ports Analysis
The network aspect should not be ignored as network glitches are common and may impact the system and traffic flows. Common network issues include port exhaustion, port choking, unreleased resources, and so on.
To identify such issues, we need to understand port states.
Some of the port states are explained briefly.
|LISTEN||Represents ports that are waiting for a connection request from any remote TCP and port.|
|ESTABLISHED||Represents connections that are open and data received can be delivered to the destination.|
|TIME WAIT||Represents waiting time to ensure acknowledgment of its connection termination request.|
|FIN WAIT2||represents waiting for a connection termination request from the remote TCP.|
Let’s explore how we can analyze port related information in Linux.
Port ranges are defined in the system and range can be increased/decreased accordingly. In the below snippet, the range is from 15000 to 65000, which makes a total of 50000 (65000 – 15000) available ports. If utilized ports are reaching this limit, then there is an issue.
[[email protected] ~]$ /sbin/sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 15000 65000
We can see connection type as ‘TCP’, IPs/ Ports in the next columns and on the right hand side, STATE is shown.
[[email protected] ~]$ netstat -an | grep EST tcp 0 0 172.31.203.233:37145 172.31.199.132:9984 ESTABLISHED tcp 0 0 127.0.0.1:40312 127.0.0.1:59770 ESTABLISHED tcp 0 0 172.31.203.213:22 172.31.208.6:58816 ESTABLISHED tcp 0 0 172.31.203.213:20677 172.31.252.164:9997 ESTABLISHED
8. Identifying Packet loss
In system monitoring, we need to ensure the outgoing and incoming communication is intact.
One helpful command is ping. Ping hits the destination system and brings the response back. Note the last few lines of statistics that show packet loss % and time.
# ping destination IP [[email protected] ~]$ ping 10.13.6.113 PING 10.13.6.141 (10.13.6.141) 56(84) bytes of data. 64 bytes from 10.13.6.113: icmp_seq=1 ttl=128 time=0.652 ms 64 bytes from 10.13.6.113: icmp_seq=2 ttl=128 time=0.593 ms 64 bytes from 10.13.6.113: icmp_seq=3 ttl=128 time=0.478 ms 64 bytes from 10.13.6.113: icmp_seq=4 ttl=128 time=0.384 ms 64 bytes from 10.13.6.113: icmp_seq=5 ttl=128 time=0.432 ms 64 bytes from 10.13.6.113: icmp_seq=6 ttl=128 time=0.747 ms 64 bytes from 10.13.6.113: icmp_seq=7 ttl=128 time=0.379 ms ^C --- 10.13.6.113 ping statistics --- 7 packets transmitted, 7 received,0% packet loss, time 6001ms rtt min/avg/max/mdev = 0.379/0.523/0.747/0.134 ms
Packets can also be captured at runtime using ‘tcpdump’. We’ll look into it ilater.
9. Gathering stats for issue post mortem
It is always a good practice to gather certain stats that would be useful for identifying the root cause later. Usually, after system reboot or services restart, we lose the earlier system snapshot and logs.
Confluence is a collaboration tool that offers an excellent incident post mortem template to record the findings on the way. You can find the template here.
Below are some of the methods to capture system snapshot.
9.1 Logs Backup
Before making any changes, copy log files to another location. This is crucial for understanding what condition the system was in during time of issue. Sometimes log files are the only window to look into past system states as other runtime stats are lost.
9.2 TCP Dump
Tcpdump is a command-line utility that allows you to capture and analyze incoming and outgoing network traffic. It is mostly used to help troubleshoot network issues. If you feel, system traffic is being impacted, take Tcpdump as follows:
sudo tcpdump -i any -w <filepath and filename> # Stop the command after a few mins as the file size may increase # use file extension as .pcap
Here we have, in this blog post, a handbook for generic system monitoring. The process is more or less very similar for all nix-like systems. A healthy system and timely checks ensure smooth operations(and peaceful weekends 😀 ). The mentioned process acts as a window to look into the system health. Remember, the system has a lot to tell you, be all ears and you will never be disappointed 🙂