System Monitoring using conventional methods

You must be wondering why are we discussing conventional methods for system monitoring. Isn’t it the thing of the past?

Well, we must never forget the roots ! They make us stronger.

Why do we need to understand monitoring?

System monitoring is an important aspect of system administration. Enterprise applications demand a high level of proactiveness to reduce and even prevent failure impact. For that purpose, today, we can see numerous efficient tools that automatically monitor the system for us. They are capable enough to gather stats, generate alerts based on thresholds, and even guarantee excellent visuals. ELK, Nagios, and Pandora FMS are some examples. AI makes monitoring tools even smarter. Darktrace is an example of a cybersecurity defense tool that detects real-time threats using AI.

However, have you ever wondered how these tools work? Also, in order to understand them better, we need to understand manual system monitoring. It is also possible (very rare, but possible), that automated monitoring systems are unavailable. One such example is when they are going through maintenance. Can you imagine the chaos if your system was crashing while monitoring tools were on vacation? YIKES! 😮 At that point, manual monitoring is a lifesaver.

It might be possible that you are experimenting with VMs or trying to set up your own lab. In the early phases, when monitoring tools are not installed, detailed system insights can be gathered using the methods shared in this blog.

Linux offers very powerful tools to gauge system health. The commands I’ll be discussing are generic and present on all Linux distros. I am using Linux- Ubuntu.

By the way, if you want to follow along, do check replit. There you can launch a Linux shell ready for use within minutes.

1. Finding Load Average and system uptime

System reboots may occur which can also mess up some configurations. To check how long the machine has been up, use the command: uptime. In addition to the uptime, the command also displays load average.

The three values at the right represent the load average. Load average is the system the load over the last 1, 5, and 15 minutes. A quick glance indicates whether the system load appears to be increasing or decreasing over time.

[[email protected] ~]$ uptime
19:15:00 up  1:04,  0 users,  load average: 2.92, 4.48, 5.20

Ideal CPU queue is 0. This is only possible when there are no waiting queues for the CPU.

* Per-CPU load lower than 1 is considered to be normal.

Per-CPU load can be calculated by dividing load average with the total number of CPUs available.

find the number of CPUs from the command: lscpu

[[email protected] ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
output omitted

Calculating load average
Calculating Per CPU load average

For red signals, do keep an eye on Load Average!

2. Calculating Free Memory

Sometimes, high memory utilisation might be causing problems. To check the available memory, use the below command.

[[email protected] ~]$ free -mh
                total        used        free      shared  buff/cache   available
Mem:            25G        6.6G         11G         15M        7.3G         18G
Swap:            0B          0B          0B

3. Calculating Disk Space

To ensure the system, is healthy, don’t miss out on the disk space. To list all the available mount points and their respective used percentage, use the below command. Ideally, utilized disk spaces should not exceed 80%.

[[email protected] ~]$ df -h

Filesystem              Size  Used Avail Use% Mounted on
 overlay                  97G   52G   46G  54% /
 tmpfs                    64M     0   64M   0% /dev
 tmpfs                    13G     0   13G   0% /sys/fs/cgroup
 overlay                 2.4G  904K  2.4G   1% /nix
 tmpfs                   2.6G   17M  2.6G   1% /io
 tmpfs                    13G     0   13G   0% /proc/acpi
 tmpfs                    13G     0   13G   0% /proc/scsi
 tmpfs                    13G     0   13G   0% /sys/firmware
 /dev/mapper/conman-850  2.4G  892K  2.4G   1% /home/runner/Bash

4. Understanding Process states

Processes are the most important part of any OS kernel. They give deep insights into the processes and their states and how they communicate with each other. The column ‘stat’ in the below result shows process states.

[[email protected] ~]$ ps aux
 runner         1  0.1  0.0 1535464 15576 ?       S  19:18   0:00 /inject/init
 runner        14  0.0  0.0  21484  3836 pts/0    S   19:21   0:00 bash --norc
 runner        22  0.0  0.0  37380  3176 pts/0    R+   19:23   0:00 ps aux

Below are some of the common process states:

RunningRThe task is running.
SleepingSThe task is sleeping and will not respond to any prompts.
StoppedTThe task is stopped and suspended after receiving the signal from another process.
ZombieZZ denotes zombie processes.
Process States

5. Real Time System monitoring

Real time monitoring is one of the most powerful utilities. To check runtime stats, ‘top’ command is used.

The top program displays a dynamic view of the system’s processes, displaying a summary header
followed by a process or thread list. Unlike its static counterpart, ‘ps’, ‘top’ continuously refreshes the system stats.

With ‘top’, we can see well organised details in a compact window. There a number of flags, shortcuts and highlighting methods that come along with ‘top’. Do experiment with it as it is powerful and convenient to use.

You can also kill processes using ‘top’. Use the flag ‘k’ and enter process id.

Top command output
Top command output

6. Interpreting Logs

System and application logs carry tons of information about what the system is going through. It’s like a dialogue between you and the system( but one-sided 😀 ). They contain useful information and error codes that point towards errors. If you search for error codes in logs, issue identification and rectification time can be greatly reduced.

For detailed logs parsing, check this post.

7. Network Ports Analysis

The network aspect should not be ignored as network glitches are common and may impact the system and traffic flows. Common network issues include port exhaustion, port choking, unreleased resources, and so on.

To identify such issues, we need to understand port states.

Some of the port states are explained briefly.

LISTENRepresents ports that are waiting for a connection request from any remote TCP and port.
ESTABLISHEDRepresents connections that are open and data received can be delivered to the destination.
TIME WAITRepresents waiting time to ensure acknowledgment of its connection termination request.
FIN WAIT2represents waiting for a connection termination request from the remote TCP.
Process states

Let’s explore how we can analyze port related information in Linux.

Port ranges are defined in the system and range can be increased/decreased accordingly. In the below snippet, the range is from 15000 to 65000, which makes a total of 50000 (65000 – 15000) available ports. If utilized ports are reaching this limit, then there is an issue.

[[email protected] ~]$ /sbin/sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 15000    65000

We can see connection type as ‘TCP’, IPs/ Ports in the next columns and on the right hand side, STATE is shown.

[[email protected] ~]$ netstat -an | grep EST 
 tcp        0      0     ESTABLISHED
 tcp        0      0         ESTABLISHED
 tcp        0      0      ESTABLISHED
 tcp        0      0     ESTABLISHED

8. Identifying Packet loss

In system monitoring, we need to ensure the outgoing and incoming communication is intact.

One helpful command is ping. Ping hits the destination system and brings the response back. Note the last few lines of statistics that show packet loss % and time.

# ping destination IP

[[email protected] ~]$ ping
 PING ( 56(84) bytes of data.
 64 bytes from icmp_seq=1 ttl=128 time=0.652 ms
 64 bytes from icmp_seq=2 ttl=128 time=0.593 ms
 64 bytes from icmp_seq=3 ttl=128 time=0.478 ms
 64 bytes from icmp_seq=4 ttl=128 time=0.384 ms
 64 bytes from icmp_seq=5 ttl=128 time=0.432 ms
 64 bytes from icmp_seq=6 ttl=128 time=0.747 ms
 64 bytes from icmp_seq=7 ttl=128 time=0.379 ms
 --- ping statistics ---
 7 packets transmitted, 7 received,0% packet loss, time 6001ms
 rtt min/avg/max/mdev = 0.379/0.523/0.747/0.134 ms

Packets can also be captured at runtime using ‘tcpdump’. We’ll look into it ilater.

9. Gathering stats for issue post mortem

It is always a good practice to gather certain stats that would be useful for identifying the root cause later. Usually, after system reboot or services restart, we lose the earlier system snapshot and logs.

Confluence is a collaboration tool that offers an excellent incident post mortem template to record the findings on the way. You can find the template here.

Below are some of the methods to capture system snapshot.

9.1 Logs Backup

Before making any changes, copy log files to another location. This is crucial for understanding what condition the system was in during time of issue. Sometimes log files are the only window to look into past system states as other runtime stats are lost.

9.2 TCP Dump

Tcpdump is a command-line utility that allows you to capture and analyze incoming and outgoing network traffic. It is mostly used to help troubleshoot network issues. If you feel, system traffic is being impacted, take Tcpdump as follows:

sudo tcpdump -i any -w <filepath and filename>

# Stop the command after a few mins as the file size may increase 
# use file extension as .pcap

Once Tcpdump is captured, you can use tools like Wireshark to visually analyze the dump. To learn in-depth about Tcpdump, check their official site or visit their man pages.

Here we have, in this blog post, a handbook for generic system monitoring. The process is more or less very similar for all nix-like systems. A healthy system and timely checks ensure smooth operations(and peaceful weekends 😀 ). The mentioned process acts as a window to look into the system health. Remember, the system has a lot to tell you, be all ears and you will never be disappointed 🙂