dcd9fedd-5509-4f32-8754-e48.../docs/06_system_info.md

270 lines
8.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Step 6 — System Info & Monitoring (Ubuntu 24)
> **Type along** exactly as shown. Safe to run on your machine; nothing here makes persistent config changes (a few optional installs use `apt`).
> **Estimated time:** ~1520 minutes
---
## What youll learn
- Identify OS, kernel, uptime, and hardware at a glance
- Inspect **CPU**, **memory**, **disks/partitions**, and **network**
- Watch processes and system load with `top`/`htop`, `ps`, and `vmstat`
- Check running **services** with `systemd` (`systemctl`) and view **logs** (`journalctl`)
- Measure disk/CPU/IO hot spots with `iostat`, `pidstat`, `iotop`
- Run a quick “health sweep” to triage performance issues
> **Setup:** Use your lab folder:
> ```bash
> mkdir -p ~/playground && cd ~/playground
> ```
---
## 0) Snapshot: who, what, where
Start with the fastest highlevel checks.
```bash
whoami
hostnamectl # hostname, OS, kernel
uname -rsm # kernel: release + machine arch
cat /etc/os-release # distro details
uptime -p # human-friendly uptime
uptime # load averages (1/5/15 min)
date # current time (TZ matters!)
```
> **Load average rule of thumb:** On a 4core machine, a 1minute load of ~4 means “fully busy”; much higher means queued work.
---
## 1) CPU & memory
### CPU facts
```bash
lscpu # sockets, cores, threads, flags
nproc # logical CPU count
cat /proc/cpuinfo | grep -m1 'model name'
```
### Memory / swap
```bash
free -h # Mem/Swap usage (human units)
vmstat -s | head -20 # memory counters summary
swapon --show # active swap areas
cat /proc/meminfo | sed -n '1,10p'
```
> **Tip:** High `si/so` (swapin/out) in `vmstat 1` output usually correlates with memory pressure.
---
## 2) Processes, load, and scheduling
### `top` basics (builtin)
```bash
top
```
Useful keys inside **top**:
- `1`: show perCPU cores
- `M`: sort by memory; `P`: sort by CPU; `T`: by time
- `k`: kill a process (enter PID, then signal like `15`)
- `Shift+E`: change units (KiB/MiB)
- `q`: quit
### `htop` (nicer UI)
```bash
sudo apt update && sudo apt install -y htop
htop
```
### Point-in-time listings with `ps`
```bash
ps aux --sort=-%cpu | head -15
ps aux --sort=-%mem | head -15
pstree -a | head -40 || echo 'Install: sudo apt install -y psmisc'
```
### Niceness & signals (be careful)
```bash
nice -n 10 sleep 60 & # start lower priority
pid=$!
renice +5 -p "$pid" # increase niceness (lower priority)
kill -15 "$pid" # graceful
# kill -9 "$pid" # last resort (commented)
```
---
## 3) Disks, partitions, and filesystem usage
### Layout & mounts
```bash
lsblk -f # devices, filesystems, labels
mount | column -t | head -20
findmnt -t ext4,xfs # mounted FS by type
```
### Space usage
```bash
df -hT # size, used, type per mount
# top 20 heavy dirs under /
sudo du -xh / | sort -h | tail -20
# top 10 heavy dirs under /var (faster)
sudo du -h -d1 /var | sort -h | tail -10
# inode pressure
df -i
```
> **When a disk looks full but `df -h` shows space free:** check **inodes** with `df -i`. Lots of tiny files can exhaust inodes.
### Disk performance (optional)
```bash
sudo apt install -y sysstat iotop
# per-disk stats, queue, utilization
iostat -xz 1 3
# per-process IO (press 'o' to filter active)
sudo iotop -oPa
```
---
## 4) Network quick checks
```bash
ip -br a # brief addresses
ip r # routing table
hostname -I # IPs only
resolvectl status | sed -n '1,40p' # DNS view
```
Listening ports and sockets:
```bash
ss -tulpn | head -30 # TCP/UDP listeners with PIDs
ss -s # socket summary
```
Connectivity basics:
```bash
ping -c 3 8.8.8.8
ping -c 3 google.com
sudo apt install -y mtr-tiny dnsutils
mtr -rwbzc 50 google.com # quick route quality report
dig +short google.com
curl -I https://example.com
```
> **If DNS is flaky:** try `dig @1.1.1.1 example.com` to bypass local resolvers.
---
## 5) Services with systemd
List running services and check status:
```bash
systemctl list-units --type=service --state=running | head -30
systemctl status ssh
systemctl is-enabled ssh
```
Restart a service (requires sudo):
```bash
sudo systemctl restart ssh
```
See boot performance & failures:
```bash
systemd-analyze time
systemd-analyze blame | head -20
systemctl --failed
```
---
## 6) Logs: journald and friends
Live/system logs:
```bash
journalctl -n 200 --no-pager # last 200 lines
journalctl -f # follow
journalctl -p warning --since '1 hour ago'
```
Service-specific logs:
```bash
journalctl -u ssh --since 'today' --no-pager
```
Kernel messages and last boot:
```bash
journalctl -k --since '1 hour ago'
journalctl -b -1 --no-pager # previous boot
```
Legacy file logs (some distros still write to these):
```bash
sudo tail -n 200 /var/log/syslog
sudo grep -i 'oom' /var/log/kern.log || true
```
> **Tip:** Use `-g PATTERN` to grep inside `journalctl`, e.g., `journalctl -g "Out of memory" -k`.
---
## 7) Resource pressure signals (advanced but handy)
Linux exposes PSI (Pressure Stall Information):
```bash
cat /proc/pressure/{cpu,io,memory}
```
If you see sustained **some**/**full** memory or IO pressure, correlate with `iostat`, `vmstat`, and logs (possible OOMs or slow disks).
OOM killer evidence:
```bash
journalctl -k -g 'Out of memory' --since '1 day ago'
dmesg -T | grep -i 'killed process' | tail -n 10
```
---
## 8) Quick health sweep (copy/paste)
Run this as a oneshot collection for triage (prints to screen):
```bash
{ echo '=== SNAPSHOT ===';
date; hostnamectl | sed -n '1,8p'; uname -rsm; uptime; echo;
echo '=== CPU/MEM ==='; lscpu | sed -n '1,8p'; free -h; vmstat 1 3; echo;
echo '=== DISK ==='; df -hT; iostat -xz 1 2 2>/dev/null || true; echo;
echo '=== TOP PROCS ==='; ps aux --sort=-%cpu | head -10; ps aux --sort=-%mem | head -10; echo;
echo '=== NET ==='; ip -br a; ss -tulpn | head -20; echo;
echo '=== SERVICES ==='; systemctl --failed || true; systemd-analyze time || true; echo;
echo '=== LOGS (last 50) ==='; journalctl -n 50 --no-pager;
} | sed 's/\x1b\[[0-9;]*m//g'
```
> **Note:** Some parts need packages (`sysstat`) or privileges; missing tools will be gracefully skipped.
---
## 9) Practice tasks (do these now)
1) Find your CPU core/thread count and current load.
*Hint:* `lscpu`, `uptime`, `top` (`1` key).
2) Identify the **top 5** memoryhungry processes and the **top 5** CPUhungry processes.
*Hint:* `ps aux --sort=-%mem` / `--sort=-%cpu`.
3) Determine which directory under `/var` consumes the most space.
*Hint:* `sudo du -h -d1 /var | sort -h | tail -5`.
4) List all listening TCP sockets with owning PIDs.
*Hint:* `ss -tulpn`.
5) Check logs for **ssh** in the last hour and restart the service.
*Hint:* `journalctl -u ssh --since '1 hour ago'`, then `sudo systemctl restart ssh`.
6) (Optional) Install `sysstat` and run `iostat -xz 1 3`; identify any device >90% util.
---
## 10) Troubleshooting quick guide
- **High load avg with low CPU** → usually **IO wait** or many blocked procs. Check `iostat -xz`, `ps` with `STAT` column (`D` = uninterruptible IO sleep).
- **Memory spikes / OOM** → check `journalctl -k -g 'Out of memory'`, watch `free -h`, consider which process grew via `ps --sort=-rss`.
- **Disk full** → use `df -hT`; then `du` to find culprits; also check **inodes** (`df -i`).
- **Service down** → `systemctl status <svc>` then `_journalctl -u <svc>_` to see the why; look for ExecStart errors, missing ports.
- **No DNS resolution** → `resolvectl status`, try `dig @1.1.1.1 example.com`; if that works, local resolver is suspect.
- **Network seems fine but web fails** → check outbound firewall/proxy, test raw IP `curl -I http://1.1.1.1`.
---
## 11) Quick quiz (1 minute)
- What do the **three** numbers in `uptime` represent?
- Which tool shows **perdisk** queue and utilization quickly?
- How do you list services that **failed** on boot?
- One command to show **listening** sockets with PIDs?
- Where do you look for **OOM killer** events?
**Answers:** 1/5/15min load avgs; `iostat -xz`; `systemctl --failed`; `ss -tulpn`; `journalctl -k -g 'Out of memory'` (or `dmesg`).
---
## Next Step
Proceed to **Step 7 — Users & Authentication** to manage local users, groups, passwords, and SSH hardening.