Advanced Linux Troubleshooting Techniques for Site Reliability Engineers

Explore advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and eBPF.

Prashanth Ravula

CORE ·

May. 15, 24 · Tutorial

Likes (2)

Comment

Save

3.1K Views

In Site Reliability Engineering (SRE), the ability to quickly and effectively troubleshoot issues within Linux systems is crucial. This article explores advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and using the Extended Berkeley Packet Filter (eBPF) for real-time data gathering.

Kernel Debugging

Kernel debugging is a fundamental skill for any SRE working with Linux. It allows for deep inspection of the kernel's behavior, which is critical when diagnosing system crashes or performance bottlenecks.

Tools and Techniques

GDB (GNU Debugger)

GDB can debug kernel modules and the Linux kernel. It allows setting breakpoints, stepping through the code, and inspecting variables.

GNU Debugger Official Documentation: This is the official documentation for GNU Debugger, providing a comprehensive overview of its features.

KGDB

The kernel debugger allows the kernel to be debugged using GDB over a serial connection or a network. Using kgdb, kdb, and the kernel debugger internals provides a detailed explanation of how kgdb can be enabled and configured.

Dynamic Debugging (dyndbg)

Linux's dynamic debug feature enables real-time debugging messages that help trace kernel operations without rebooting the system. The official Dynamic Debug page describes how to use the dynamic debug (dyndbg) feature.

Tracing System Calls With strace

strace is a powerful diagnostic tool that monitors the system calls used by a program and the signals received by a program. It is instrumental in understanding the interaction between applications and the Linux kernel.

Usage

To trace system calls, strace can be attached to a running process or start a new process under strace. It logs all system calls, which can be analyzed to find faults in system operations.

Example:

    Shell
   
 

   root@ubuntu:~# strace -p 2009
strace: Process 2009 attached
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
  

In the above example, the -p flag is the process, and 2009 is the pid. Similarly, you can use the -o flag to log the output to a file instead of dumping everything on the screen. You can review the following article to understand system calls on Linux with strace.

Performance Analysis With perf

perf is a versatile tool used for system performance analysis. It provides a rich set of commands to collect, analyze, and report on hardware and software events.

Key Features

perf record: Gathers performance data into a file, perf.data, which can be further analyzed using perf report to identify hotspots
perf report: This report analyzes the data collected by perf record and displays where most of the time was spent, helping identify performance bottlenecks.
Event-based sampling: perf can record data based on specific events, such as cache misses or CPU cycles, which helps pinpoint performance issues more accurately.

Example:

    Shell
   
 

   root@ubuntu:/tmp# perf record
^C[ perf record: Woken up 17 times to write data ]
[ perf record: Captured and wrote 4.619 MB perf.data (83123 samples) ]

root@ubuntu:/tmp#

root@ubuntu:/tmp# perf report
Samples: 83K of event 'cpu-clock:ppp', Event count (approx.): 20780750000
Overhead  Command          Shared Object             Symbol
74%  swapper          [kernel.kallsyms]         [k] cpuidle_idle_call
36%  stress           [kernel.kallsyms]         [k] __do_softirq
17%  stress           [kernel.kallsyms]         [k] finish_task_switch.isra.0
90%  stress           [kernel.kallsyms]         [k] el0_da
73%  stress           libc.so.6                 [.] random_r
92%  stress           [kernel.kallsyms]         [k] flush_end_io
87%  stress           libc.so.6                 [.] random
71%  stress           libc.so.6                 [.] 0x00000000001405bc
71%  kworker/0:2H-kb  [kernel.kallsyms]         [k] ata_scsi_queuecmd
58%  stress           libm.so.6                 [.] __sqrt_finite
45%  stress           stress                    [.] 0x0000000000000f14
62%  stress           stress                    [.] 0x000000000000168c
46%  stress           [kernel.kallsyms]         [k] __pi_clear_page
37%  stress           libc.so.6                 [.] rand
34%  stress           libc.so.6                 [.] 0x00000000001405c4
22%  stress           stress                    [.] 0x0000000000000e94
20%  stress           [kernel.kallsyms]         [k] folio_batch_move_lru
20%  stress           stress                    [.] 0x0000000000000f10
16%  stress           libc.so.6                 [.] 0x00000000001408d4
84%  stress           [kernel.kallsyms]         [k] handle_mm_fault
77%  stress           [kernel.kallsyms]         [k] release_pages
65%  stress           [kernel.kallsyms]         [k] super_lock
62%  stress           [kernel.kallsyms]         [k] _raw_spin_unlock_irqrestore
61%  stress           [kernel.kallsyms]         [k] blk_done_softirq
61%  stress           [kernel.kallsyms]         [k] _raw_spin_lock
60%  stress           [kernel.kallsyms]         [k] folio_add_lru
58%  kworker/0:2H-kb  [kernel.kallsyms]         [k] finish_task_switch.isra.0
55%  stress           [kernel.kallsyms]         [k] __rcu_read_lock
52%  stress           [kernel.kallsyms]         [k] percpu_ref_put_many.constprop.0
46%  stress           stress                    [.] 0x00000000000016e0
45%  stress           [kernel.kallsyms]         [k] __rcu_read_unlock
45%  stress           [kernel.kallsyms]         [k] dynamic_might_resched
42%  stress           [kernel.kallsyms]         [k] _raw_spin_unlock
41%  stress           [kernel.kallsyms]         [k] __mod_memcg_lruvec_state
40%  stress           [kernel.kallsyms]         [k] mas_walk
39%  stress           [kernel.kallsyms]         [k] arch_counter_get_cntvct
39%  stress           [kernel.kallsyms]         [k] rwsem_read_trylock
39%  stress           [kernel.kallsyms]         [k] up_read
38%  stress           [kernel.kallsyms]         [k] down_read
37%  stress           [kernel.kallsyms]         [k] get_mem_cgroup_from_mm
36%  stress           [kernel.kallsyms]         [k] free_unref_page_commit
34%  stress           [kernel.kallsyms]         [k] memset
32%  stress           libc.so.6                 [.] 0x00000000001408c8
30%  stress           [kernel.kallsyms]         [k] sync_inodes_sb
29%  stress           [kernel.kallsyms]         [k] iterate_supers
29%  stress           [kernel.kallsyms]         [k] percpu_counter_add_batch
  

Real-Time Data Gathering With eBPF

eBPF allows for creating small programs that run on the Linux kernel in a sandboxed environment. These programs can track system calls and network messages, providing real-time insights into system behavior.

Applications

Network monitoring: eBPF can monitor network traffic in real-time, providing insights into packet flow and protocol usage without significant performance overhead.
Security: eBPF helps implement security policies by monitoring system calls and network activity to detect and prevent malicious activities.
Performance monitoring: It can track application performance by monitoring function calls and system resource usage, helping SREs optimize performance.

Conclusion

Advanced troubleshooting in Linux involves a combination of tools and techniques that provide deep insights into system operations. Tools like GDB, strace, perf, and eBPF are essential for any SRE looking to enhance their troubleshooting capabilities. By leveraging these tools, SREs can ensure the high reliability and performance of Linux systems in production environments.

Linux kernel Site reliability engineering Linux (operating system) Performance Debug (command)

Opinions expressed by DZone contributors are their own.

Related

Trending