Senior systems engineer with 10+ years working on high-throughput
networking infrastructure, kernel bypass stacks (DPDK, XDP/eBPF),
and distributed storage. Currently OSS contributor to several Go
networking libraries.
A practical guide to continuous profiling: from
runtime/pprof and the HTTP handler, through FlameGraph
generation, to reading the output without going blind.
Everything you need to squeeze more throughput out of a Linux server —
congestion control algorithms, socket buffer math, and the
net.core.* knobs that actually matter.
Walking through the HPACK header compression, SETTINGS frames, and
stream multiplexing that make HTTP/2 tick — implemented from scratch
using only the standard library.
Benchmarks and implementation notes for io_uring submission-queue
polling mode, compared to epoll edge-triggered on a
10 Gbps NIC under synthetic load.
Deep dive: eBPF for zero-overhead network flow monitoring
Mar 12, 2026 · 9 min read · pinned
Traditional network monitoring tools like tcpdump copy packets to userspace, introducing measurable overhead at high packet rates. XDP (eXpress Data Path) programs run in the NIC driver before the kernel network stack, enabling zero-copy per-flow statistics with sub-microsecond overhead.
The trick is using BPF maps as per-CPU hash tables keyed by a 5-tuple (src/dst IP, src/dst port, proto). Each XDP program increments counters in-place — no memory allocation, no locks on the hot path.
TC (traffic control) hooks cover egress — XDP only sees ingress. Together they give full bidirectional visibility. A userspace daemon reads the BPF maps every second via bpf_map_lookup_elem and exports to Prometheus, keeping the kernel path completely allocation-free.
Profiling Go services in production with pprof and Flamegraph
Feb 28, 2026 · 6 min read
Go's runtime/pprof captures CPU, heap, goroutine, and block profiles. The easiest way to enable continuous profiling in a running service is registering the net/http/pprof handler — it adds endpoints under /debug/pprof/ with zero configuration.
main.go
import (
"net/http"
_ "net/http/pprof"// registers /debug/pprof handlers
)
funcmain() {
// pprof on a separate port — never expose publiclygo http.ListenAndServe("localhost:6060", nil)
// ... rest of service
}
To generate a FlameGraph: go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30. The built-in UI renders flame graphs, call trees, and source annotations.
Common pitfall: heap profiles show allocations, not live objects. Use /debug/pprof/heap?gc=1 to force a GC cycle before capturing — otherwise dead objects inflate the numbers and mislead optimization efforts.
Linux TCP tuning for high-throughput servers: BBR, buffer sizing, and fast open
Feb 4, 2026 · 7 min read
CUBIC (the Linux default) reacts to packet loss by halving the congestion window. BBR instead models the bottleneck directly — it tracks the maximum delivery rate and minimum RTT observed, keeping the pipe full without over-buffering.
BBR requires the fq packet scheduler for pacing — without it, BBR can burst and trigger loss. Verify: sysctl net.ipv4.tcp_congestion_control. For kernels ≥ 6.x, BBRv3 is available as a module with improved stability under competing flows.
Building a minimal HTTP/2 server in Go from scratch
Jan 17, 2026 · 8 min read
HTTP/2 multiplexes multiple logical streams over a single TCP connection. Each stream has a 31-bit ID — client uses odd IDs, server uses even ones. Streams are independent, eliminating the head-of-line blocking that plagued HTTP/1.1.
HPACK header compression uses a static table of 61 common headers plus a dynamic table that both sides maintain in sync. A full :method: GET header compresses to a single byte (static index 2). Keeping encoder/decoder tables in sync without drift is the trickiest implementation detail.
Go's standard library handles HTTP/2 transparently via golang.org/x/net/http2. Building from scratch is a useful exercise — particularly for understanding flow-control window management, which operates at both connection and stream level simultaneously.
io_uring from userspace: async I/O that actually beats epoll
Dec 30, 2025 · 7 min read
epoll's fundamental cost is a syscall per event: epoll_wait blocks, the kernel wakes the thread, and userspace issues another syscall (read/write) to do the actual I/O. io_uring collapses this: submit a read to the SQ ring and the completion appears in CQ — zero extra syscalls in SQPOLL mode.
src/io.rs
use io_uring::{IoUring, opcode, types};
use std::os::unix::io::AsRawFd;
let mut ring = IoUring::new(128)?;
let fd = types::Fd(file.as_raw_fd());
let read_op = opcode::Read::new(fd, buf.as_mut_ptr(), buf.len() as u32)
.build()
.user_data(0x42);
unsafe { ring.submission().push(&read_op)?; }
ring.submit_and_wait(1)?;
let cqe = ring.completion().next().unwrap();
println!("read {} bytes", cqe.result());
Benchmark on a 10 Gbps NIC under synthetic read load: epoll consumed ~12% CPU at 800K IOPS; io_uring in SQPOLL mode hit the same throughput at ~9.5% — a 21% reduction. The gap widens with smaller I/O sizes where syscall overhead dominates. Note: SQPOLL keeps a kernel thread spinning, so it's only worth it under sustained high-IOPS workloads.
TLS 1.3 handshake internals: 0-RTT, session resumption and why it matters
Dec 11, 2025 · 6 min read
TLS 1.2 required two round trips before application data could flow. TLS 1.3 cuts this to one: the client sends its key share in the first message, and the server responds with encrypted data in the same flight. For globally distributed services, 1-RTT shaves 50–150 ms off connection establishment.
Session resumption with PSK (Pre-Shared Key) goes further. After a successful TLS 1.3 handshake, the server sends a NewSessionTicket containing a PSK. On reconnect, the client includes this PSK in its ClientHello and the session resumes immediately.
0-RTT early data lets the client send application data in the first flight, before the server has acknowledged the handshake. The trade-off: early data is not forward-secret and is vulnerable to replay attacks. Only use 0-RTT for idempotent requests (GET); never for state-changing operations.