rr is the record/replay tool developed by Mozilla Research. I won’t spend much space talking about rr directly; for that read this and this, and give it a try here. Instead I want to summarize the OS features that rr relies on for its full functionality. This post is intended to pique the interest of someone with low-level OS X systems programming experience who might eventually be interested in attempting to bring up rr on OS X. You don’t need to know all the details of how rr works to comment on whether the necessary interfaces exist on OS X.
Generally speaking, rr has two levels of functionality. The first level is to record an application in such a way that the recording can be deterministically replayed. This is the bare-minimum functionality a record/replay tool must have. The second level of functionality is high-performance recording. That’s what makes rr interesting as a tool. Performance really is a feature, but that’s a topic for another post …
Level 0 functionality: basic recording and deterministic replay
For this to work, the rr tracer process must be able to observe all nondeterministic events that may affect execution of its tracee. rr also needs a mechanism to preemptively schedule tracee threads, and to replay those scheduling decisions. Here are the interfaces that rr uses on linux
- basic
ptrace
API: rr uses most mundane debugger APIs like reading and writing registers and memory, continuing execution, single-stepping, and so forth. (Obviously OS X supports all these things in some form.) - notification of pending signals through
ptrace
: rr has to record signals too; see below for more details. (Presumably OS X can notify these somehow too.) ptrace(PTRACE_SYSCALL)
: continue the tracee, but stop it at the entry of the next syscall. rr uses these traps mainly to record the outparam data written by the kernel after syscalls.prctl(PR_SET_TSC, PR_TSC_SIGSEGV)
: somewhat obscure interface to generate a SIGSEGV exception when a tracee attempts to execute ardtsc
instruction. rr has to record the returned tsc value because it’s nondeterministic, and this exception enables it do that.sched_setaffinity(mask)
: this one requires a bit of explanation. The problem is that the x86cpuid
instruction is nondeterministic. For example, it can expose the number of the core your thread is running on. So nominally, rr would want to record the effects ofcpuid
, just like it does forrdtsc
above. Recent processors allow trapping oncpuid
, just like forrdtsc
. However, linux doesn’t expose an analogous interface likeprctl(PR_SET_CPUID, PR_CPUID_SIGSEGV)
. There was a proposed patch, but it was rejected on the grounds that no one would ever use it. (Sigh.) So rr implements the poor man’s solution, which is to bind all tracee threads to the same logical processor (hardware thread). There may be other nondeterministic side effects ofcpuid
that are being missed, but it’s worked well enough in practice.perf_event_open()
for precise HW performance counters: rr uses counters in two ways. First, when tracees are interrupted at an asynchronous preemption point by a signal, rr records the value of a particular perf counter (retired conditional branches). This value identifies the point in execution at which the signal arrived. Then to replay the signal, rr advances execution until the performance counter reaches its saved value. (Oversimplifying a bit.) And second, rr programs a perf counter to “interrupt” after a certain time-slice. This mechanism is the basis of rr’s preemptive scheduling. It’s a nice coincidence on linux that the perf-counter interrupts are surfaced to userspace as regular signals; that means for rr, recording task preemptions and signals is the same problem. It doesn’t have to be like that though.personality(ADDR_NO_RANDOMIZE | ADDR_COMPAT_LAYOUT)
: rr doesn’t like address-space randomization and so disables it. Though ASLR could be supported with some effort.
What do you think, OS X systems hackers? Can we bring up this basic level of rr functionality on OS X?
Level 1 functionality: fast recording
There’s an inherent performance ceiling when you trap every tracee syscall to a tracer process. Other projects besides rr have hit this ceiling, including Chrome’s linux sandbox team and the Dune project. rr runs perfectly fine within this ceiling, but linux has APIs that let rr break through the ceiling.
LD_PRELOAD="lib.so"
: rr injects code into tracee process to do tracee-side processing of syscalls as much as possible.LD_PRELOAD
is how that helper code is injected. (OS X supports this.)prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog)
: rr sets things up so that tracee syscalls are selectively traced. Most tracee syscalls can be processed by the rr preload library, without taking a ptrace trap into the rr tracer. rr uses this recent kernel interface (known as “seccomp-bpf”) to install a filter that only ptrace-traps the syscalls that rr can’t process entirely on the tracee side. The details here aren’t that important, but what’s necessary is essentially a kernel interface that supports efficient sandboxing.perf_event_open()
for context-switch counter: there’s not really a short explanation of this requirement; it quickly gets off into the rr weeds. What’s required is a way to interrupt tracees after the next time they’re context-switched out by the kernel thread scheduler. This is fairly straightforward on linux, because “context-swtiches” is just another (SW) perf counter, and it can be programmed to interrupt in just the same way as for the preemptive-scheduler interrupt described above.
Does OS X provide analogous mechanisms?
If you know about any of this, please do drop us a line somehow. The mailing list is probably best. Or if you’re really adventurous, hack up a prototype :). It would be a major effort to merge an OS X port into upstream rr, but it would be an interesting discussion to start.
Appendix: but what about binary instrumentation?
Yes, the OS footprint of record/replay tools based on binary instrumentation is much smaller. There are other tools built around binary instrumentation; we chose a different approach for rr mainly for performance reasons, and we’ve been happy with that trade-off so far. But instrumentation is a fine thing to use: that and rr are just two different approaches.
Appendix: but what about kernel modules?
Sure, you could write a kernel module to expose all the interfaces that rr needs. The major downsides to this approach are (i) increased friction for new users; (ii) much higher maintenance burden; (iii) unpleasant security implications. We’d rather not go there if we can avoid it.