In late 2018, I was seeking new work for a few reasons:
- I was yearning new skills (e.g., Golang).
- I wanted to experience living in New York City.
- I wanted to work for a frontier tech company.
I felt somewhat lucky seeing IrisVR on LinkedIn, and an old mutual connection from my Exeter days: Robin Kim.
The role checked all the boxes I was looking for.
Here’s a closer look at my time as an engineer there.
Iris was a true frontier tech company. Their thesis was that existing processes in the AEC (architecture, engineering, construction) were inefficient and error-prone. It was better to review your models in an immersive environment and catch conflicts and issues there, rather than catching very costly mistakes mid-cycle.
Win 1: Marketing Page Web Vitals
One of the first tasks I took on when I arrived was improving the performance of the marketing splash page, irisvr.com, in order to yield greater revenue and higher conversion rates. (See here and here).
It was tedious work, but an important contribution.
Essentially what this task consisted was porting components and styles out of our monolithic Create-React-App and into GatsbyJS, a framework that gave us a lot around SEO best practices, and even more value around image optimizations:
- webp format
- variety of image resolutions for every screen size
- progressive loading via the Blur Up technique.
After all was said and done, our Light House scores were pretty satisfactory, having previously been bogged down by heavy assets.
Win 2: Migration to the Cloud
With the release of the 1st-gen Oculus Quest, our users were excited to collaborate inside their CAD models using wireless headsets. In anticipation of the hype around improved ergonomics, we knew it was critical for our product to be able to host CAD files in the cloud.
It was time to move off the desktop and into the cloud!
Going into this project, we had two main constraints:
- Allow disconnected operations: when users don’t have internet, they should still be allowed to mark up files; tweak file metadata; rename, move, add, or delete files; etc.
- Minimize conflicts in the event multiple users mutate the same entity.
Events and Disconnected Ops
To achieve the first goal, we decided the client would need to cache events. When it had an internet connection, it would periodically sync those events to the server.
Although the system I ended up building was akin to an infinitely nestable hierarchical file system, in practice the only two domain entities that mattered were projects (folders) and experiences (CAD files, represented as bundles of files, kind of like a ZIP file).
To minimize (but not fully eradicate) conflicts, whenever any node(s) (folder or file) under a Project ancestor was mutated, we generated a new UUID, called LastRevisionID, which we then recursively propagated upward. This would indicate to clients whether certain Projects needed syncing before mutation events were pushed to the server.
We had some tolerance for edge cases. For example:
- On Monday, Alice downloads a folder (called “New York City”), immediately turns off her internet, and then renames it to “NYC”.
- On Tuesday, Bob downloads the same file (“New York City”) and renames it to “Big Apple”.
- On Wednesday, Alice turns on her internet. Her client realizes revisions have been made and receives the new name, “Big Apple”. However, her rename event is sent regardless and overwrites Bob’s, even though his technically happened chronologically after.
It was effectively a last-in-wins approach.
Firestore limits the number of documents any one transaction can touch to 500. If a user went offline for a long period of time, the C# client would have to batch these events. It did some intelligent squashing to remove redundant events, but occasionally, we’d run into concurrent-write errors. Retries were an effective solution here.
Read first, then write
Another major limitation of Firestore was: All read operations must come before write operations. This meant that when the server was processing batches of events, it basically had to build out the new file tree in memory as it was reading the stream, and only once all the reads were performed could it start creating, mutating, and/or deleting existing nodes.
I remember writing a bunch of unit tests for this behavior, only to scrap all of them in favor of integration tests. Reasoning about an API contract is a lot less overwhelming and more readable than trying to reason about internal complexity. Treating our logic like a black box led to better test coverage and confidence.
It was critical for the client to be able to recursively navigate the file system in a somewhat performant way.
Considering that each Firestore lookup took 50s, and each query 200ms, parallelization was absolutely necessary.
Starting from the root node (the project), the client would issue a ListChildren RPC (200ms), and then for each child, it would issue a subsequent lookup RPCs (each taking 50ms). Using goroutines to parallelize these lookups was a big win. I was pretty excited upon hooking them up with Jaeger so we could visualize our traces and identify hotspots and bottlenecks.
Win 3: Issue Tracking
Another big win from 2020 was issue tracking. An issue consists of a title, description, status, type (e.g., clash, quality control, etc), creation timestamp, and priority. Think of JIRA, ClickUp, Trello, or any number of project management softwares.
As a backend engineer, I was responsible for building the GraphQL API that powered queries and mutations.
From a complexity standpoint, this was a simpler epic than building the C#-gRPC-Firestore file system bridge. Still, it involved lots of coordination between the backend team (myself), the C# library (Ezra Smith), and the Unity team (Sten Ulfsson and Jake Loeterman).