Engineering at Care.com

May 30 · 9min


In my next role post-IrisVR, I was looking for similar tech-stack environment: Golang, AWS, Kafka, gRPC, protobufs, microservices, event-driven, etc.

Care.com had that technology, but it also had more process, more infrastructure, more documentation, more everything — due to its 200+ engineers. The platform team there was excellent. They built and maintained the template that all Go microservices and lambdas ran on; a big win in terms of standardization. They also provided CDK SDKs each team could use to deploy their own infrastructure — S3 buckets, Lambdas, Kafka topics, etc.

I learned a lot from more senior engineers. Little things, like how to write table-driven unit tests using nothing but the standard library, when/where to declare an interface, etc. But also more important things: habits like “stop starting, start stopping”, agreeing on a good definition of done, awareness about over-engineering (KISS), etc.

Here are some of my highlights there:

Win 1: Soft Skills, Code Quality, and Docs

The Problem: People want to learn about your domain, architecture, patterns, practices, etc.

The solution is to teach them.

A lot of this is kind of grunt work. You definitely won’t get promoted for it, but the little things lead to better experiences for others (and yourself) down the road. The rationale is kind of like why you comment your code partly out of self-interest: because in six months when you come back you’ll thank yourself. The same applies for mentoring: by verbally explaining, writing about, or visually documenting something, you come to understand it to a greater degree. That said, I probably over-relied on ad-hoc visual documentation in Slack responses and maybe didn’t give enough TLC to Confluence. Maybe I preferred READMEs over Confluence? Maybe I thought the docs were too messy? Maybe I didn’t have the time?

It’s tedious, not a lot of the recognizable kind of impact. It’s unclear what kind of impact we get from improving the readability of a function, or boosting unit test coverage. It definitely can’t hurt! Code quality is good to think about, but it can turn into yak-shaving pretty quickly.

Win 2: Senior-Care Ingestion

The Problem: Our ingestion pipeline needs to be extensible to different business verticals.

When I arrived, our ingestion pipeline for provider profiles had been built out for child-care centers.

To onboard new senior-care facilities, all we had to do was update our Protocol Buffer (proto) model.

We had to account for new kinds of properties that didn’t normally exist in the context of child-care:

  • Does the facility offer memory care? Or assisted living care? Or continuing care?
  • What kind of bedroom capacities do they support? Do they have central AC?
  • Is there a 24/7 nurse on-site?
  • Is gluten-free food available?

The main debate was whether these new fields would be explicitly modeled in our proto, or stored as an arbitrary string-to-string map. I was a strong proponent for explicit modeling. It would be a win in terms of clarity and self-documentation, and would pay dividends down the road. The biggest counter-argument was related to the fact that the proto would have to be updated in a few services whenever the schema did evolve.

Protobufs are backwards-compatible. Adding new fields is fine. If we did happen to prematurely ingest a document using an old version of the protobuf, and therefore couldn’t deserialize some fields yet, Kafka allowed us to replay messages whenever we wanted. In other words, this edge case was something we humans could solve with process, and we didn’t need to engineer around it.

Win 3: Improved search experience

The Problem: Users want better search result sorting.

If there’s several day-care centers in your postal code, you want search results to come back with “the best” on top.

“Best” in this case was determined the Data Science team, using a number of factors related to a provider, all of which formed a score which my team received in the form of daily CSV uploads to an S3 bucket.

I was responsible for implementing the AWS Lambda logic you see in the diagram. It would get alerted whenever new scores came in, parse the CSV file, and write messages to a Kafka topic.

From there, The Ingester would pick up the scores and upsert them into provider documents in Elasticsearch. At search time, the score field would be the primary sort key.

There was more complexity beyond that, but you get the picture.

Complexity around search rankings

For example:

  • The Data team sent us two scores for some reason: responsiveness score and a ranking. Because we weren’t just receiving one score, we had to do some multiplication in our Elastic query using Painless scripting.
  • Why couldn’t the Ingester have done the multiplication? Because it wasn’t guaranteed to receive or possess both of a provider’s scores, and we wanted to avoid multiplying by zero. Sure, we could have throw in some “lookup” functionality, but we wanted to keep the Ingester as simple as possible.
  • The fact that the Data team’s S3 bucket was in a different AWS Project meant that we couldn’t do a direct bucket-to-lambda connection, but had to use an SNS intermediary.

Win 4: Auto-Accept Leads

The Problem: We want greater lead conversion.

What on earth is an auto-accept lead? Better yet, what is a lead?

If you’re like I was, someone with no marketing knowledge, you’ll have to do some research. But basically, lead generation is: the initiation of consumer interest or enquiry into products or services of a business.

A lead basically captures a care-seeker ID and a care-provider ID, and has all sorts of prices, discounts, and multipliers attached to it as well.

When a user types in their zip code and we render a list of nearby, suitable day-care centers, and then they click on one of the provider links, they are explicitly declaring some form of interest. We capture this intention as a (consumer-choice) Lead.

What about the other search results that showed up that didn’t get clicked on? These are still potential matches, also known as auto-accept leads, pending leads, potential leads, etc.

As shown in the diagram, after 30-60 minutes, we had our client-side code make a request to persist these pending leads to DynamoDB, where they received a TTL of 1 hour. Once a document expired, it would invoke a Lambda, which would send the expired payload (via gRPC) to the Leads API, which would persist it as an actual Lead.

The idea behind this epic was that by creating these cheaper auto-accept leads, providers would see greater lead conversion and it would be a better experience for both sides of the market.

Lingering questions

  • Why 30-60 minutes? Why not 10 minutes, or 2 hours?
  • If a user gets a list of search results but then leaves her laptop for an hour, is that really communicating much about user intent?
  • Do such leads deserve to be converted and billed at standard rates?

Win 5: Google Ads Location Targeting

The Problem: We want better ad targeting (for greater lead conversion).

Prior to this epic, Google Ads were manually targeted at certain areas and tediously updated by the marketing team.

The thesis was simple: ads drive lead conversion and revenue, and should be powered by an automated, tunable, configurable process. This process shouldn’t simply be dependent on coverage (the number of providers within a postal code), but also on solvency (how much monthly budget providers had remaining). It also needs to be written in such a way that it’s extensible: coverage and solvency matter today, but maybe other things matter tomorrow.

During the planning phase of the epic, it was clear we would need to use CDK to create a few more pieces of AWS infrastructure: a new EventBridge rule (for scheduling), S3 bucket (for configuration), and Lambda (for execution).

  • The S3 bucket allowed for managers to tweak the config file without needing a code redeployment, something that wouldn’t have been possible had we used env vars in our Harness configs.
  • The Lambda would execute daily, triggered by EventBridge; or whenever the configs changed.

As diagrammed, the first step the Lambda took was to read the config file, which determined:

  • coverage — the minimum required number of providers in a postal code to justify an ad target (coverage);
  • solvency — the minimum percentage of monthly budget remaining a provider needed to qualify / not get filtered out.

After that, the Lambda would read all existing provider IDs from the existing Google Ads campaigns. Using those IDs, it would retrieve the postal codes that each provider served. This would be done in batch and in parallel using the errgroup package. (I was really impressed with this blog post the first time I read it). Finally, the Lambda would make calls to the Google Ads API to add/remove postal codes in batch.

Reflecting on the Ads Epic

Most of these thoughts / rules to myself I more or less followed throughout the implementation of this epic. But they’re definitely good reminders.

  1. Postal codes are easy but … strange?. Postal codes aren’t uniform and vary in geospatial size. Although I suppose they correlate inversely with population density. And they are easy to reason about in general. However, as a user/consumer, you might feel better served but other postal codes. Tessellating the world into uniform shapes and factoring in traffic might have been a win. (This would be something I would think about in my next role at The Drivers Coop).
  2. Make sure your libraries/dependencies actually work. The Go library I wanted to use (Opteo/google-ads-go) wasn’t stable enough; I should’ve known given its 15-star count on GitHub. In the end, I think I settled for hand-rolled REST communication but used the official googleapis protos for (un)marshalling requests/responses.
  3. Integrating with 3rd parties is sometimes hard. I remember getting Google Ads authn/authz right was a PITA. I struggled with this for a while and eventually settled for a sub-optimal solution; either it was scoped too specifically (i.e., to an actual human user with permissions to the Google Ads Campaign, such as myself) or it was scoped too broadly (i.e., to a Google service account that could mutate STG and PROD environments). It didn’t help that we didn’t have an actual separate Staging environment. Implementing the auth in Go took a while, but it helped to curl it out on the command-line.
  4. Do a CURL test up-front before implementing. I had a tab in Sublime Text full of curl snippets that definitely helped remind me how the flow worked.
The Care.com Logo
comment on twitter