When I joined product data science at Mozilla, our primary focus was analytics and experimentation. While these accounted for most of my projects, I also got some unique opportunities to collaborate on and drive new efforts of researching statistical models and put them into production as internal data products.

We came across various cases where the noise of raw numbers made it difficult to draw relevant lessons, and where statistical methods were able provide a new perspective to improve decisionmaking. In this sense, the goal of the models I worked on was to provide insights, rather than optimize for raw predictive accuracy the way many machine learning applications would.

While production dashboards add value in their ability to keep insights current and fresh, another part of these models’ value is that by describing the data generating process, they allowed for raw numbers to be decomposed into more meaningful quantities. An example (described below) is a model that was able to take raw test scores, and provide separate estimates for the likelihood of a regression, and the severity of a regression. On their face, these are simple enough concepts that people already use intuitively when looking at time series plots. But the value of the model can be in providing concrete estimates of these quantities and helping people to think more instinctively in these terms.

The following are descriptions of some of these models, and lessons learned along the way.

Active profiles

This was the first model I built that made it into production through collaboration with data engineering, and was motivated by a widespread interest in better understanding desktop retention. This project made predictions about clients’ retention, but did so by decoupling their likelihood of using the browser on a given day from the likelihood that they’ve churned, and providing estimates of each of these. It’s based on the beta-geometric beta-binomial model in the Lifetimes library with the inference rewritten using the Numba JIT for scalability. The Numba replacement gave more than an order of magnitude speedup, allowing for efficient fitting and predictions across the client base. A nice thing I like about this model is that because it’s modeling the browser usage with 2 probabilistic processes, there are a lot of quantities that can be queried for each individual client, including

  • probability the client is still an active user
  • probability of using the browser in a given day
  • probability of using the browser at least once within the next month (MAU)
  • expected number of times they’ll use the browser in the next n days

These mapped pretty well to metrics and questions that people in the company were thinking about.

There were a lot of lessons that I learned from this project, including some improvements I would have made in hindsight. One example is that the model didn’t completely account for the zero inflation we would see with usage frequency of our client base (but for some technical reasons I think it actually did better than I’d normally expect). Building a model that could take zero inflation into account or use a range of other features we had available would have improved it, especially for first time clients.1 But despite some assumptions about the distribution being off, and the fact that it only involved fitting 4 parameters, it made very good longer term predictions at the population level, and illuminated differences in the usage patterns of clients when split on different dimensions.

Mission Control: v2

The original iteration of Mission Control delivered a nice low latency dashboard of crash rates that release management could monitor throughout the release cycle. A difficulty with using it, though, was that due to a mix of the kinds of users who would update earliest, crash rates tended to spike shortly after release, generating a lot of false positives. I worked with Saptarshi, another data scientist, on a successor dashboard powered by a brms model that aimed to constrain the crash rate estimates with prior information from previous releases, and only alert once sufficient evidence of a crash spike accumulated to overcome the priors. He also had the great idea of decomposing the overall crash numbers into more intuitive quantities of crash incidence (percentage of clients experience it a crash) and crash rates (average number of crashes, given you’ve crashed once). This model took into account information related to the release cycle, and lowered the number of false positive crash spikes, allowing for better classification and triaging of crashing bugs. While the statistical model was crucial to enabling the new insights of the dashboard, this was a good example where at least 90% of the effort to build it involved a lot of ETL and data plumbing of one sort or another.

Slow regressions

Another aspect of managing releases requires monitoring changes in performance test scores over time. After every commit to the Firefox codebase, a series of performance tests is triggered to help ensure that Firefox stays fast. While an existing system is in place to detect accute degradations in performance, often due to a single commit, there is no system to alert when a performance regression builds slowly as a result of multiple commits. The noise distribution of the performance tests are very…special, with level shifts, multimodality, and a relatively high rate of outliers, making naive week over week comparisons unreliable.

In an attempt to better illuminate the state of performance at a given time, I reframed the question of regressions in probabilistic terms, and designed a system to estimate the latent performance level for various tests. The resulting model allowed for the ability to compare the performance levels at arbitrary points in time, and query the probability that a regression had occurred, along with an estimate of the severity of the regression.

Some principles that went into the model design were

  • robustness: There are a variety of different test suite and platform combinations, all of which have different noise and trend characteristics. The rate and severity of outliers is unpredictable and varies significantly between test suites–some test suites have outliers that are ~10 deviations from the trend, some have outliers that are ~100 deviations away! The model needs to be simple enough to run reliably on the wide array of time series, but capture enough of the information from the tests to accurately convey the performance level.
  • diagnostics: Because of the wide variety of tests, it wasn’t feasible to manually specify parameters to describe each one (trend time scale, noise level, rate of outliers, etc.). But as a safeguard, I used diagnostic information from Stan’s MCMC sampler to at least provide some visibility into whether a given test met the assumptions of the model.
  • Intuitive interpretation: rather than using confusing terminology to describe the results, a goal was to enable the stakeholders to query the probability that a regression had occurred, and the probability of the severity, over the various time scales they are interested in.

Lessons learned

To highlight some lessons learned:

  • Often data science is 95% plumbing
  • For production models, robustness is key, and simpler models can be preferable. And even simple models can be useful.
  • Models can bring value by optimizing for insights beyond predictions
  • Write code with future Chris in mind, not present Chris

A lesson I quickly learned was that using sophisticated modeling techniques can be straightforward for an analysis that only needs to run one time, but the need to rerun it on a daily basis with new data brings constraints that favor robust, and often simpler models. I see Bayesian models as offering an advantage in this respect; probabilistic programming languages like Stan invite you to be explicit about properties of the data your model expects. And as an insurance policy, Stan offers a rich set of diagnostics that can flag when the data isn’t meeting those assumptions.

Researching and building these models served was a uniquely valuable experience for me. I feel really fortunate for the opportunities I had to collaborate with such talented colleagues and mentors on these projects, and am proud of these products we were able to build. 2

  1. I tried rewriting the model in Stan at one point, using covariates and a zero inflated binomial model, but could not find a way to scale it up to the lifetimes version. The lifetimes approach is based on some papers that use some clever math tricks to reduce the required information to min and max dates and sums, while the Stan approach required the full history of each client. 

  2. Saptarshi was the driver for a lot of the Mission Control v2 project, and Will worked with us to get it production-worthy. Anthony worked with me to get active profiles into production, and Eric was the SQL whiz and project manager for slow regressions.