Shepherd: How Stripe adapted Chronon to scale ML feature development

2024-04-15

Machine learning (ML) is a foundation underlying nearly every facet of Stripe’s global operations, optimizing everything from backend processing to user interfaces. Applications of ML at Stripe add hundreds of millions of dollars to the internet economy each year, benefiting millions of businesses and customers worldwide. Developing and deploying ML models is a complex multistage process, and one of the hardest steps is feature engineering.

Before a feature—an input to an ML model—can be deployed into production, it typically goes through multiple iterations of ideation, prototyping, and evaluation. This is particularly challenging at Stripe’s scale, where features have to be identified among hundreds of terabytes of raw data. As an engineer on the ML Features team, my goal is to build infrastructure and tooling to streamline ML feature development. The ideal platform needs to power ML feature development across huge datasets while meeting strict latency and freshness requirements.

In 2022 we began a partnership with Airbnb to adapt and implement its platform, Chronon, as the foundation for Shepherd—our next-generation ML feature engineering platform—with a view to open sourcing it. We’ve already used it to build a new production model for fraud detection with over 200 features, and so far the Shepherd-enabled model has outperformed our previous model, blocking tens of millions of dollars of additional fraud per year. While our work building Shepherd was specific to Stripe, we are generalizing the approach by contributing optimizations and new functionality to Chronon that anyone can use.

This blog discusses the technical details of how we built Shepherd and how we are expanding the capabilities of Chronon to meet Stripe’s scale.

ML feature engineering at Stripe scale

In a previous blog post, we described how ML powers Stripe Radar, which allows good charges through while blocking bad ones. Fraud detection is adversarial, and Stripe needs to improve models quickly—fraud patterns change as malicious actors evolve their attacks, and Stripe needs to move even faster.

ML feature development is the process of defining the inputs (features) that a model uses to make its predictions. For example, a feature for a fraud prediction model could be the total number of charges processed by a business on Stripe over the last seven days.

To identify and deploy new features that would address rapidly changing fraud trends, we needed a feature engineering platform that would allow us to move quickly through the lifecycle of feature development.

Effectively deploying ML models in the Stripe environment also requires meeting strict latency and feature freshness requirements.

Latency: A measure of the time required to retrieve features during model inference. This is important because models such as the ones powering Radar are also used in processing payments, and the time required to retrieve features directly impacts the overall payment API latency—lower latency means faster payments and a better overall customer experience for businesses.
Feature freshness: A measure of the time required to update the value of features. This is important because Stripe needs to react quickly to changes in fraud patterns. For example, if there is an unusual spike in transactions for one business, feature values must quickly be updated to reflect the pattern so models can incorporate the new information in their predictions for other businesses.

There are trade-offs between latency and feature freshness. For example, we can improve latency at the expense of freshness by performing more precomputation when new data arrives, while we can prioritize freshness over latency by performing more of the feature computation during serving. Stripe’s strict requirements for both low latency and feature freshness across the billions of transactions we process create a unique set of constraints on our feature platform.

Shepherd: Stripe’s next-generation ML feature platform

As Stripe grew, so did our ambitions for applying ML to hard problems. To accelerate our feature engineering work, we evaluated several options, including revamping our existing platform, building from scratch, and implementing proprietary or open-source options. One particularly appealing option was an invitation we received from Airbnb to become early external adopters of Chronon, which Airbnb had developed to power its ML use cases.

Airbnb wanted to integrate the platform with an external partner prior to open sourcing, and Chronon met all of our requirements: an intuitive Python- and SQL-based API, efficient windowed aggregations, support for online and offline computation of features, and built-in consistency monitoring. At the same time, we couldn’t just use it off-the-shelf. We knew we would need to adapt Chronon to Stripe’s unique scale, where training data can include thousands of features and billions of rows. It was going to be a significant engineering challenge, but we were confident that it was a strong foundational building block.

Adapting Chronon

Chronon supports batch and streaming features in both online and offline contexts. To be able to use Chronon as the foundation for Shepherd, we needed to make sure the offline, online, and streaming components could all meet Stripe’s scale.

ML engineers use Chronon to define their features with a Python- and SQL-based API, and Chronon provides the offline, streaming, and online components to compute and serve the features. Integrating with Chronon involves setting up each of these components and providing an implementation for the key-value (KV) store used to store feature data for serving. When integrating with Chronon, we needed to make sure each of the components could meet our feature freshness and latency requirements.

KV store implementation

The KV store is responsible for storing data required to serve features. Offline jobs compute and write historical feature data to the store, and streaming jobs write feature updates. To cost-efficiently scale our KV store, we split it into two implementations: a lower-cost store optimized for bulk uploads that is write-once and read-many, and a higher-cost distributed memcache-based store that is optimized for write-many and read-many. With this dual KV store implementation, we lowered the cost of storing and serving data while still meeting our latency and feature freshness requirements.

Streaming jobs

Chronon streaming jobs consume event streams and write the events to the KV store. The events can be thought of as updates to features. The default Chronon implementation writes events into the KV store with no preaggregation. Storing individual events into the KV store would not allow us to meet our latency requirements for features with a large number of events. We needed to choose a streaming platform that could achieve low latency updates and allow us to implement a more scalable write pattern.

We chose Flink as the streaming platform because of its low latency stateful processing. Since the Chronon API is a combination of Python and Spark SQL, maintaining consistency between offline and online computation meant we needed a way to run Spark SQL expressions in Flink. Fortunately, the Spark SQL expressions used in Chronon’s feature definitions only require maps and filters. These are narrow transformations—with no shuffling of data—and can be applied to individual rows.

We implemented support for Spark SQL expressions applied to Flink rows. With Flink now powering our feature updates, we achieved p99 feature freshness of 150ms.

Untiled Architecture

Tiled Architecture

Flink-based streaming architecture allowed us to meet our feature freshness requirements; that left latency targets. To achieve those, we needed to modify how Chronon stores events in the KV store. When events are stored individually, computing features requires retrieving events for the feature and aggregating them together. If there are a large number of events for the feature, this is time-consuming and increases latency.

Rather than store individual events, we decided to maintain the state of preaggregated feature values in the Flink app, and periodically flush those values out to the KV store. We call each of these preaggregated values a “tile.” With tiling, computing a feature only requires retrieving and aggregating the tiles for the feature rather than all the individual events. For features with a large number of events, this is a much smaller amount of data and significantly decreases latency. We contributed both the Flink and tiling implementations back to Chronon, along with documentation on how to get started with them.

Meeting Stripe’s offline requirements

The Chronon offline algorithm produces both offline training data for models and batch-only use cases. Offline jobs are also required to compute historical data used for serving GroupBys. The offline jobs are configured using the same Python- and Spark SQL-based API as the online jobs, allowing developers to define their features once and compute both online and offline features.

Stripe’s scale for offline jobs is larger than previous use cases of Chronon, just as it was for streaming and online components. Although the offline algorithm is designed to be robust, with support for handling skewed data, we needed to verify that it would scale to the size of Stripe’s training sets. As a first step to integrating with Chronon’s offline jobs, we performed benchmarks of training dataset generation and found the algorithm to be scalable with predictable tuning knobs.

After verifying its scalability, we needed to integrate Chronon’s offline jobs with Stripe’s data orchestration system. We built a custom integration for scheduling and running jobs that worked with our highly customized Airflow setup. We designed the integration so users only need to mark their GroupBys as online or set an offline schedule in their Join definitions, after which the required offline jobs are automatically scheduled.

We also needed to integrate Chronon with Stripe’s data warehouse. Chronon assumes data sources are all partitioned Hive tables. Not all data sources at Stripe meet these requirements. For example, many of the data sources required for batch features are unpartitioned snapshot tables.

We built support into our Chronon integration for defining features with a wider variety of data sources, and for writing features to Stripe’s data warehouse using customized Iceberg writers. Fully integrating with our data warehouse provides feature engineers the flexibility to define features using any data source, and to consume features in downstream batch jobs for use cases including model training and batch scoring.

Our implementation for more flexible data source support was Stripe-specific, but we plan to generalize the approach and contribute it to Chronon.

Building a SEPA fraud model on Shepherd

Our first use case for Shepherd was a partnership with our Local Payment Methods (LPM) team to create an updated ML model for detecting SEPA fraud. SEPA, which stands for Single Euro Payments Area, enables people and businesses to make cashless euro payments—via credit transfer and direct debit—anywhere in the European Union. The LPM team initially planned on combining new Shepherd-created features with existing features from our legacy feature platform, but found development on Shepherd so easy that they created all new features and launched a Shepherd-only model.

Our new SEPA fraud model consists of over 200 features, including a combination of batch-only and streaming features. As we built the model, we also developed support for modeling delay in the offline training data so we could accurately represent the delay of batch data in training data to avoid training-serving skew—when the feature values that a model is trained on are not reflective of the feature values used to make predictions.

As part of the new SEPA fraud model, we also built monitoring and alerting for Shepherd—including integrating with Chronon’s online offline consistency monitoring. As we mentioned at the start of this post, the new model blocks tens of millions of dollars of additional fraud a year.

Supporting the Chronon community

As a co-maintainer of Chronon with Airbnb, we’re excited to grow and support this open-source community while continuing to expand the capabilities of the project. We also designed the new Chronon logo, a subtle nod to the fabric of time.

Over the coming months, we’ll contribute new functionality and additional optimizations to Chronon, and we’ll share more details about how teams at Stripe are adopting Shepherd.

To get started with Chronon, check out the GitHub repository, read the documentation at Chronon.ai, and drop into our community Discord channel.

And if you’re interested in building ML infrastructure at Stripe—or developing ML features for Stripe products—consider joining our engineering team.