Save the date - Google I/O returns May 18-20. Register to get the most out of the digital experience: Build your schedule, reserve space, participate in Q&As, earn Google Developer profile badges, and more. Register now

Risk tolerance and model performance

When predicting user behavior, there is always a degree of uncertainty that requires a trade-off: you must decide whether to reach fewer users with higher overall accuracy, or to reach more users even with lower overall accuracy. This trade-off is determined by your unique use case.

Risk tolerance levels

Firebase Predictions defines risk tolerance levels based on two metrics:

  • The true positive rate of a prediction is the proportion of users who did perform an action that were correctly predicted to perform (for example, the proportion of users who ended up making a purchase that Firebase predicted would make a purchase ahead of time).
  • The false positive rate of a prediction is the proportion of users that did not perform an action that were incorrectly predicted to perform that action (for example, the proportion of users who didn't make a purchase that Firebase predicted would make a purchase).

You tell Predictions how much uncertainty you are willing to tolerate when targeting users by choosing a risk tolerance level for a prediction. Each risk tolerance level guarantees that the false positive rate won't exceed some maximum threshold. Given that fixed false positive threshold, Predictions will target as many users as possible to maximize the true positive rate without exceeding the false positive threshold. If the maximum achievable true positive rate fails to meet a minimum threshold, the risk profile is disabled and no users will be targeted with the risk profile. In this way, risk profiles provide a mechanism for ensuring that any targeting you apply has a certainty threshold that, if unmet, disables targeting.

When you target users based on a prediction, you select a risk tolerance level. Depending on the type of prediction and the number of available Analytics events, one or more of the following levels are available to choose from:

Risk tolerance levels
  • Targets the most users, at the cost of prediction accuracy
  • Guarantees at most a 20% false positive rate
  • Inactive when the true positive rate falls below 45%
  • For every 10 users correctly targeted, at most 4.44 users (10 × 20% ÷ 45%) are incorrectly targeted*
  • Targets fewer users with higher accuracy
  • Guarantees at most a 10% false positive rate
  • Inactive when the true positive rate falls below 35%
  • For every 10 users correctly targeted, at most 2.86 users (10 × 10% ÷ 35%) are incorrectly targeted*
  • Targets the fewest users, with the best accuracy
  • Guarantees at most a 5% false positive rate
  • Inactive when the true positive rate falls below 25%
  • For every 10 users correctly targeted, at most 2 users (10 × 5% ÷ 25%) are incorrectly targeted*

*Assuming an equal number of actual positive and negative cases among your users. If there are X times as many negative cases as positive cases, multiply the maximum number of false positives by X.


Suppose you have an app with 35,000 users, and you want to predict which users will stop using your app (or churn) in the next few days so that you can do something to encourage them to keep using your app.

In the figures below, each face represents 1,000 of your users, with those groups who are satisfied and who won't churn in green, and those who are dissatisfied and who will churn in red.

High risk tolerance

With high risk tolerance, Predictions might create a group like the one in the figure below. This group includes 10,000 of the 13,000 users who are dissatisfied. Therefore, the true positive rate of this prediction is about 76.9%. If, while high risk tolerance is selected, this value ever fell below 45%, the prediction would become inactive until the true positive rate improved.

This group also includes 4,000 users who are actually satisfied with your app, and who you might not want to target in your re-engagement strategy. Because 4,000 of your 22,000 satisfied users were falsely predicted to churn, the false positive rate of this prediction is about 18.18%, which is below the 20% maximum false positive rate guaranteed by the high risk tolerance profile.

Low risk tolerance

On the other hand, the figure below shows what a group created with low risk tolerance might look like. This group contains fewer false positives—only 1,000 users—but also includes 4,000 fewer dissatisfied users than the high risk tolerance group. The true positive rate of this prediction is about 46.15% and the false positive rate is about 4.55%.

See how risk tolerance affects performance

Because the quality of your predictions can change every day, it's possible that a prediction—given a particular risk profile—will be active one day, but inactive the next day. For this reason, understanding how a prediction's risk profile affects its reliability is important for deciding which risk profile you should use to target your users.

For example, if you set up a Remote Config parameter based on a particular risk profile, on any day that prediction is inactive, Remote Config will not assign a value to the parameter, and all your users will get whatever default behavior you have defined. Depending on your use case, this might be acceptable, but if it isn't, you need to know which risk profile gives you a reliably active prediction.

To help you understand how risk tolerance affects the reliability of your predictions, each prediction card in the Firebase console has a footer that indicates how reliable the prediction has been in the last 2 weeks, for each of the three available risk profiles.

For more details about the prediction's performance, you can expand the card's performance section:

The graph shows the prediction model's true positive rate for your data over the last two weeks. Each data point on the graph indicates how well that day's model performed on a holdout dataset (see How performance statistics are calculated). The graph is displayed in red on any day the true positive rate falls below the required threshold. On such days, Firebase deactivates user targeting based on the prediction.

If you see that your model has been inactive on some of the past 14 days, you might want to consider increasing your risk tolerance level to target more users and avoid inactive days, at the expense of potentially more false positives. You can see how different risk tolerance levels affect your model's performance by moving the Risk tolerance slider to different positions:

When you do so, the graph shows how well each day's model would have performed with the risk tolerance level you selected. In the example above, you can see that increasing the risk tolerance from medium to high results in the model's true positive rate remaining well above the 45% threshold for all of the past two weeks (but with greater tolerance for false positives).

When you find a risk tolerance level that achieves a balance between user reach and accuracy that you're comfortable with, select that risk tolerance level when you target users with Remote Config, Firebase In-App Messaging, A/B Testing, or the Notifications composer.

How performance statistics are calculated


Like many machine learning tasks, training a Predictions model is a "surpervised learning" task. This means all of the users used to train the model must be assigned labels such as "will churn", "will not spend", and so on. To label users, Predictions takes all 28-day active users of your app, and removes the last 7 days of events from their data. This period is called the label window. Firebase Predictions uses events from the label window to assign labels to the user, and then uses the user's events prior to those 7 days (events from the training window) to train the model.

Holdout data and training data

Not all of the data is used directly for training. As is typical for supervised learning tasks, Predictions sets aside 20% of the data as holdout data and uses only the remaining 80% of the data to train the model. Then, to evaluate the model's performance, predictions are generated for the users in the holdout set, based on the data in the training window, and compared to the actual outcomes for each user, based on the labels generated from the label window.

All of the statistics presented in the Firebase console come from evaluating the model against the holdout data.