Used by 10,000+ companies to process massive transaction volumes, Stripe has become a core part of the payments infrastructure of the internet. And because of the importance of this transactional data, it should absolutely form the foundation of the analytics at any company that uses it.
While it’s certainly possible to analyze revenue based on data reported in other systems — Google Analytics, your internal database — your payments provider is where the rubber meets the road. If Stripe says that a transaction was processed, that’s a datapoint you can count on. This is the foundation you want to build from as you’re building out your internal analytics.
The challenge in analyzing Stripe data is that you have to get at it first. That’s where tools like RJMetrics Pipeline come in. Pipeline connects to your Stripe account and syncs your data into Redshift in near-real-time. Once your Stripe data is in Redshift, you can analyze it using SQL.
That’s where we’ll pick up the storyline. This post walks through the data modeling and analysis required to take your raw Stripe data in Redshift and get your MRR metrics:
- new revenue,
- downgrades, and
Note: All of the SQL in this post is published as open source software under the Apache 2.0 license, as a part of our new project Analyst Collective. Analyst Collective is a collection of open source tools and analysis dedicated to improving analytic workflow. At the end of this post, I’ll share more about Analyst Collective, including tools to easily configure and deploy the data models and analysis from this post within your own Redshift cluster. At any point you’d like to skip the walkthrough and just deploy the code, just skip to the end.
Lots of credit to Chris Hexton at Vero for providing the impetus for this post and contributing the initial code. Thanks Chris!!
Preparing the data
Before you dive into writing analytic code it’s always important to normalize your raw data. Effective data cleansing and transformation play a critical role in data analysis: they provide the underlying layer upon which all of your later analysis is built. For Stripe, there is absolutely some work that needs to be done on this front.
The first stage is to take the unix timestamps that Stripe’s API gives you and translate them into dates that Redshift understands.
Once we perform this translation, we materialize it as a view or table into our schema and will rely on it for all subsequent analysis. This process of abstraction will result in clean, maintainable analytic code. Here’s what the data looks like at this stage (this is randomly generated test data):
The next step is to take the three core tables we need for our analysis—plans, subscriptions, and invoices—and combine them together into a single table of revenue per month. You can find the entire query here, but there’s a lot going on and I’m going to walk through it piece by piece.
First, we want to filter out invoice records that don’t represent real payments. We don’t want to include anything in our revenue analysis that doesn’t represent real revenue (the kind we’d want to report to the IRS).
Next, we have to split our invoices over the months they represent. If you only have monthly invoices, this stage of the analysis isn’t as relevant for you, but most subscription revenue companies have multiple subscription term options. This analysis does the work of splitting your annual subscriptions accurately across the 12 months for which revenue should be accrued. This step is critical if you want to produce trustworthy & useful financial analysis: you must accrue your revenue appropriately across the months during which service was delivered. This is definitely the trickiest part of the analysis.
The following two common table expressions create a list of all distinct months that have occurred from the very first Stripe transaction recorded up through the current date.
If you haven’t had to create date tables in Redshift before, bookmark this trick in your brain. It’s frequently useful.
The final set of CTEs create a list of all customers for all months in which they had revenue recognized.
This analysis will recognize revenue on the first day of the month of an active subscription. (While it’s possible to do revenue recognition on a daily basis, that was overkill for our current needs.) So, the above query would return a record for February, but not one for March, if a customer made their final payment on February 28th.
Once we’ve done all of that pre-work, it’s time to create the data output we want to return. The only tricky thing in this query are the join conditions connecting the customer_dates CTE to the invoices CTE. These are incredibly important to get right; they’re what tie this entire query together.
Here’s the resulting output of that query:
You can see that every row in the table is a single month for a single customer, whether that customer is actually on a monthly or annual subscription. For annual subscriptions, payments are divided across the 12 months of the subscription.
The results contain when that period started and ended, the total amount paid, and binary flags determining whether this was a customer’s first or last payment. All of these fields will be the inputs to our subsequent analysis. For now, we’ll materialize this into our database as a view or table so that we can build on top of it.
Now that we’ve prepared our data into the format we need for subsequent analysis, it’s time to write the SQL that will calculate our MRR by month. Let’s start with the basics:
The first CTE seems strange, but it reflects a practice I’ve been using for a while now: I modularize my analytic SQL so that I only have to name a data table once. This makes potential table renaming extremely easy to deal with in the future, and Redshift’s query optimizer treats this CTE as a complete passthrough, so it’s costless.
The second CTE creates a dataset of distinct months where we have some revenue to report.
Then things get interesting. The next thing we have to do is compare a customer’s payment with their payment in the prior month. This will allow us to tell whether it’s a flat renewal, upgrade, or downgrade in a subsequent query.
With heavy use of the lag function, we can easily pull forward the relevant information about the prior month.
Next, we calculate what category the revenue should go in: new, renewal, prepaid renewal (monthly revenue recognized from an annual prepaid subscription), upgrade, downgrade, or churn. This is straightforward to write now that we have all of the data we need; it’s a simple case statement. Take a look:
Once we have each invoice categorized by type, we need to pull them out into their component parts so that we can later consolidate them by month. This involves a bunch of CTEs, one for each category of revenue.
The only somewhat unexpected part of this query is the fact that renewals, upgrades, and downgrades are all summed as a part of renewal revenue. This is because some component of upgrade and downgrade revenue is actually a renewal. Imagine the case where someone upgrades from a $5/month to a $50/month plan. This is $5 of renewal revenue and $45 of upgrade revenue. The inverse is true for downgrades. This math is important to maintain the MRR identity:
last month revenue + new + upgrade - downgrade - churns = this month revenue
The final step is to join each of these CTEs together so that all of the revenue for a given month exists as a single row. The final SELECT is very straightforward:
Here’s the resulting data:
And here’s what it looks like when you visualize it in Mode:
The code has a lot of lines, but is intentionally written to be modular and readable. Hopefully, in walking through it, you’ll agree: this code is written to be read, understood, and maintained over an extended period by a team of analysts.
Customizing this analysis for your business
There are several rules baked into this analysis for how you want to measure your revenue. These may or may not be your preferred methods to accounting for these situations:
- This analysis counts someone as churned immediately upon the end of their most recent invoice. If someone misses a monthly payment, they are counted as first a churn, and then, when they pay the following month, a renewal. This is a very “pessimistic” way of reading this situation.
- This analysis doesn’t take into account any potential service credits that may be granted. Service credits would look like a downgrade in this model.
- This analysis doesn’t take into account any one-time charges that may occur but are not renewing. For instance, a one-time charge for professional services setup would look, initially, like an upgrade and the following month look like a downgrade.
The goal with this post is to lay the groundwork for you to conduct your own subscription analysis on top of Stripe data, and we highly recommend that you customize it for the reality of your business.
Deploying this analysis for your business
Before you can run this analysis, you have to get your data in Amazon Redshift. As we’ve seen, that’s straightforward to do with RJMetrics Pipelineand, since Stripe is on the free tier, you can actually use Pipeline free forever to sync your Stripe data! Of course, you’re free to make that step of the process happen however suits you best.
Once your data is in Redshift, the easiest way to deploy this analysis is by using Analyst Collective’s dbt (data build tool). The analysis in this post is available as open source code here. dbt will build all of the necessary data models, configured for your schema, automatically. You can then run the analysis in your visualization tool of choice.