Used by 10MM companies to send over 600MM emails per day, it’s hard to avoid running into Mailchimp if you have an email account. And, because email marketing continues to be the highest-converting marketing channel, it’s likely that Mailchimp is a part of your company’s marketing stack.

While it’s certainly possible to analyze Mailchimp data with its built-in reports or by downloading CSVs and slicing them in Excel, that’s not a great solution. There is incredibly rich information about your customers and their preferences locked up in your Mailchimp account, and to really dig in you’re going to need to use SQL.

The challenge in analyzing Mailchimp data is that you have to get at it first. That’s where tools like RJMetrics Pipeline come in. Pipeline connects to many popular SaaS tools and syncs your data to Amazon Redshift in near-real-time. Once your Mailchimp data is in Redshift, you can analyze it using SQL.

That’s where we’ll pick up the storyline. This post walks through the data modeling and analysis required to take your raw Mailchimp data in Redshift and produce insightful email analytics:

  • raw behavioral trends over time,
  • list membership adds and losses,
  • top email send days and times, and
  • most engaged members.

In the future, I plan on extending this analysis further, including measuring user-level product category affinity. For now, this should get you to a pretty good place.

Note: All of the SQL in this post is published as open source software under the Apache 2.0 license, as a part of our project Analyst Collective. Analyst Collective is a collection of open source tools and analysis dedicated to improving analytic workflow. At the end of this post, I’ll share more about Analyst Collective, including tools to easily configure and deploy the data models and analysis from this post within your own Redshift cluster. At any point you’d like to skip the walkthrough and just deploy the code, just skip to the end.

Modeling Mailchimp data

Data about email marketing can be logged as a stream of events with one of five types:

  • Send
  • Open
  • Click
  • Unsubscribe
  • Bounce (hard / soft)

If you had a single table with a user_id, event_action, event_date, and campaign_id, that would be nearly all of what you need to analyze your email marketing data. Unfortunately, that’s not how Mailchimp returns data from its API. So the first thing we’re going to do is transform the data into exactly this schema in a series of data models.

This process of modeling is key to analysis, and is also one of the most often overlooked steps. Common data models provide an abstraction layer that can act as the reliable foundation from which to build subsequent analysis. Modeling forms one of the key concepts of the Analyst Collective viewpoint on the mature analytics workflow.

Each individual event isn’t hard to get; let’s run through them quickly.

Note: If you notice the {{env.schema}} mentions, those are callouts for the analytic schema you use. This is a configuration that is interpolated during the compilation process of the Analyst Collective deployment tool, dbt.


In order to get a sense of what this data looks like, here’s the output of this query on our test data:

output of mailchimp_sends.sql

The other models return nearly identical data, so I’m going to skip showing their outputs. You get the idea.





Putting it all together

One of the most useful data structures to use to view this data is summarizing it by individual email sent to each individual person. Rather than a long stream of events, we’re going to join all of that data together into a structure that shows, for every email sent to every person, when it was sent, whether and when it was opened, whether and when it registered a click, and several other helpful calculations. This data structure will be used repeatedly in our subsequent analysis.

Because we’ve already prepared the underlying data, the logic in this query is very basic. In all of the CTEs, we’re collecting together the various events and calculating some basic statistics. In the final SELECT we’re joining the results together.

The resulting data is very raw, but it’s extremely useful. You can think of every row in this dataset as being the answer to the question “What happened when I sent this email to this person?”. Here’s some sample output to give you a sense for how it looks:

output from mailchimp_email_summary.sql

This is all of the transformation we’re going to need to conduct our subsequent analysis. At this point, we need to materialize these queries as views or tables into our schema. This process of abstraction will result in clean, maintainable analytic code.

The easiest way to deploy these models to your Redshift cluster is our open source tool, dbt. dbt is a command-line tool that makes the process of building, deploying, and testing data models extremely streamlined. I use it as a core part of my analytic workflow today.

Analyzing Mailchimp data

Time to get to work analyzing and visualizing our data. Let’s start with a basic time series.

Time series analysis

Our first port of call should simply be to trend our email marketing raw performance over time. This will form the basis of our answer to the question “How is our email marketing performing?” We’ll write a single query to trend our core metrics—count of sends, opens, clicks, bounces, and unsubscribes and their corresponding rates—over time. And because of the modeling we’ve already done, this query is pretty damn simple.

Calculate a few fields, perform a few aggregations, and you’re done. Here’s one possible visualization of this data from Mode:

In my personal experience, increasing send volume will result in more total opens and clicks, but will depress open rate and click-through rate. It will also result in more unsubscribes. I highly recommend looking at these correlations in your own data—this will help you find a “sweet spot” for how many emails you should target sending in a given week or month.

List adds and losses

Most businesses consider email list membership growth to be a core marketing metric. Here’s how to measure your adds and losses over time.

In this query, we’re simply counting the number of list adds and list losses in a given month. The only slightly tricky element of this query is that it’s possible to count someone as a loss twice. The “losses” subquery makes sure to only count someone as a loss once.

Here’s how this visualization looks:

You can also use the same analysis to create other visualizations, including charting total members over time.

Top send days and times

If you’ve ever been in marketing, you know how important it is to optimize email send times. Sometimes the strangest times seem to work great—think Sunday at 5am—whereas what would potentially be the most obvious times (weekdays at 8am) are terrible. The core reason for this is that email, as all marketing channels, suffers from saturation. Your performance is directly associated with the behavior of other marketers who are vying for the attention from the same market segment. The only way to resolve this uncertainty is by testing and measuring performance.

Fortunately, because we’ve already normalized our email data, it’s trivial to answer this tried-and-true email marketing question. Take a look:

Just switch the final select statement to alternate between reporting on day of week vs. hour of day. Creating modified versions of this query is easy as well—you might want to look at specific five-minute periods during weekday afternoons. This makes sense and would be an easy extension to the above.

Here’s how the above query gets visualized:

One interesting question we could try to answer would be looking at email marketing performance based on the time zone of the email recipient. All of the email data above is based on the time zone of the sender (you), but performance is based on the behavior of the recipient, not the sender. We can’t get this data from Mailchimp alone, but if we could join email marketing data to core customer data that analysis would be very straightforward. Another great reason to make sure you have your ETL process in good shape!

Most engaged subscribers

I run an email newsletter called the Data Science Roundup that curates the most interesting data science news of the week. When the Roundup was first getting started I did customer development for it: I found the most engaged subscribers and sent them a personal email asking for feedback.

Your most engaged customers are always an incredible treasure trove of information, whether you’re talking about ecommerce customers, mobile game players, or email subscribers. This simple query is all you need to find your most engaged 100 subscribers in Mailchimp:

There are some assumptions made in this query that you can and should customize:

  • It’s measuring engagement by click-through rate. You could alternately look at open rate.
  • It’s limiting the group to the top 100 subscribers.
  • It’s limiting the set to subscribers who have opened at least 5 emails.

Customizing this analysis for your business

The goal of this post is to share templates for how to analyze your Mailchimp data, not to tell you the single correct way to do it. You can install these models and run this analysis on top of your own Mailchimp data and it will work, but you will likely want to make changes appropriate for your business. Here are some ideas:

  • Segment this data! Perform the same analysis on subsets of customers to see how performance differs. And then think about customizing messaging.
  • Analyze A/B test results with explanatory / confounding variables.
  • Examine the performance of link clicks by destination URL.

There’s a lot more to be done here. I plan to return to this analysis in the future, but if you go deeper with this analysis in the meantime please pay it forward by contributing your work!

Deploying this analysis for your business

Before you can run this analysis, you have to get your data in Amazon Redshift. RJMetrics Pipeline is a great option for getting SaaS data into Redshift, but you’re free to make the ETL step of the process happen however suits you best.

Once your data is in Redshift, the easiest way to deploy this analysis is by using Analyst Collective’s dbt (data build tool). The analysis in this post is available as part of the Analyst Collective open source analytics repository, here.

dbt will build all of the necessary data models, configured for your schema, automatically. You can then run the analysis in your visualization tool of choice. Setup instructions for Analyst Collective can be found in our wiki, and you can learn more about the project and its mission here.

All Analyst Collective code is open source, and we’d love to see folks make improvements to these queries and contribute those improvements back to the community. With the proliferation of SQL-based analysis on top of common data schemas (such as Mailchimp), there is a huge opportunity for analysts to collaborate on code. I spent about two days building and testing these queries; I’d hate to see analysts at the 10,000 other Mailchimp customers do this work over and over again.