TIPS & TRICKS

Quick N’ Dirty: Retention

Inspired by a customer conversation, I recently posted a blog entry on funnels. This customer also asked about calculating retention. As it happens, retention is just a variation on the funnel concept. The main difference is that we add the subsearch concept.

So, first, let’s define retention in the way this customer defined it. For him, retention was defined as the percentage of this week’s users who also visited last week or the week before.

Let’s start with the first part of the question. Specifically: how many unique visitors did we see this week?

sourcetype=retention | stats dc(VisitorID) as this_week

retention_1Next, we ask the second part of the question: of those visitors, how many were here last week as well?

sourcetype=retention | stats dc(VisitorID) as this_week | 
appendcols [search sourcetype=retention [search earliest=-0w@w | table VisitorID] earliest=-1w@w latest=-0w@w | stats dc(VisitorID) as one_week_ago ] |

retention_2

This is where the subsearch concept comes into play. As we saw with funnels, that appendcols function allows you to add another completely independent search to the search results. The subsearch part (“[search earliest=-0w@w | table VisitorID]”) is telling Splunk: “limit these results to events where VisitorID is present in this week’s data”.

And, now that you’ve done one week ago (“earliest=-1w@w latest=-0w@w”), you can do any time frame you like:

sourcetype=retention | stats dc(VisitorID) as this_week| 
appendcols [search sourcetype=retention [search earliest=-0w@w | table VisitorID] earliest=-1w@w latest=-0w@w | stats dc(VisitorID) as one_week_ago ] |
appendcols [search sourcetype=retention [search earliest=-0w@w | table VisitorID] earliest=-2w@w latest=-1w@w | stats dc(VisitorID) as two_weeks_ago] |

retention_3

I also like to clean things up and make them pretty:

sourcetype=retention | stats dc(VisitorID) as this_week| 
appendcols [search sourcetype=retention [search earliest=-0w@w | table VisitorID] earliest=-1w@w latest=-0w@w | stats dc(VisitorID) as one_week_ago ] |
appendcols [search sourcetype=retention [search earliest=-0w@w | table VisitorID] earliest=-2w@w latest=-1w@w | stats dc(VisitorID) as two_weeks_ago] |
eval one_week_ago = ((one_week_ago/this_week) * 100) . "%" |
eval two_weeks_ago = ((two_weeks_ago/this_week) * 100) . "%" |
rename this_week AS "This Week" one_week_ago AS "Last week" two_weeks_ago as "Two Weeks Ago" |
transpose

retention_4A Word on Optimization

Finally, a word on optimization. Using appendcols and subsearches are nifty, but they’re hungry searches and may be prohibitively slow with large volumes of data. My suggestion? Explore the summary index concept. For example, you could create a search that lists the unique visitor IDs:

sourcetype="weblogs" earliest=-1d@d latest=-0d@d | sistats dc(VisitorID)

Save it as a summary index named “daily_visitors”, and run it every morning at, like 2am.

Then, instead of querying raw data (“sourcetype=retention”), you would query the summary index (“index=summary search_name=”daily_visitors”). This would dramatically improve the performance of the retention search.

Thanks,
-S.

Sondra Russell
Posted by

Sondra Russell

Join the Discussion