PUBLIC SECTOR PUBLIC SECTOR

Using Splunk for Public Transit Advocacy

In some of my free time, I volunteer for and am a board member of the non-profit Market Street Railway that advocates for improved and expanded service of historic streetcars in San Francisco. One of the things Market Street Railway has been advocating for in particular is improved and expanded service on the relatively new E-Embarcadero line. When doing any advocacy, one needs to know the current status as a baseline. In terms of service, probably the most important metric is a line's headway.

But how can one obtain that?

It turns out that SFMTA contracts with NextBus to provide vehicle arrival time predictions. Furthermore, NextBus provides a publicly accessible API to query real-time vehicle data. Using this API, vehicle data can be collected and analyzed to gauge service.

Available Data

NextBus, via their public API, provides the following data:

  • Agency list
  • Route list (for a particular agency)
  • Route configuration (for a particular route, the list of stops and—for each stop—its title, direction, latitude, longitude and stop ID)
  • Schedule (for a particular route)
  • Vehicle Locations (for a particular route, the ID, direction, speed, latitude, longitude and last report time for vehicles on that route)
  • Vehicle Predictions (for a particular stop)

For our purpose, the Route Configuration and Vehicle Locations data are the most useful.

Desired Statistics

Given the available data, the statistics that are most interesting include:

  • Which stops are the most used?
  • What are the average and maximum headways?
  • What are the average and maximum layovers?
  • How do the headways change over the course of a day?

However, these statistics can not be collected since NextBus does not provide them directly. Instead, a program that polls NextBus for the data it does provide needs to be manipulated in order to obtain the desired statistics.

Desired Data

Based on the desired statistics, we want data that looks like:

2016-11-19 18:20:02 GMT vehicle_id="1007" vehicle_route="E" vehicle_speed=0 vehicle_distance=182 stop_tag="5184"
2016-11-19 18:20:31 GMT vehicle_id="1009" vehicle_route="E" vehicle_speed=21 vehicle_distance=116 stop_tag="5237"
2016-11-19 18:20:54 GMT vehicle_id="1011" vehicle_route="E" vehicle_speed=7 vehicle_distance=109 stop_tag="5240"
2016-11-19 18:21:31 GMT vehicle_id="1007" vehicle_route="E" vehicle_speed=0 vehicle_distance=182 stop_tag="5184"
2016-11-19 18:21:55 GMT vehicle_id="1009" vehicle_route="E" vehicle_speed=28 vehicle_distance=63 stop_tag="7145"

where:

  • vehicle_id: Unique vehicle identifier (the streetcar number).
  • vehicle_route: The route the streetcar is on (even though we’re currently only interested in the E-Embarcadero line, we might want other lines in the future, e.g., the F-Market & Wharves line—the first historic transit line in San Francisco).
  • vehicle_speed: The streetcar’s speed.
  • vehicle_distance: The distance the streetcar is from its closest stop.
  • stop_tag: The identifier of the closest stop.

However, the NextBus API “vehicle locations” command (vehicleLocations):

http://webservices.nextbus.com/service/publicXMLFeed?command=vehicleLocations&a=sf-muni&r=E&t=1144953500233

returns data of the form (non-essential XML attributes elided):

<?xml version="1.0" encoding="utf-8" ?>
<body copyright="All data copyright San Francisco Muni 2017.">
  <vehicle id="1010" routeTag="E" dirTag="E____O_S10" lat="37.77648" lon="-122.39378" secsSinceReport="11" speedKmHr="0"/>
  <vehicle id="1008" routeTag="E" dirTag="E____O_F00" lat="37.80737" lon="-122.41493" secsSinceReport="3" speedKmHr="25"/>
  <vehicle id="1007" routeTag="E" dirTag="E____I_F00" lat="37.79878" lon="-122.39733" secsSinceReport="18" speedKmHr="27"/>
  <vehicle id="1006" routeTag="E" dirTag="E____O_F00" lat="37.8003" lon="-122.39864" secsSinceReport="3" speedKmHr="24"/>
  <lastTime time="1500572570930"/>
</body>

Hence, we have to map the latitude & longitude of the streetcar to its closest stop. The NextBus API “route configuration” command (routeConfig):

http://webservices.nextbus.com/service/publicXMLFeed?command=routeConfig&a=sf-muni&r=E

returns XML data of the form (non-essential XML attributes elided):

<?xml version="1.0" encoding="utf-8"?>
  <body copyright="All data copyright San Francisco Muni 2017.">
  <route tag="E" title="E-Embarcadero">
    <stop tag="5184" title="Jones St &amp; Beach St" lat="37.8071299" lon="-122.41732" stopId="15184"/>
    <stop tag="3092" title="Beach St &amp; Mason St" lat="37.8074099" lon="-122.41412" stopId="13092"/>
    ...
    <direction tag="E____O_F00" title="Outbound to Mission Bay" name="Outbound">
      <stop tag="5184" />
      <stop tag="3092" />
      ...
      <stop tag="5241" />
    </direction>
    <direction tag="E____I_F00" title="Inbound to Fisherman&apos;s Wharf" name="Inbound">
      <stop tag="5240" />
      <stop tag="5237" />
      ...
      <stop tag="35184" />
    </direction>
  </route>
</body>

That is:

  1. For the route of interest, every stop’s information including its latitude & longitude
  2. The set of stops for each direction*
*In San Francisco, inbound means toward downtown (or northbound for cross-town routes) and outbound means away from downtown (or southbound for cross-town routes)

Using a small program (nb-vstop written in Python), we can use this data and the haversine formula to calculate the closest stop to a streetcar’s location. Note that when calculating the closest stop, we have to match the streetcar’s direction (dirTag) against the stop’s direction since—in many cases—the inbound and outbound stops are physically separated and therefore have different coordinates.

The same program polls NextBus periodically (currently every 20 seconds) to obtain up-to-date vehicle location data. Every poll, NextBus returns the set of vehicles that have reported updated information since the last poll. Each vehicle’s location is then compared against every stop’s location to find each closest stop. This vehicle/stop pair is logged having occurred at the time of the most recent GPS location reported by the vehicle. This data can then be imported into Splunk.

Enriching Data

To get the desired statistics, it’s helpful to enrich the data. For example:

  • Given a stop_tag, add the stop’s direction: inbound or outbound.
  • Given a vehicle_id, add the streetcar’s type, number of ends (some streetcars are single-ended and some are double-ended) and—just for fun—the year of manufacture, the agency at which the streetcar originated (many streetcars in San Francisco’s historic fleet are from elsewhere) and the livery (paint scheme) it’s in.

This can be accomplished in Splunk via CSV look-up tables.

The nb-vstop program can dump the NextBus API routconfig XML response as a CSV file:

stop_tag,stop_title,stop_id,stop_direction,stop_lat,stop_lon
"5184","Jones St & Beach St","15184","Outbound",37.807130,-122.417320
"3092","Beach St & Mason St","13092","Outbound",37.807410,-122.414120
"3095","Beach St & Stockton St","13095","Outbound",37.807840,-122.410810
...

This CSV file goes in $SPLUNK_ETC/apps/search/lookups/NextBus_stops.csv.

To tell Splunk to use these look-ups files, they have to be configured in $SPLUNK_ETC/apps/search/local/props.conf:

[NextBus]
LOOKUP-fleet = Historic_fleet vehice_id
LOOKUP-stops = NextBus_stops stop_tag

where:

  • The LOOKUP-fleet line says to use the Historic_fleet transform with vehicle_id as the key to look up.
  • The LOOKUP-stops line says to use the NextBus_stop transform with stop_tag as the key to look up.

Now we have to map those transform names to the CSV files via and $SPLUNK_ETC/apps/search/local/transforms.conf:

 
[Historic_fleet]
filename = Historic_fleet.csv
max_matches = 1
min_matches = 1
default_match = bus
[NextBus_stops]
filename = NextBus_stops.csv
max_matches = 1
min_matches = 1

where:

  • max_matches and min_matches specify to match exactly one row.
  • default_match specifies a value to use when no match is found. (For various reasons, the SFMTA occasionally substitutes busses for streetcars on the F-Market & Wharves line. The vehicle_id of busses aren’t in the Historic_fleet.csv file so no row will match. For our purposes, whether a particular vehicle is a bus or not is sufficient. FYI: for technical reasons, busses can not be substituted on the E-Embarcadero line.)

Given all this, data in Splunk now looks like:

Filtering Data

The event data in its entirety is more than necessary. What we want is to filter the log entries such that, for each vehicle/stop pair, retain only those entries where the vehicle/stop distance is the least. For example, given:

1| 2016-11-19 18:20:02 GMT vehicle_id="1007" vehicle_route="E" vehicle_speed=0 vehicle_distance=23 stop_tag="5184"
2| 2016-11-19 18:20:31 GMT vehicle_id="1009" vehicle_route="E" vehicle_speed=21 vehicle_distance=116 stop_tag="5237"
3| 2016-11-19 18:20:54 GMT vehicle_id="1011" vehicle_route="E" vehicle_speed=7 vehicle_distance=109 stop_tag="5240"
4| 2016-11-19 18:21:31 GMT vehicle_id="1007" vehicle_route="E" vehicle_speed=5 vehicle_distance=182 stop_tag="5184"
5| 2016-11-19 18:21:55 GMT vehicle_id="1009" vehicle_route="E" vehicle_speed=28 vehicle_distance=63 stop_tag="5237"

we want to filter out lines 2 and 4:

  • Filter out line 2 because both lines 2 and 5 refer to streetcar 1009, but, of those two lines, line 5 has a smaller vehicle_distance (63) compared to line 2 (116).
  • Filter out line 4 because both lines 1 and 4 refer to streetcar 1007, but, of those two lines, line 1 has a smaller vehicle_distance (23) compares to line 4 (182).

resulting in:

2016-11-19 18:20:02 GMT vehicle_id="1007" vehicle_route="E" vehicle_speed=0 vehicle_distance=23 stop_tag="5184"
2016-11-19 18:20:54 GMT vehicle_id="1011" vehicle_route="E" vehicle_speed=7 vehicle_distance=109 stop_tag="5240"
2016-11-19 18:21:55 GMT vehicle_id="1009" vehicle_route="E" vehicle_speed=28 vehicle_distance=63 stop_tag="5237"

Unfortunately, there’s (currently) no built-in way to get this result. Fortunately, Splunk allows you to write custom search commands.

Custom Search Commands

Using the Splunk SDK for Python, we can write a custom search command. The algorithm is:

For each vehicle, look at its stop:

  • Same stop?
    • Yes: Distance greater than previous?
      • Yes: Discard log entry
      • No: Update to lesser distance
    • No: Output log entry having least distance

Here's the code:

#! /usr/bin/env python
import operator
import sys
import time




from splunklib.searchcommands import dispatch, Configuration, EventingCommand
from splunklib.searchcommands.decorators import ConfigurationSetting




# Define constants for dictionary keys to avoid typos.
K_STAG  = 'stop_tag'
K_TIME  = '_time'
K_VDIST = 'vehicle_distance'
K_VID   = 'vehicle_id'

def date( t ):
    return time.strftime( '%Y%m%d', time.localtime( int( t ) ) )




@Configuration()
class NextBusClosestStop( EventingCommand ):
    """ Filters NextBus records to only those where a vehicle is at its
        closest distance to a given stop.




    ##Syntax
    .. code-block::
        nbclosest
    ##Description
    The :code:`nbclosest` command filters NextBus records to only those where a
    vehicle is at its closest distance to a given stop.
    """




    class ConfigurationSettings( EventingCommand.ConfigurationSettings ):
        required_fields = ConfigurationSetting( value=[ K_TIME, K_VID, K_VDIST, K_STAG ] )




    def __init__( self ):
        super( NextBusClosestStop, self ).__init__()
        self.old_date = ''
        self.vdict = { }




    def drain( self ):
        """ When we’re out of records (or the date changes), “drain” our backlog dictionary."""
        recs = self.vdict.values()
        for rec in sorted( recs, key=operator.itemgetter( K_TIME ) ):|
            yield rec
        self.vdict.clear()




    def transform( self, records ):
        for rec in records:
            if not ( K_TIME in rec and K_VID in rec and K_VDIST in rec and K_STAG in rec ):
                continue




            vid = rec[ K_VID ]
            if not vid:
                continue




            new_stop = rec[ K_STAG ]
            if not new_stop:
                continue




            new_date = date( rec[ K_TIME ] )
            if new_date != self.old_date:   # date changed: “drain” our backlog dictionary
                for rec in self.drain():
                    yield rec
                self.old_date = new_date




            if vid in self.vdict:
                old_rec = self.vdict[ vid ]
                old_stop = old_rec[ K_STAG ]
                if new_stop == old_stop:    # for current vehicle, same stop?
                    old_dist = int( old_rec[ K_VDIST ] )
                    new_dist = int( rec[ K_VDIST ] )
                    if new_dist > old_dist: # distance greater?
                        continue            # yes: discard log entry
                else:
                    yield old_rec           # output log entry having least distance
            self.vdict[ vid ] = rec




        # no more log entries: “drain” whatever's left in the dictionary
        for rec in self.drain():
            yield rec



dispatch( NextBusClosestStop, sys.argv, sys.stdin, sys.stdout, __name__ )

This program goes in $SPLUNK_ETC/apps/search/bin/nbclosest.py.

To tell Splunk to use this custom search command, it has to be configured in $SPLUNK_ETC/apps/search/local/commands.conf:

[nbclosest]
filename = nbclosest.py
chunked = true

where chunked=true tells Splunk to use the newer chunked protocol.

Analyzing Data

Average Time Between Boardings by Stop
To find out which stops are the most used, we can write an SPL query like:

1| index="muni" vehicle_route="E" | nbclosest |
2|   eval date=strftime(_time,"%Y-%m-%d") |
3|   streamstats window=2 global=f range(_time) as tbb by stop_tag date |
4|   outlier action=remove tbb |
5|   chart eval(round(avg(tbb)/60,0)) as atbb by stop_title

where we:

  1. Filter log entries through our custom nbclosest search command (line 1).
  2. Set a date variable to the event’s date (line 2) so we can later group-by date.
  3. Use the streamstats command (line 3) to generate time-between-boardings (tbb) by calculating statistics in a streaming manner where:
    • window=2: using two events at a time (current and previous).
    • global=f: use a separate window if the date changes.
    • range(_time): the time between the two events in the window.
  4. Use the outlier command (line 4) to remove outliers from tbb.**
  5. Create a chart (line 5) of the average-time-between boardings (atbb) by stop.
**It’s necessary to remove outliers because it’s sometimes the case that the GPS units on board the streetcars are either “flaky” or the streetcar goes out of service. In such cases, the end time for a two-event pair either appears much later in the day or not at all. If such outliers were not removed, they would drastically skew the results.

This yields a chart like:

This chart shows (for example) that:

  • Pier 39 is the most frequently used stop (with an average time between boardings of 21 minutes)
  • King St & 2nd St is the least frequently used stop (with an average time between boardings of 35 minutes)

Average & Maximum Headways by Date
To find the average an maximum headways by date, we can write an SPL query like:

1| index="muni" vehicle_route="E"| nbclosest | where stop_tag=5240 |
2|   eval date=strftime(_time,"%Y-%m-%d") |
3|   streamstats window=2 global=f range(_time) as headway by date |
4|  eval headway=headway/60 | outlier action=remove headway |
5|  timechart span=1d eval(round(max(headway),0)) as "Max. Headway"
6|                    eval(round(avg(headway),0)) as "Avg. Headway"

where we:

  1. Filter log entries through our custom nbclosest search command (line 1)
  2. Limit the data to a representative single stop on the line (5240: 4th & King St, inbound: the start of the line) using a where clause (line 1)
  3. Set a date variable to the event’s date (line 2) so we can later group-by date
  4. Use the streamstats command (line 3) as before
  5. Use the outlier command (line 4) as before to remove outliers
  6. Create a timechart of the maximum & average headways by date (lines 5 & 6)

This yields a time-chart like:

This chart shows (for example) that:

  • Overall, the average headway is in the mid-20-minute range
  • The maximum headways are terrible
  • On Saturday, April 15, it was a particularly bad day having a 51 minute average and a 110 minute maximum headway

Average Headway by Date By Hour
To find the average headway by date by hour, we can write an SPL query like:

1| index="muni" vehicle_route="E"| nbclosest | where stop_tag=5240 |
2|   eval date=strftime(_time,"%Y-%m-%d") |
3|   streamstats window=2 global=f earliest(_time) as start latest(_time) as end by date |
4|   eval headway=(end-start)/60, hour=strftime(end,"%H") |
5|   timechart eval(round(avg(headway),0)) by hour fixedrange=false

where we:

  1. Filter log entries through our custom nbclosest search command (line 1).
  2. Limit the data to a representative single stop on the line (5240: 4th & King St, inbound) using a where clause (line 1)
  3. Set a date variable to the event’s date (line 2) so we can later group-by date
  4. Use the streamstats command similarly to before (line 3), but distinguish the start and end times so we can chart by the hour of the end time
  5. Create a timechart of the average headway by hour (line 5)

This yields a time-chart like:


This chart shows (for example) that:

  • On average, the headway is highly variable during the course of a day
  • On Saturday, April 22, the headways were particularly bad

Average & Maximum Layover by Date
Other useful statistics are the average and maximum layover times. To find said times, we can write an SPL query like:

 1| index="muni" vehicle_route="E"| nbclosest |
 2|   eval date=strftime(_time, "%Y-%m-%d") |
 3|   streamstats window=2 global=f
 4|     earliest(_time) as start latest(_time) as end
 5|     earliest(stop_direction) as dir1 latest(stop_direction) as dir2
 6|     by vehicle_id date |
 7|   where dir1="Outbound" and dir2="Inbound" |
 8|   eval layover=(end-start)/60 | outlier action=remove layover |
 9|   timechart span=1d eval(round(max(layover),0)) as "Max. Layover"
10|                     eval(round(avg(layover),0)) as "Avg. Layover"
11|                     fixedrange=false

where we:

  • Filter log entries through our custom nbclosest search command (line 1)
  • Set a date variable to the event’s date (line 2) so we can later group-by date
  • Use the streamstats command similarly to the previous case (lines 3–6), but also pair up the stop directions (line 5) and look for when the direction changes from outbound to inbound using a where clause (line 7), i.e., when the layover occurs at the end of the line
  • Calculate the layover (line 8) and and use the outlier command (line 8) as before to remove outliers
  • Create a timechart of the maximum & average layovers by date (line 9)

This yields a time-chart like:

This chart shows (for example) that:

  • On average, a layover lasts between 8-12 minutes
  • The maximum ranges between 16-33 minutes

Takeaways

Armed with such data and charts as a baseline makes for powerful weapons for advocating for improved service. To achieve this:

  • Use look-ups to enrich your event data
  • Implement custom search commands to filter events in novel ways
  • Use where clauses to filter events some more
  • Use streamstats to calculate statistics of a “streaming window” of data
  • Use chart to present statistical data
  • Use timechart to present time-based statistical data
Paul Lucas
Posted by Paul Lucas

Paul J. Lucas started at Bell Labs in telephony, log file visualization, and testing cfront, the original C++ compiler. He is the author of “The C++ Programmer’s Handbook” and holds patents on data visualization and programming language type systems. He holds a Masters Degree in Computer Science from the University of Illinois at Urbana-Champaign. In some of his free time, he volunteers for the non-profit Market Street Railway supporting historic transit in San Francisco.

Join the Discussion