
It’s time for a little Indexing 101. If you look in the directory where your Splunk datastore resides (default location /opt/splunk/var/lib/splunk) you will find a directory called fishbucket. This index is not really intended for normal humans to investigate, more just Splunk engineers trying to decipher file input issues. It contains seek pointers and CRCs for the files you are indexing, so splunkd can tell if it has read them already. To see what’s there, try searching for “index=_thefishbucket”. Events look something like this:
48a304b3 initcrc::5f66db978a1ff3a3 seekcrc::bc96de428cc0b5e6 seekptr::414063 modtime::1218643123 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
The fields are:
timestamp (epoch time, in hex)
CRC of the first 256 bytes of the file
CRC of the 256 bytes where we were last reading
seek pointer for where we are in the file
the time the file last changed
the full path to the file.
the full path to the source, which is usually the same as the file but could be the archive the file came from.
When the file monitor processor looks at a file, it searches the fishbucket to see if the CRC from the beginning of the file is already there. If not, the file is indexed as new, If yes, then we check the CRC of where we were reading against the saved value in seekcrc. If it matches and the file is longer than the saved seek pointer, then there is new stuff at the end to read. If the top of the file matches but the seekcrc doesn’t, or the seek pointer is beyond the current end of the file, then something in the part we have already read has changed. Since we don’t know what might have changed, we just index the whole thing. (You can control this: see CHECK_METHOD in props.conf.spec.)
If you want to track what is happening with a particular file, you can search for all the events in the fishbucket associated with it by the file or source name (like source::/var/log/apache2/feorlen_org_access_log.) If you check the seekptr and the modtime, they will only be increasing with time (note that events are returned most recent first, so this list is newest to oldest.)
48a3084d initcrc::5f66db978a1ff3a3 seekcrc::3e746e9f66897965 seekptr::414a40 modtime::1218644042 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a307d9 initcrc::5f66db978a1ff3a3 seekcrc::77f6d8313fc689ba seekptr::41419b modtime::1218643929 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a3062e initcrc::5f66db978a1ff3a3 seekcrc::2cc30b86b37c646 seekptr::4140fc modtime::1218643502 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a304b3 initcrc::5f66db978a1ff3a3 seekcrc::bc96de428cc0b5e6 seekptr::414063 modtime::1218643123 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a300d3 initcrc::5f66db978a1ff3a3 seekcrc::8db2f52ef6f75c91 seekptr::413fa4 modtime::1218642130 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a2fc7a initcrc::5f66db978a1ff3a3 seekcrc::881375418e194bd5 seekptr::413f06 modtime::1218640999 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a2f996 initcrc::5f66db978a1ff3a3 seekcrc::c596371ec4c573d4 seekptr::413e6c modtime::1218640260 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a2f80c initcrc::5f66db978a1ff3a3 seekcrc::2e686cf0dd2f62bb seekptr::413dce modtime::1218639883 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a2f25a initcrc::5f66db978a1ff3a3 seekcrc::b2e489862ed72c79 seekptr::413d1d modtime::1218638406 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a2f1d1 initcrc::5f66db978a1ff3a3 seekcrc::58af0c6446e96bf5 seekptr::413c7f modtime::1218638289 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a2f19d initcrc::5f66db978a1ff3a3 seekcrc::16fdb83b48965067 seekptr::413bbe modtime::1218638236 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a2f05b initcrc::5f66db978a1ff3a3 seekcrc::fbb8700a35cfdfcb seekptr::413b25 modtime::1218637915 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
48a2ebc5 initcrc::5f66db978a1ff3a3 seekcrc::ddbac21aa7386a6 seekptr::413abd modtime::1218636714 filename::/var/log/apache2/feorlen_org_access_log source::/var/log/apache2/feorlen_org_access_log
Anything other than this indicates a big problem with the file, like it is getting re-indexed when it shouldn’t. (Some files you do want to re-index when they change, but not normal logfiles that roll.)
So why do I care?
Every Splunk instance has a fishbucket index, except the lightest of hand-tuned lightweight forwarders, and if you index a lot of files it can get quite large. As any other index, you can change the retention policy to control the size via indexes.conf. But since it tracks what files the instance has seen, you have to consider carefully before you change the retention policy. If you retire data from the fishbucket for files that still exist on the host, it will “forget” it saw them and next time around they will get re-indexed.