Index ICU: Assertion `_sourceMetaData != __null’ failed, part 1

There you were, merrily going along and Boom! Somebody kicks the power switch, your filesystem goes off the deep end, something Very Bad happens. You start to understand why fsck is a four-letter word. After using some additional four-words, you get things up and running. But what’s with Splunk? It won’t start!? You only get some cryptic error and “Splunkd appears too be down.” Welcome to the world of WordData. You had a backup, right? Yeah, thought so.

Buried deep in the index are a bunch of *.data files:[feorlen]:/Applications/splunk/var/lib/splunk/defaultdb/db$ ls -lr *.data
-rw-r--r-- 1 root admin 10276 Sep 3 07:41
-rw-r--r-- 1 root admin 5085 Sep 3 07:41
-rw-r--r-- 1 root admin 252 Sep 3 07:41
-rw-r--r-- 1 root admin 21 Jul 26 19:19

You will find them in every bucket, they contain event counts for sources, sources, hosts and event types along with some timerange info. During indexing, these are constantly being updated. They are supposed to look something like this (note my timestamping oops there for host::grumpy):

$ more
0 0 2147483647 0 0
1 host::grumpy 11194556 900458000 1231448496 1220453014
2 host::www 1953184 1194131619 1220452994 1220452994
3 2350 1207761050 1216665145 1216665145
4 host::localhost 7482 1203904810 1217973661 1217973661

Except when they look like this:

$ more
^@^@^@^@^@^@^@^@^@^@^@ (END)

That isn’t very good. splunkd doesn’t much like it when somebody messes with it’s *.data files. There are also supposed to be at minimum,, and ( may legitimately not be there in some cases.) Your crash log will likely contain something like this:

[0x00002B51C8EEFB6E] abort + 270 (/lib/
[0x00002B51C8EE8266] __assert_fail + 246 (/lib/
[0x000000000066661D] ? (splunkd)
[0x0000000000697BA6] _ZN23DatabasePartitionPolicy20getSourceWordForCodeEmmR3Str + 182 (splunkd)

and here is the real smoking gun in splunkd_stderr.log:

splunkd: /opt/splunk/p4/splunk/branches/3.2/src/pipeline/indexer/TimeInvertedIndex.cpp:974: void TimeInvertedIndex::getSourceWordForCode(long unsigned int, Str&): Assertion `_sourceMetaData != __null' failed.

Ok, so you’ve got a horked *.data file. Where? Well, based on frequency of writes, it’s going to be in a db-hot directory because that is where active indexing is going on. And the most active indexes are usually fishbucket, _internal and defaultdb. Start by looking for *.data files that are binary. Here’s one way you can find which files are binary, a big clue on where the problem is:

$ cd /opt/splunk/var/lib/splunk
$ find . -name *.data | xargs grep "." % | grep Binary
grep: %: No such file or directory
Binary file ./_internaldb/db/db-hot/ matches
Binary file ./_internaldb/db/db-hot/ matches
Binary file ./_internaldb/db/db-hot/ matches
Binary file ./fishbucket/db/db-hot/ matches

file will do it also, but beware false positives:

$ for i in `find . -name *.data`; do file $i | grep -v text ;done
./_internaldb/db/db-hot/ data
./_internaldb/db/db-hot/ data
./_internaldb/db/db-hot/ data
./defaultdb/db/db_1214955936_1210836930_38/ Bio-Rad .PIC Image File 2352 x 12297, 14601 images in file

Another check is to see if the line numbers in the file are in ascending order. If they aren’t, then something is seriously wrong:

for i in `find . -name *.data`; do sort -nc $i;done

Have a look at these files and see what’s in them. If they are only partially corrupted, you may be able to edit out the garbage. If they are totally full of junk, you will need to find replacements. For _internaldb and fishbucket, you may not care if your event counts are exactly correct so you can lift some files from another bucket. If the problem were in defaultdb or another index containing your real indexed data, you’ll need to pay more attention to the contents.

In the simple case, if the files in db-hot are trashed, see if there is a warm bucket next to it you can copy some from. Warm buckets are in the same directory as db-hot and look something like db_1218802821_1218658318_17. Copy the *.data files from there into db-hot and try to restart Splunk. If it does, then you are good to go. If not, that means there is more damage to repair. If there are other binary *.data files, make sure you deal with all of them.

This should handle the most common types of problems. I’ll go into more detailed debugging and reconstruction in another post.

Posted by