Topics

| pdf version

About the Splunk Admin Manual

How Splunk Works


Splunk > The IT Search Company

  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk

Localized Splunk documentation

Looking for Splunk documentation in other languages?

Configure character set encoding

This documentation does not apply to the most recent version of Splunk.

This documentation applies to the following versions of Splunk: 3.3 , 3.3.1 , 3.3.2 , 3.3.3 , 3.3.4 , 3.4 , 3.4.1 , 3.4.2 , 3.4.3 , 3.4.5 , 3.4.6 , 3.4.8 , 3.4.9 , 3.4.10 , 3.4.11 , 3.4.12 , 3.4.13

Configure character set encoding

Splunk allows you to configure character set encoding for your data sources. Splunk has built in character set specifications to support internationalization of your Splunk deployment. Splunk supports 71 languages (including 20 that aren't UTF-8 encoded). You can Retrieve a list of Splunk's valid character encoding specifications using the iconv -l command on most *nix systems.

Splunk attempts to apply UTF-8 encoding to your sources by default. If a source doesn't use UTF-8 encoding or is a non-ASCII file, Splunk will try and convert data from the source to UTF-8 encoding unless you specify a character set to use by setting the CHARSET key in props.conf.

Note: If a source's character set encoding is valid, but some characters from the specification are not valid in the encoding you specify then Splunk escapes the invalid characters as hex values (for example: "\xF3").


Supported character sets

Language Code
Arabic CP1256
Arabic ISO-8859-6
Armenian i ARMSCII-8
Belarus CP1251
Bulgarian ISO-8859-5
Czech ISO-8859-2
Georgian Georgian-Academy
Greek ISO-8859-7
Hebrew ISO-8859-8
Japanese EUC-JP
Japanese SHIFT-JIS
Korean EUC-KR
Russian CP1251
Russian ISO-8859-5
Russian KOI8-R
Slovak CP1250
Slovenian ISO-8859-2
Thai TIS-620
Ukrainian KOI8-U
Vietnamese VISCII


Manually specify a character set

Manually specify a character set to apply to a source by setting the CHARSET key for a source in props.conf.

For example, if you have a source the is in Greek and that uses ISO-8859-7 encoding, set CHARSET=ISO-8859-7 for that source in props.conf.

[source::$SOURCE]
CHARSET=ISO-8859-7


Automatically specify a character set

Splunk can automatically detect languages and proper character sets using its sophisticated character set encoding algorithm.

Configure Splunk automatically detect the proper language and character set encoding for a source by setting CHARSET=AUTO for that source in props.conf. For example, if you want Splunk to automatically detect character set encoding for the source "my-foreign-docs", then set CHARSET=AUTO for that source in props.conf.

[my-foreign-docs]
CHARSET=AUTO


If Splunk doesn't recognize a character set

Train Splunk to recognize a character set if you want to use an encoding that Splunk doesn't recognize, by adding a character set sample file to the following directory:

SPLUNK_HOME/etc/ngram-models/_<language>-<encoding>.txt

Once you add the character set specification, then restart Splunk. After you restart, Splunk can recognize sources that use the new character set, and will automatically convert them to UTF-8 format at index time.

For example, if you want to use the "vulcan-ISO-12345" character set, copy the specification file to the following path:

/SPLUNK_HOME/etc/ngram-models/_vulcan-ISO-12345.txt 
Revision: 207 Contact Privacy Policy Terms of Use Community content licensed under Creative Commons