Sitecore Solr Support for Chinese Language

If you’re running Sitecore with Solr, you may have noticed crawling errors when you add versions in certain languages. A common requirement for multilingual sites is support for Chinese, which the generated Solr schema Sitecore provides does not support by default.  Fortunately, it’s relatively simple to correct this and add support for Chinese, as well as other languages that aren’t available in the default schema.

First, it’s good to understand how Sitecore maps field data to fields in Solr. When Sitecore crawls an item, it checks the field names and types from the item’s template to map the Sitecore fields to fields in Solr. These mappings are defined in the Sitecore.ContentSearch.Solr.config file. For most fields, Sitecore’s maps them to a dynamic field in Solr based on the field type. Text fields are mapped to “*_t,” and for language versions, the culture is appended. For example, the Title field localized to a German version on the item would map to “title_t_de” in Solr.

The schema.xml generated by Sitecore contains field definitions for many languages already. If you have a language in your content that isn’t in the generated schema, you’ll see errors in your crawling logs to that effect. To add support for Chinese to your indexes, it’s easy enough to add a new dynamic field to your schema.xml like so,

…
<dynamicField name="*_t_tr" type="text_tr" indexed="true" stored="true" />
<dynamicField name="*_t_zh" type="text_general" indexed="true" stored="true" />
<dynamicField name="*_i" type="int" indexed="true" stored="true" />
…

However, while this will get your crawlers working, actually searching these Chinese fields will yield poor results.  This is because the field type, text_general, is not using a tokenizer and filter set that is optimized for Chinese text.

Because Chinese text doesn’t have spaces like western languages, we cannot tokenize our terms on spaces. We need a specific analyzer that can understand the searchable terms found in Chinese text. Fortunately Solr provides one, the CJKAnalyzer.

To add a field type definition that uses a different tokenizer and filters set add this to your schema.xml,

<!-- Chinese (Simplified) -->
 <fieldType name="text_zh" class="solr.TextField" >
 <analyzer class="org.apache.lucene.analysis.cjk.CJKAnalyzer" words="lang/stopwords_zh.txt" />
 </fieldType>

We’ll also need to get specific stop words for Chinese and put them in the /conf/lang folder. You can download the stopwords_zh.txt file here.

This field type definition won’t yield perfect results, but they will be good enough for most simple searches.  There are more optimized configurations you can use, but they are more difficult to set up and may require a working understanding of Chinese. (which I admittedly do not have, but I ran this approach by a co-worker who is fluent in Mandarin and he verified these settings and the results). This blog post explains the differences between the analyzers:
http://opensourceconnections.com/blog/2011/12/23/indexing-chinese-in-solr/.

With this in place, we can update the dynamic field like so,

…
<dynamicField name="*_t_tr" type="text_tr" indexed="true" stored="true" />
<dynamicField name="*_t_zh" type="text_zh" indexed="true" stored="true" />
<dynamicField name="*_i" type="int" indexed="true" stored="true" />
…

You can add support for other Sitecore languages in a similar way, for example Korean which also can use the CJK analyzer.