This BundleList includes modules that allow to use the Paoding Analyzer to
It is highly recommended to use the Paoding bundle list in combination with the smartcn one as Paoding does not provide sentence detection. Because of that a typical EnhancementChain for Chinese should also include the ‘smartcn-sentence’ engine
:::text langdetect smartcn-sentence paoding-token {your-entitylinking}
where ‘{your-entitylinking}’ will typically be an EntityhubLinkingEngine engine configured for your vocabulary containing the Entities with Chinese labels.
Please also note the comments in the lists.xml
When you plan to use the Paoding Analyzer to process Chinese texts it is important to also properly configure the Solr schema.xml used by the Entityhub SolrYard. The DZone article Indexing Chinese in Solr by Jason Hull provides really great background information on that.
When following those instructions keep in mind that the {working-dir} of the Stanbol Entityhub IndexingTool is that directory where you call ‘java -jar …’ therefore if you configure the ‘PAODING_DIC_HOME’ the value will be relative to the {working-dir}.
For the use of Paoding within Apache Stanbol the directory will be automatically initialized and be located in the persistent storage location of the org.apache.stanbol:org.apache.stanbol.commons.solr.extras.paoding:0.10.0-SNAPSHOT bundle.
To use the Paoding Analyzer for Chinese literals a FieldType and a DynamicField configuration need to be added to the Solr schema.xml.
the fieldType specification for Chinese
:::xml
A dynamic field using this field type that matches against Chinese language literals
:::xml
The smartcn.solrindex.zip is identical with the default configuration but uses the above fieldType and dynamicField specification.
Extract the paoding.solrindex.zip to the “indexing/config” directory.
Copy the Paoding Bundle (org.apache.stanbol:org.apache.stanbol.commons.solr.extras.paoding) in the lib directory of the Solr Core configuration “indexing/config/paoding/lib”. Solr includes all jar files within this directory in the Classpath. Because of that it will find the padding analyzer implementation during indexing.
Rename the “indexing/config/paoding” directory to the {site-name} (the value of the “name” property of the “indexing/config/indexing.properties” file).
As an alternative to (2) you can also explicitly configure the name of the solr config as value to the “solrConf:smartcn” of SolrYardIndexingDestination.
:::text indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:smartcn,boosts:fieldboosts
Copy the padding dictionary to ‘{paoding-dic-dir}’. You can obtain the dic from the original paoding projects SVN repository. An Zip archive with the dictionary is also included in the Paoding OSGI bundle part of Stanbol.
Correctly parse the -DPAODING_DIC_HOME={paoding-dic-dir} when calling the Entityhub indexing tool. As alternative you can also set the ‘PAODING_DIC_HOME’ as system environment variable.
If you want to create an empty SolrYard instance using the paoding.solrindex.zip configuration you will need to
If you want to use the paoding.solrindex.zip as default you can rename the file in the datafilee folder to “default.solrindex.zip” and the enable the “Use default SolrCore configuration” (org.apache.stanbol.entityhub.yard.solr.useDefaultConfig) when you configure a SolrYard instance.
See also the documentation on how to configure a managed site).