Before using any of the following script you should configure a new classifier model identified model
for instance using the Felix System Console at http://localhost:8080/system/console matching training set. The HTTP API for that classifier model will be published at:
http://localhost:8080/topic/model
NewsML is standard XML file format used by major news agencies. The topic of news articles can be categorized using a controlled vocabulary.
Such vocabulary can be loaded in the entityhub by copy the IPTC [zip archive][1] in the stanbol/datafiles
folder of a running server and deploy the [referenced site definition jar][2] (for instance using the Felix Console).
[1] http://dev.iks-project.eu/downloads/stanbol-indices/iptc.solrindex.zip [2] http://dev.iks-project.eu/downloads/stanbol-indices/org.apache.stanbol.data.site.iptc-1.0.0.jar
If you have an archive of NewsML files at hand you can train a topic classifier on by using the files to build the training set for the model (you need Python 2.7 and lxml to run the script).
First import the RDF definition of the IPTC taxonomy into the model:
TODO
Then import the data into the training set of the model:
python newsmlimporter.py /path/to/newml/topleve/folder 10000 \ http://localhost:8080/topic/model/trainingset
The second argument is the maximum number of news to import in the training set.
You can then train the model with curl:
curl -i -X POST http://localhost:8080/topic/model/trainer?incremental=false
The model can then be used as part of any enhancer engine chain to assign IPTC topics to text documents.
A subset of Wikipedia / DBpedia categories can be used as a classifier. To extract such a taxonomy of topics you can use [dbpediakit][3] (you will need python and postgresql for this to run):
git clone https://github.com/ogrisel/dbpediakit cd dbpediakit
Create the dbpediakit database on the postgresql server by following the instructions in:
https://github.com/ogrisel/dbpediakit/blob/master/dbpediakit/postgres.py
You can now run the extraction (this will download the required dumps and load them in postgresql hence can take a long time):
python examples/topics/build_taxonomy.py --max-depth=2
Back in this folder, import the taxonomy and training set to Stanbol so as to build the classifier model:
python dbpediacategories.py /path/to/dbpediakit/dbpedia-taxonomy.tsv \ /path/to/dbpediakit/dbpedia-examples.tsv.bz2 \ http://localhost:8080/topic/model