In this tutorial page we describe how to execute SAMOA with data files in Apache Avro file format. Here is an outline of this tutorial
Users of Apache SAMOA can now use Binary/JSON encoded Avro data as an alternate to the default ARFF file format as the data source. Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Avro specifies two serialization encodings for the data: Binary and JSON, default being Binary. However the meta-data is always in JSON. Avro data is always serialized with its schema. Files that store Avro data should also include the schema for that data in the same file.
You can find the latest Apache Avro documentation here for more details.
It is required that the input Avro files to the SAMOA framework follow certain Input Format Rules to seamlessly work with the SAMOA Instances. The first line of Avro Source file for SAMOA (irrespective of whether data is encoded in binary or JSON) will be the metadata (schema). The data would be by default one record per line following the schema and will be mapped into 1 SAMOA instance per record.
E.g Enums {"name":"species","type":{"type":"enum","name":"Labels","symbols":["setosa","versicolor","virginica"]}} E.g Unions {"name":"attribute1","type":["null","int"]} -Allowed to denote that value for attribute1 is optional {"name":" attribute2","type":["string","int"]} -Not allowed
You may execute a SAMOA task using the aforementioned bin/samoa
script with the following format: bin/samoa <platform> <jar> "<task>"
. Follow this link and this link to learn more about deploying SAMOA on Apache S4 and Apache Storm respectively. The Avro files can be used as data sources for any of the aforementioned platforms. The only addition that needs to be made in the commands is as follows: AvroFileStream <file_name> -e <file_format>
. Examples are given below for different modes. Though the examples below use Prequential Evaluation task the commands are applicable to all other tasks as well.
bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_json.avro -e json) -f 100000"
bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_binary.avro -e binary) -f 100000"
bin/samoa storm target/SAMOA-Storm-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_json.avro -e json) -f 100000"
bin/samoa storm target/SAMOA-Storm-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_binary.avro -e binary) -f 100000"
The samples below describe how the default ARFF file formats may be converted to JSON/Binary encoded Avro formats.
@RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {setosa,versicolor,virginica} @DATA 5.1,3.5,1.4,0.2,setosa 4.9,3.0,1.4,0.2,virginica 4.7,3.2,1.3,0.2,virginica 4.6,3.1,1.5,0.2,setosa
{"type":"record","name":"Iris","namespace":"com.yahoo.labs.samoa.avro.iris","fields":[{"name":"sepallength","type":"double"},{"name":"sepalwidth","type":"double"},{"name":"petallength","type":"double"},{"name":"petalwidth","type":"double"},{"name":"class","type":{"type":"enum","name":"Labels","symbols":["setosa","versicolor","virginica"]}}]} {"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"} {"sepallength":3.0,"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"} {"sepallength":4.7,"sepalwidth":3.2,"petallength":1.3,"petalwidth":0.2,"class":"virginica"} {"sepallength":3.1,"sepalwidth":1.5,"petallength":4.6,"petalwidth":0.2,"class":"setosa"}
Objavro.schema΅{"type":"record","name":"Iris","namespace":"com.yahoo.labs.samoa.avro.iris","fields":[{"name":"sepallength","type":"double"},{"name":"sepalwidth","type":"double"},{"name":"petallength","type":"double"},{"name":"petalwidth","type":"double"},{"name":"class","type":{"type":"enum","name":"Labels","symbols":["setosa","versicolor","virginica"]}}]} !<khCrֱS빧ީȂffffff@ @ffffffٙٙɿ @ffffffٙٙ@ڙٙٙɿΌ͌͌@ڙٙٙ @Ό͌͌ٙٙɿΌ͌͌@ ffff@ڙٙٙɿ !<khCrֱS빧ީ
The JSON & Binary encoded AVRO Files covtypeNorm_json.avro & covtypeNorm_binary.avro for the Forest CoverType dataset can be found at Wiki