This explains the build and run instructions for Samoa on Apache Apex (http://apex.apache.org/)
Simply clone the repository and and create SAMOA with Apex package.
git clone http://git.apache.org/incubator-samoa.git cd incubator-samoa mvn -Papex package
The deployable jar will be present in target/SAMOA-Apex-0.4.0-incubating-SNAPSHOT.jar
.
samoa.apex.mode
is set to local
bin/samoa apex target/SAMOA-Apex-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 500000 -l (classifiers.trees.VerticalHoeffdingTree -p 1) -s (generators.RandomTreeGenerator -c 2 -o 5 -u 5)"
samoa.apex.mode
is set to cluster
dt.dfsRootDirectory
parameter to point to a valid HDFS directoryfs.default.name
parameter to point to the name node service of the Hadoop clusterbin/samoa apex target/SAMOA-Apex-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 500000 -l (classifiers.trees.VerticalHoeffdingTree -p 1) -s (generators.RandomTreeGenerator -c 2 -o 5 -u 5)"
tuplesLimitPerWindow
parameter) through the dt-site.xml
config file. The path to this file needs to be set in samoa-apex.properties
file as dt.site.path
. Following is an example configuration which limits the speed to 2000
tuples per window. For Apex, with default steaming window size of 500
ms, this amounts to 4000
tuples per second speed. This configuration parameter is specific to the type of data and the type of topology run and may not be optimal for all the topologies. A useful guide on choosing this parameter is to check if the latency of the operators in the Dag stays within a acceptable range. If not, the operator is not able to handle this load and this parameter must be decreased.<property> <name>dt.operator.*.prop.tuplesLimitPerWindow</name> <value>2000</value> </property>
This will change the limit of all the operators present in the Dag. However, this will affect only input operators as other operators do not have this property. This is convenient as we don't need to know the name of the input operator corresponding to the Entrance Processing Item in the topology.
dt-site.xml
file which is specified as the dt.site.path
property in samoa-apex.properties
file. Some of attributes which can be modified are as followsMEMORY_MB
http://docs.datatorrent.com/beginner/#allocating-operator-memorySTREAMING_WINDOW_SIZE_MILLIS
http://docs.datatorrent.com/tutorials/topnwords-c7/#streaming-windows-and-application-windowsCHECKPOINT_WINDOW_COUNT
https://apex.apache.org/docs/apex/application_development/#checkpointingPlease refer the following for more information on what attributes can be specified externally and their impact on the processing engine. However, note that most of these would not be applicable to applications running on Samoa, as the topology and its properties are already defined specified by Samoa. The Apex runner is for running the topology by converting it to an Apex Dag.
https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.DAGContext.html
https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.OperatorContext.html
To enable debug logging, add the following configuration to the dt-site.xml
file specified at location in dt.site.path
<property> <name>dt.loggers.level</name> <value>org.apache.*:DEBUG,com.datatorrent.*:DEBUG</value> </property>
The user can view details about any application launched via Apex using the cli. The apex-core project must be checked out to some directory. Launch the apex cli located at: apex-core/engine/src/main/scripts/apex
Following can be achieved using the cli
list-apps
connect <app-id>
get-app-info <app id>