The DML (Declarative Machine Learning) language has built-in functions which enable access to both low- and high-level functions to support all kinds of use cases.
Builtins are either implemented on a compiler level or as DML scripts that are loaded at compile time.
There are some functions which generate an object for us. They create matrices, tensors, lists and other non-primitive objects.
tensor
-FunctionThe tensor
-function creates a tensor for us.
tensor(data, dims, byRow = TRUE)
Name | Type | Default | Description |
---|---|---|---|
data | Matrix[?], Tensor[?], Scalar[?] | required | The data with which the tensor should be filled. See data -Argument. |
dims | Matrix[Integer], Tensor[Integer], Scalar[String], List[Integer] | required | The dimensions of the tensor. See dims -Argument. |
byRow | Boolean | TRUE | NOT USED. Will probably be removed or replaced. |
Note that this function is highly unstable and will be overworked and might change signature and functionality.
Type | Description |
---|---|
Tensor[?] | The generated Tensor. Will support more datatypes than Double . |
data
-ArgumentThe data
-argument can be a Matrix
of any datatype from which the elements will be taken and placed in the tensor until filled. If given as a Tensor
the same procedure takes place. We iterate through Matrix
and Tensor
by starting with each dimension index at 0
and then incrementing the lowest one, until we made a complete pass over the dimension, and then increasing the dimension index above. This will be done until the Tensor
is completely filled.
If data
is a Scalar
, we fill the whole tensor with the value.
dims
-ArgumentThe dimension of the tensor can either be given by a vector represented by either by a Matrix
, Tensor
, String
or List
. Dimensions given by a String
will be expected to be concatenated by spaces.
print("Dimension matrix:"); d = matrix("2 3 4", 1, 3); print(toString(d, decimal=1)) print("Tensor A: Fillvalue=3, dims=2 3 4"); A = tensor(3, d); # fill with value, dimensions given by matrix print(toString(A)) print("Tensor B: Reshape A, dims=4 2 3"); B = tensor(A, "4 2 3"); # reshape tensor, dimensions given by string print(toString(B)) print("Tensor C: Reshape dimension matrix, dims=1 3"); C = tensor(d, list(1, 3)); # values given by matrix, dimensions given by list print(toString(C, decimal=1)) print("Tensor D: Values=tst, dims=Tensor C"); D = tensor("tst", C); # fill with string, dimensions given by tensor print(toString(D))
Note that reshape construction is not yet supported for SPARK execution.
DML-bodied built-in functions are written as DML-Scripts and executed as such when called.
KMeans
-FunctionThe kmeans() implements the KMeans Clustering algorithm.
kmeans(X = X, k = 20, runs = 10, max_iter = 5000, eps = 0.000001, is_verbose = FALSE, avg_sample_size_per_centroid = 50)
Name | Type | Default | Description |
---|---|---|---|
x | Matrix[Double] | required | The input Matrix to do KMeans on. |
k | Int | 10 | Number of centroids |
runs | Int | 10 | Number of runs (with different initial centroids) |
max_iter | Int | 100 | Max no. of iterations allowed |
eps | Double | 0.000001 | Tolerance (epsilon) for WCSS change ratio |
is_verbose | Boolean | FALSE | do not print per-iteration stats |
Type | Description |
---|---|
String | The mapping of records to centroids |
String | The output matrix with the centroids |
lm
-FunctionThe lm
-function solves linear regression using either the direct solve method or the conjugate gradient algorithm depending on the input size of the matrices (See lmDS
-function and lmCG
-function respectively).
lm(X, y, icpt = 0, reg = 1e-7, tol = 1e-7, maxi = 0, verbose = TRUE)
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Matrix of feature vectors. |
y | Matrix[Double] | required | 1-column matrix of response values. |
icpt | Integer | 0 | Intercept presence, shifting and rescaling the columns of X (Details) |
reg | Double | 1e-7 | Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features |
tol | Double | 1e-7 | Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm |
maxi | Integer | 0 | Maximum number of conjugate gradient iterations. 0 = no maximum |
verbose | Boolean | TRUE | If TRUE print messages are activated |
Note that if number of features is small enough (rows of X/y < 2000
), the lmDS
-Function' is called internally and parameters tol
and maxi
are ignored.
Type | Description |
---|---|
Matrix[Double] | 1-column matrix of weights. |
icpt
-ArgumentThe icpt-argument can be set to 3 modes:
X = rand (rows = 50, cols = 10) y = X %*% rand(rows = ncol(X), cols = 1) lm(X = X, y = y)
lmDS
-FunctionThe lmDS
-function solves linear regression by directly solving the linear system.
lmDS(X, y, icpt = 0, reg = 1e-7, verbose = TRUE)
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Matrix of feature vectors. |
y | Matrix[Double] | required | 1-column matrix of response values. |
icpt | Integer | 0 | Intercept presence, shifting and rescaling the columns of X (Details) |
reg | Double | 1e-7 | Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features |
verbose | Boolean | TRUE | If TRUE print messages are activated |
Type | Description |
---|---|
Matrix[Double] | 1-column matrix of weights. |
X = rand (rows = 50, cols = 10) y = X %*% rand(rows = ncol(X), cols = 1) lmDS(X = X, y = y)
lmCG
-FunctionThe lmCG
-function solves linear regression using the conjugate gradient algorithm.
lmCG(X, y, icpt = 0, reg = 1e-7, tol = 1e-7, maxi = 0, verbose = TRUE)
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Matrix of feature vectors. |
y | Matrix[Double] | required | 1-column matrix of response values. |
icpt | Integer | 0 | Intercept presence, shifting and rescaling the columns of X (Details) |
reg | Double | 1e-7 | Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features |
tol | Double | 1e-7 | Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm |
maxi | Integer | 0 | Maximum number of conjugate gradient iterations. 0 = no maximum |
verbose | Boolean | TRUE | If TRUE print messages are activated |
Type | Description |
---|---|
Matrix[Double] | 1-column matrix of weights. |
X = rand (rows = 50, cols = 10) y = X %*% rand(rows = ncol(X), cols = 1) lmCG(X = X, y = y, maxi = 10)
lmpredict
-FunctionThe lmpredict
-function predicts the class of a feature vector.
lmpredict(X, w)
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Matrix of feature vector(s). |
w | Matrix[Double] | required | 1-column matrix of weights. |
icpt | Matrix[Double] | 0 | Intercept presence, shifting and rescaling of X (Details) |
Type | Description |
---|---|
Matrix[Double] | 1-column matrix of classes. |
X = rand (rows = 50, cols = 10) y = X %*% rand(rows = ncol(X), cols = 1) w = lm(X = X, y = y) yp = lmpredict(X, w)
scale
-FunctionThe scale function is a generic function whose default method centers or scales the column of a numeric matrix.
scale(X, center=TRUE, scale=TRUE)
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Matrix of feature vectors. |
center | Boolean | required | either a logical value or numerical value. |
scale | Boolean | required | either a logical value or numerical value. |
Type | Description |
---|---|
Matrix[Double] | 1-column matrix of weights. |
X = rand(rows = 20, cols = 10) center=TRUE; scale=TRUE; Y= scale(X,center,scale)
sigmoid
-FunctionThe Sigmoid function is a type of activation function, and also defined as a squashing function which limit the output to a range between 0 and 1, which will make these functions useful in the prediction of probabilities.
sigmoid(X)
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Matrix of feature vectors. |
Type | Description |
---|---|
Matrix[Double] | 1-column matrix of weights. |
X = rand (rows = 20, cols = 10) Y = sigmoid(X)
steplm
-FunctionThe steplm
-function (stepwise linear regression) implements a classical forward feature selection method. This method iteratively runs what-if scenarios and greedily selects the next best feature until the Akaike information criterion (AIC) does not improve anymore. Each configuration trains a regression model via lm
, which in turn calls either the closed form lmDS
or iterative lmGC
.
steplm(X, y, icpt);
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Matrix of feature vectors. |
y | Matrix[Double] | required | 1-column matrix of response values. |
icpt | Integer | 0 | Intercept presence, shifting and rescaling the columns of X (Details) |
reg | Double | 1e-7 | Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependent/sparse/numerous features |
tol | Double | 1e-7 | Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm |
maxi | Integer | 0 | Maximum number of conjugate gradient iterations. 0 = no maximum |
verbose | Boolean | TRUE | If TRUE print messages are activated |
Type | Description |
---|---|
Matrix[Double] | Matrix of regression parameters (the betas) and its size depend on icpt input value. (C in the example) |
Matrix[Double] | Matrix of selected features ordered as computed by the algorithm. (S in the example) |
icpt
-ArgumentThe icpt-arg can be set to 2 modes:
selected
-OutputIf the best AIC is achieved without any features the matrix of selected features contains 0. Moreover, in this case no further statistics will be produced
X = rand (rows = 50, cols = 10) y = X %*% rand(rows = ncol(X), cols = 1) [C, S] = steplm(X = X, y = y, icpt = 1);
slicefinder
-FunctionThe slicefinder
-function returns top-k worst performing subsets according to a model calculation.
slicefinder(X,W, y, k, paq, S);
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Recoded dataset into Matrix |
W | Matrix[Double] | required | Trained model |
y | Matrix[Double] | required | 1-column matrix of response values. |
k | Integer | 1 | Number of subsets required |
paq | Integer | 1 | amount of values wanted for each col, if paq = 1 then its off |
S | Integer | 2 | amount of subsets to combine (for now supported only 1 and 2) |
Type | Description |
---|---|
Matrix[Double] | Matrix containing the information of top_K slices (relative error, standart error, value0, value1, col_number(sort), rows, cols,range_row,range_cols, value00, value01,col_number2(sort), rows2, cols2,range_row2,range_cols2) |
X = rand (rows = 50, cols = 10) y = X %*% rand(rows = ncol(X), cols = 1) w = lm(X = X, y = y) ress = slicefinder(X = X,W = w, Y = y, k = 5, paq = 1, S = 2);
normalize
-FunctionThe normalize
-function normalises the values of a matrix by changing the dataset to use a common scale. This is done while preserving differences in the ranges of values. The output is a matrix of values in range [0,1].
normalize(X);
Name | Type | Default | Description |
---|---|---|---|
X | Matrix[Double] | required | Matrix of feature vectors. |
Type | Description |
---|---|
Matrix[Double] | 1-column matrix of normalized values. |
X = rand(rows = 50, cols = 10) y = X %*% rand(rows=ncol(X), cols=1) y = normalize(X = X)