blob: c609101a6fd31017c09521a2b267f88afd703144 [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
= Introduction
This section describes how to use Ignite ML for tuning ML algorithms and [Pipelines](doc:pipeline-api) . Built-in Cross-Validation and other tooling allow users to optimize [hyper-parameters](doc:hyper-parameter-tuning) in algorithms and Pipelines.
Model selection is a set of tools that provides the ability to prepare and [evaluate](doc:evaluator) models efficiently. Use it to link:machine-learning/model-selection/split-the-dataset-on-test-and-train-datasets[split] data based on training and test data as well as perform cross validation.
== Overview
It is not good practice to learn the parameters of a prediction function and validate it on the same data. This leads to overfitting. To avoid this problem, one of the most efficient solutions is to save part of the training data as a validation set. However, by partitioning the available data and excluding one or more parts from the training set, we significantly reduce the number of samples which can be used for learning the model and the results can depend on a particular random choice for the pair of (train, validation) sets.
A solution to this problem is a procedure called link:machine-learning/model-selection/cross-validation[Cross-Validation]. In the basic approach, called k-fold CV, the training set is split into k smaller sets and after that the following procedure works: a model is trained using k-1 of the folds (parts) as a training data, the resulting model is validated on the remaining part of the data (its used as a test set to compute metrics such as accuracy).
Apache Ignite provides cross validation functionality that allows it to parameterize the trainer to be validated, metrics to be calculated for the model trained on every step and the number of folds training data should be split on.