blob: 0bc1ea32bd4eb7f0b2805e9cea529fd1e9406533 [file] [log] [blame] [view]
---
id: org.apache.streampipes.sinks.internal.jvm.datalake
title: Data Lake
sidebar_label: Data Lake
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
~
-->
<p align="center">
<img src="/img/pipeline-elements/org.apache.streampipes.sinks.internal.jvm.datalake/icon.png" width="150px;" class="pe-image-documentation"/>
</p>
***
## Description
Stores events in the internal data lake so that data can be visualized in the live dashboard or in the data explorer.
Simply create a pipeline with a data lake sink, switch to one of the data exploration tool and start exploring your
data!
***
## Required input
This sink requires an event that provides a timestamp value (a field that is marked to be of type ``http://schema
.org/DateTime``.
***
## Configuration
### Dimensions
The fields which will be stored as dimensional values in the time series storage. Dimensions are typically identifiers
such as the ID of a sensor.
Dimensions support grouping in the data explorer, but will be converted to a text-based field and provide less advanced
filtering capabilities.
Be careful when modifying dimensions of existing pipelines! This might have impact on how you are able to view data in
the data explorer due to schema incompatibilities.
### Identifier
The name of the measurement (table) where the events are stored.
### Schema Update Options
The Schema Update Options dictate the behavior when encountering a measurement (table) with the same identifier.
#### Option 1: Update Schema
- **Description:** Overrides the existing schema.
- **Effect on Data:** The data remains in the data lake, but accessing old data is restricted to file export.
- **Impact on Features:** Other StreamPipes features, such as the Data Explorer, will only display the new event schema.
#### Option 2: Extend Existing Schema
- **Description:** Keeps old event fields in the event schema.
- **Strategy:** This follows an append-only strategy, allowing continued work with historic data.
- **Consideration:** Old properties may exist for which no new data is generated.
### Dimensions
Select fields which will be marked as dimensions. Dimensions reflect tags in the underlying time-series database.
Dimensions support grouping operations and can be used for fields with a limited set of values, e.g., boolean flags or
fields representing IDs. Dimensions are not a good choice for fields with a high number of different values since they
slow down database queries.
By default, all fields which are marked as dimensions in the metadata are chosen and can be manually overridden
with this configuration.
Data types which can be marked as dimensional values are booleans, integer, and strings.
### Ignore Duplicates
Before writing an event to the time series storage, a duplicate check is performed. By activating this option, only
fields having a different value than the previous event are stored.
This setting only affects measurement fields, not dimensions.