blob: 3b42a6c7734e0dc075ef3b2adcb54b84de6ebf6b [file] [log] [blame] [view]
---
title: "Apache Gravitino Glossary"
date: 2023-11-28
license: "This software is licensed under the Apache License version 2."
---
## API
- Application Programming Interface, defining the methods and protocols for interacting with a server.
## AWS
- Amazon Web Services, a cloud computing platform provided by Amazon.
## AWS Glue
- A compatible implementation of the Hive Metastore Service (HMS).
## GPG/GnuPG
- Gnu Privacy Guard or GnuPG is an open-source implementation of the OpenPGP standard.
It is usually used for encrypting and signing files and emails.
## HDFS
- **HDFS** (Hadoop Distributed File System) is an open-source distributed file system.
It is a key component of the Apache Hadoop ecosystem.
HDFS is designed as a distributed storage solution to store and process large-scale datasets.
It features high reliability, fault tolerance, and excellent performance.
## HTTP port
- The port number on which a server listens for incoming connections.
## IP address
- Internet Protocol address, a numerical label assigned to each device in a computer network.
## JDBC
- Java Database Connectivity, an API for connecting Java applications to relational databases.
## JDBC URI
- The JDBC connection address specified in the catalog configuration.
It usually includes components such as the database type, host, port, and database name.
## JDK
- The software development kit for the Java programming language.
A JDK provides tools for compiling, debugging, and running Java applications.
## JMX
- Java Management Extensions provides tools for managing and monitoring Java applications.
## JSON
- JavaScript Object Notation, a lightweight data interchange format.
## JSON Web Token
- See [JWT](#jwt).
## JVM
- A virtual machine that enables a computer to run Java applications.
A JVM implements an abstract machine that is different from the underlying hardware.
## JVM instrumentation
- The process of adding monitoring and management capabilities to the [JVM](#jvm).
The purpose of instrumentation is mainly for the collection of performance metrics.
## JVM metrics
- Metrics related to the performance and behavior of the [Java Virtual Machine](#jvm).
Some valuable metrics are memory usage, garbage collection, and buffer pool metrics.
## JWT
- A compact, URL-safe representation for claims between two parties.
## KEYS file
- A file containing public keys used to sign previous releases, necessary for verifying signatures.
## PGP signature
- A digital signature generated using the Pretty Good Privacy (PGP) algorithm.
The signature is typically used to validate the authenticity of a file.
## REST
- A set of architectural principles for designing networked applications.
## REST API
- Representational State Transfer (REST) Application Programming Interface.
A set of rules and conventions for building and interacting with Web services using standard HTTP methods.
## SHA256 checksum
- A cryptographic hash function used to verify the integrity of files.
## SHA256 checksum file
- A file containing the SHA256 hash value of another file, used for verification purposes.
## SQL
- A programming language used to manage and manipulate relational databases.
## SSH
- Secure Shell, a cryptographic network protocol used for secure communication over a computer network.
## URI
- Uniform Resource Identifier, a string that identifies the name or resource on the internet.
## YAML
- YAML Ain't Markup Language, a human-readable file format often used for structured data.
## Amazon Elastic Block Store (EBS)
- A scalable block storage service provided by Amazon Web Services (AWS).
## Apache Gravitino
- An open-source software platform initially created by Datastrato.
It is designed for high-performance, geo-distributed, and federated metadata lakes.
Gravitino can manage metadata directly in different sources, types, and regions,
providing data and AI assets with unified metadata access.
## Apache Gravitino configuration file (gravitino.conf)
- The configuration file for the Gravitino server, located in the `conf` directory.
It follows the standard properties file format and contains settings for the Gravitino server.
## Apache Hadoop
- An open-source distributed storage and processing framework.
## Apache Hive
- An open-source data warehousing software project.
It provides SQL-like query language for managing and querying large datasets.
## Apache Iceberg
- An open-source, versioned table format for large-scale data processing.
## Apache Iceberg Hive catalog
- The **Iceberg Hive catalog** is a metadata service designed for the Apache Iceberg table format.
It allows external systems to interact with an Iceberg metadata using a Hive metastore thrift client.
## Apache Iceberg JDBC catalog
- The **Iceberg JDBC catalog** is a metadata service designed for the Apache Iceberg table format.
It enables external systems to interact with an Iceberg metadata service using [JDBC](#jdbc).
## Apache Iceberg REST catalog
- The **Iceberg REST Catalog** is a metadata service designed for the Apache Iceberg table format.
It enables external systems to interact with Iceberg metadata service using a [REST API](#rest-api).
## Apache License version 2
- A permissive, open-source software license written by The Apache Software Foundation.
## Authentication mechanism
- The method used to verify the identity of users and clients accessing a server.
## Binary distribution package
- A software package containing the compiled executables for distribution and deployment.
## Catalog
- A collection of metadata from a specific metadata source.
## Catalog provider
- The specific system or technology used to store and manage metadata catalogs.
## Columns
- The individual fields or attributes of a table.
Each column has properties like name, data type, comment, and nullability.
## Continuous integration (CI)
- The practice of automatically building and testing code changes when they are committed to version control.
## Dependencies
- External libraries or modules required by a project for its compilation and features.
## Distribution
- A packaged and deployable version of the software.
## Docker
- A platform for developing, shipping, and running applications in containers.
## Docker container
- A lightweight, standalone package that includes everything needed to run the software.
A container compiles an application with its dependencies and runtime for distribution.
## Docker Hub
- A cloud-based registry service for Docker containers.
Users can publish, browse and download containerized software using this service.
## Docker image
- A lightweight, standalone package that includes everything needed to run the software.
A Docker image typically comprises the code, runtime, libraries, and system tools.
## Dockerfile
- A configuration file for building a Docker image.
A Dockerfile contains instructions to build a standard image for distributing the software.
## Dropwizard metrics
- A Java library for measuring the performance of applications and providing support for various metric types.
## Environment variables
- Variables used to customize the runtime configuration for a process.
## Geo-distributed
- The distribution of data or services across multiple geographic locations.
## Git
- A distributed version control system used for tracking software artifacts.
## GitHub
- A web-based platform for version control and community collaboration using Git.
## GitHub Actions
- A continuous integration and continuous deployment (CI/CD) service provided by GitHub.
GitHub Actions automate the build, test, and deployment workflows.
## GitHub labels
- Labels assigned to GitHub issues or pull requests for organization or workflow automation.
## GitHub pull request
- A proposed change to a GitHub repository submitted by a user.
## GitHub repository
- The location where GitHub stores a project's source code and related files.
## GitHub workflow
- A series of automated steps triggered by specific events on a GitHub repository.
## Gradle
- An automation tool for building, testing, and deploying projects.
## Gradlew
- A Gradle wrapper script used to execute Gradle commands.
## Hashes
- Cryptographic hash values generated from some data.
A typical use case is to verify the integrity of a file.
## Headless
- A system without a local console.
## Identity fields
- Fields in tables that define the identity of the records.
In the scope of a table, the identity fields are used as the unique identifier of a row.
## Integration tests
- Tests that ensure software correctness and compatibility when integrating components into a larger system.
## Java Database Connectivity (JDBC)
- See [JDBC](#jdbc)
## Java Development Kits (JDKs)
- See [JDK](#jdk)
## Java Management Extensions
- See [JMX](#jmx)
## Java Toolchain
- A Gradle feature for detecting and managing JDK versions.
## Java Virtual Machine
- See [JVM](#jvm)
## Key pair
- A pair of cryptographic keys, including a public key used for verification and a private key used for signing.
## Lakehouse
- **Lakehouse** is a modern data management architecture that combines elements of data lakes and data warehouses.
It aims to provide a unified platform for storing, managing, and analyzing both raw unstructured data
(similar to data lakes) and curated structured data.
## Manifest
- A list of files and their associated metadata that collectively define the structure and content of a release or distribution.
## Merge operation
- A process in Iceberg that involves combining changes from multiple snapshots into a new snapshot.
## Metalake
- The top-level container for metadata.
Typically, a metalake is a tenant-like mapping to an organization or a company.
All the catalogs, users, and roles are associated with one metalake.
## Metastore
- A central repository that stores metadata for a data warehouse.
## Module
- A distinct and separable part of a project.
## Open authorization / OAuth
- A standard protocol for authorization that allows third-party applications to authenticate a user.
The application doesn't need to access the user credentials.
## OrbStack
- A tool mentioned as an alternative to Docker for macOS when running Gravitino integration tests.
## Private key
- A confidential key used for signing, decryption, or other operations that should remain confidential.
## Properties
- Configurable settings and attributes associated with catalogs, schemas, and tables.
The property settings influence the behavior and storage of the corresponding entities.
## Protocol buffers (protobuf)
- A method developed by Google for serializing structured data, similar to XML or JSON.
It is often used for efficient and extensible communication between systems.
## Public key
- An openly shared key used for verification, encryption, or other operations intended for public knowledge.
## Representational State Transfer
- See [REST](#rest)
## RocksDB
- An open source key-value storage database.
## Schema
- A logical container for organizing tables in a database.
## Secure Shell
- See [SSH](#ssh)
## Security group
- A virtual firewall for your instance to control inbound and outbound traffic.
## Serde
- A serialization/deserialization library.
It can transform data between a tabular format and a format suitable for storage or transmission.
## Snapshot
- A point-in-time capture of the state of an Iceberg table, representing a specific version of the table.
## Sort order
- The arrangement of data within a Hive table, specified by expression or direction.
## Spotless
- A tool or process used to enforce code formatting standards and apply automatic formatting to code.
## Structured Query Language
- See [SQL](#sql)
## Table
- A structured set of data elements stored in columns and rows.
## Thrift
- A network protocol used for communication with Hive Metastore Service (HMS).
## Token
- A **token** in the context of computing and security is a small, indivisible unit of data.
Tokens play a crucial role in various domains, including authentication and authorization.
## Trino
- A query engine for big data processing.
## Trino connector
- A connector module for integrating Gravitino with Trino.
## Ubuntu
- A Linux distribution based on Debian, widely used for cloud computing and servers.
## Unit test
- A type of software testing where individual components or functions of a program are tested.
Unit tests help to ensure that the component or function works as expected in isolation.
## Verification
- The process of confirming the authenticity and integrity of a release.
This is usually done by checking its signature and associated hash values.
## Web UI
- A graphical interface accessible through a web browser.