license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
There is a zoo of data processing platforms which help users and organizations to extract value out of their data. Although each of these platforms excels in specific aspects, users typically end up running their data analytics on suboptimal platforms. This is not only because choosing the right platform among the myriad of big data platforms is a daunting task, but also due to the fact that today’s data analytics are moving beyond the limits of a single platform. Thus, there is an urgent need for cross-platform data processing, i.e., using more than one data processing platform to perform a data analytics task. Despite the need, achieving this is still a dreadful process where developers have to get intimate with many systems and write ad hoc scripts for integrating them. This tutorial is motivated by this need. We will discuss the importance of supporting cross-platform data processing in a systematic way as well as the current efforts to achieve that. In particular, we will introduce a classification of the different cases where an application needs or benefits from cross-platform data processing and the challenges of each case. Along with this classification, we will also present the efforts known up to date to support cross-platform data processing. We will conclude this tutorial with a discussion of several important open problems.