blob: 1de259dcac1810f3e86fcd639fa9b37dd4e04396 [file] [log] [blame]
<?xml version='1.0' encoding='utf-8' ?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "CloudStack_GSoC_Guide.ent">
%BOOK_ENTITIES;
]>
<!-- Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="gsoc-meng">
<title>Meng's 2013 GSoC Proposal</title>
<para>This chapter describes Meng's 2013 Google Summer of Code project within the &PRODUCT; ASF project. It is a copy paste of the submitted proposal.</para>
<section id="Project-Description">
<title>Project Description</title>
<para>
Getting a hadoop cluster going can be challenging and painful due to the tedious configuration phase and the diverse idiosyncrasies of each cloud provider. Apache Whirr<ulink url="http://whirr.apache.org/ "><citetitle>[1]</citetitle></ulink> and Provisionr is a set of libraries for running cloud services in an automatic or semi-automatic fashion. They take advantage of a cloud-neutral library called jclouds<ulink url=" http://www.jclouds.org/documentation/gettingstarted/what-is-jclouds/"><citetitle>[2]</citetitle></ulink> to create one-click, auto-configuring hadoop clusters on multiple clouds. Since jclouds supports CloudStack API, most of the services provided by Whirr and Provisionr should work out of the box on CloudStack. My first task is to test that assumption, make sure everything is well documented, and correct all issues with the latest version of CloudStack (4.0 and 4.1).
</para>
<para>
The biggest challenge for hadoop provisioning is automatically configuring each instance at launch time based on what it is supposed to do, a process known as contextualization<ulink url="http://dl.acm.org/citation.cfm?id=1488934"><citetitle>[3]</citetitle></ulink><ulink url="http://www.nimbusproject.org/docs/current/clouds/clusters2.html "><citetitle>[4]</citetitle></ulink>. It causes last minute changes inside an instance to adapt to a cluster environment. Many automated cloud services are enabled by contextualization. For example in one-click hadoop clusters, contextualization basically amounts to generating and distributing ssh key pairs among instances, telling an instance where the master node is and what other slave nodes it should be aware of, etc. On EC2 contextualization is done via passing information through the EC2_USER_DATA entry<ulink url="http://aws.amazon.com/amazon-linux-ami/ "><citetitle>[5]</citetitle></ulink><ulink url="https://svn.apache.org/repos/asf/whirr/branches/contrib-python/src/py/hadoop/cloud/data/hadoop-ec2-init-remote.sh"><citetitle>[6]</citetitle></ulink>. Whirr and Provisionr embrace this feature to provision hadoop instances on EC2. My second task is to test and extend Whirr and Provisionr’s one-click solution on EC2 to CloudStack and also improve CloudStack’s support for Whirr and Provisionr to enable hadoop provisioning on CloudStack based clouds.
</para>
<para>
My third task is to add a Query API that is compatible with Amazon Elastic MapReduce (EMR) to CloudStack. Through this API, all hadoop provisioning functionality will be exposed and users can reuse cloud clients that are written for EMR to create and manage hadoop clusters on CloudStack based clouds.
</para>
</section>
<section id="Project-Details">
<title>Project Details</title>
<para>
Whirr defines four roles for the hadoop provisioning service: Namenode, JobTracker, Datanode and TaskTraker. With the help of CloudInit<ulink url="https://help.ubuntu.com/community/CloudInit "><citetitle>[7]</citetitle></ulink> (a popular package for cloud instance initialization), each VM instance is configured based on its role and a compressed file that is passed in the EC2_USER_DATA entry. Since CloudStack also supports EC2_USER_DATA, I think the most feasible way to have hadoop provisioning on CloudStack is to extend Whirr’s solution on EC2 to CloudStack platform and to make necessary adjustment based on CloudStack’s
</para>
<para>
Whirr and Provisionr deal with two critical issues in their role configuration scripts (configure-hadoop-role_list): SSH key authentication and hostname configuration.
</para>
<orderedlist>
<listitem><para>
SSH Key Authentication. The need for SSH Key based authentication is required so that the master node can login to slave nodes to start/stop hadoop daemons. Also each node needs to login to itself to start its own hadoop daemons. Traditionally this is done by generating a key pair on the master node and distributing the public key to all slave nodes. This can be only done with human intervention. Whirr works around this problem on EC2 by having a common key pair for all nodes in a hadoop cluster. Thus every node is able to login to one another. The key pair is provided by users and obtained by CloudInit inside an instance from metadata service. As far as I know, Cloudstack does not support user-provided ssh key authentication. Although CloudStack has the createSSHKeyPair API<ulink url="http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.0.2/html/Installation_Guide/using-sshkeys.html "><citetitle>[8]</citetitle></ulink> to generate SSH keys and users can create an instance template that supports SSH keys, there is no easy way to have a unified SSH key on all cluster instances. Besides Whirr prefers minimal image management, so having a customized template doesn’t seem quite fit here.
</para></listitem>
<listitem><para>
Hostname configuration. The hostname of each instance has to be properly set and injected into the set of hadoop config files (core-site.xml, hdfs-site.xml, mapred-site.xml ). For an EC2 instance, its host name is converted from a combination of its public IP and an EC2-specific pre/suffix (e.g. an instance with IP 54.224.206.71 will have its hostname set to ec2-54-224-206-71.compute-1.amazonaws.com). This hostname amounts to the Fully Qualified Domain Name that uniquely identifies this node on the network. As for the case of CloudStack, if users do not specify a name the hostname that identifies a VM on a network will be a unique UUID generated by CloudStack<ulink url="https://cwiki.apache.org/CLOUDSTACK/allow-user-provided-hostname-internal-vm-name-on-hypervisor-instead-of-cloud-platform-auto-generated-name-for-guest-vms.html"><citetitle>[9]</citetitle></ulink>.
</para></listitem>
</orderedlist>
<para>
These two are the main issues that need support improvement on the CloudStack side. Other things like preparing disks, installing hadoop tarballs and starting hadoop daemons can be easily done as they are relatively role/instance-independent and static. Runurl can be used to simplify user-data scripts.
</para>
<para>
After we achieve hadoop provisioning on CloudStack using Whirr we can go further to add a Query API to CloudStack to expose this functionality. I will write an API that is compatible with Amazon Elastic MapReduce Service (EMR)<ulink url="http://docs.aws.amazon.com/ElasticMapReduce/latest/API/Welcome.html "><citetitle>[10]</citetitle></ulink> so that users can reuse clients that are written for EMR to submit jobs to existing hadoop clusters, poll job status, terminate a hadoop instance and do other things on CloudStack based clouds. There are eight actions<ulink url="http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_Operations.html "><citetitle>[11]</citetitle></ulink> now supported in EMR API. I will try to implement as many as I can during the period of GSoC. The following statements give some examples of the API that I will write.
</para>
<programlisting><![CDATA[
https://elasticmapreduce.cloudstack.com?Action=RunJobFlow &Name=MyJobFlowName &Instances.MasterInstanceType=m1.small &Instances.SlaveInstanceType=m1.small &Instances.InstanceCount=4
]]></programlisting>
<para>
This will launch a new hadoop cluster with four instances using specified instance types and add a job flow to it.
</para>
<programlisting><![CDATA[
https://elasticmapreduce.cloudstack.com?Action=AddJobFlowSteps &JobFlowId=j-3UN6WX5RRO2AG &Steps.member.1.Name=MyStep2 &Steps.member.1.HadoopJarStep.Jar=MyJar
]]></programlisting>
<para>
This will add a step to the existing job flow with ID j-3UN6WX5RRO2AG. This step will run the specified jar file.
</para>
<programlisting><![CDATA[
https://elasticmapreduce.cloudstack.com?Action=DescribeJobFlows &JobFlowIds.member.1=j-3UN6WX5RRO2AG
]]></programlisting>
<para>
This will return the status of the given job flow.
</para>
</section>
<section id="Roadmap">
<title>Roadmap</title>
<para><emphasis role="bold">Jun. 17 ∼ Jun. 30</emphasis> </para>
<orderedlist>
<listitem><para>
Learn CloudStack and Apache Whirr/Provisionr APIs; Deploy a CloudStack cluster.
</para></listitem>
<listitem><para>
Identify how EC2_USER_DATA is passed and executed on each CloudStack instance.
</para></listitem>
<listitem><para>
Figure out how the files passed in EC2_USER_DATA are acted upon by CloudInit.
</para></listitem>
<listitem><para>
Identify files in /etc/init/ that are used or modified by Whirr and Provisionr for hadoop related configuration.
</para></listitem>
<listitem><para>
Deploy a hadoop cluster on CloudStack via Whirr/Provisionr. This is to test what are missing in CloudStack or Whirr/Provisionr in terms of their support for each other.
</para></listitem>
</orderedlist>
<para><emphasis role="bold">Jul. 1∼ Aug. 1</emphasis> </para>
<orderedlist>
<listitem><para>
Write scripts to configure VM hostname on CloudStack with the help of CloudInit;
</para></listitem>
<listitem><para>
Write scripts to distribute SSH keys among CloudStack instances. Add the capability of using user-provided ssh key for authentication to CloudStack.
</para></listitem>
<listitem><para>
Take care of the other things left for hadoop provisioning, such as mounting disks, installing hadoop tarballs, etc.
</para></listitem>
<listitem><para>
Compose files that need to be passed in EC2_USER_DATA to each CloudStack instance . Test these files and write patches to make sure that Whirr/Provisionr can succefully deploy one-click hadoop clusters on CloudStack.
</para></listitem>
</orderedlist>
<para><emphasis role="bold">Aug. 3 ∼ Sep. 8</emphasis> </para>
<orderedlist>
<listitem><para>
Design and build an Elastic Mapreduce API for CloudStack that takes control of hadoop cluster creation and management.
</para></listitem>
<listitem><para>
Implement the eight actions defined in EMR API. This task might take a while.
</para></listitem>
</orderedlist>
<para><emphasis role="bold">Sep. 10 ∼ Sep. 23</emphasis> </para>
<orderedlist>
<listitem><para>
Code cleaning and documentation wrap up.
</para></listitem>
</orderedlist>
</section>
<section id="Deliverables-meng">
<title>Deliverables</title>
<orderedlist>
<listitem><para>
Whirr has limited support for CloudStack. Check what’s missing and make sure all steps are properly documented on the Whirr and CloudStack websites.
</para></listitem>
<listitem><para>
Contribute code to CloudStack and and send patches to Whirr/Provisionr if necessary to enable hadoop provisioning on CloudStack via Whirr/Provisionr.
</para></listitem>
<listitem><para>
Build an EMR-compatible API for CloudStack.
</para></listitem>
</orderedlist>
</section>
<section id="Nice-to-have">
<title>Nice to have</title>
<para>In addition to the required deliverables, it’s nice to have the following:</para>
<orderedlist>
<listitem><para>
The capability to add and remove hadoop nodes dynamically to enable elastic hadoop clusters on CloudStack.
</para></listitem>
<listitem><para>
A review of the existing tools that offer one-click provisioning and make sure that they support CloudStack based clouds.
</para></listitem>
</orderedlist>
</section>
<section id="References">
<title>References</title>
<orderedlist>
<listitem><para>
http://whirr.apache.org/
</para></listitem>
<listitem><para>
http://www.jclouds.org/documentation/gettingstarted/what-is-jclouds/
</para></listitem>
<listitem><para>
Katarzyna Keahey, Tim Freeman, Contextualization: Providing One-Click Virtual Clusters
</para></listitem>
<listitem><para>
http://www.nimbusproject.org/docs/current/clouds/clusters2.html
</para></listitem>
<listitem><para>
http://aws.amazon.com/amazon-linux-ami/
</para></listitem>
<listitem><para>
https://svn.apache.org/repos/asf/whirr/branches/contrib-python/src/py/hadoop/cloud/data/hadoop-ec2-init-remote.sh
</para></listitem>
<listitem><para>
https://help.ubuntu.com/community/CloudInit
</para></listitem>
<listitem><para>
http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.0.2/html/Installation_Guide/using-sshkeys.html
</para></listitem>
<listitem><para>
https://cwiki.apache.org/CLOUDSTACK/allow-user-provided-hostname-internal-vm-name-on-hypervisor-instead-of-cloud-platform-auto-generated-name-for-guest-vms.html
</para></listitem>
<listitem><para>
http://docs.aws.amazon.com/ElasticMapReduce/latest/API/Welcome.html
</para></listitem>
<listitem><para>
http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_Operations.html
</para></listitem>
<listitem><para>
http://buildacloud.org/blog/235-puppet-and-cloudstack.html
</para></listitem>
<listitem><para>
http://chriskleban-internet.blogspot.com/2012/03/build-cloud-cloudstack-instance.html
</para></listitem>
<listitem><para>
http://gehrcke.de/2009/06/aws-about-api/
</para></listitem>
<listitem><para>
Apache_CloudStack-4.0.0-incubating-API_Developers_Guide-en-US.pdf
</para></listitem>
</orderedlist>
</section>
</chapter>