blob: e276cdb40e62663a50df79556baa0bbfb05e8843 [file] [log] [blame] [view]
---
layout: page
title: Run Hello Samza without Internet
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
This tutorial is to help you run [Hello Samza](../../../startup/hello-samza/{{site.version}}/) if you can not connect to the internet.
### Test Your Connection
Ping irc.wikimedia.org. Sometimes the firewall in your company blocks this service.
{% highlight bash %}
telnet irc.wikimedia.org 6667
{% endhighlight %}
You should see something like this:
```
Trying 208.80.152.178...
Connected to ekrem.wikimedia.org.
Escape character is '^]'.
NOTICE AUTH :*** Processing connection to irc.pmtpa.wikimedia.org
NOTICE AUTH :*** Looking up your hostname...
NOTICE AUTH :*** Checking Ident
NOTICE AUTH :*** Found your hostname
```
Otherwise, you may have the connection problem.
### Use Local Data to Run Hello Samza
We provide an alternative to get wikipedia feed data. Instead of running
{% highlight bash %}
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
{% endhighlight %}
You will run
{% highlight bash %}
bin/produce-wikipedia-raw-data.sh
{% endhighlight %}
This script will read wikipedia feed data from local file and produce them to the Kafka broker. By default, it produces to localhost:9092 as the Kafka broker and uses localhost:2181 as zookeeper. You can overwrite them:
{% highlight bash %}
bin/produce-wikipedia-raw-data.sh -b yourKafkaBrokerAddress -z yourZookeeperAddress
{% endhighlight %}
Now you can go back to Generate Wikipedia Statistics section in [Hello Samza](../../../startup/hello-samza/{{site.version}}/) and follow the remaining steps.
### A Little Explanation
The goal of
{% highlight bash %}
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
{% endhighlight %}
is to deploy a Samza job which listens to wikipedia API, receives the feed in realtime and produces the feed to the Kafka topic wikipedia-raw. The alternative in this tutorial is reading local wikipedia feed in an infinite loop and producing the data to Kafka wikipedia-raw. The follow-up job, wikipedia-parser is getting data from Kafka topic wikipedia-raw, so as long as we have correct data in Kafka topic wikipedia-raw, we are fine. All Samza jobs are connected by the Kafka and do not depend on each other.