doc/developers.md - cassandra-spark-connector - Git at Google

 # Documentation

 ## Developers Tips

 ### Getting Started

 The Spark Cassandra Connector is built using sbt. There is a premade
 launching script for sbt so it is unneccessary to download it. To invoke
 this script you can run `./sbt/sbt` from a clone of this repository.

 For information on setting up your clone please follow the [Github
 Help](https://help.github.com/articles/cloning-a-repository/).

 Once in the sbt shell you will be able to build and run tests for the
 connector without any Spark or Cassandra nodes running. The integration tests
 require a valid path to java home set in the `JAVA_HOME` env variable and
 [CCM](https://github.com/riptano/ccm) to be installed on your machine.
 This can be accomplished with `pip install ccm`.

 The most common commands to use when developing the connector are

 1. `test` - Runs the the unit tests for the project.
 2. `it:test` - Runs the integration tests with Cassandra (started by CCM) and Spark
 3. `package` - Builds the project and produces a runtime jar
 4. `publishM2` - Publishes a snapshot of the project to your local maven repository allowing for usage with `--packages` in the spark-shell

 The integration tests located in `connector/src/it` should
 probably be the first place to look for anyone considering adding code.
 There are many examples of executing a feature of the connector with
 Cassandra and Spark nodes and are the core of our test coverage.

 ### Merge Path

 b2.5 => b3.0 => b3.1 => master

 New features can be considered for 2.5 as long as they do not break apis.
 Once a feature is ready for b2.5, create a feature branch for b3.0 and merge
 b2.5 feature branch to b3.0 feature branch. Repeat for master.

 Example for imaginary SPARKC-9999.

 Let's assume that `datastax` is `git@github.com:datastax/spark-cassandra-connector.git` remote
 and origin is your personal clone.
 ```shell
 $ git remote -v
 datastax	git@github.com:datastax/spark-cassandra-connector.git (fetch)
 datastax	git@github.com:datastax/spark-cassandra-connector.git (push)
 ...
 ```

 Here is how the work should look like.

 ```shell
 git fetch datastax
 git checkout -b SPARKC-9999-b2.5 datastax/b2.5
 # do the work, commit
 git push origin SPARKC-9999-b2.5

 # Forward merge on the next version:
 git checkout -b SPARKC-9999-b3.0 datastax/b3.0
 git merge SPARKC-9999-b2.5
 # Resolve conflict, if any
 # Push the new feature branch:
 git push origin SPARKC-9999-b3.0

 # Forward merge on the next version:
 git checkout -b SPARKC-9999-master datastax/master
 git merge SPARKC-9999-b3.0
 # Resolve conflict, if any
 # Push the new feature branch:
 git push origin SPARKC-9999-master
 ```

 ### Sub-Projects

 The connector currently contains several sub-projects

 #### connector
 This sub-project contains all of the actual connector code and is where
 any new features or tests should go. This Scala project also contains the
 Java API and related code. Anything related to the actual connecting or modification
 of Java Driver code belongs in the next module

 #### driver
 All of the code relating to the Java Driver. Connection factories, row transformers
 anything which could be used for any application even if Spark is not involved.


 #### test-support
 CCM Wrapper code. Much of this code is based on the Datastax Java Driver's test code.
 Includes code for spawning CCM as well as several modes for launching clusters
 while testing. Together this also defines which tests require seperate clusters to
 run and the parallelization used while running tests.

 ### Test Parallelization

 In order to limit the number of test groups running simultaneously use the
 `TEST_PARALLEL_TASKS` environment variable. Only applies to `sbt test` tasks.

 ### Set Cassandra Test Target
 Our CI Build runs through the Datastax Infrastructure and tests on all the builds
 listed in build.yaml. In addition the _test-support_ module supports Cassandra
 or other CCM Compatible installations.

 If using SBT you can set
 `CCM_CASSANDRA_VERSION` to propagate a version for CCM to use during tests.

 If you are running tests through IntelliJ or through an alternative framework (jUnit)
 set the system property `ccm.version` to the version you like.

 ### CCM Modes
 The integration tests have a variety of modes which can be set with `CCM_CLUSTER_MODE`

 * Debug - Use to preserve logs of running CCM Clusters as well as the cluster directories themselves
 * Standard - Starts a new cluster which is cleaned up entirely on finish the test run
 * Developer - Does not clean up cluster on test completion, can be used when running a test multiple times for faster iteration

 ### Continuous Testing

 It's often useful when implementing new features to have the tests run
 in a loop on code change. Sbt provides a method for this by using the
 `~` operator. With this `~ test` will run the unit tests every time a
 change in the source code is detected. This is often useful to use in
 conjunction with `testOnly` which runs a single test. So if a new feature
 were being added to the integration suite `foo` you may want to run
 `~ it:testOnly foo`. Which would only run the suite you are interested in
 on a loop while you are modifying the connector code. Use this in conjunction
 with "Developer" CCM Mode for the fastest test iteration.

 ### Packaging

 The `spark-shell` and `spark-submit` are able to use local caches to load
 libraries and this can be taken advantage of by the SCC. For example
 if you wanted to test the maven artifacts produced for your current build
 you could run `publishM2` which would generate the needed artifacts and
 pom in your local cache. You can then reference this from `spark-shell`
 or `spark-submit` using the following code
 ```bash
 ./bin/spark-shell --repositories file:/Users/yourUser/.m2/repository --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.0-14-gcfca49e
 ```
 Where you would change the revision `1.6.0-14-gcfca49e` to match the output
 of your publish command.

 This same method should work with `publishLocal`
 after the merging of [SPARK-12666](https://issues.apache.org/jira/browse/SPARK-12666)


 ### Publishing ScalaDocs

 Run the generateDocs script with all the versions to generate docs for

 ```bash
 ./generateDocs Version Version Version Version
 ```
 Which will checkout those tags v$Version and run sbt doc for each of them.
 The output files will eventually be moved to the gh-pages branch. After the
 script has finished inspect the results then if they are good run.

 ```bash
 git add .; git commit
 ```
	# Documentation

	## Developers Tips

	### Getting Started

	The Spark Cassandra Connector is built using sbt. There is a premade
	launching script for sbt so it is unneccessary to download it. To invoke
	this script you can run `./sbt/sbt` from a clone of this repository.

	For information on setting up your clone please follow the [Github
	Help](https://help.github.com/articles/cloning-a-repository/).

	Once in the sbt shell you will be able to build and run tests for the
	connector without any Spark or Cassandra nodes running. The integration tests
	require a valid path to java home set in the `JAVA_HOME` env variable and
	[CCM](https://github.com/riptano/ccm) to be installed on your machine.
	This can be accomplished with `pip install ccm`.

	The most common commands to use when developing the connector are

	1. `test` - Runs the the unit tests for the project.
	2. `it:test` - Runs the integration tests with Cassandra (started by CCM) and Spark
	3. `package` - Builds the project and produces a runtime jar
	4. `publishM2` - Publishes a snapshot of the project to your local maven repository allowing for usage with `--packages` in the spark-shell

	The integration tests located in `connector/src/it` should
	probably be the first place to look for anyone considering adding code.
	There are many examples of executing a feature of the connector with
	Cassandra and Spark nodes and are the core of our test coverage.

	### Merge Path

	b2.5 => b3.0 => b3.1 => master

	New features can be considered for 2.5 as long as they do not break apis.
	Once a feature is ready for b2.5, create a feature branch for b3.0 and merge
	b2.5 feature branch to b3.0 feature branch. Repeat for master.

	Example for imaginary SPARKC-9999.

	Let's assume that `datastax` is `git@github.com:datastax/spark-cassandra-connector.git` remote
	and origin is your personal clone.
	```shell
	$ git remote -v
	datastax git@github.com:datastax/spark-cassandra-connector.git (fetch)
	datastax git@github.com:datastax/spark-cassandra-connector.git (push)
	...
	```

	Here is how the work should look like.

	```shell
	git fetch datastax
	git checkout -b SPARKC-9999-b2.5 datastax/b2.5
	# do the work, commit
	git push origin SPARKC-9999-b2.5

	# Forward merge on the next version:
	git checkout -b SPARKC-9999-b3.0 datastax/b3.0
	git merge SPARKC-9999-b2.5
	# Resolve conflict, if any
	# Push the new feature branch:
	git push origin SPARKC-9999-b3.0

	# Forward merge on the next version:
	git checkout -b SPARKC-9999-master datastax/master
	git merge SPARKC-9999-b3.0
	# Resolve conflict, if any
	# Push the new feature branch:
	git push origin SPARKC-9999-master
	```

	### Sub-Projects

	The connector currently contains several sub-projects

	#### connector
	This sub-project contains all of the actual connector code and is where
	any new features or tests should go. This Scala project also contains the
	Java API and related code. Anything related to the actual connecting or modification
	of Java Driver code belongs in the next module

	#### driver
	All of the code relating to the Java Driver. Connection factories, row transformers
	anything which could be used for any application even if Spark is not involved.


	#### test-support
	CCM Wrapper code. Much of this code is based on the Datastax Java Driver's test code.
	Includes code for spawning CCM as well as several modes for launching clusters
	while testing. Together this also defines which tests require seperate clusters to
	run and the parallelization used while running tests.

	### Test Parallelization

	In order to limit the number of test groups running simultaneously use the
	`TEST_PARALLEL_TASKS` environment variable. Only applies to `sbt test` tasks.

	### Set Cassandra Test Target
	Our CI Build runs through the Datastax Infrastructure and tests on all the builds
	listed in build.yaml. In addition the _test-support_ module supports Cassandra
	or other CCM Compatible installations.

	If using SBT you can set
	`CCM_CASSANDRA_VERSION` to propagate a version for CCM to use during tests.

	If you are running tests through IntelliJ or through an alternative framework (jUnit)
	set the system property `ccm.version` to the version you like.

	### CCM Modes
	The integration tests have a variety of modes which can be set with `CCM_CLUSTER_MODE`

	* Debug - Use to preserve logs of running CCM Clusters as well as the cluster directories themselves
	* Standard - Starts a new cluster which is cleaned up entirely on finish the test run
	* Developer - Does not clean up cluster on test completion, can be used when running a test multiple times for faster iteration

	### Continuous Testing

	It's often useful when implementing new features to have the tests run
	in a loop on code change. Sbt provides a method for this by using the
	`~` operator. With this `~ test` will run the unit tests every time a
	change in the source code is detected. This is often useful to use in
	conjunction with `testOnly` which runs a single test. So if a new feature
	were being added to the integration suite `foo` you may want to run
	`~ it:testOnly foo`. Which would only run the suite you are interested in
	on a loop while you are modifying the connector code. Use this in conjunction
	with "Developer" CCM Mode for the fastest test iteration.

	### Packaging

	The `spark-shell` and `spark-submit` are able to use local caches to load
	libraries and this can be taken advantage of by the SCC. For example
	if you wanted to test the maven artifacts produced for your current build
	you could run `publishM2` which would generate the needed artifacts and
	pom in your local cache. You can then reference this from `spark-shell`
	or `spark-submit` using the following code
	```bash
	./bin/spark-shell --repositories file:/Users/yourUser/.m2/repository --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.0-14-gcfca49e
	```
	Where you would change the revision `1.6.0-14-gcfca49e` to match the output
	of your publish command.

	This same method should work with `publishLocal`
	after the merging of [SPARK-12666](https://issues.apache.org/jira/browse/SPARK-12666)


	### Publishing ScalaDocs

	Run the generateDocs script with all the versions to generate docs for

	```bash
	./generateDocs Version Version Version Version
	```
	Which will checkout those tags v$Version and run sbt doc for each of them.
	The output files will eventually be moved to the gh-pages branch. After the
	script has finished inspect the results then if they are good run.

	```bash
	git add .; git commit
	```