Archive for the ‘Spark’ Category

Using spark-shell

December 23, 2015 Leave a comment

As a new learner for spark/scala, I found using spark-shell for debugging is very useful. Sometimes, I just feel it like the ipython shell.  There are few tricks of using it:

0. Do ./spark-shell -h will give you a lot of help information

1. Load external file in spark-shell:
spark-shell -i file.scala  or in-shell do
scala> :load your_path_to.scala

2. Remember when you start the shell, the SparkContext(sc) and the SQLContext (sqlContext) has already loaded. If you are not in the spark shell — remember to create such in your program

3. You can import multiple things like this: scala> import org.apache.spark.{SparkContext, SparkConf}

4. You can use `spark-shell -jars your.jar` to run a single-jar spark module from the start, and then you will be able to `import somthing_from_your_jar’ from your just added library.

5. If you install spark locally, you can open it’s web ui(port 4040) for validation purpose: http://localhost:4040/environment/

Screen Shot 2015-12-23 at 11.36.29 AM

6. To re-use what you have entered into the spark-shell, you can extract your input from the spark shell history which is in a file called “.spark-history” in the user’s home directory. For example `tail -n 5 .spark_history > mySession1.scala`. Next time, you can use (1) to reload your saved scala session. In the shell session, if you want to check history, you can simply do `scala> :history`

7. A library called scalaplot can help you to do some visual investigation.

8. Use $ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell to print launch command of spark scripts

9. Open spark-shell and execute :paste -raw that allows you to enter any valid Scala code, even including package.

ps. to install spark on your mac, you can simply use homebrew
$brew update
$brew install scala
$brew install sbt
$echo ‘SBT_OPTS=”-XX:+CMSClassUnloadingEnabled -XX:PermSize=256M -XX:MaxPermSize=512M -Xmx2G”‘ >> ~/.sbtconfig
$brew install apache-spark

After the installation, you can update your PATH variable to include the path to spark/bin.

You can also set up pyspark locally, here are some instructions:

One short but nice Scala book

Categories: Scala, Spark