In this post, I’d like to list the technical stack with some useful links for a fresh new Apache Spark developer, who may have some fundamental programming skills while feeling lacking of a mass of background knowledge to be an efficient user or developer of Spark. I hope that the list can serve as index of the path to be a skilled Spark developer.

Without assumption of the reader’s technical background, I will start with some Basic Tools. Readers can skip any section if he/she has been familiar to the subject.

Basic Tools

Java

  • The stack of Java programming language is the basis for all projects that are built on a JVM-based language (Scala, Closure, Jython…), and I believe readers can conveniently find many tutorials or courses for this popular language. A deep understanding of programming with Java may not necessary for a Spark developer, but the knowledge of some useful tools are really important to understand how programs on Spark is built and used.
  • Maven
    • Apache Maven is one of the most popular project management and comprehension tools. It organizes dependencies in a central manner (which means that all dependencies can be downloaded from one or several central sites) recursively (which means that the dependencies of dependencies are managed automatically). It also contains the configurations of how to built the project, including the root of source code, necessary plugins, etc. All configurations of Maven are organized in a XML-style file with filename suffix of .pom (project object model). As a well-known open source project, the documentations on the official site are very clear and helpful.
    • Easy Way: Maven in 5 Minutes
    • Hard Way: Maven Getting Started Guide
  • log4j
    • This lists some common guide for log4j, as a background knowledge of the logging mechanism of Spark. If you just want to know how to log in a Spark application, move to the Spark section at the end of this list.
    • Easy Way: log4j - Quick Guide
    • Hard Way: Official Docs
  • garbage collection

Scala

  • Language
    • Though easy to use, Scala is a complicated programming language. I learned Scala systematically through a MOOC by Martin Odersky, who is one of the language designers. Limited by my experience, I cannot recommend more materials about the language.
  • sbt
    • sbt is a very useful and advanced build tool for Scala and Java. It is more convenient to configure than Maven and makes the project containing a clear project structure. For a Spark developer, a knowledge of basic configurations and commands of sbt is enough.
    • Easy Way: A simple tutorial, Another simple one, Some plugins
    • Hard Way: Official reference
  • Typesafe Config
    • Typesafe Config is a useful configuration library for JVM-based languages, docs on Github is clear and comprehensive enough for users.
  • ScalaTest
    • ScalaTest is another advanced yet complicated tool in Scala stack. It enables various testing styles, and users can find their comfortable one to use. The User guide is fairly clear.

Hadoop

Spark