Scala for Big Data Engineering

Data Science as we all know is a combination of statistics and real-world programming. Data Scientists use a number of programming languages to extract insights and value from the data, with Scala, Python and R being the most famous among them.

Source: LinkedIn

So, in this series of Scala for Big Data Engineering, we will look at how Scala is used for big data engineering. We will cover two aspects in this series of articles, one is Scala as a programming language and the other is Big Data.

History of Scala:

Scala (Scalable Language), a programming language developed by Martin Odersky, a professor at Ecole Polytechnique de Lausanne(EPFL), uses both object-oriented and functional programming constructs on the JVM.

Classification of Programming Languages

Programming languages are classified as follows:

  1. Declarative Programming Languages: In these programming languages, the compiler understands the given instructions and performs the computations while the underlying query engine or the parser performs the parsing and the compilation. For example, SQL syntax shows only the query for execution while parsing the query, understanding the grammar and executing the query is done by the underlying query engine or the compiler. Example: Haskell, ML, Prologh
  2. Functional Programming Languages: In these programming languages, the computational model is based on the lambda calculus. Lambda calculus comprises of three elements: Variables (x), Functions (λx.x) and Applications(λx.x)a. The programming paradigm is based on two constructs: immutability and pure functions. Here, we pass the data to a function to perform any transformation. Example: Haskell, ML, etc.
  3. Imperative Programming Languages: In these programming languages, step by step instructions are given to achieve the required output from the program. Example: C, C++, Fortran, and Java
  4. Object-oriented Programming Languages: In these programming languages, the program consists of interacting objects that use concepts like encapsulation, modularity, polymorphism, and inheritance.
    Encapsulation is the process of wrapping up of data and functions into a single unit. Inheritance is the process by which one class acquires the properties and behaviours of another class. Polymorphism is the ability to express one function as another form.
    An object is a basic unit in these programming languages which use object-oriented constructs having data and functions bound into a single unit representation called a class. Example: C++, C#, Java
  5. Parallel Programming Languages: In these programming languages, computation is run concurrently on multiple processors. These were designed for distributed computing and using the shared memory architectures or multiple cores to run the programs. Example: SR, EMERALD, PARLOG, chapel, CUDA, Cilk, MPI, POSIX threads, X10

Scala as a Programming Language     

Scala is a programming language which supports both object-oriented programming and functional programming constructs, and that’s its biggest advantage.

Source: Quora

The language was designed to meet the growing needs of modularity, scalability and a trend towards multi-paradigm programming. The characteristics of Scala are as follows:

  1. Object Oriented
  2. Functional Programming
  3. Statically Typed
  4. Java Interoperability
  5. Functions as Objects.
  6. Functional constructs for parallelism and distributed computing.
  7. Scala generates Java byte code.

Scala blends functional and object-oriented constructs to form a uniform object model, pattern matching and higher-order functions. Following are the foundation features of Scala programming language:


Scala programs compile to JVM bytecodes. Scala programs interoperate seamlessly with Java class libraries that use method calls, field accesses, class inheritance, and interface implementation. All of the aforementioned features also work well with Java and while Scala resembles the Java syntax, there are a few differences too.


Scala program is written completely in a different style. It treats arrays as instances of general sequence abstractions and uses higher order functions instead of loops.

Concise and Precise:

Scala is concise and precise as it uses semicolon inference, type inference, lightweight classes and closures as control abstractions. The average reduction in the lines of code a compared to Java is greater than 2, and this is due to the concise syntax and better abstraction capabilities. The elaborate static type system catches many errors early.

Scala is both expressive and elegant. Following features are included in Scala:

  • A Pure Object System
  • Operator Overloading
  • Closures as control abstractions
  • Mixin composition with traits
  • Abstract type members
  • Pattern Matching

The features that are not included in Scala are:

  • Static members
  • Special treatment of primitive types
  • Special treatment of interfaces

Scala has many features that are particularly valuable for writing scientific workflows, machine learning algorithms and complex analytics solutions.

Scala’s interoperability and features make its application handy in different areas and have a vast ecosystem. Scala has the following libraries in Big Data and Machine Learning space:

  1. Apache Spark – It is a general purpose analytics engine for large scale data processing. Spark is written in Scala and runs on JVM. Spark provides API’s in Java, Scala, Python, and R. The fundamental data abstraction in Spark is called the Resilient Distributed Datasets (RDDs). They are immutable, distributed, lazily evaluated and catchable in properties. Scala can be used for other modules of Spark which include Spark Streaming, Spark SQL, Spark MLib and ML, and Spark GraphX.
  2. Apache Flink – It is a framework and distributed processing engine for batch and real-time applications. Apache Flink is written in Java and Scala and uses several APIs for batch processing, real-time streaming, and relational queries. FlinkML is a Machine Learning library having a lot of algorithms (written in Scala) and CEP (Complex Event Processing written in Scala and Java).
  3. Apache Kafka – Written in Java and Scala, it is a distributed messaging and streaming platform that publishes and subscribes the stream of messages as soon as they arrive.
  4. Apache Samza –  Written in Java and Scala, it is a distributed stateful processing application to process data taken from different sources and Apache Kafka.
  5. Akka – Written in Scala, it is a toolkit for building distributed and message-driven systems for Java and Scala.
  6. ScalaNLP – The library is used for building machine learning and NLP library written in Scala.
  7. DeepLearning.Scala – Built in Scala programming language, it is the library that builds neural networks using object-oriented and functional programming constructs. The programs can be run on a JVM or a jupyter notebook.
  8. PredictionIO – It is a machine learning server that is built on top of Apache Spark, HBase and Hadoop. It is used for building and deploying machine learning models and uses machine learning libraries like Spark ML and Spark Mllib.
  9. Vegas – This library is similar to Matplotlib and connects between Spark with Scala.

That’s all for now!
In the upcoming series, we shall start learning Scala, Eclipse IDE, Scala with Jupyter and some of the above technologies.

Stay tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *