Engineering Blog

Using Parquet and Scrooge with Spark

by Community Manager ‎02-08-2017 10:21 AM - edited ‎02-08-2017 10:22 AM (1,817 Views)

Written by Evan Chan 2/28/2014

Software Development



Parquet is an exciting new columnar HDFS file format with built-in dictionary encoding and compression, as well as the ability to only read the columns you care about. These features make it very suitable for high-performance OLAP workloads with query engines such as Cloudera Impala or Facebook Presto.

Parquet translates each row into an object when reading the file, and it natively supports Thrift and Avro schemas for your data. Scrooge is Twitter's Scala class generator for Thrift, making it much more convenient and idiomatic to work with Thrift structs in Scala.

This blog post will lay down the steps for reading Parquet files using Scrooge-generated Scala classes for Thrift schemas, in Apache Spark.



Generating Scrooge scala classes


To start off with, let's add the scrooge-sbt-plugin to your project/plugins.sbt file:


addSbtPlugin("com.twitter" %% "scrooge-sbt-plugin" % "3.12.0")

Next, we need to activate this plugin and put some basic settings down. These lines are for build.sbt; if you use the full build settings / Scala files, you should be able to adjust accordingly.



libraryDependencies += "com.twitter" %% "scrooge-core" % "3.12.0"

scroogeThriftOutputFolder in Compile <<= baseDirectory(_ / "src_gen")


Now, put your Thrift definitions in src/main/thrift, run sbt scrooge-gen, and you should see generated Scala class files under src_gen! Note that by default, Scrooge will put your generated Scala class files in target/src_managed. If you want to publish the generated classes in a jar, check them in, or compile the classes. You will almost certainly want something like the above setting.


By default the package name of the generated classes is taken from the Thrift files. You might want to change the package name, though. Let's say you want to generate both Java and Scala classes. This handy setting can be added to build.sbt:


scroogeThriftNamespaceMap in Compile := Map("ooyala.session.thrift" -> "ooyala.session.scala")



Compiling the generated classes


Now we need to pull in the generated classes to our compile, which you can do with a setting in build.sbt like the following. This defines two source directories from the project root, src and src_gen.


unmanagedSourceDirectories in Compile <<= Seq(baseDirectory(_ / "src" ),
                                              baseDirectory(_ / "src_gen")).join


Assuming you have SBT running, with the above setting, here are the steps in SBT to compile:

  1. Do a reload to reload the build classes with the new source dir setting.
  2. Do a compile. The generated classes should show up in target.
  3. If you need to use the classes in another project, you may want to publish or publish-local.


TODO: automatic generation of scrooge at compile time



Reading Parquet files with Scrooge in Spark


This is fairly straightforward. First you have to set up a job configuration. In the following, you must replace ThriftSession with the name of your own Thrift class


import parquet.hadoop.ParquetInputFormat
import parquet.hadoop.thrift.{ParquetThriftInputFormat, ThriftReadSupport}
import parquet.scrooge.ScroogeRecordConverter

val jobConf = new JobConf
ParquetInputFormat.setReadSupportClass(jobConf, classOf[ThriftReadSupport[ThriftSession]])
ThriftReadSupport.setRecordConverterClass(jobConf, classOf[ScroogeRecordConverter[ThriftSession]])
ParquetThriftInputFormat.setThriftClass(jobConf, classOf[ThriftSession])


Note that the ScroogeRecordConverter is essential for Scala support, because of the way that Scala classes and traits translate into actual bytecode classes with '$' after the name and such.


After the above setup, creating a Spark RDD is as simple as these few lines:


  sc.newAPIHadoopFile(path, classOf[ParquetThriftInputFormat[ThriftSession]],
                      classOf[Void], classOf[ThriftSession], jobConf)
    .map { case (void, session) => session }


Note that Void is the type of the Key, and since it is empty, we map it out with the final map. At this point we can use any of the RDD methods to extract, transform, and process our Parquet Thrift records, and you have access to native Scala Seqs and other types. All thanks to Scrooge!



Current limitations of Scrooge


As of Scrooge 3.12, the generated Scala classes are suitable for reading data but not for writing data.


  • They are immutable, so it is inconvenient to update and change big Thrift structures
  • There is no working support for writing to Parquet files from Scrooge classes
  • Column projection (selecting a subset of columns to read from) does not yet work from Scrooge.