If you don't find an answer, please click here to post your question.
Engineering Blog

Using Parquet and Scrooge with Spark

by Community Manager ‎02-08-2017 10:21 AM - edited ‎02-08-2017 10:22 AM (1,589 Views)

Written by Evan Chan 2/28/2014

Software Development

 

 

Parquet is an exciting new columnar HDFS file format with built-in dictionary encoding and compression, as well as the ability to only read the columns you care about. These features make it very suitable for high-performance OLAP workloads with query engines such as Cloudera Impala or Facebook Presto.

Parquet translates each row into an object when reading the file, and it natively supports Thrift and Avro schemas for your data. Scrooge is Twitter's Scala class generator for Thrift, making it much more convenient and idiomatic to work with Thrift structs in Scala.

This blog post will lay down the steps for reading Parquet files using Scrooge-generated Scala classes for Thrift schemas, in Apache Spark.

 

 

Generating Scrooge scala classes

 

To start off with, let's add the scrooge-sbt-plugin to your project/plugins.sbt file:

 

addSbtPlugin("com.twitter" %% "scrooge-sbt-plugin" % "3.12.0")

Next, we need to activate this plugin and put some basic settings down. These lines are for build.sbt; if you use the full build settings / Scala files, you should be able to adjust accordingly.

 

com.twitter.scrooge.ScroogeSBT.newSettings

libraryDependencies += "com.twitter" %% "scrooge-core" % "3.12.0"

scroogeThriftOutputFolder in Compile <<= baseDirectory(_ / "src_gen")

 

Now, put your Thrift definitions in src/main/thrift, run sbt scrooge-gen, and you should see generated Scala class files under src_gen! Note that by default, Scrooge will put your generated Scala class files in target/src_managed. If you want to publish the generated classes in a jar, check them in, or compile the classes. You will almost certainly want something like the above setting.

 

By default the package name of the generated classes is taken from the Thrift files. You might want to change the package name, though. Let's say you want to generate both Java and Scala classes. This handy setting can be added to build.sbt:

 

scroogeThriftNamespaceMap in Compile := Map("ooyala.session.thrift" -> "ooyala.session.scala")

 

 

Compiling the generated classes

 

Now we need to pull in the generated classes to our compile, which you can do with a setting in build.sbt like the following. This defines two source directories from the project root, src and src_gen.

 

unmanagedSourceDirectories in Compile <<= Seq(baseDirectory(_ / "src" ),
                                              baseDirectory(_ / "src_gen")).join

 

Assuming you have SBT running, with the above setting, here are the steps in SBT to compile:

  1. Do a reload to reload the build classes with the new source dir setting.
  2. Do a compile. The generated classes should show up in target.
  3. If you need to use the classes in another project, you may want to publish or publish-local.

 

TODO: automatic generation of scrooge at compile time

 

 

Reading Parquet files with Scrooge in Spark

 

This is fairly straightforward. First you have to set up a job configuration. In the following, you must replace ThriftSession with the name of your own Thrift class

 

import parquet.hadoop.ParquetInputFormat
import parquet.hadoop.thrift.{ParquetThriftInputFormat, ThriftReadSupport}
import parquet.scrooge.ScroogeRecordConverter

val jobConf = new JobConf
ParquetInputFormat.setReadSupportClass(jobConf, classOf[ThriftReadSupport[ThriftSession]])
ThriftReadSupport.setRecordConverterClass(jobConf, classOf[ScroogeRecordConverter[ThriftSession]])
ParquetThriftInputFormat.setThriftClass(jobConf, classOf[ThriftSession])

 

Note that the ScroogeRecordConverter is essential for Scala support, because of the way that Scala classes and traits translate into actual bytecode classes with '$' after the name and such.

 

After the above setup, creating a Spark RDD is as simple as these few lines:

 

  sc.newAPIHadoopFile(path, classOf[ParquetThriftInputFormat[ThriftSession]],
                      classOf[Void], classOf[ThriftSession], jobConf)
    .map { case (void, session) => session }

 

Note that Void is the type of the Key, and since it is empty, we map it out with the final map. At this point we can use any of the RDD methods to extract, transform, and process our Parquet Thrift records, and you have access to native Scala Seqs and other types. All thanks to Scrooge!

 

 

Current limitations of Scrooge

 

As of Scrooge 3.12, the generated Scala classes are suitable for reading data but not for writing data.

 

  • They are immutable, so it is inconvenient to update and change big Thrift structures
  • There is no working support for writing to Parquet files from Scrooge classes
  • Column projection (selecting a subset of columns to read from) does not yet work from Scrooge.

 

 

 

Tags: