bycommunity4u02-08-201710:21 AM - edited 02-08-201710:22 AM
Written by Evan Chan 2/28/2014
Parquet is an exciting new columnar HDFS file format with built-in dictionary encoding and compression, as well as the ability to only read the columns you care about. These features make it very suitable for high-performance OLAP workloads with query engines such as Cloudera Impala or Facebook Presto.
Parquet translates each row into an object when reading the file, and it natively supports Thrift and Avro schemas for your data. Scrooge is Twitter's Scala class generator for Thrift, making it much more convenient and idiomatic to work with Thrift structs in Scala.
This blog post will lay down the steps for reading Parquet files using Scrooge-generated Scala classes for Thrift schemas, in Apache Spark.
Generating Scrooge scala classes
To start off with, let's add the scrooge-sbt-plugin to your project/plugins.sbt file:
Now, put your Thrift definitions in src/main/thrift, run sbt scrooge-gen, and you should see generated Scala class files under src_gen! Note that by default, Scrooge will put your generated Scala class files in target/src_managed. If you want to publish the generated classes in a jar, check them in, or compile the classes. You will almost certainly want something like the above setting.
By default the package name of the generated classes is taken from the Thrift files. You might want to change the package name, though. Let's say you want to generate both Java and Scala classes. This handy setting can be added to build.sbt:
scroogeThriftNamespaceMap in Compile := Map("ooyala.session.thrift" -> "ooyala.session.scala")
Compiling the generated classes
Now we need to pull in the generated classes to our compile, which you can do with a setting in build.sbt like the following. This defines two source directories from the project root, src and src_gen.
Note that Void is the type of the Key, and since it is empty, we map it out with the final map. At this point we can use any of the RDD methods to extract, transform, and process our Parquet Thrift records, and you have access to native Scala Seqs and other types. All thanks to Scrooge!
Current limitations of Scrooge
As of Scrooge 3.12, the generated Scala classes are suitable for reading data but not for writing data.
They are immutable, so it is inconvenient to update and change big Thrift structures
There is no working support for writing to Parquet files from Scrooge classes
Column projection (selecting a subset of columns to read from) does not yet work from Scrooge.