We were working on a Spark job to read JSON files out of HDFS, and it seemed to be running way too slowly. It turns out it was the JSON parsing library. So, here are some notes to help others navigate the Scala JSON parsing landscape, where there are at least 6 different libraries -- on both performance and correctness.
Our use case
We needed to de/serialize:
Files with nested JSON objects, one per line, with both string and numeric values - so basically a Map[String, Any]. We would like the nested maps to all be scala.Map's.
A case class with optional fields, some of which are Lists, and ideally should be initialized to Nil by default.
The test data is a 91,582-line file with one JSON blob per line, 34.5MB. The base time for reading this file using something like scala.io.Source(logfile).getlines.length is ~130ms.
All benchmarking was conducted on a late model MacBook Pro, using Scala 2.9.3.
Spray-json is based on the parboiled parsing library.
8 seconds to deserialize the file - 10x faster than JacksMapper. Woohoo!
Does not natively unpack to Map[String, Any] -- we needed to supply a new type class to handle this
Only treats case class fields of Option[_] type as optional - any other fields that are missing from the JSON will cause an exception to be thrown. We did not test this out as our case class did not have Options.
One benefit is that it has a easy API to generate pretty-printed JSON. Oh, and of course it natively integrates with spray, soon to be akka-http.
NOTE: A major new version of spray-json's backend Parboiled parser has been made available, which should result in order-of-magnitude improvements in parsing times. Unfortunately it's not been incorporated into spray-json yet as of the time of this testing.
A very promising project started by the guys from Wordnik (of Swagger fame), it aims to unify Scala JSON ASTs, sports multiple backends (including Jackson), and has native support from both Scalatra and Spray.
Native deserialization - 940 ms (based on the Lift web framework JSON parser)
Jackson deserialization - 670 ms
Can deserialize to Map[String, Any], including nested ones, but using some clumsy workaround, instead of the native read method
Missing case class fields throws an exception. :( Although, you can define alternative constructors to get around part of the issue, and writing a custom type class for deserialization is pretty easy.
Easy pretty printing (Serialization.writePretty)
One thing that json4s has that the others don't, is an extremely rich functional API for transforming the AST. It can also work with XML, apparently.
None of the tested frameworks is perfect. If I had to pick one, I would go with json4s -- it has the most support and features, and with the jackson backend it performs just as fast as jackson-scala-module and jerkson.
All of the frameworks offer rich ASTs for transformation of JSON entities before finally converting back into actual Scala objects. In theory you can build an even faster
I know this post will attract lots of comments from folks saying "But what about XXX?" I apologize in advance; we only had time to test a few that we were considering to improve our correctness and performance, but suggestions are welcome.