scala - Compress Output Scalding / Cascading TsvCompressed -
so people have been having problems compressing output of scalding jobs including myself. after googling odd hiff of answer in obscure forum somewhere nothing suitable peoples copy , paste needs.
i output tsv
, writes compressed output.
anyway after faffification managed write tsvcompressed output seems job (you still need set hadoop job system configuration properties, i.e. set compress true, , set codec sensible or defaults crappy deflate)
import com.twitter.scalding._ import cascading.tuple.fields import cascading.scheme.local import cascading.scheme.hadoop.{textline, textdelimited} import cascading.scheme.scheme import org.apache.hadoop.mapred.{outputcollector, recordreader, jobconf} case class tsvcompressed(p: string) extends fixedpathsource(p) delimitedschemecompressed trait delimitedschemecompressed extends source { val types: array[class[_]] = null override def localscheme = new local.textdelimited(fields.all, false, false, "\t", types) override def hdfsscheme = { val temp = new textdelimited(fields.all, false, false, "\t", types) temp.setsinkcompression(textline.compress.enable) temp.asinstanceof[scheme[jobconf,recordreader[_,_],outputcollector[_,_],_,_]] } }
Comments
Post a Comment