scala - Running tasks automatically using apache spark -
i new apache spark , trying run test application using spark. problem i'm facing when create rdd using collection of data want process, gets created spark doesn't start processing unless , until call .collect method present in rdd class. in way have wait spark process rdd. there way spark automatically processes collection form rdd , can call .collect method processed data time without have wait spark?
moreover there way can use spark put processed data database instead of returning me?
the code i'm using given below:
object appmain extends app { val spark = new sparkcontext("local", "sparktest") val list = list(1,2,3,4,5) // want rdd processed created val rdd = spark.parallelize(list.toseq, 1).map{ => i%2 // checking if number or odd } // more functionality here // job above starts when line below executed val result = rdd.collect result.foreach(println) }
spark uses lazy evaluation model in transformation operations applied once 'action' called on rdd. model fits batch operation applied large dataset. it's possible 'cache' parts of computation using rdd.cache(), not force computation, indicates once rdd available, should cached.
further clarification comments indicate op might better served using streaming model in incoming data processed in 'micro-batches'
this example of how 'urgent event count' streaming job (not tested, illustrative purposes onle) (based on network wordcountexample
object urgenteventcount { def main(args: array[string]) { // create context 1 second batch size val sparkconf = new sparkconf().setappname("urgenteventcount").setmaster(spark_master) val ssc = new streamingcontext(sparkconf, seconds(1)) val dbconnection = db.connect(dbhost, dbport) val lines = ssc.sockettextstream(ip, port, storagelevel.memory_and_disk_ser) //we assume events tab separated val events = lines.flatmap(_.split("\t")) val urgenteventcount = events.filter(_.contains("urgent")) dbconnection.execute("insert urgentevents ...) ssc.start() }
as can see, if need connect database, need provide necessary driver , code execute db interaction. sure include driver's dependencies in job jar-file.
Comments
Post a Comment