Ship it to the cluster, run it on the cluster and validate. ordersCSV = spark.read. \ csv(inputBaseDir + '/orders'). \ toDF('order_id', 'order_date', 'order_customer ...