In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Cloud Platforms > Google Cloud Platform (GCP) > Cloud Dataproc

Cloud Dataproc

Author: Venkata Sudhakar

Google Cloud Dataproc is a fully managed, fast, easy-to-use cloud service for running Apache Spark, Apache Hadoop, Apache Hive, and Apache Pig workloads. It simplifies big data processing by handling cluster management automatically.

Key Features:

1. Fast cluster creation - Clusters start in 90 seconds or less.

2. Auto-scaling - Automatically adjusts cluster size based on workload.

3. Low cost - Per-second billing with preemptible VM support reduces costs significantly.

4. Versioning - Multiple Spark/Hadoop versions supported simultaneously.

5. Integrated - Native integration with BigQuery, Cloud Storage, Bigtable, and Pub/Sub.

The below example shows how to submit a PySpark word count job to Cloud Dataproc.

Submit the job using gcloud CLI,

It gives the following output,

+----+-----+
|word|count|
+----+-----+
| the|27801|
|   i|21028|
| and|19649|
|  to|17361|
|  of|16750|
+----+-----+

Dataproc vs Dataflow:

Cloud Dataproc - Best for lifting and shifting existing Spark/Hadoop workloads. Great for batch ETL jobs and ML workloads with Spark MLlib.

Cloud Dataflow - Best for new streaming and batch pipelines using Apache Beam. Fully serverless with no cluster management required.

Send your comments, suggestions or queries regarding this site to [email protected].