tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Cloud Platforms > Google Cloud Platform (GCP) > Cloud Dataproc

Cloud Dataproc

Author: Venkata Sudhakar

Google Cloud Dataproc is a fully managed, fast, easy-to-use cloud service for running Apache Spark, Apache Hadoop, Apache Hive, and Apache Pig workloads. It simplifies big data processing by handling cluster management automatically.

Key Features:

1. Fast cluster creation - Clusters start in 90 seconds or less.

2. Auto-scaling - Automatically adjusts cluster size based on workload.

3. Low cost - Per-second billing with preemptible VM support reduces costs significantly.

4. Versioning - Multiple Spark/Hadoop versions supported simultaneously.

5. Integrated - Native integration with BigQuery, Cloud Storage, Bigtable, and Pub/Sub.

The below example shows how to submit a PySpark word count job to Cloud Dataproc.


Submit the job using gcloud CLI,


It gives the following output,

+----+-----+
|word|count|
+----+-----+
| the|27801|
|   i|21028|
| and|19649|
|  to|17361|
|  of|16750|
+----+-----+

Dataproc vs Dataflow:

Cloud Dataproc - Best for lifting and shifting existing Spark/Hadoop workloads. Great for batch ETL jobs and ML workloads with Spark MLlib.

Cloud Dataflow - Best for new streaming and batch pipelines using Apache Beam. Fully serverless with no cluster management required.


 
  


  
bl  br