|
|
Cloud Dataproc
Author: Venkata Sudhakar
Google Cloud Dataproc is a fully managed, fast, easy-to-use cloud service for running Apache Spark, Apache Hadoop, Apache Hive, and Apache Pig workloads. It simplifies big data processing by handling cluster management automatically. Key Features: 1. Fast cluster creation - Clusters start in 90 seconds or less. 2. Auto-scaling - Automatically adjusts cluster size based on workload. 3. Low cost - Per-second billing with preemptible VM support reduces costs significantly. 4. Versioning - Multiple Spark/Hadoop versions supported simultaneously. 5. Integrated - Native integration with BigQuery, Cloud Storage, Bigtable, and Pub/Sub. The below example shows how to submit a PySpark word count job to Cloud Dataproc.
Submit the job using gcloud CLI,
It gives the following output,
+----+-----+
|word|count|
+----+-----+
| the|27801|
| i|21028|
| and|19649|
| to|17361|
| of|16750|
+----+-----+
Dataproc vs Dataflow: Cloud Dataproc - Best for lifting and shifting existing Spark/Hadoop workloads. Great for batch ETL jobs and ML workloads with Spark MLlib. Cloud Dataflow - Best for new streaming and batch pipelines using Apache Beam. Fully serverless with no cluster management required.
|
|