My experience and walkthroughs with the GCP Skillsboost Challange Labs.
See Challenge Lab
This lab is in three parts:
If done efficiently, you can probably complete this whole lab in about 20 minutes.
First, let’s define some variables we can use throughout this challenge. In Cloud Shell:
export PROJECT_ID=<what the lab tells you>
export BUCKET_NAME=<whatever the lab tells you>
export lab=<what the lab tells you>
export region=<what the lab tells you>
export machine_type=e2-standard-2
We need to ingest a csv file, and process the data into BigQuery, using a Dataflow template. Dataflow requires Google Cloud Storage for temporary data.
First, we create a three node cluster, using the required machine type for our nodes. Note that this takes a couple of minutes to run.
It’s crucial to get your storagea and BigQuery regions right!
In Cloud Shell:
bq --location=$region mk $lab
gsutil mb -l $region gs://$BUCKET_NAME/
Now we can simply create the Dataflow job from the Console:
That’s it! You can monitor the job progress in the Dataflow page. It takes a couple of minutes to run.
We’re asked to create a Dataproc cluster with some specific configuration settings. Once the cluster is created, we need to SSH onto one of the clusters nodes, and from there, copy the input text from GCS to the HDFS storage of the Dataproc cluster itself. We then need to run a Spark job that we’ve been given.
In Cloud Shell:
gcloud config set dataproc/region $region
gcloud dataproc clusters create my-cluster \
--worker-machine-type=$machine_type \
--num-workers=2 \
--worker-boot-disk-size 500
This creates a cluster with two workers, using the specified machine type. It takes a couple of minutes for the cluster to be created. Once it has been created, open the Compute Engine page, and from there, SSH onto one of the worker machines in the cluster.
From the SSH session:
hdfs dfs -cp gs://cloud-training/gsp323/data.txt /data.txt
Now, back in Cloud Shell:
gcloud dataproc jobs submit spark --cluster my-cluster \
--class org.apache.spark.examples.SparkPageRank \
--max-failures-per-hour=1 \
--jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- /data.txt
This runs a Spark job with a main class of org.apache.spark.examples.SparkPageRank
, with the specified jar location, with max retries set to 1
, and with the final parameter of /data.txt
, which is read from HDFS.
It runs and completes pretty quickly.
Here we’re going to do some simple data wrangling of an input csv file, using Dataprep.
This is all done through the Console.
New Flow
.gs://cloud-training
gsp323
.runs.csv
file and select it.Name the flow. E.g. Lab Runs
. (What you call it doesn’t matter.)
Now we’ll edit the recipe:
SUCCESS
. Keep rows
–> Add./(^0$|^0\.0$)/
–> Delete macthing rows.New step
and then entering the following:rename type: manual mapping: [column2,'runid'],[column3,'userid'],[column4,'labid'],[column5,'lab_title'],[column6,'start'],[column7,'end'],[column8,'time'],[column9,'score'],[column10,'state']
Finally, Run the job, and run it with Dataflow.
It takes a couple of minutes.
You can do any of the three ML API tasks. But here, I’ve gone with the Natural Language API task. For this task, we need to analyse the text:
“Old Norse texts portray Odin as one-eyed and long-bearded, frequently wielding a spear named Gungnir and wearing a cloak and a broad hat.”
We need to create a service account, obtain its key file, give the service the storage.admin role:
gcloud iam service-accounts create lab-svc
# create service account key file, replacing the project name below
gcloud iam service-accounts keys create key.json --iam-account lab-svc@$PROJECT_ID.iam.gserviceaccount.com
# Add storage admin
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:lab-svc@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.admin
export GOOGLE_APPLICATION_CREDENTIALS="/home/$USER/key.json"
# authenticate (activate) the service account
gcloud auth activate-service-account --key-file key.json
Now we can call the API:
gcloud ml language analyze-entities --content="Old Norse texts portray Odin as one-eyed and long-bearded, frequently wielding a spear named Gungnir and wearing a cloak and a broad hat." > result.json
Finally, we need to copy the result file to the specified bucket, in order for the lab to validate. Your required file name might be different.
gsutil cp result.json gs://$PROJECT_ID-marking/task4-cnl-907.result