Friday, July 24, 2015

Creat EMR(Amazon Elastic MapReduce) cluster using AWS Cli and Run a Python Spark Job on That

I spend few hours today to get up and running a spark program that I knew is running fine on my local machine over a EMR cluster. Since amazon a few days ago announced officially supporting spark the documentation was not good enough. So it was a bit painful to find right commad. My goal was to use the cli so that I will be able to automate that later on. This is the cli version but I will have a look at the boto (AWS python SDK) in future to fully utilize the API. but for now with the following recipe you should be able to create a cluster and submit your spark job.
So here is the recipe ;)

before I start you need to install aws cli first and configure it. use following commands:
sudu pip install aws
aws configure 
After configureing AWS start the process by :

1) Create a cluster using AWS Cli:
first create a default roles:
aws emr create-default-roles 
Then using Cli create a cluster:
A couple of notes:

  • current version of AMI is 3.8
  • application name is Spark
  • you can log in S3 using the --log-uri option 
  • the supported instances starts from m3.xlarge

aws emr create-cluster --name "Spark cluster" --ami-version 3.8 --applications Name=Spark --ec2-attributes KeyName=ir --log-uri s3://Path/logs --instance-type m3.xlarge  --instance-count 1 --use-default-roles 

2) After creating your cluster you can see the status of your cluster in details using : 
aws emr list-clusters  
you get the list of clusters: 

"Clusters": [
        {
            "Status": {
                "Timeline": {
                    "ReadyDateTime": 1437738625.815,
                    "CreationDateTime": 1437738374.25
                },
                "State": "RUNNING",
                "StateChangeReason": {
                    "Message": "Running step"
                }
            },
            "NormalizedInstanceHours": 8,
            "Id": "j-3I1E1Q5RZKPLL",
            "Name": "Spark cluster"
        },

3) you can ssh to the master node in cluster and test the job using:
aws emr ssh --cluster-id j-3I1E1Q5RZKPLL --key-pair-file ~/mykeypath.pem  

4) you can also copy your python application into the machine and run it. so first copy the file 
aws emr put --cluster-id j-1A9EIDW2XFMNS --key-pair-file ~/Documents/AWS/ir.pem --src myscript.py --dest /home/hadoop/spark/

5) Now you are one step far to run your job. Now the cluster is waiting for any jobs to be received. simply you need to add step which is defining a job and after that you job would be started.
a couple of notes:

  • give the path to your script in args
  • you can define the action to cluster after running the job. the action failure can be selected from: "TERMINATE_CLUSTER"|"CANCEL_AND_WAIT"|"CONTINUE"

aws emr add-steps --cluster-id j-3I1E1Q5RZKPLL --steps Name=Spark,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,client,/home/hadoop/spark/myscript.py],ActionOnFailure=CONTINUE
after running that you will receive stepId in terminal with which you can see the details of the step plus the status of you job:
aws emr list-steps --cluster-id j-3I1E1Q5RZKPLL --step-ids s-xxxxxx  

Done! Liked it? leave a comment.
Cheers,



1 comment:

  1. Thanks for sharing! I want to ask what the command would be like if the python file has several arguments? Thanks

    ReplyDelete