I spend few hours today to get up and running a spark program that I knew is running fine on my local machine over a EMR cluster. Since amazon a few days ago announced officially supporting spark the documentation was not good enough. So it was a bit painful to find right commad. My goal was to use the cli so that I will be able to automate that later on. This is the cli version but I will have a look at the boto (AWS python SDK) in future to fully utilize the API. but for now with the following recipe you should be able to create a cluster and submit your spark job.
So here is the recipe ;)
before I start you need to install aws cli first and configure it. use following commands:
1) Create a cluster using AWS Cli:
first create a default roles:
A couple of notes:
5) Now you are one step far to run your job. Now the cluster is waiting for any jobs to be received. simply you need to add step which is defining a job and after that you job would be started.
a couple of notes:
Done! Liked it? leave a comment.
Cheers,
So here is the recipe ;)
before I start you need to install aws cli first and configure it. use following commands:
sudu pip install aws
aws configure
After configureing AWS start the process by :1) Create a cluster using AWS Cli:
first create a default roles:
aws emr create-default-roles
Then using Cli create a cluster:A couple of notes:
- current version of AMI is 3.8
- application name is Spark
- you can log in S3 using the --log-uri option
- the supported instances starts from m3.xlarge
aws emr create-cluster --name "Spark cluster" --ami-version 3.8 --applications Name=Spark --ec2-attributes KeyName=ir --log-uri s3://Path/logs --instance-type m3.xlarge --instance-count 1 --use-default-roles
2) After creating your cluster you can see the status of your cluster in details using :
aws emr list-clusters
you get the list of clusters:
"Clusters": [
{
"Status": {
"Timeline": {
"ReadyDateTime": 1437738625.815,
"CreationDateTime": 1437738374.25
},
"State": "RUNNING",
"StateChangeReason": {
"Message": "Running step"
}
},
"NormalizedInstanceHours": 8,
"Id": "j-3I1E1Q5RZKPLL",
"Name": "Spark cluster"
},
3) you can ssh to the master node in cluster and test the job using:
aws emr ssh --cluster-id j-3I1E1Q5RZKPLL --key-pair-file ~/mykeypath.pem
4) you can also copy your python application into the machine and run it. so first copy the file
aws emr put --cluster-id j-1A9EIDW2XFMNS --key-pair-file ~/Documents/AWS/ir.pem --src myscript.py --dest /home/hadoop/spark/
5) Now you are one step far to run your job. Now the cluster is waiting for any jobs to be received. simply you need to add step which is defining a job and after that you job would be started.
a couple of notes:
- give the path to your script in args
- you can define the action to cluster after running the job. the action failure can be selected from: "TERMINATE_CLUSTER"|"CANCEL_AND_WAIT"|"CONTINUE"
aws emr add-steps --cluster-id j-3I1E1Q5RZKPLL --steps Name=Spark,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,client,/home/hadoop/spark/myscript.py],ActionOnFailure=CONTINUE
after running that you will receive stepId in terminal with which you can see the details of the step plus the status of you job:aws emr list-steps --cluster-id j-3I1E1Q5RZKPLL --step-ids s-xxxxxx
Cheers,
Thanks for sharing! I want to ask what the command would be like if the python file has several arguments? Thanks
ReplyDelete