Home page

Sunday, September 13, 2015

Embedding font to a pdf for latex submission

Some conferences evaluate the pdf before you are able to submit your paper. Often the common problem is the error in embedding the font. The way to get around this problem is to convert it to ps and then back to pdf. Following commands shows how to do it:

pdflatex yourfile.tex 
pdftops yourfile.pdf
ps2pdf14 -dPDFSETTINGS=/prepress yourfile.pdf you-newfile.pdf

Thursday, August 27, 2015

How to set up an IDE to use Anaconda

http://docs.continuum.io/anaconda/ide_integration

Thursday, August 6, 2015

Install Maven with Yum on Amazon Linux

install maven on Aws Linux:

sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
sudo sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
sudo yum install -y apache-maven
mvn --version

Wednesday, August 5, 2015

Merge Spark Output to a Single File

If your results fit into node's memory, you can use coalesce(number_od_partitions) to bring all data into one node and then write it into your output will be one single file.

look at this example:
Here I want to parse a text file filter few lines based on a criteria and get the distinct number of first item in the file and at the end storing the list of the ids into one single output.

a.filter(lambda x : len(x.split('|')) >3 ).filter(lambda x : x.split('|')[2]=='SD').map(lambda x : x.split('|')[0]).distinct().coalesce(1).saveAsTextFile('/Downloads/spark-1.4.0-bin-hadoop2.6/ciks1')

Saturday, July 25, 2015

Simple Example of Using Accumulators in Python Spark

Here is a simple example of accumulator variable that is used by multiple workers and return an accumulative value at the end.

>>> acc = sc.accumulator(5)
>>> def f(x):
...     global acc
...     acc += x
>>> rdd = sc.parallelize([1,2,4,1])
>>> rdd.foreach(f)
>>> acc.value
13

A couple of notes:

Tasks in the worker can not access the Accumulator values
Tasks see accumulator as write only
Mostly used for debugging puerpose

Friday, July 24, 2015

Creat EMR(Amazon Elastic MapReduce) cluster using AWS Cli and Run a Python Spark Job on That

I spend few hours today to get up and running a spark program that I knew is running fine on my local machine over a EMR cluster. Since amazon a few days ago announced officially supporting spark the documentation was not good enough. So it was a bit painful to find right commad. My goal was to use the cli so that I will be able to automate that later on. This is the cli version but I will have a look at the boto (AWS python SDK) in future to fully utilize the API. but for now with the following recipe you should be able to create a cluster and submit your spark job.
So here is the recipe ;)

before I start you need to install aws cli first and configure it. use following commands:

sudu pip install aws
aws configure

After configureing AWS start the process by :

1) Create a cluster using AWS Cli:
first create a default roles:

aws emr create-default-roles

Then using Cli create a cluster:
A couple of notes:

current version of AMI is 3.8
application name is Spark
you can log in S3 using the --log-uri option
the supported instances starts from m3.xlarge

aws emr create-cluster --name "Spark cluster" --ami-version 3.8 --applications Name=Spark --ec2-attributes KeyName=ir --log-uri s3://Path/logs --instance-type m3.xlarge  --instance-count 1 --use-default-roles

2) After creating your cluster you can see the status of your cluster in details using :

aws emr list-clusters

you get the list of clusters:

"Clusters": [
        {
            "Status": {
                "Timeline": {
                    "ReadyDateTime": 1437738625.815,
                    "CreationDateTime": 1437738374.25
                },
                "State": "RUNNING",
                "StateChangeReason": {
                    "Message": "Running step"
                }
            },
            "NormalizedInstanceHours": 8,
            "Id": "j-3I1E1Q5RZKPLL",
            "Name": "Spark cluster"
        },

3) you can ssh to the master node in cluster and test the job using:

aws emr ssh --cluster-id j-3I1E1Q5RZKPLL --key-pair-file ~/mykeypath.pem

4) you can also copy your python application into the machine and run it. so first copy the file

aws emr put --cluster-id j-1A9EIDW2XFMNS --key-pair-file ~/Documents/AWS/ir.pem --src myscript.py --dest /home/hadoop/spark/

5) Now you are one step far to run your job. Now the cluster is waiting for any jobs to be received. simply you need to add step which is defining a job and after that you job would be started.
a couple of notes:

give the path to your script in args
you can define the action to cluster after running the job. the action failure can be selected from: "TERMINATE_CLUSTER"|"CANCEL_AND_WAIT"|"CONTINUE"

aws emr add-steps --cluster-id j-3I1E1Q5RZKPLL --steps Name=Spark,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,client,/home/hadoop/spark/myscript.py],ActionOnFailure=CONTINUE

after running that you will receive stepId in terminal with which you can see the details of the step plus the status of you job:

aws emr list-steps --cluster-id j-3I1E1Q5RZKPLL --step-ids s-xxxxxx

Done! Liked it? leave a comment.
Cheers,

Tuesday, July 21, 2015

Merge Multiple Files in Unix/Linux

What I found following examples are the quickest and kind of cleanest way to combine and aggregate few files to one separate file.

sed -n wfile.merge file1 file2

Or:

awk '{print > "file.merge"}' file1 file2

Or:

sh -c 'cat file1 file2 > file.merge'

Note: If you want to run the above on multiple file you can select all of the in one go for example all the txt file in the current directory:

sed -n wfile.merge *.txt