Thursday, July 16, 2015

Read/Write Output Using Local File System and Amazon S3 in Spark

First step to process any data in spark is to read it and be able to write it.
following codes show you how to read and write from local file system or amazon S3 / process the data and write it into filesystem and S3.


  • here is an example of reading and writing data from/into local file system.


 from pyspark import SparkContext  
 logFile = "README.md" # Should be some file on your system  
 sc = SparkContext("local", "Simple App")  
 logData = sc.textFile(logFile).cache()  
 #### calculate the wordcounts
 wordCounts = logData.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)  
 
 ### Save output locally using Python Way
 f = open('test.txt','w')  
 print >> f, 'wordcount'+str(wordCounts.collect())  
 ### Save output locally using spark API
 wordCounts.saveAsTextFile('test1.txt')  
 


  • following shows you how to use S3 as source to get the data and writing result into it. to be able to use that you need to export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as local variable in your machine and use following code. to 


 from pyspark import SparkContext  
 logFile = "README.md" # Should be some file on your system  
 sc = SparkContext("local", "Simple App")  
 #logData = sc.textFile('s3n://your s3 path/README.txt')  
 wordCounts = logData.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)  
 wordCounts.saveAsTextFile("s3n://your s3 output folder path/output")  



1 comment:

  1. Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!


    Web Developer Melbourne

    ReplyDelete