First step to process any data in spark is to read it and be able to write it.
following codes show you how to read and write from local file system or amazon S3 / process the data and write it into filesystem and S3.
following codes show you how to read and write from local file system or amazon S3 / process the data and write it into filesystem and S3.
- here is an example of reading and writing data from/into local file system.
from pyspark import SparkContext
logFile = "README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
#### calculate the wordcounts
wordCounts = logData.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
### Save output locally using Python Way
f = open('test.txt','w')
print >> f, 'wordcount'+str(wordCounts.collect())
### Save output locally using spark API
wordCounts.saveAsTextFile('test1.txt')
- following shows you how to use S3 as source to get the data and writing result into it. to be able to use that you need to export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as local variable in your machine and use following code. to
from pyspark import SparkContext
logFile = "README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
#logData = sc.textFile('s3n://your s3 path/README.txt')
wordCounts = logData.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("s3n://your s3 output folder path/output")
Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!
ReplyDeleteWeb Developer Melbourne