Below is a script to read and write to and from large datasets saved as csv files.
Writing datasets to an csv file.
import csv #writing data into csv file writer = csv.writer(open('dataset.csv', 'wb', buffering=0)) writer.writerows([ ('GOOG', 'Google Inc.', 123.44, 0.32, 0.09), ('YHOO', 'Yahoo! Inc.', 2.33, 99.23, 0.123), ('IBM', 'IBM Inc.', 223.44, 212.32, 6.42) ])
Reading from large datasets csv files
import csv dataset = csv.reader(open('dataset.csv', 'rb')) status_labels = {-1: 'down', 0: 'unchanged', 1: 'up'} for ticker, name, price, change, pct in dataset: status = status_labels[cmp(float(change), 0.0)] print '%s is %s (%s%%)' % (name, status, pct)
This script is good for importing large datasets for you Hadoop jobs.