I attended a webinar hosted by Deepak Singh from Amazon’s Web Service group on analytics in the cloud. He made a very compelling case for utilizing the cloud to build out your analytics infrastructure. Esp with the growing data sizes that we deal with now, I think it makes absolute sense. You can utilize different software stacks and grow (and shrink) your hardware stack as required. Great stuff..
But there is a catch. Most of the data generated by current organizations is “inside” their perimeters. Whether it is the OLAP database collecting all your data or that application that spews gigabytes of logs, most of the data is housed in your infrastructure. So if you want to use the cloud to perform analytics on this data, you have to first transfer this data to the cloud. And therein lies the problem. As Deepak mentioned in the webinar, human beings have to yet conquer the limitations of physics :). You have to have a pretty big pipe to the Internet to just transfer this data.
Amazon has come up with various means to help with this issue. They are creating copies of publicly available data sets within their cloud so that customers don’t have to transfer them. They are also working with companies to keep private data sets in the cloud for other customers to use. So similar to how you would be able to spin up a Redhat AMI, by paying some license fee to Redhat, I believe they are looking at providing customers access to this private data sets by paying some fee to the company providing this data set. It is a win-win-win situation 🙂 for Amazon, the company providing the private data set and Amazon’s web services customers. They also support a one time import of data from physical disk or tape.
Coming back to the title of this post :). I think this field is still in it’s infancy. Once companies start migrating their infrastructure to the cloud (And yes, it will happen. It is only a matter of time :).), it will be a lot easier to leverage the cloud to perform your analytics. All your data will be in the cloud and you start leveraging the hardware and software stacks in the cloud.