Ever wonder how http://www.trendingtopics.org/ collects & process the visitor information from wikipedia? This Cloudera Post walks you through the steps of how to leverage various cloud tools to power a process-intensive web application. Overall the steps looks something like:
- provision a Hadoop cluster on EC2 for compute capabilities
- load the logs into Hadoop
- process the log data, clean it up, apply trending algorithms to organize the data
- export the processed data into MySQL for the web application to use
This is really cool stuff… at least for me ;-). Now maybe I can leverage a Hadoop cluster to take all of the powerpoint slides and process & organize them for me into a consumable way… hmmmmm.
Read more on the Cloudera blog.