Brief look at Cloud Architectures and GrepTheWeb

In this post I take a quick look at the GrepTheWeb application; which is part of The Alexa Web Search 1 suite of Web Services allowing developers to build their own customised search engines. A more detailed developer article on GrepTheWeb can be found at 2. Finally I consider some questions about cloud architectures in general.

GrepTheWeb allows users to search or rather “grep” the web, as UNIX users will know the key difference between a “grep” and a “search” is the fact that you can use regular expressions. As you might imagine the amount of processing power required to grep the largest data source possible i.e. the web, would be huge.

Due to global time differences, web applications rarely use the same capacity consistently throughout the day. Therefore when large populations like the USA are offline the same level of capacity is no longer required. The developers of GrepTheWeb therefore decided to utilise the concept of cloud computing in what is described as a “Cloud Architecture”.

In a Cloud Architecture the processing and physical storage are shared between multiple systems. As processing activities increase the system garners more resources to complete the activities. Similarly when resources are no longer required, they are released back to the “cloud” for the other systems to use.

Providers of cloud computing hardware then bill their customers based on usage rather then a fixed sum; this is a paradigm similar to mobile phone Pay-As-You-Go billing rather then the Pay-Monthly (contract) alternative.

As we can see from Fig[1] below GrepTheWeb is comprised of a number of Amazon web service products, which have been arranged in a Cloud Architecture:

System Overview

Amazon SQS – The Amazon Simple Queue Service orchestrates calls between controllers and queues requests when the system is under heavy load. You can consider this, as indeed the developer guide describes, as the “glue” between controllers; however I have never been a fan of this analogy, as it implies something permanently fixed and is not easy to change.

Controller – The diagram above is somewhat confusing with respect to the controller, as firstly the controller code is actually part of the EC2 Cluster. Second the system does not have a single controller, but rather multiple controllers each relating to the phases of processing in the system:

The Launch Controller will start the process off in the EC2 cluster, as well as creating a processing record in the Simple DB.

The Monitor Controller will check if a process is complete, when it is notify both the Shutdown and Billing Controller by way of a messages placed in their respective SQS queues.

The Shutdown Controller will relinquish the now unused resources in the EC2 cluster.

The Billing Controller calculates the amount of usage the process utilised and sends the information to the billing service.

Amazon EC2 Cluster – The Elastic Compute Cloud is the essence of the cloud architecture; it offers processing in the form of a Web Service that can be scaled up or down as required. It is also where the application code is deployed and executed. When required multiple processes are run in parallel, before being aggregated into a single output.

Amazon SimpleDB – As the architecture relies on asynchronous calls and as Web Services are inherently stateless it must be possible to determine the state of a given component. Controllers therefore use the Simple DB to query and update a given processes status.

Amazon S3 – The Simple Storage Service is where the content from the web crawlers are stored, essentially an ever changing and expanding dataset from the internet. Similar to the processing of the EC2 cluster, storage is scalable on demand.

Inputs – There are two inputs to the system, one is the internet itself that are provided to the S3 data store by means of web crawlers. The second is the regular expression that a given user wishes to run against that dataset.

Outputs – As the output from a given grep can be particularly large, the content is written to the data store rather then being returned as typical Web Service response.

What purpose does this architecture serve?

Cloud architectures and the way they are offered as services by vendors like Amazon, are unusual in the fact they do not aim to address a traditional design principal like separation of concerns i.e. Object Orientated Architecture or event response i.e. Event Driven Architectures. Rather the purpose is directly related to reducing the costs of running the system.

A more purest view is that the main purpose is to increased the amount of parallel processing capacity a single system could achieve on its own; but ultimately this is directly related to not having to pay for the cost of the hardware yourself.

So how are the costs reduced? Well as alluded to previously when the system is not utilising resources they are releases back to the cloud, to be used by other systems. This allows greater hardware utilisation and therefore a greater efficiency is attained, the other major saving is that an organisation no longer has a massive initial hardware cost.

Who do you think will use it and for what purpose?

The term “cloud computing” is growing in momentum and has gained a lot of coverage recently; however some organisations will always have reservations about their information existing in a shared space; particularly governments. Organisations that are involved with customer’s personal information may also face legal concerns, particularly with data moving across country boundaries.

For those ready to embrace the cloud, there are a number of facets that will appeal to a diverse mix organisations:

Perhaps the demographic that are most attracted to the cloud are new start ups, mainly due to them not having the hardware outlay upfront. As an added bonus these start ups no longer have to estimate the capacity they will require; which can be a costly a complex process in itself. Conversely those organisations that have already paid the costs for existing systems would have less to gain.

Another major beneficiary of this architecture will be web companies that get bursts of traffic. Organisations with uneven server load will ultimately be best suited to the form of “Pay-As-You-Go” computing, e.g. a sports website, that typically has their traffic focused at weekends when matches are played.

Quite different still would be a company that wanted to run processor intensive tasks, data conversion, data mining etc. Most companies run some form of batch processing over night, which often results in complicated scheduling in order to get tasks completed before the start of the working day. With cloud computing you can scale up the processing activities to the Nth degree and have your processing completed in a suitable time frame, before releasing the system resources for the rest of the day.

The power of the cloud...

Brief look at Cloud Architectures and GrepTheWeb