May 18, 2009

Data-Aware Routing on a Cloud, featuring Sun Grid Engine, GemFire and EC2

We are excited to announce that we have taken our Convergence project to the next step in the last few weeks. Last time we demonstrated how data aware routing can speed up the combination of compute grids and data grids. Since then, we have developed new grid adapters for our Convergence project and moved to the cloud.

New adapters: Sun Grid Engine, GemFire

Sun Grid Engine (SGE) is an open source batch-queuing system, developed and supported by Sun Microsystems. SGE is typically used in a server farm or high-performance computing (HPC) cluster and is responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses. We have developed an adapter that wraps SGE's Java API and enables data aware routing of tasks.

GemFire is an Enterprise Data Fabric (EDF) solution from GemStone Systems, Inc. It is a high performance, distributed in-memory-data-grid (IMDG) that offers very low latency, high resiliency, scalability and high throughput data sharing and event distribution features for high performance computing applications that need access to real-time data. We have developed a monitor component, which can query location of data from GemFire regions with partitioned schema.

Running on the cloud

Along with integrating new grids, we changed our demo application to use a cloud infrastructure. It is deployed now on Amazon EC2. Our demo now allocates servers, completes setup of cluster software (SGE and GemFire in this case) and starts the demo application server on the fly. Just a single click and few minutes later you will have a new cluster up and running.

The deployment process is straightforward and requires just a couple of scripts. First, we allocate one EC2 server and start the master node. When the master is started, it starts allocating worker nodes and sets up the grid software. When the cluster setup is complete, we start the demo control UI on the master node, and from that moment our interactive data aware routing demo is available. Each cluster has a time to live, and when its lifespan expires, all servers are returned to EC2. We are using OpenSolaris AMI, provided by Sun Microsystems, for all our servers. Deployment of GemFire is trivial --we just need to copy few jars. Installing SGE has its quirks, but after we have figured out how to do it correctly, it just works.

From a usability point of view, cloud hosting has huge benefits. Each developer can work with his own cluster, or even several clusters (e.g., comparing the effects of data aware routing between clusters of different size).

Data-aware routing demo

We have adapted our existing demo application to work with new grids and be scalable on a cloud. In our demo application we are simulating a financial application. We have a set of trade objects, which are loaded into a partitioned GemFire region. 50k objects are placed on each server (size of each object is about 2Kb). Those trades belong to different equal-sized books (5K trades in each book). The job here is to evaluate all books using Sun Grid Engine to distribute work across cluster nodes. This means that a job consists of 10*(number of servers) tasks and each task should fetch 5K trades (about 10MB of data) from the data grid and perform some calculation over them. For simplicity, we just sum the IDs of the trades and return the result.

Unlike DataSynapse GridServer, SGE is a process-oriented computational grid. Each task in SGE is an operating system process, and execution time includes all JVM startup and class loading overheads. Usually tasks are long enough, and process start up time does not make much difference, but in the case of an interactive demo we should find a balance. We cannot make the user wait half an hour to see the first result, but if we make tasks too short, the effect of data aware routing will not be visible due to JVM startup overheads.

The demo runs a constant flow of jobs and constantly measures job completion latency and task completion latency. To illustrate data aware routing advantages we introduced 3 different scheduling modes. “Data aware” mode is where we perform data aware routing, ensuring local space access for the engine. “Neutral” mode is where we do unguided DS scheduling as SGE sees fit. “Anti data aware” mode is where we deliberately violate data awareness and ensure network space access. The user can change the task scheduling mode on the fly and see the impact of the scheduling mode on performance.

The diagram on the left shows the completion time of the job and small gray bars show cumulative task execution time on each server in cluster. Gray bars do not include JVM start up time, so cumulative execution time on each server is considerably smaller then job execution time.

The diagram on the right shows average task completion latency (without JVM startup time), so you can see effect of data aware routing on task and job level.

Conclusion

We have demonstrated how a generic data-aware routing approach implemented in the Convergence project can be used with different grid products. We had a very good experience migrating our research/demo platform to the cloud. Using a cloud provided us flexibility and comfort of development, which are hard to achieve in traditional resource constraint environment.

Stay tuned for more advances in the Convergence project!

Labels: , , , ,

May 5, 2009

Waters Power 2009 Conference

Grid Dynamics was one of the sponsors for Incisive Media's Waters Power 2009 event that was held in New York City. Main agenda of the conference was to showcase the most up-to date developments in HPC with in-depth analysis of cloud computing, virtualization and SOA solutions, bringing the latest strategies, techniques and technologies that give optimum performance and maximum efficiency for any data center. Major HPC vendors and many wall street firms were represented at the conference

Key note speech was presented by Jeffrey Birnbaum from Merrill Lynch. It was a well presented speech that highlighted the opportunities, challenges and approaches of cloud computing for HPC. Highlights from his speech are

  • Main attraction of cloud computing is to drive the cost of computing down. Google and Amazon had done a good job getting the cost pretty low. How can the enterprises do the same, or come close?
  • Most cloud providers like Amazon, Google etc. are GbE Based. Everyone is moving towards 10GbE and this serves as foundation for viable clouds
  • Enterprise cloud infrastructure needs a global file system. All software is installed on that file system. This makes it simple for the end users, in order to run any compute environment - just mount a file system
  • In order to get global scalability - use multiple copies of the files
  • Replicate the files in real-time on any update. It might be better to wait on update than deal with eventual consistency
  • For better performance cache the files in regional locations
  • Do not provide node-level redundancy and increase the cost of hardware, buy commodity hardware and design for failure. Route the workload from a node that failed to some available node (like what Google and Amazon do)
  • Design Data Centers around PODs connected by layer-2, not layer-3.


Technologies that will change the world,

  • Multi-core/multi-socket commodity compute nodes (run more VMs, writing more parallel code)
  • 10GbE with iWARP or RDMA (lower latency to storage)
  • Flash-based storage at 200K IOPS (totally changes how you think about the problems)


Implications of the above changes are dramatic and will affect the way we design the Data Centers and Applications

There was a Panel Discussion about "A silver lining for cloud computing". Panel included Victoria Livschitz, Founder and CEO of Grid Dynamics. Questions asked by the moderator triggered insightful discussions and sometimes contradicting answers,

  • Define cloud computing? This is probably most asked question in cloud computing forums and glad to hear we are closing in on a definition. This brought up another question about what stage of technology maturity is cloud computing in. Victoria's response to this was that it is in really early stages and it will take about 5 yrs to be mature enough for enterprise adoption
  • What applications are not suitable for cloud computing? It was acknowledged by few panel members that low latency applications may be bad candidates before cloud infrastructure matures but one speaker thought that this is up to the design of the cloud and it is possible to host low latency applications if architected well
  • An interesting conversation that came up is when cloud will get more adoption. This is same question that keeps coming up when any new technology comes into light. Old companies with huge investments in the existing infrastructure and approaches will need to spend more time adopting while newer players will benefit from them much sooner
  • Victoria made an insightful argument about the way applications were designed and implemented, this was about the data and its ownership. Most of the applications now are designed with data ownership as the premise but the promise of consuming and providing data as a service has many opportunities. For e.g., cloud computing is making infrastructure, software and computation as a service and having data also as a consumable entity gives the applications unbound possibilities

The thoughts about the data ownership and promise of being able to consume this as a service were echoed by few other speakers in later discussions

Ken Michellini's (from CitiHub) speech about "Keeping your feet on the ground: Unveiling the truth about cloud computing" was educational. It talked about the ROI calculations in a public cloud, 3rd party hosted and internal cloud scenarios and gave some ideas on how you can make these decisions when building your next big application. He also talked about characteristics of a good cloud application and typical challenges in building cloud applications.

There were few more Panel discussions that we attended including "Building the 21st century data center", "Preparing for Next phase in grid computing: What it takes to build the perfect data grid" and "Virtual reality: Optimizing storage, application and network virtualization" and another presentation about real life lessons learned while administering grids "Nuts and Bolts: Practical issues in grid administration"

I would like to extend congratulations to Incisive Media for a well conducted conference that let many financial industry experts brain storm and discuss the opportunities of Cloud. Great job guys!

Labels: , , , ,

February 9, 2009

Full-text & faceted search over In-Memory Data Grids

Modern in-memory data grid (IMDG) solutions provide different facilities for execution of queries over whole stored data sets with different levels of sophistication. Oracle Coherence provides Query facilities (one time full scan and continuous querying with Cost-Based-Optimized). GigaSpaces has a JDBC Query interface with the ability to use hash and B-Tree indexes. Those solutions work quite well for many problem areas. However, for heavy loads and complex multi-criteria queries those facilities can quickly become a bottleneck.

There is a class of workloads that produce high loads with complex queries on IMDGs. Retail companies that use IMDGs for their item catalogs are a good example. Those catalogs are hit by diverse stream of multi-criteria queries. A typical query you may see there looks like:
give me cell phones with MP3 support, Java and in red color.

Fortunately, the Compass Framework allows you to process such queries effectively. You can build inverse indexes with Apache Lucene and store them on a grid. This capability is based on the very modular design of the Lucene framework. All index I/O operations are well-hidden by the abstraction of FileDirectory.

For now Compass provides implementations for Coherence, GigaSpaces and Terracotta, introducing an unprecedented ability to build a vertical search solution on top of In-Memory Data Grids.


In addition, Compass has a sophisticated object-to-document mapping system that allows you to make stored objects searchable just by adding Java annotations or XML mapping files. Mappings can also be built in runtime.

However, despite its great codebase, Compass documentation is pretty sparse. It may take significant time to dive into the code and docs to get what you want. But the results will overcome all your expectations. Search engine performance on top of data grids easily overcomes any old-generation search technology.

Enjoy!

Labels: , , , ,

December 22, 2008

Speeding up data-intensive HPC applications with Velocity

Massively parallel, data intensive applications that run on computer grids require timely access to their data. When the application data is distributed among the grid nodes, data access is not a problem because it scales along with the number of computational task. However, when the application data is stored in a centralized database, access to data can quickly become a serious bottleneck.

We encountered such a problem in one of our recent projects, where a single database was used as a centralized application data repository for the entire compute grid. In this case, access to application data became bogged down enough to cause an overall degradation of the system performance, despite the fact that the database was being hosted on sufficiently powerful server.

In this article we will describe different ways of reducing database load and how they affect the overall system performance. We will also explore the advantages and disadvantages of Velocity as one of the ways to reduce data access latency. More specifically, we will show that:
  • Database load can be reduced by introducing Velocity distributed cache, or by simplifying SQL queries with moving all calculations from SQL to application logic.
  • Velocity CTP1 distributed cache improves overall system performance dramatically. In our case, querying data with active Velocity cache was up to 31 times faster than without it!
  • In CTP1, we experienced some scalability issues in particular configurations, but we are confident these issues will be resolved in future releases.
Compute Environment Overview
HPC++ CompFin Labs is a compute cloud that was created to give university students the ability to run massive analytical financial computations.

When a computation needs to be performed, one has to create a custom computational model that defines how the calculation will be performed and how the resulting data will be handled. The computational model operates in accordance with the MapReduce paradigm and includes an Excel-based UI with a list of input parameters, and the .NET assembly with the logic for splitting the computation into tasks, running a single task (map task) and combining task results into computation results (reduce task).

For example, one may write a model that performs some statistical calculation for a given stock symbol over some period of time. Later, someone else can open the Excel UI, change the stock symbol and the time period, and re-run the calculation. The following diagram illustrates how the example model would work in a CompFin cloud:

Picture 1. Original CompFin

The system consists from the following components:
  • A Microsoft HPC Server 2008 cluster.
  • A Sharepoint Server, where the CompFin website and the computational model files are hosted.
  • A SQL Server, where computation results are stored.
  • A SQL Server Intermediate Storage Provider (ISP), where task results are stored for subsequent consumption in the “reduce” phase.
  • A centralized market data database, where all financial data is located.
The computation proceeds as follows:
  1. The user navigates to the CompFin website. On this website, the user chooses the model to run and opens the appropriate Excel file. Next, the user fills in the required computation parameters and, with CompFin Excel plug-in, submits the job.
  2. The job start request is handled by the CompFin web service, which typically runs on the Sharepoint server as well. This web service calls the model logic to split the computation into tasks.
  3. All computational tasks are submitted to the Microsoft HPC Cluster.
  4. The map tasks are started first. Each task retrieves appropriate data from central market database, performs calculations on this data, and submits intermediate results in the SQL Server ISP.
  5. When all map tasks are finished, the reduce task is launched. It retrieves intermediate data, combines it into final results data, and stores final results into the result storage.
  6. After the computation is finished, the user may request job results via the CompFin website.
  7. The computation results are retrieved from the result storage and returned to the user.
The above compute grid uses one SQL Server database for storage of stock prices, and one SQL Server database for storage of intermediate and final results. The combined compute capacity of the HPC compute cluster is 400GB of RAM and 200 (2.00GHz) Xeon cores, in particular, 50 machines each with 4 (2.00GHz) Xeon cores and 8GB of RAM. The capacity of each SQL Server is 32GB of RAM and 4 (2.33GHz) Xeon cores. So, the compute cluster had 12.5 times more RAM and 50 times more CPU. Based on these numbers, we anticipated two potential places of bottlenecks – the central market database, and the ISP.

Finding the bottleneck
The computational model, chosen for our experiments, computed correlation between stock prices over some period of time and contained very data-intensive computations with complex queries to the SQL Server that required significant grouping and sorting. The model used significant amount of data, most of which could be cached. The anticipated maximum cache hit ratio for the model was greater than 90%.

First, we needed to conduct a test of the model. We decided to test all three ways of scalability measurement (one data logical unit is roughly equal to 32 millions of records in database or 512MB of tick data returned from queries):

Leave amount of data to process, increase processing power:




Increase amount of data, leave processing power:




Increase amount of data and processing power synchronously:




We tested the model using this test plan and measured the entire time of map task, time spent in retrieving financial data from central market database (GetTrades method), and time spent in SQL Server ISP. There is also the graphic, which shows how the task time will behave if the system is completely linear:

Picture 2. Map task time distribution of original model in CompFin

You may notice that the bottleneck was in the central market database: time, spent in retrieving financial data is almost equal to the entire task time. You also may notice that currently the system is much worse than linear. So, it was a good chance for Velocity to help to improve the performance and scalability.

Getting rid of the bottleneck – Velocity cache for tick data
At first, we created a Velocity distributed cache for financial data. This was just an intra-job cache for reducing the negative impact of data-reuse inside the computation job. Complete replacement of central market database by the Velocity cache was not considered at this stage. This approach is shown as on the picture 3 (blue cloud):

Picture 3. CompFin + Velocity improvements (distributed and local caches)

Velocity cache for tick data ran on HPC cluster nodes. More precisely, the same machines, where the computation tasks ran, were also used as Velocity hosts.

We compared this approach with the Original CompFin:

Picture 4. Comparison of map task time between original CompFin and CompFin with Velocity distributed cache

In the beginning, everything was working well. Velocity distributed cache reduced the number of requests to the central market database and performance increased dramatically.

As the number of processors used for the computation increased to 128, the performance started to decrease. On 32 machines (128 processors), Velocity still dramatically outperformed the original CompFin results, but some performance degradation was experienced relative to the 16 machine (64 processors) scenario. On 50 machines (200 processors), in CTP1, we began to experience some failures and timeouts.

Nevertheless, keeping in mind that Velocity was in first CTP, the results were still quite amazing.

Velocity scalability – local cache & data-aware routing
Fortunately, we had a plan of how to further improve the performance and reduce the load on Velocity distributed cache. We enabled Velocity “local cache” and added data-aware routing in the CompFin (see picture 3). This was possible, because all tasks, which execute on the same HPC node, execute in the same Windows process. The local in-process cache can improve performance if the data-reuse factor for tasks that run in the same process is high. Hence the data-aware routing was required.

Below is the comparison between this approach and the approach with Velocity distributed cache alone:

Picture 5. Comparison of map task time between distributed cache and local cache

The problems with Velocity distributed cache scalability were solved, and we can conclude that Velocity can work well in CTP1 even on large clusters.

This approach performed from 5.6 to 31.9 times better than original CompFin. It also scaled better than any other approach, although not linearly. We believe that this is because the interaction with the SQL Server was still required before data was retrievied from the cache.
However, in the “Finding the bottleneck” chapter we mentioned that the model used complex queries with grouping to SQL Server. So, maybe the performance can be improved by just simplifying these queries and moving all logic from database to compute nodes.

Getting rid of the bottleneck via simplified queries
Following the previous assumption, we tried to reduce the load on the central market database by simplifying the queries used in the model through removal of grouping operations. The results were largely positive:

Picture 6. Comparison of map task time between original CompFin with complex and simple queries

The time, spent on retrieval of tick data, was reduced from 8.5 to 52 times and the task time became almost linear. However, the scalability problem with the SQL Server remained the same, so we tried to further improve the time by using our previous best approach – Velocity distributed cache with enabled local cache and data-aware routing:

Picture 7. Comparison of map task time between original CompFin and CompFin with Velocity local cache. Simplified queries are used in both cases.

Surprisingly, doing so did not improve the task performance significantly. Perhaps, the overhead of putting items in Velocity distributed cache was too big, or the cache services just stole CPU cycles from computations. Nevertheless, we believed that these factors had only a minor impact on performance so we decided to compare only the time spent in GetTrades method, where the financial data was retrieved:

Picture 8. Comparison of time, spent on retrieving tick data, between original CompFin and CompFin with Velocity local cache. Simplified queries are used in both cases.

Our assumptions were right and local cache approach greatly reduced time spent for retrieval of financial data. However, one question remained – why the overall task time was not reduced? To answer this question, we performed a number of tests to investigate the time distribution of task time, when the local cache was used:

Picture 9. Time distribution in CompFin with simplified queries and Velocity local cache.

These tests revealed that the SQL Server ISP, that was typically the least time consuming element of the system, now moved to the foreground. Since retrieval of financial data was now very fast, the frequency of write operations to the SQL Server has increased, and the SQL Server ISP became a bottleneck.

Our conclusion is that Velocity can improve even a very fast model, where only simple queries are used. In this environment, however, the SQL Server ISP remained as a bottleneck. We have yet to experiment using Velocity as the ISP with simple queries.

Conclusion
In this article we have shown that the database is the bottleneck for data-intensive applications, running on compute grids. We tried to reduce database load using two approaches: introducing Velocity distributed cache between the database and compute grid, and simplifying SQL queries to the database by moving all calculations and aggregations to the application logic.

Both these approaches improved the database performance dramatically. In particular, querying data with active Velocity distributed cache was up to 31 times faster than without it. In certain configurations, when experimenting with CTP1, we also observed some Velocity scalability issues; however, for a first CTP, Velocity performed very well. At the moment of publishing, Velocity is already in CTP2 and its development continues to be in very active stages. In the near future we expect the Velocity team to spend a lot of time focused on improving scalability and performance prior to release. Velocity team has big plans and not only will some issues soon be resolved, but lots of new functionality will be added as well.

So, if you have problems with database scalability and you use Microsoft technology stack, you may consider introducing Velocity to your system.

There are still many interesting topics to investigate regarding the Velocity and CompFin. Also, there can be some other improvements can be made in CompFin, so we hope that this project was not the last one, where we were working with Microsoft technologies.

Labels: , , , , , , ,

September 24, 2008

Cloud Performance Reports

Cloud computing is getting a great deal of attention these days. Unfortunately, there is very little data available about performance, scalability and usability of the cloud deployment platforms.

I was recently invited to speak at Silicon Valley Cloud User Group where I tried to bring a "practitioner's prospective" and present the results of three different recent performance and scalability benchmarks related to the cloud computing. The first benchmark aims to establish the scalability of EC2 on a perfectly parallel mathematical problem, a Monte Carlo simulation, executed by Grid Gain's popular open source map/reduce platform - and to document lessons learned in making the application scale to 512 nodes.

The second benchmark looks at a scalability of a more complex stateful application, typical to Risk Management, that required both in-memory data grid and compute grid. Both grids were running on EC2 and executed by GigaSpaces' data & compute grid platform.

The third benchmark looks at a prototypical data-intensive Portfolio Analysis application used heavily in the financial services industry, and studies the performance impact of data being located close to computing, or "on the cloud" vs. "off the cloud". This work was done in collaboration with Microsoft on their HPC++ CompFin Lab that integrates Microsoft Windows HPC Server, a central market data database and Microsoft productivity products to provide academic community with an online service to publish, execute and manage computational finance models.

You can find the presentation, with summary of results here. Please, note that these results are very fresh and the benchmarks in two cases are still going on. You can find far more details on the first benchmark in our previous blog post. We will be coming with more detailed blog reports for the second and third benchmarks soon.

July 31, 2008

Scalability Benchmark of Monte Carlo Simulation on Amazon EC2 with GridGain Software

This blogpost presents the report on recently concluded scalability benchmark of Monte Carlo simulations running on Amazon EC2 using the GridGain framework. It consists of two parts: Part I is a technical report on the benchmark goals, method and results and Part II is an account of the development process and lessons learned.


Part I: Benchmark description & results

The goal of the benchmark was to study the scalability characteristics of massively parallel algorithms executed on Amazon ’s Elastic Computing Cloud (EC2) and managed by GridGain

A Monte Carlo simulation was chosen as an algorithm to represent its widespread use in financial applications. The same algorithm with different parameters was used on a wide range of grid sizes: 2, 4, 8, 16, ..., 256, 512. The parameters guaranteed that the amount of work performed by the whole grid was always linear with respect to the number of nodes. In other words, twice as many nodes always performed twice as much work. Perfect linear scalability would demonstrate the identical *completion time* of Job-1 running on 2 node and Job-2 running on 512 nodes, given that Job-2 had 256-times more work to do.

As Amazon permitted us to use the maximum of 550 nodes in a single run, the upper limit o fthe benchmark was chosen at 512.

The test utilized a full open source software stack, including GridGain, the Linux operating system, Sun Microsystem’s Open MQ JMS messaging and Sun’s Java 5 VM. Amazon’s preinstalled Fedora Core 8 with custom testing framework was used to conduct the benchmark.

The results justified our hopes: we could successfully run up to 512 nodes without significant performance degradation. The results graph is shown on figure 1.

Figure 1. Average task execution times on 2-512 nodes grids.

The performance degradation of 3 seconds (about 20%) should be considered minor given roughly 250-fold increase in scale. The curve rises two times: in the ranges 2-8 and 256-512, while 8-256 remains almost flat.

The range 2-8 growth can be explained by initial rapid growth of grid size, which causes the rise of maximum task execution time among nodes (the overall time is determined by the “weak link”). The range 256-512 growth could be explained by OpenMQ limitations as applied to our use case (the JMS load is quadratic against the grid size). In order to continue scaling the grid beyond 512 nodes, a clusterization of OpenMQ is likely to be required.

We can conclude that the obtained result showed near linear scalability and performance improvements from 2 to 512 nodes in all test runs.


Part II: How did we do it?

To start using EC2, just open Amazon EC2 Getting Started Guide, download EC2 API tools and follow the instructions. You can use one of the predefined public machine images or create a bundle with your own image.

The relatively big problem is the lack of persistence options in EC2. When an instance goes down, all the data is lost. However, there are third party solutions capable of mounting Amazon's S3 storage interface as a Linux filesystem.

Now, we needed to create a simple framework allowing us to manage the grid on Amazon’s EC2. The basic functionality would look like:

  • start the grid;
  • add/remove nodes to/from the grid;
  • display the grid health;
  • run single calculation task interactively;
  • run a benchmark (batch task execution on 1, 2, 4, …, 512,... nodes).

The first problem we encountered was that multicast is not supported in the EC2 infrastructure. By default GridGain is configured to make initial discovery by IP multicast. Unfortunately, as far as we know, the EC2 guys don’t plan to create fully functional multicast support anytime soon.

Luckily, GridGain is shipped with a set of various Service Provider Interfaces (SPIs). So, we could choose another DiscoverySPI that does not use IP multicast for discovery purposes, and we chose JMSDiscoverySPI. At first, we picked Apache ActiveMQ as a JMS implementation and soon ran into stability issues. Then we switched to OpenMQ and it proved to be sufficiently robust.

Our second problem turned to be the default maximum number of running instances per user – 20. Since we were going to run grids much larger than 20 nodes, we needed to override the default limit. It took several steps and a few more days to negociate with Amazon EC2, but eventually, we were granted the right to run up to 550 nodes. It seems that the business process of requesting a large amount of nodes is still not very well-defined by Amazon.

The image creation was not a very hard task: we got the public Fedora 8 i386 image, ran it, installed Java Runtime Environment 6, GridGain and ActiveMQ, then bundled this new image.

We still required some extra configuration since the grid node must start on system startup. The hardest thing was to write shell scripts for starting the grid. These scripts had to parse user data sent to the instance (different for the Head Node and Worker Nodes), determine the node type and start the necessary software in each given configuration. Later these scripts were replaced by more convenient Java-based tools and Head Node’s ActiveMQ was replaced by a standalone OpenMQ.

Figure 2. Initial GridGain on Amazon EC2 architecture.

So, how does the grid work? First, we start the Head Node (see figure 2). Head Node automatically starts ActiveMQ (only in early versions), GridGain in master mode (will be described later) and runs the requested number of Worker Nodes. Since Workers should connect to the JMS server, Head Node passes its IP address as user data for Worker Nodes. Worker Nodes just start GridGain.

We realized quickly that we needed some user-friendly interface to manage the grid. So, we rewrote the scripts in Java using the typica library. Thus we integrated Amazon and GridGain management. Now we can automatically start EC2 instances, wait for them to start, get their IP’s, track status, etc. Together with GridGain management it becomes a very powerful tool. Let’s imagine the grid running completely autonomously. The management module can automatically bring up more EC2 instances or shut them down depending on the current grid load.

When we had the simple web UI with capabilities of seeing the table of grid nodes and running our benchmark, we felt ready to run grids larger than 20 nodes. We asked Amazon to increase our running instances limit to 1050 nodes. Amazon agreed to let us run 550 instances. The know-how of our UI was the embedded scripting engine, allowing us to change a benchmarking schema without restarting the grid. To understand how simple it is to run a benchmark consisting of several task runs on different number of nodes, just look at this code:

var itersPerNode = 5000;
var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
for (var i in cnode) {
var n = cnode[i];
grid.growEC2Grid(n, true);
grid.waitForGridInstances(n);
runTask(itersPerNode * n, n, 3);
}

Listing 1. Sample benchmarking script.


The closing remarks:

The framework we built is a good place to start developing a GridGain appliance. But what if we want to create, for instance, an In-Memory Data Grid on this basis? Our next step is making the framework more generic to allow easy integration with virtually any grid framework (either computing or IMDG). There are also other ideas such as creating a generic image and storing specific grid software/configurations in S3, which should ease debugging and small alterations to the appliance.

We are currently evaluating if our framework will be something more than just a GridGain testing framework. We'd love to hear from the community if this line of work is interesting to others beside ourselves. If you'd like to know more about this benchmark, or get access to the full source code, please contact me at mgorbunov@griddynamics.com

Labels: , , , , , ,

June 19, 2008

Grid technologies in middle-size applications

Grid technologies were born to solve extreme problems and currently they are used primarily by large-scale applications. However, like computers, which were initially used only for solving complex scientific tasks and later came to almost every house, grid technologies are coming into middle-size enterprise market. In this article I will try to answer a question about how medium application can take advantage of grid technologies.

When writing large application that solve extreme problems, scalability is always an issue and you do not have a choice: you must invest resources in scalability and developers often have enough time and knowledge to solve this problem. When writing small applications, you usually do not care about this kind of problems, because all will work fine on a single server. But when writing middle-size application, you are in trouble, because you are already big enough to start thinking about scalability, but you are not big enough to invest a lot of resources in solving this problem. When middle-size applications grow from little ones the trouble becomes really serious. However, scalability problems in middle-size applications are usually caused by a very limited set of architecture decisions. Knowing these causes and their resolutions will help to build a more scalable application.

Whatever architecture you choose for a middle-size application, it will usually have a web server and a database. In the best case, it will have a single physical machine with both. In the worst case, it will have web servers, application servers and database server on separate machines. In all cases, the request processing chain will include several processes and will take a lot of time. Usually, the database server is also a bottleneck for the entire system, because it doesn’t scale well. While resolving these problems, mankind invented caches. Caches are useful to store some data that is costly to compute or retrieve. They can reduce both database load and request processing time a lot. However, caches can also be scaling killers.

Assume you have an architecture with a dedicated web server that hosts your application process. It is relatively easy to maintain a cache on a single server. But suppose you want to scale and the single server becomes a load balancing cluster, where each machine should maintain its own cache, and this cache should be synchronized with caches on other servers. For example, if some item is removed from the cache in one server, it should be immediately removed in every cache on every other server in the cluster. This is extremely hard to accomplish. But since developers often implement simple local caches by themselves, they try to enhance their caches to support distributed behavior. The problem is that developers often do not have enough knowledge and experience to do this. Fortunately, the problem of distributed caches is well known and the solution already exists.

The solution is to use third-party distributed caches or In Memory Data Grids (IMDG). In Memory Data Grids were created to solve problems of scaling data in extreme applications, where the cluster contains a hundreds of servers. Data stored on these servers can be partitioned between them or replicated. If data is partitioned, each server contains only one chunk of data and each chunk is stored on multiple servers to provide failover. This allows huge amounts of data to be stored in memory. If data is replicated, it is stored in full on each server. Of course, the data cached on each server is synchronized with data on other servers. This allows very fast access to data, because it is always available locally. In Memory Data Grids provide lots of other interesting features, which deserve an entire book to describe. Distributed caches are essentially a simplified form of an In Memory Data Grid. Currently there are many implementations of both IMDGs and distributed caches and if you choose to use these technologies you have a number of options.

The concrete choice will depend on what technology or framework you use in your application:
  • If you are using .NET, you may use ScaleOut, NCache or Microsoft’s new distributed cache Velocity, which is currently available as a Community Technology Preview. They all provide an ASP.NET session state provider – the easiest way to gain a benefit from grid technologies. With this session state provider you will not need to maintain a special SQL Server for storing ASP.NET session data, because all data will be distributed between web servers in the cluster in a reliable and robust way.
  • If you are using Java, Oracle Coherence and GigaSpaces are the most famous In Memory Data Grids and, hence, can be used as distributed caches. They both provide a second level cache for Hibernate, so, if you use it, you can scale easily with no additional development efforts.
  • If you are using C++, PHP, Ruby or Python, you should consider memcached. This is a very famous distributed cache initially developed for LiveJournal, which has already helped to scale many extreme applications, like Wikipedia, YouTube, Facebook and others.
All these implementations will help you to solve distributed cache problems and scale well. In simple cases, to start using them, you will need to replace your old local caches with new distributed ones. If you need to scale a Hibernate second-level cache, or an ASP.NET session state provider, you will not be required to write any code. However, in the case of complex and serious scalability problems you can consult with us at Grid Dynamics any time and we will help you to solve them in a most effective way.

Labels: , , , , , , ,