Speeding up data-intensive HPC applications with Velocity
We encountered such a problem in one of our recent projects, where a single database was used as a centralized application data repository for the entire compute grid. In this case, access to application data became bogged down enough to cause an overall degradation of the system performance, despite the fact that the database was being hosted on sufficiently powerful server.
In this article we will describe different ways of reducing database load and how they affect the overall system performance. We will also explore the advantages and disadvantages of Velocity as one of the ways to reduce data access latency. More specifically, we will show that:
- Database load can be reduced by introducing Velocity distributed cache, or by simplifying SQL queries with moving all calculations from SQL to application logic.
- Velocity CTP1 distributed cache improves overall system performance dramatically. In our case, querying data with active Velocity cache was up to 31 times faster than without it!
- In CTP1, we experienced some scalability issues in particular configurations, but we are confident these issues will be resolved in future releases.
HPC++ CompFin Labs is a compute cloud that was created to give university students the ability to run massive analytical financial computations.
When a computation needs to be performed, one has to create a custom computational model that defines how the calculation will be performed and how the resulting data will be handled. The computational model operates in accordance with the MapReduce paradigm and includes an Excel-based UI with a list of input parameters, and the .NET assembly with the logic for splitting the computation into tasks, running a single task (map task) and combining task results into computation results (reduce task).
For example, one may write a model that performs some statistical calculation for a given stock symbol over some period of time. Later, someone else can open the Excel UI, change the stock symbol and the time period, and re-run the calculation. The following diagram illustrates how the example model would work in a CompFin cloud:

Picture 1. Original CompFin
The system consists from the following components:- A Microsoft HPC Server 2008 cluster.
- A Sharepoint Server, where the CompFin website and the computational model files are hosted.
- A SQL Server, where computation results are stored.
- A SQL Server Intermediate Storage Provider (ISP), where task results are stored for subsequent consumption in the “reduce” phase.
- A centralized market data database, where all financial data is located.
- The user navigates to the CompFin website. On this website, the user chooses the model to run and opens the appropriate Excel file. Next, the user fills in the required computation parameters and, with CompFin Excel plug-in, submits the job.
- The job start request is handled by the CompFin web service, which typically runs on the Sharepoint server as well. This web service calls the model logic to split the computation into tasks.
- All computational tasks are submitted to the Microsoft HPC Cluster.
- The map tasks are started first. Each task retrieves appropriate data from central market database, performs calculations on this data, and submits intermediate results in the SQL Server ISP.
- When all map tasks are finished, the reduce task is launched. It retrieves intermediate data, combines it into final results data, and stores final results into the result storage.
- After the computation is finished, the user may request job results via the CompFin website.
- The computation results are retrieved from the result storage and returned to the user.
Finding the bottleneck
The computational model, chosen for our experiments, computed correlation between stock prices over some period of time and contained very data-intensive computations with complex queries to the SQL Server that required significant grouping and sorting. The model used significant amount of data, most of which could be cached. The anticipated maximum cache hit ratio for the model was greater than 90%.
First, we needed to conduct a test of the model. We decided to test all three ways of scalability measurement (one data logical unit is roughly equal to 32 millions of records in database or 512MB of tick data returned from queries):
Leave amount of data to process, increase processing power:

Increase amount of data, leave processing power:

Increase amount of data and processing power synchronously:

We tested the model using this test plan and measured the entire time of map task, time spent in retrieving financial data from central market database (GetTrades method), and time spent in SQL Server ISP. There is also the graphic, which shows how the task time will behave if the system is completely linear:

Picture 2. Map task time distribution of original model in CompFin
You may notice that the bottleneck was in the central market database: time, spent in retrieving financial data is almost equal to the entire task time. You also may notice that currently the system is much worse than linear. So, it was a good chance for Velocity to help to improve the performance and scalability.Getting rid of the bottleneck – Velocity cache for tick data
At first, we created a Velocity distributed cache for financial data. This was just an intra-job cache for reducing the negative impact of data-reuse inside the computation job. Complete replacement of central market database by the Velocity cache was not considered at this stage. This approach is shown as on the picture 3 (blue cloud):

Picture 3. CompFin + Velocity improvements (distributed and local caches)
We compared this approach with the Original CompFin:

Picture 4. Comparison of map task time between original CompFin and CompFin with Velocity distributed cache
In the beginning, everything was working well. Velocity distributed cache reduced the number of requests to the central market database and performance increased dramatically.As the number of processors used for the computation increased to 128, the performance started to decrease. On 32 machines (128 processors), Velocity still dramatically outperformed the original CompFin results, but some performance degradation was experienced relative to the 16 machine (64 processors) scenario. On 50 machines (200 processors), in CTP1, we began to experience some failures and timeouts.
Nevertheless, keeping in mind that Velocity was in first CTP, the results were still quite amazing.
Velocity scalability – local cache & data-aware routing
Fortunately, we had a plan of how to further improve the performance and reduce the load on Velocity distributed cache. We enabled Velocity “local cache” and added data-aware routing in the CompFin (see picture 3). This was possible, because all tasks, which execute on the same HPC node, execute in the same Windows process. The local in-process cache can improve performance if the data-reuse factor for tasks that run in the same process is high. Hence the data-aware routing was required.
Below is the comparison between this approach and the approach with Velocity distributed cache alone:

Picture 5. Comparison of map task time between distributed cache and local cache
The problems with Velocity distributed cache scalability were solved, and we can conclude that Velocity can work well in CTP1 even on large clusters.This approach performed from 5.6 to 31.9 times better than original CompFin. It also scaled better than any other approach, although not linearly. We believe that this is because the interaction with the SQL Server was still required before data was retrievied from the cache.
However, in the “Finding the bottleneck” chapter we mentioned that the model used complex queries with grouping to SQL Server. So, maybe the performance can be improved by just simplifying these queries and moving all logic from database to compute nodes.
Getting rid of the bottleneck via simplified queries
Following the previous assumption, we tried to reduce the load on the central market database by simplifying the queries used in the model through removal of grouping operations. The results were largely positive:

Picture 6. Comparison of map task time between original CompFin with complex and simple queries
The time, spent on retrieval of tick data, was reduced from 8.5 to 52 times and the task time became almost linear. However, the scalability problem with the SQL Server remained the same, so we tried to further improve the time by using our previous best approach – Velocity distributed cache with enabled local cache and data-aware routing:
Picture 7. Comparison of map task time between original CompFin and CompFin with Velocity local cache. Simplified queries are used in both cases.
Surprisingly, doing so did not improve the task performance significantly. Perhaps, the overhead of putting items in Velocity distributed cache was too big, or the cache services just stole CPU cycles from computations. Nevertheless, we believed that these factors had only a minor impact on performance so we decided to compare only the time spent in GetTrades method, where the financial data was retrieved:
Picture 8. Comparison of time, spent on retrieving tick data, between original CompFin and CompFin with Velocity local cache. Simplified queries are used in both cases.
Our assumptions were right and local cache approach greatly reduced time spent for retrieval of financial data. However, one question remained – why the overall task time was not reduced? To answer this question, we performed a number of tests to investigate the time distribution of task time, when the local cache was used:
Picture 9. Time distribution in CompFin with simplified queries and Velocity local cache.
These tests revealed that the SQL Server ISP, that was typically the least time consuming element of the system, now moved to the foreground. Since retrieval of financial data was now very fast, the frequency of write operations to the SQL Server has increased, and the SQL Server ISP became a bottleneck.Our conclusion is that Velocity can improve even a very fast model, where only simple queries are used. In this environment, however, the SQL Server ISP remained as a bottleneck. We have yet to experiment using Velocity as the ISP with simple queries.
Conclusion
In this article we have shown that the database is the bottleneck for data-intensive applications, running on compute grids. We tried to reduce database load using two approaches: introducing Velocity distributed cache between the database and compute grid, and simplifying SQL queries to the database by moving all calculations and aggregations to the application logic.
Both these approaches improved the database performance dramatically. In particular, querying data with active Velocity distributed cache was up to 31 times faster than without it. In certain configurations, when experimenting with CTP1, we also observed some Velocity scalability issues; however, for a first CTP, Velocity performed very well. At the moment of publishing, Velocity is already in CTP2 and its development continues to be in very active stages. In the near future we expect the Velocity team to spend a lot of time focused on improving scalability and performance prior to release. Velocity team has big plans and not only will some issues soon be resolved, but lots of new functionality will be added as well.
So, if you have problems with database scalability and you use Microsoft technology stack, you may consider introducing Velocity to your system.
There are still many interesting topics to investigate regarding the Velocity and CompFin. Also, there can be some other improvements can be made in CompFin, so we hope that this project was not the last one, where we were working with Microsoft technologies.
Labels: .NET, data grid, distributed cache, grid computing, Microsoft HPC, scalability, Velocity, ~Max Martynov

1 Comments:
Although Velocity has made progress from CTP1 to CTP2, it still leaves much to be desired. It will be some time before they provide all the important features in a distributed cache and even longer before it is tested in the market. I wish them good luck. In the meantime, NCache already provides all CTP2 & V1, and many more features. NCache is the first, the most mature, and the most feature-rich distributed cache in the .NET space. NCache is an enterprise level in-memory distributed cache for .NET and also provides a distributed ASP.NET Session State. Check it out at Distributed Cache. NCache Express is a totally free version of NCache. Check it out at Free Distributed Cache.
Post a Comment
Subscribe to Post Comments [Atom]
Links to this post:
Create a Link
<< Home