January 6, 2010

Java tricks, reducing memory consumption

In this blog post, I want to discuss optimization of java memory usage. The Sun JDK has two simple-but-powerful tools for memory profiling -- jmap and jhat.

jmap has two important capabilities for memory profiling. It can:
• create a heap dump file for any live java process
• show a heap distribution histogram

Neither of these capabilities requires any special parameters for the Java virtual machine (JVM). Below is a heap distribution histogram produced by jmap.

>jmap -histo:live 16608
num #instances #bytes class name
----------------------------------------------
1: 464773 45411544 [C
2: 148131 22404200
3: 21461 13197200 [S
4: 148131 11862064

5: 231304 11562840

6: 448533 10764792 java.lang.String
7: 14801 9520800

8: 14801 6706536

9: 20178 6060872 [B
10: 250951 6022824 java.util.HashMap$Entry
11: 12530 5532096

12: 26538 4437384 [Ljava.util.HashMap$Entry;
13: 54179 4381328 [I
14: 73294 3939656 [Ljava.lang.Object;
15: 34872 3905664 org.eclipse.swt.graphics.TextLayout$StyleItem
16: 27360 1551928 [Ljava.lang.String;
17: 16056 1541376 java.lang.Class
18: 20222 1294208 org.eclipse.core.internal.resources.ResourceInfo
19: 47006 1128144 java.util.ArrayList
20: 25235 1009400 java.util.HashMap
21: 22911 983784 [[I
22: 13398 857472 org.eclipse.swt.custom.StyleRange
23: 14362 831832 [[C
24: 16264 780672 org.eclipse.ui.internal.handlers.HandlerActivation
25: 36992 591872 org.eclipse.ui.internal.handlers.LegacyHandlerListenerWrapper
26: 423 591304 [Lorg.eclipse.swt.custom.StyleRange;
27: 13427 537080 org.eclipse.jface.text.TreeLineTracker$Node
28: 20254 486096 org.eclipse.core.internal.dtree.DataTreeNode
29: 19700 472800 org.eclipse.swt.internal.win32.SCRIPT_ANALYSIS
30: 19700 472800 org.eclipse.swt.internal.win32.SCRIPT_STATE
31: 17108 410592 org.eclipse.jface.text.Line
32: 1255 401600

33: 16650 399600 java.util.Hashtable$Entry
34: 10952 350464 org.eclipse.ui.internal.services.EvaluationReference
35: 8665 346600 org.eclipse.ui.commands.HandlerSubmission
36: 10624 339968 java.util.TreeMap$Entry
37: 6598 307376 [Lorg.eclipse.swt.graphics.TextLayout$StyleItem;
38: 18520 296320 java.lang.Integer
39: 8567 274144 org.eclipse.ui.commands.ActionHandler
40: 11268 270432 org.eclipse.ui.internal.help.WorkbenchHelpSystem$6
...
Total 2894736 198472504


For each class we can see the class name, the number of instances of the class in the heap, and the number of bytes used in the heap by all instances of the class. The table is sorted by consumed heap space. Although it is very simple, it is extremely useful. This information is enough to diagnose 60% of heap capacity problems.

If a heap distribution histogram is too high-level, you can try jhat. jhat can read and explore a heap dump. It has a web interface and you can click though your dumped object graph. Of course, clicking through a few million objects is not fun. Fortunately, jhat supports query language. Let's give it a try.

>jmap -dump:file=eclipse.heap 16608
Dumping heap to C:\Program Files (x86)\Java\jdk1.6.0_11\bin\eclipse.heap ...
Heap dump file created

>jhat -J-Xmx1G eclipse.heap
Reading from eclipse.heap...
Dump file created Wed Oct 14 08:41:07 MSD 2009
Snapshot read, resolving...
Resolving 2300585 objects...
Chasing references, expect 460 dots.............................................
................................................................................
................................................................................
.......................................
Eliminating duplicate references................................................
................................................................................
................................................................................
....................................
Snapshot resolved.
Started HTTP server on port 7000
Server is ready.


Now open http://localhost:7000 and you can see a summary of all classes. You can use standard queries via links or go straight to the “Execute Object Query Language (OQL) query” link at the bottom and type your own query. Query language is quite awkward (it is based on the java script engine) and may not work well for large numbers of objects, but it is a very powerful tool.
 

Enterprise applications are 80% strings and maps


Let’s look at the jmap histogram again. In the top row, we can see "[C" class consuming most of the heap space. Actually, these char arrays are part of String objects, and we can see that String instances are also consuming considerable space. From my experience, 60-80% of heaps in an enterprise application are consumed by strings and hash maps.
Strings
Let's look how the JVM is storing strings. String objects are semantically immutable. Each instance has four fields (all except hash are marked final):
• Reference to char array
• Integer offset
• Integer count of character
• Integer string hash (lazily evaluated, and once evaluated never changes)

How much memory does one String instance consume?

Here and below are size calculations for the Sun JVM (32bit). This should be similar for other vendors.
Object header (8 bytes) + 3 refs (12 bytes) + int (4 bytes) = 24 bytes. But the String instance is only a header. Actual text is stored in char array (2 bytes each character + 12 bytes array header).



String instances can share char arrays with each other. When you call substring(…) on a String instance, no new char array allocation happens. Instead, a new String referencing subrange of existing char arrays is created. Sometimes this can become a problem. Imagine you are loading a text file (e.g., CSV). First you are loading the entre line as string, then you seek the position of the field and call substring(…). Now your small field value string object has a reference to an entry line of text. Sometime later, a string header for the text line object is collected, but characters are still in memory because they are referenced via other string object.
 






 


In the illustration above, useful data is marked by yellow. We can see that some characters cannot be accessed any more, but still occupy space in the heap. How can you avoid such problems?
If you are creating a String object with a constructor, a new char array is always allocated. To copy content of a substring you can use the following construct (it looks a bit strange, but it works):
new String(a.substring(…))

String.intern() – think twice
String class has a method intern() that can guarantee the following:
a.equals(b)  => a.intern() == b.intern()

There is a table inside of the JVM used to store normal forms of strings. If some text is found in a table then value from a table will be returned from intern(), else the string will be added to table. So a reference returned by intern() is always an object from JVM intern table.

String intern table keep weak references to its objects, so unused strings can be collected as garbage when no other references exist except from intern table itself. It looks like a great idea to use intern(), and eliminate all duplicated strings in an application. Many have tried … and many have regretted such decision. I cannot say this is true for every JVM vendor, but if you are using Sun’s JVM, you should never do this. Why?


  •   JVM string intern tables are stored in PermGen -- a separate region of the heap (not included in -Xmx size) used for the JVM’s internal needs. Garbage collection in this area is expensive and size is limited (though configurable).
  •   You would have to insert a new string into the table which has O(n) complexity, where n is table size.
String intern tables in JVMs work perfectly for the JVM’s needs (new entries are added only while loading new classes so insertion time is not a big issue) and it is very compact. But it is completely unsuitable for storing millions of application strings. It just was not designed for such a use case.
Removing of duplicates
The JVM’s string intern table is not an option but the idea of eliminating string duplicates is very attractive. What can we do?




public
class InternTable {

 private
   final Map table = new HashMap();

 public X intern(X val) {
  X in = table.get(val);
  if (in == null) {
  table.put(val, val);
  in = val;
  }
  return in;
  }
}

Above is a simple snippet of a custom intern table. It can be used for strings or any other immutable objects. Looks good, but it has a problem. Such an implementation will prevent objects from being collected by GC. What can we do?

We need to use weak references. There are WeakHashMap classes in the JDK. It is a hash table that uses weak references for keys. Let's try to use it.

public
class InterTable {

private
final Map table = new WeakHashMap();

public X intern(X val) {
  WeakReference ref = table.get(val);
  X in = ref == null ? null : ref.get();
  if (in == null) {
    table.put(val, new WeakReference(val));
    in = val;
  }
  return in;
 }
}

Did you notice we also need a weak reference for the wrap value? This will work, but what is the cost?
• Reference from hash table – 1 ref * ratio of unused slots (~6 bytes)
• WeakHashMap$Entry object – 40 bytes
• value WeakReference – 24 bytes

The cost per entry in the table is about 70 bytes. That’s expensive. Can we reduce it?

Right now we have to have two weak references per entry. If we rewrite WeakHashMap to return an entry from the get(...) method (instead of the value), we can drop the second weak reference and save 24 bytes. But the cost of such an intern table is still high. You should analyze/experiment with your data to see if such a trick will bring greater benefits in your case.

UTF8 strings
Java strings are UTF and encoded using UTF16 in memory. If they are to be converted to UTF8, the actual text is likely to consume half the memory. But using UTF8 will break compatibility with the String class. You have to create your own class and convert it to standard String for a majority of operations. This will increase CPU usage, but in some edge cases this approach can be useful for overcoming heap size limitations (e.g., storing large text indexes in memory).

Maps and sets

Good old java.util.HashMap is used everywhere. Standard implementation in JDKs is to use the open hash table data structure.






References to key and value are stored in the Entry object (which also keeps the hash for the key for faster access). If several keys are mapped to same hash slot, they are stored as a list of interlinked entries. The size of each entry structure: object header + 3 references + int = 24 bytes. java.util.HashSet is using java.util.HashMap under the hood so your overhead will be the same. Can we store map/set in more compact form? Sure.

Sorted array map
If the keys are comparable, we can store the map or set as sorted arrays of interleaved key/value pairs (array of keys in the case of a set). Binary search can be used for fast lookups in an array, but insertion and deletion of entries will have to shift elements. Due to the high cost of updates in such a structure, it will be effective only for smaller collection sizes, and for operation patterns that are mostly read. Fortunately, this is usually what we have -- map/set of a few dozen objects that are read more often than modified.

Closed hash table
If we have a collection of a larger size, we should use a hashtable. Can we make the hashtable more compact? Again, the answer is yes. There are data structures called closed hashtables that do not require entry objects.






In closed hashtables, references from the hashtable point directly to an object (key). What if we want to put in a reference to a key but the hash slot is occupied already? In such a case, we should find another slot (e.g., next one). How do you lookup a key? Search through all adjacent slots until key or null reference is found. As you can see from algorithm, it is very important to have enough empty slots in your table. Density of closed hash tables should be kept below 0.5 to avoid performance degradation.

Maps structures summary

Open hash map/set
Cost per entry:
  • 1 ref * 1.33 (for 0.75 density)
  • Entry object
  • Total: ~30 bytes
+ Better handling hash collisions.
+ Fast access to hash code.
Closed hash map/set
Cost per entry (map):
  • 2 ref * 2 (for 0.5 density)
  • Total: 16 bytes
Cost per entry (set):
  • 1 ref * 2 (for 0.5 density)
  • Total: 8 bytes
- Generally slower than open version.
Sorted array map/ser
Cost per entry (map):
  • 2 ref
  • Total: 8 bytes
Cost per entry (set):
  • 1 ref
  • Total: 4 bytes
- For small collection only.
- Expensive update/delete.

The JDK is using an open hashtable structure because in general case it is a better structure. But when you are desperate to save memory, the other two options are worth considering.

Conclusion
Optimization is an art. There are no magical data structures capable of solving every problem. As you can see, you have to fight for every byte. Memory optimization is a complex process. Remember that you should design your data so each object can be referenced from different collections (instead of having to copy data). It is usually better to use semantically immutable objects because you can easily share them instead of copying them. And from my experience, in a well-designed application, optimization and tuning can reduce memory usage by 30-50%. If you have a very large amount of data, you have to be ready to handle it. At Grid Dynamics, that’s our day-to-day job! So, if you are building system capable of handling enormous amounts of data, don’t hesitate to ask us for assistance :)

Labels: , , ,

October 26, 2009

Oracle Coherence memory usage, indexes

In previous posts we discussed a memory consumption in Oracle Coherence. We loaded 1M of DomainObj into cached and looked into memory dumps aquired with jmap utility. Now, we will talk about memory overheads caused by another powerful feature of Coherence - indexes.

Coherence support indexes. Indexes are another major memory consumer, and if you going to use indexes you should plan memory usage for them. We will use a distributed scheme with a local scheme on the back end in this case. An index is created for each of the DomainObjAttrib fields, and the number of unique values of this field should be about 1M/16 (each value of indexed field should match 16 objects).

<distributed-scheme>
<scheme-name>simple-distributed-scheme</scheme-name>
<backing-map-scheme>
<local-scheme/>
</backing-map-scheme>
<backup-count>0</backup-count>
</distributed-scheme>

First let's look at Coherence 3.5 :


  • row 6 (com.tangosol.util.SegmentedHashMap$Entry) - 1M of instances related to main storage and 1062.5k to the index;
  • row 7 ([Lcom.tangosol.util.SegmentedHashMap$Entry;) – 19Mb is related to main storage, the rest to the index;
  • row 12 – 7.2 Mb is related to main storage, the rest to the index.

Internally the index is constructed from 2 maps:
  • Forward map: maps indexed attribute value to a set of cache entries (I suspect that reference to the key is actually stored, but I am not sure);
  • Reverse map: maps cache entry (or its key) to the value of the indexed attribute for this entry.

In our case we have 62,500 unique values of indexed fields.

Now let's compare the memory picture with 3.4:


Rows related to the index are highlighted by yellow, also:

  • row 6 ([Lcom.tangosol.util.SafeHashMap$Entry;) – 7.2Mb related to main storage and the rest to the index.
An interesting thing to note: in 3.4 there are 1M DomainObjAttrib objects on the heap, but in 3.5 only 62.5k (and we know there are exactly 62.5k unique objects of this kind), so 3.4 is storing duplicates of objects, which 3.5 avoids. You can find more detailed information about this on the Coherence forum.


Summary for Coherence 3.4:
  • Domain objects: 227.3 Mb
  • Overhead: 116.4 Mb
  • Index: 88.5 Mb
  • Other: 6.5 Mb




Summary for Coherence 3.5:
  • Domain objects: 227.3 Mb
  • Overhead: 167 Mb
  • Index: 65 Mb
  • Other: 8.4 Mb
_



Again, we can see that Coherence 3.5 uses additional data structures to make partition operations more efficient, but, in the case of index, removing duplicated attribute values from the heap makes 3.5 more memory efficient.

Let's find a formula for index size (N number of entries, M number of unique indexed field values)

Coherence 3.4:

  • Forward map
    • 4 * M – reference in hash table
    • 24 * M – SafeHashMap$Entry
    • 16 * M – SafeHashSet
    • 56 * M – SafeHashMap (backend for SafeHashSet)
    • 4 * N – reference in hash table of hash set
    • 24 * N – SafeHashMap$Entry
  • Reverse map
    • 4 * N – references in hash table
    • 24 * N – SafeHashMap$Entry
    • <Size of attribute value> * N – stored attribute value
  • Total index size - 100 * M + 56 * N + N * <Size of attribute value>

In our case the estimated index size for 3.4 should be 100 * 6250 + 56 * 1M + 24000000 = 81.4Mb. The formula assumes a 100% fill ratio of hash tables; it produces the lower bound for index size. The actual size will always be little bigger (unless the hash table fill ration is greater than 100%, but it should never happen).

Coherence 3.5:
  • Forward map – 100 * M (bytes) + size of value objects.
    • 4 * M – reference in hash table
    • 24 * M – SegmentedHashMap$Entry
    • 16 * M – SafeHashSet
    • 56 * M – SafeHashMap (backend for SafeHashSet)
    • 4 * N – reference in hash table of hash set
    • 24 * N –SafeHashMap$Entry
  • Reverse map – 100 * M (bytes) + size of value objects.
    • 4 * N – references in hash table
    • 24 * N – SegmentedHashMap$Entry
    • <Size of attribute value> * M – stored attribute value
  • Total index size - 100 * M + 56 * N + M * <Size of attribute value>
In our case the estimated index size for 3.5 should be 100 * 6250 + 56 * 1M + 1500000 = 60Mb. Again we have the lower bound for index size.

Summary

Let's put the results of our experiments in single table. N is a number of entries in the cache, M is a number of distinct values of indexed field.

Coherence 3.4: >100 * M + 56 * N + N * A
Coherence 3.5: >100 * M + 56 * N + M * A

N is a number of entrees, M - is a number of distinct entries, A is a size of indexed attribute value.

Because the fill ratio of the hash table will always be below 100%, in practice, the cache will always consume slightly more memory than the value calculated by the formulas above.

Conclusion
Working with Coherence caches, it is always important to remember that both main data and indexes are stored in main memory. Here I tried to give a simple tool to get a good estimation of memory consumption.
Hope you will find it useful.

Labels: , , ,

October 23, 2009

Oracle Coherence, memory structure of cache

Imagine that you are working on project involving an in-memory data grid. You have analyzed your requirements and you can see that you need to store 10 million objects in your grid. The next question is how much physical memory do you need to provide such capacity? I recently have been facing the same question, and want to share some findings about how Oracle Coherence (popular in-memory data grid middleware) uses memory.

My approach is very simple. I am creating a cache, then puttung 1 million objects in it, and then analyze heap memory usage.

Local scheme

Lest start with local scheme. While it may not be so often used by itself, a local scheme may serve as a backing map for other schemes.


<local-scheme>
   <
scheme-name>local-schemescheme-name>
<local-scheme>

The memory picture for one million objects is:


(Sun JDK contains a 'lmap' tool that can display memory usage by objects for a live Java process. It is an extremely handy tool for memory profiling.)


On the diagram we can see our domain objects (DomainObjAttrib, DomainObject, DomainObjKey). Also we can see 1M of  com.tangosol.net.cache.LocalCache$Entry objects. They are part of hash table implemented in Coherence. If you put the same objects in java.util.HashMap, the picture will be almost the same, but you will see 1M of java.util.HashMap$Entry instead.

Lets summarize memory consumption:


  • Domain object: 140.6Mb
  • Hash table: 77.3Mb
  • Other: 4.8Mb

In short, we have overhead of about 77 bytes per cache entry. How this space is used by Coherence?

  • 72 bytes – size of LocalCache$Entry (BTW, the size of HashMap$Entry is just 24 bytes because it does not need to store additional data needed to support eviction and statistics collection);
  • 4 bytes – reference from hash table;
  • Because the hash table fill ration is always less than 100%, some references are unused. This produces an additional 1.3 bytes per entry in our case.

Distributed scheme

Now let's switch to a distributed scheme (distributed hash table implementation of Coherence). As in the first case, we will use a simple configuration with a local scheme on the back end.


<distributed-scheme>
<
scheme-name>simple-distributed-schemescheme-name>
  <
backing-map-scheme><local-scheme/>backing-map-scheme>
  <
backup-count>0</backup-count>
<
distributed-scheme>

The memory picture for one million objects is:
Now things look much more complicated. First, there are no domain objects on the heap – that's because in distributed scheme objects are stored in serialized form. Coherence is using com.tangosol.util.Binary
objects to wrap the actual byte array (this way they can be store in the backing map), so we have 2M of wrapper objects (for each key, and each value). The next strange thing is SegmentedHashMap, which is not a part of local scheme, so it should be related to the distributed scheme. In Coherence prior to version 3.5 the data for all partitions was stored in a single backing map instance and to relocate a partition data from backing map, a full scan was required. Version 3.5 introduced a new mechanism - an additional index that remembers a key set for each partition. It improves partition relocation performance drastically but increases memory usage. Finally there is a pack of int[] objects, but this is still a mystery for me.

I think it is worthwhile to compare the 3.5 results with the 3.4 results at this point. With exactly the same configuration, the memory picture for Coherence 3.4 is:
Two things to note:

  1. Size of LocalCache$Entry in 3.4 is 64 bytes instead of 72 in 3.5
  2. There is no SegmentedHashMap related staff in 3.4
Let's summarize memory usage.

Coherence 3.4:

  • Domain objects: 227.3Mb
  • Overhead: 116.4Mb
  • Other 11Mb

Coherence 3.5

  • Domain objects: 227.3Mb
  • Overhead: 167Mb
  • Other 8Mb

Conclusion:

  • In the distributed scheme we have 2 additional Binary objects per entry (+48 bytes)
  • Also 3.5 adds additional data structures, which consume about 50 bytes per object
Data structures added in 3.5 are a trade-off to provide better performance in partition-related operations. They have their merits for sure, but is still would be preferable to have an option for turning them off.

External scheme

A local scheme is not the only option to use as a backing map. Coherence also has a so-called external scheme, which can store data off the heap using the BinaryStore plugin (plugins for NIO memory storage and several disk backend storage models are supported out of the box). Let's analyze the distributed scheme with an external scheme (nio memory) as the back end.


<distributed-scheme>
<
scheme-name>external-distributed-schemescheme-name>
<
backing-map-scheme>
<
external-scheme>
<
nio-memory-manager/>
external-scheme>
backing-map-scheme>
<
backup-count>0backup-count>
distributed-scheme>

The memory picture for one million objects is:


Please keep in mind that we are analyzing only the Java heap and in this case some amount of data is stored out of the heap using direct memory buffers. You may expect that all your business data will be stored out of the heap, but still we can see plenty of binary objects in memory. These binaries are keys. Yes, while values are stored entirely in external storage, keys are stored in the heap (actually they are stored in both the heap and external storage). Coherence 3.4 shows a similar picture, so I will just omit it.


In this case the diagram does not show the full memory picture (only the heap) so you should not compare it directly to the previous cases.

Coherence 3.5:

  • Key duplicates: 32Mb
  • Overhead: 66Mb
  • Other: 18Mb

Conclusion:

  • While the external scheme is using non-heap memory for data storage, it istill consumes enough of the heap for keys and other structures.

Replicated cache

Cache configuration:

<replicated-scheme>
<
scheme-name>simple-replicated-schemescheme-name>
<
backing-map-scheme>
<
local-scheme/>
<backing-map-scheme>
<replicated-scheme/>

The memory picture for one million objects is:
I rarely use a replicated scheme, so the number of additional data structures is a little surprising. As you can see, the replicated scheme is storing plain Java objects (unlike the distributed scheme, which is always operating with serialized blobs).

Memory summary is (for Coherence 3.5)

  • Domain objects: 140.6Mb
  • Overhead: 263.3Mb
  • Other: 7.6Mb



Conclusion

Lets put all results in a simple table with capacity formulae:


Here N is a number of entries, K is a size of keyset.

Stay tuned
Hope, this will help you to more presizely estimate memory consumptions of your Coherence cluster. In next blog posting I will describe the structure and overheads of secondary indexes in Oracle Coherence. Stay tuned!



Labels: , , , ,

September 18, 2009

Oracle Coherence using POF, without a single line of code

People developing distributed Java applications know the importance of wire formats for objects. Native Java serialization has only one advantage—it is built in. It is relatively slow, not very compact, and has other quirks. Starting with version 3.2, Oracle Coherence is offering its own proprietary binary wire format for objects—POF serialization. POF is not only cross platform, but also much more compact and faster compared to built-in serialization. Both compactness and speed are extremely important for data grid application. The only disadvantage of POF is that you should write custom serialization/deserialization code for each of your mobile objects. Not only domain objects stored in cache should have serializers, but also entry processors, aggregators, etc. The amount of code you have to write may look daunting and force you to stick with built-in Java serialization.

But there is a simple way to get best of both worlds. In a recent project I have implemented a generic POF serializer. It uses reflection and doesn't require any changes in code, although you still need to register classes in coherence-pof-config.xml. Still it offers the advantages of the POF format – compact object size and performance. While performance is a bit degraded from using reflection (but still much faster than Java serialization), the sizes of serialized objects are similar to a handmade POF serializer.

Some people believe that reflection is slow. I have performed simple speed tests between Java reflection, a handmade POF serializer, and a reflection-based POF serializer—a very simple single threaded test doing serialization and deserialization of an object in a loop.

  • Java serialization – 8K ops/sec
  • Handmade POF – 46.4K ops/sec
  • Reflection based POF – 35K ops/sec

Sweet! I got 4 times speed up, 8 times reduced size just by writing 5 lines in coherence-pof-config.xml. Numbers may be different for different application but the trend is obvious.

And one more good thing—you can combine a reflection-based POF serializer with handmade ones. You can start with using generic implementation for all objects, and later write a few handmade serializers to reclaim ~24% lost on serialization for hot objects.

If you are interested you can look at the implementation here, available under the Apache 2.0 license.

Labels: , , ,

August 31, 2009

GoGrid is out of beta

GoGrid is officially out of beta, so now it's time to compare GoGrid with the public cloud market leader, Amazon EC2. Both vendors have relatively stable performance metrics and they both implement the basic functionality of a cloud provider:
  • Programmatic control via REST API
  • Scalable data storage solution: Elastic Block Storage in Amazon AWS and CloudStorage in GoGrid
  • Ability to assign external IP addresses
  • Ability to grant root/administrator credentials
  • Many options to configure the cloud cluster
  • Windows and Linux support
  • Web console for monitoring and management

So far, as cloud providers GoGrid and Amazon EC2 look very similar, but practical experience of using them in real-life projects helps us identify some unique features that can be important for specific applications.

For Amazon EC2 we can highlight the features listed below:
  • We can choose instances with different performance characteristics: Amazon EC2 has a product line that can be scaled by performance
  • Support for multiple OSes, some of them powered by marker leaders such as Oracle, Sun, and IBM. Also, the community has made a big commitment to Amazon EC2 resources.
  • User-friendly web console and web help
  • Service popularity: most questions about EC2 can be answered on the web; it's very possible that someone else has already solved the same problem
  • Start-up time for instances based on a popular AMI. It takes from 2 to 4 minutes to invoke an AMI and start the OS.
  • We always have access to instances. Even if we can't have SSH access into an instance we can look onto system logs or reboot and instance using the web console.

Although there are many positive aspects for Amazon EC2, some negative points can be identified as well:
  • There is no SLA for CPU, very low performance for m1.small and performance level of m1.small is unstable
  • Pricing for high performance instances m1.large and m1.xlarge is not attractive
  • No logging for API access
  • There is only one way to build hybrid clouds: use a VPN connection to boxes outside the EC2 cluster.
  • No control over instances deployment. It may happen that instances will be deployed onto one physical box, but for some applications it's critical to have instances deployed on different boxes. We can specify different regions and geozones for this application, but further control would be desirable.
  • Independent namespaces for different regions: each region has its own namespace for keys, AMI, instances, etc.
  • Security management isn’t always logical. For example, we can create instance without specifying a key-pair that will be inaccessible via ssh (in those rare cases when we really need it) or we can accidentally have access (and even terminate) instances that belong to the same account, but which were created with other security keys; for example, another development group in the same company.
  • Startup time of custom AMI : an instance's pending time can be about 10-15 minutes.

Compared to Amazon EC2, so far GoGrid does not have a big list of different features, but despite their recent entrants to the market, they already have some unique features:
  • True hybrids: we can build hybrid clouds along with colo servers, which will be located in the same network switch with virtual cloud instances.
  • Logging API usage by a job's history. Actually, we can monitor all requests to GoGrid’s API.
  • Non-transient instances: We can reboot instances without data-loss.
And, as with Amazon EC2, GoGrid has some challenges as well:
  • The web interface needs to be more user-friendly.
  • Using and mounting Cloud Storage is more complex than it is expected and it performs as a local disk drive only in case of low network load.
  • Choice of OS is limited only to Windows, RedHat and CentOS.
  • For cloud instances we can choose only RAM size, not different CPU levels. This point can be partially reconciled by using bare metal as part of the cluster infrastructure.
  • Practical experience with GoGrid's instances shows that maintenance windows happen more often than anticipated.
Obviously, choosing a cloud infrastructure provider for a particular project should be based on determining which application's requirements will be best fitted for this cloud provider and which requirement will not be satisfied. Providing hybrid clouds with bare metal boxes can be a unique differentiator, but not all applications really need it. The main promise of cloud computing is 'resources by demand' and this can be delivered by Amazon EC2 more than GoGrid—they provide a wide variety of different infrastructure parameters (such as CPU, RAM, and OS) for instances, they have more users, and they're currently more experienced as an infrastructure cloud provider.

Labels: , , ,

August 25, 2009

Provisioning in Microsoft HPC

Suppose you have five machines under your desk and need to establish a small HPC cluster for development from this estate, but you are too lazy to do it manually machine-by-machine. Or even better--you are in charge of a large HPC cluster that has thousands of nodes spanning multiple racks installed in an area over 2000 square meters. Obviously, manual installation is not an option here.

What to do? The answer is simple: use automatic provisioning. In this post, I'll try to share my recent experience provisioning with Microsoft High Performance Cluster 2008.

Good news: provisioning is an intrinsic feature of HPC 2008, which uses Microsoft Windows Deployment Server technology under the hood. WDS is a tool to exploit PXE (Preboot execution environment), which is a "must have" feature of all modern network cards.

This solution completely shields me from the complexity of WDS. Like many MS products, HPC Cluster manager provides a wizard for creation of a template. A template is a central notion of the whole system. It's merely an installation image, bootable over the network, accompanied by a script that does additional steps needed to get a node ready for joining the cluster.

However, just having an image is rarely sufficient. It allows you to install just the basic OS without anything on top of it, which is usually not what we want. That is where the script comes into play with its ability to execute (almost) arbitrary OS commands, even including executables from the network shares. If one tries to code those commands by hand, it would be somewhere between boring and very boring. Fortunately, HPC cluster manager's designers applied a lot of effort to make things simple. Just follow the wizard and you'll get a fully operational template in almost no time.

Hey, but what if you wanna add some specifics, something not provided by default? Well, nothing is lost, you can create a default template with the wizard and run the edit tool from the context menu. In the opened editor you can add new commands or delete the existing ones. When the template is ready, you invoke another wizard, which controls the process of installation. The only thing to do is to choose a template and turn on all the machines you want to have installed with this template. Simple.

Finally, if you don't have an installation image, you may create it from the distribution media by means of another wizard, embedded into HPC Cluster Manager. I'm not sure why this feature's needed, because for years the Microsoft installation media has included the installation images, but it might be valuable for image developers.

So far so good and it sounds like a magic, but there are some pitfalls I encountered when playing with MS HPC auto-provisioning.

  • First of all never ever try to run this kind of installation on a network that is not under your control, and first of all, DHCP. Obviously, HPC will need to add a record to DHCP for a new node.

  • Always use multicast mode when you are dealing with more then one node. It significantly reduces the time to provision, although even with multicast it may take a long time in some cases. In one of my experiments, I tried to provision 3 virtual nodes on the same VMware server with 5 additional VMs running. It took about 150 min on my not-so-old server (Supermicro SuperServer 6015B-UR 2 x XeonE5430 @2.66GHz (Quad-core) 16GB RAM).

  • And last but not at all least. Do not use the clusrun utility in your template scripts. Obviously, commands from a template will be executed on all nodes for which the provisioning is requested. If you try to run clusrun (whose task is to run its argument command on every nodes in cluster) on each node , you may get an unpredictable result.

All in all, Microsoft HPC implements a PXE style deployment in a comfortable way, allowing you to do this task without serious troubleshooting.

Enjoy!

Labels: , ,

August 19, 2009

GoGrid Management Tools and Library

When we started working with GoGrid, we began using the web interface at my.gogrid.com. It's done pretty well -- powerful and user-friendly. However, after some time we decided to implement our own set of tools -- fortunately, GoGrid provides a great designed API -- for the following reasons:
  • While the UI is good, it has no facilities for automating things. For instance, if you need to create several servers of the same configuration using the web interface you have to create all the servers manually one-by-one, while using gg-tools, you can do something like this:

    for i in `seq 5`; do gg-server-add -i 10 -r 2GB -n server$i; done

    To create 5 servers with names like server1, server2, ..., server5 of the same configuration.

  • gg tools allow you to use all the power of the Unix command line. For example, if you need a password for server with ip, say 192.168.0.1, you can do:

    gg-password | grep 192.168.0.1

    and you will see the password for the required server right on your screen instead of searching in a long list of passwords and servers in the UI.

  • With the GoGrid UI, you have to remember your login and password. If you're using gg tools, you can configure it once and then do not have to type logins and passwords all the time. If you have it installed on a *nix box with ssh access, you can use it from every place where you have ssh client configured.
Initially, gg tools started as a few Python scripts to start, bootstrap and shut down servers. After working with these scripts they have grown pretty quickly and we decided it would make sense to split the bootstrap functionality and provide general tools for GoGrid management. Later on it turned out it would make sense to re-organize the code one more time and split out a general library that could be used not only by gg tools, but any project that needs to control GoGrid.

As a result, now we have a GoGridManager -- a Python module that supports almost all operations provided by the GoGrid API and gg tools -- a set of CLI tools that use GoGridManager.

GoGridManager is also quite easy to use. Here's a basic example:

(11:53) novel@hybrid:~/zf0 %> python
Python 2.5.2 (r252:60911, Oct 5 2008, 19:24:49)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from GoGridManager import GoGridManager
>>> manager = GoGridManager()
>>> server = manager.get_servers()[0]
>>> print server
server my_go_grid (id = 28940, descr = , state = Started, ip = 173.1.54.221)
>>>


We have been working for almost half a year now with GoGrid and for many months gg tools has become an everyday tool for us. Recently it was rewritten to use XML responses
from the API and MyGSI support has been added. However, there is a lot of work left to do. For example, Load Balancer API calls are not implemented yet, as we haven't use it so far and
for example, Job and Billing support could be better.

Check GG Tools out on http://github.com/novel/gg/tree/master
Enjoy!

Further reading:

http://wiki.gogrid.com/wiki/index.php/API
http://wiki.gogrid.com/wiki/index.php/MyGSI

Labels: , , ,