April 3, 2008

Binary Calculator Project - What is the footprint of my GigaSpaces Entries?

In the initial stages of every Data Grid project it is always essential to get good estimates of memory requirements. How much memory will my domain objects converted to Entries consume and what will be the indexing overhead? The answer to this question defines what JVM heap size to choose and how many of JVMs will be needed to store the intended dataset - or determine how much data can be stored within a given hardware footprint.

For GigaSpaces, it is hard to provide accurate theoretical estimates of your Entry size. Here are the reasons:
  • Space doesn't store entries as heap objects - they are stored decomposed to fields
  • String uid is generated and stored along with each entry
  • Index overhead is dependent on type of index (ordered or unordered ) and on a field dataset cardinality.
For these reasons it is far more practical to take an experimental approach in measuring entry footprint than trying to apply a theoretical formula. The goal of the Binary Calculator project is to build a convenient toolkit for this problem. The main idea is quite simple:

First, collect the basic statistics on the efficiency of storage of the individual entities:
  1. Connect to remote space
  2. Get the batch of tested entries from some entry source
  3. Write a batch to remote space
  4. Perform remote garbage collecting
  5. Measure memory usage
  6. Repeat step 2
After a sufficient number of iterations is done, we will get a number of data points in the format (entriesWritten, memoryUsage). Performing a linear approximation (e.g. min square linear fit) we receive the approximation of single entry footprint (including index overhead).

Implementing this idea, we have built an initial version of the Binary Calculator, which can be used as a toolkit for measuring arbitrary entry footprint. It has a very simple GUI that shows progress of the memory experiment.


We are planning to turn this simple toolkit into a much more powerful tool, which will generate entries on the fly, based on user-supplied meta data. This way, the user can specify an Entry Description as a simple table in a GUI:







TypeIndexedNumber of fieldsAvg Length
LongYes1N/A
StringNo31000
StringYes25000
IntegerNo3N/A

BinaryCalculator will generate Entries at runtime based on this description, populate it with random data, perform memory experiments and show estimated entry size.

Also, we are planning to build a lightweight plugin system to supply a custom EntrySource,
for example, your own random entry generator or JDBC or Hibernate data source. Consequently, performing full fledged capacity experiments loading real data from the database will be much easier.

We hope that this tool will be quite useful for GigaSpaces implementors in the field.

Labels: , , , , ,

2 Comments:

Blogger Guy Sayar said...

Eugene,

Very nice initiative.

Few comments:
1) Don't carry GigaSpaces' jars with you, rather ask the user to point out to GigaSpaces install dir. It will allow you to be loosely coupled, and to run against different GigaSpaces versions.

2) What would be good is to have the size implementation at the space process itself. I.e. add your processing bean / filter statisics to collect this data (the user configures which classes to monitor), and then the UI is used to only display the results.

3) Add best practices and suggestions. I.e. help the user implement different size reduction mechanisms. e.g. Externaliable support. See http://www.gigaspaces.com/wiki/display/OLH/Externalizable+Support for some details.

-Guy

April 4, 2008 6:25 AM  
Blogger Eugene said...

Guy,
Thank you for interest in this project and for valuable comments. Let me answer them one by one

1) The first version was build under ant as early prototype, next version is already moved under maven 2, so this issue is addressed

2) We thought about this approach and decided not to go that way, as it adds additional complexity and was unclear to us what additional value does it bring in compared to simple approach to write a batch, gc and get memory stats through JMX. If you have ideas on that, please share.

More, current architecture is decoupled from IMDG implementation and can be easily adopted to measure capacity of other IMDG products

3) Absolutely. Besides, Managing entry sizes is the scope of another OpenSpaces project we run, PackRat.
We have a plan to bring in the PackRat demo to BinaryCalculator to visually present PackRat value.

April 7, 2008 3:51 AM  

Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home