DataGridFS - let your legacy code store data into IMDG
The one of the most discussed topics on grid computing here in Grid Dynamics is bringing Computational Grids and In Memory Data Grids together. And DataGridFS concept was born in such discussions.
First of all, let me describe where it applies most efficiently. There are a lot of systems, which are built around Computational Grid where jobs produce files as result of computations. These files stored to NFS, so they are accessible locally after mounting on every node and client. Sun Grid Engine and Platform LSF are examples of such Computational Grids. Imagine that such system needs enhancement and such enhancement is based on bringing IMDG in this system. So, the codebase will consist of legacy code which saves results to files and modern code which uses IMDG API to get initial values and store results too. More over, these two parts should communicate with each other somehow. For example, results calculated by legacy code are initial values for new codebase jobs. It is clear, that we need the way to make new jobs able to process results stored in files on file system. There are several ways to achieve that. One is to build daemon process which scans directory on filesystem with results, parse them and stores data into IMDG using it's provided API. This approach has few cons. At least you should be able that this process is always alive and so on. The second way which is to use DataGridFS.
DataGridFS is filesystem which stores files transparently in IMDG itself. So, this filesystem inherits most features of used IMDG. For example, if IMDG allows partitioning then filesystem automatically becomes distributed and so on. So, when job process writes file to filesystem object (or set of objects) appears in IMDG. This object can be accessed using regular IMDG API or DataGridFS API which just "wrappers" for IMDG API calls. "What is about file content?", you may ask, "Is it still to be parsed in object model?" Yes, but this can be done by workers inside IMDG, which is more convenient and flexible way than separate process.
I've managed to code very simple prototype to illustrate idea. It is just read-only filesystem built using FUSE and FUSE-J. File data stored with GigaSpaces XAP. On the screenshot you can see space content via Space Browser, and the same content via UNIX command line utils.

(Click to enlarge)
First of all, let me describe where it applies most efficiently. There are a lot of systems, which are built around Computational Grid where jobs produce files as result of computations. These files stored to NFS, so they are accessible locally after mounting on every node and client. Sun Grid Engine and Platform LSF are examples of such Computational Grids. Imagine that such system needs enhancement and such enhancement is based on bringing IMDG in this system. So, the codebase will consist of legacy code which saves results to files and modern code which uses IMDG API to get initial values and store results too. More over, these two parts should communicate with each other somehow. For example, results calculated by legacy code are initial values for new codebase jobs. It is clear, that we need the way to make new jobs able to process results stored in files on file system. There are several ways to achieve that. One is to build daemon process which scans directory on filesystem with results, parse them and stores data into IMDG using it's provided API. This approach has few cons. At least you should be able that this process is always alive and so on. The second way which is to use DataGridFS.
DataGridFS is filesystem which stores files transparently in IMDG itself. So, this filesystem inherits most features of used IMDG. For example, if IMDG allows partitioning then filesystem automatically becomes distributed and so on. So, when job process writes file to filesystem object (or set of objects) appears in IMDG. This object can be accessed using regular IMDG API or DataGridFS API which just "wrappers" for IMDG API calls. "What is about file content?", you may ask, "Is it still to be parsed in object model?" Yes, but this can be done by workers inside IMDG, which is more convenient and flexible way than separate process.
I've managed to code very simple prototype to illustrate idea. It is just read-only filesystem built using FUSE and FUSE-J. File data stored with GigaSpaces XAP. On the screenshot you can see space content via Space Browser, and the same content via UNIX command line utils.

(Click to enlarge)
Labels: convergence, filesystems, gigaspaces, grid computing, ~Kirill Uvaev

0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
Links to this post:
Create a Link
<< Home