Apache Ignite

03 Oct 2017

Apache Ignite

I used Apache Ignite for our most recent conference paper to share a Spark RDD across multiple Spark sessions, and now it’s time to really dig into Ignite and figure out what it is and how it works. Here are some notes that I can reference later.

Overview

Ignite is a scalable off-heap in-memory platform for ACID-compliant relational/NoSQL database or key-value data grid functionality. Ignite isn’t a fully-fledged RDBMS, because it eliminates costly broadcast messages, like those required by foreign key constraints, in favor of more efficient scalability. When distributing data, Ignite assigns each node as the primary for a certain partition of the data as well as a backup for a different partition. There’s no master mode that manages failures and such, but rather the nodes are all equal peers and the cluster self heals.

Ignite is sold by GridGain as an Enterprise version that competes with other HTAPs (Dinsmore 2016). HTAP stands for Hybrid Transactional-Analytical Platform. There are a number of HTAP in-memory storage products right now, including Oracle and SQLServer Enterprise editions, that serve up in-memory databases that are fast and cost a lot of money. HTAP and NewSQL are new to me, so here’s a brief tangent to summarize. In conventional systems (ie OldSQL), transactional databases that handle individual transactions quickly and reliably are separated by an ETL layer from the sorts of routine business intelligence / data science analyses that aggregate the data and take up a lot of system resources (Stonebraker 2012). Stonebraker discusses broadly the diverse range of New SQL systems that are being pursued in response to the increasing pace and scale of online transactions and analytics and that also provide the OldSQL ACID properties that NoSQL systems tend to relax. In-memory NewSQL systems like MemSQL, VoltDB, and Ignite are potentially fast enough and ACID enough to bridge this OldSQL-NoSQL / transactional-analytical gap (Hrubaru and Fotache 2017).

IGFS

The Ignite File System (IGFS) works like HDFS but in memory and, despite living in memory, can persist data for as long as is needed. In cases where data don’t all fit in memory, the IGFS can be used as a caching layer for the data stored on disk, but this sort of misses the real advantage of Ignite. Ignite serves as a fully-fledged in-memory storage system, so we’d much rather horizontally scale our memory than simply use Ignite as cache. In a cluster, data are collocated, and persisted data will last through a full cluster reboot.

Ignite v. Spark

IgniteRDD’s are implementations of Spark RDDs that are faster due to the in-memory key/value indexing. Like Spark, Ignite can be deployed in a variety of environments. As an in-memory processing and storage platform, Ignite is in some ways faster than Spark. The IgniteRDD is like a Spark RDD with an added in-memory key-value index that speeds up many of the transformations and/or actions that you apply to it in addition to speeding up SQL queries from linear to log time. On the other hand, Ignite can also complement Spark’s features like Streaming (Magda 2017). I’ve been most interested in using Ignite as a data grid which has been discussed above, but Ignite also offers a compute grid that, like Spark. has an in-memory implementation of MapReduce as well as a machine learning library.

Questions

  1. I’m not sure how Ignite’s partitioning strategy provides enough replication to compete with the reliability of other systems. Each node is responsible for a primary partition as well as a backup partition - is this how HDFS works? I should go back to 604 notes and compare various replication strategies to see where Ignite fits in on the push/pull of reliability/performance.
  2. Data structures tuned specifically for RAM that will increase system performance might be row-based, as in MemSQL, which allows new records to be added easily, or columnar as in Oracle and ADAM’s Parquet, which reduces the amount of unneeded data retrieved from records and also allows better compression (Hrubaru and Fotache 2017). I believe ADAM chose the columnar Parquet format chiefly for its compressibility. Is Ignite happy to store either or is there some way in which it would favor row-based or column-based formats?
  3. Ignite seems to be taking off, how are VoltDB and MemSQL doing?

References

Dinsmore, T. 2016. In-Memory Analytics: Satisfying the Need for Speed. Disruptive Analytics, Chapter 5.

Hrubaru I, Fotache M. 2017. On the performance of three in-memory data systems for on line analytical processing. Informatica Economica 21(1):5-15.

Magda D. Apache Ignite and Apache Spark: Where fast data meets the IoT. 2017. GridGain presentation - https://www.slideshare.net/databricks/apache-spark-and-apache-ignite-where-fast-data-meets-the-iot-with-denis-magda

Stonebraker M. 2012. New Opportunities for New SQL. Communications of the ACM 55(11):10-11