Apache Ignite - IGFS setup
I took a more practical road this week based on my converation with Leo around question of whether non-Java applications are able to access Ignite’s in-memory file system directly, without the language-specific API’s provided by Ignite, which would cut out the need for named pipes to connect our Spark applications to our legacy applications. I had a straightforward goal: Add a file to IGFS from a Java application and go grab it directly from my terminal. Simple, haha!
Chapter 5 of my Bhuiyan et al. text runs through the three main components used by Ignite to speed up conventional Hadoop. The Hadoop Accelerator consists of 1) new implementation of MapReduce, 2) new HDFS-compliant file system in memory (IGFS), and 3) secondary filesystem that serves as secondary HDFS cache. The chapter sets up a pseudo-distributed Ignite cluster and then walks through demonstrations of each of these components.
Before checking the book, I had already tried using Hadoop 2.9 and Ignite 2.3 w/ Hadoop Accelerator to set up the pseudo-distributed system on my laptop. That went well enough, so I tried to go straight to using Ignite as an HDFS cache from online documentation. I wasn’t really sure when I started whether the IGFS as independent file system or as HDFS cache was what we were aiming for. I knew that IGFS as a file system breaks free of HDFS but seemed to me to require using one of the language-specific APIs to access. Using Ignite as an HDFS cache, on the other hand, seemed to be well suited for grabbing files from memory in the usual hdfs:// way. I tried to work my way through the online docs using the Spark shell and programatically configuring the secondary Ignite cache in Scala, because I thought this approach lead me to the goal much faster than using Spring configuration files. I was able to work with Ignite from the Spark shell as I had done before with my previous applications, but I wasn’t able to solve networking configuration issues that were preventing Hadoop from using Ignite as a cache. I decided to take a step back and walk through Ch. 5 of my book. The book’s description of the three components of the Hadoop Accelerator were much clearer than what I found online, and I hoped that following the tutorials would provide as much clarity on the configuration confusion.
Chapter 5 first walks through the MapReduce speedup component of the Accelerator and plug-and-play compatibility with existing tools like Pig and Hive. The Ignite MapReduce engine was definitely faster than Hadoop’s MapReduce with the WordCount application examples even in my pseudo-cluster. The second component handled in the chapter is the independent IGFS. I got a long way toward setting up IGFS but am currently stuck. I copied the IngestFileInIGFS application that runs just fine, but I am getting a NPE error that I think is related to a known bug in Ignite 2.1 and 2.3. I’m going to try rolling back to 2.2 binary to see if that solves that problem. I’ve pushed my progress to our lab repo along with my current logfile with an error related to javax-cache, but I’m certain that I’ve included the correct dependency in my project. There is another outstanding issue related to the timeouts I am also getting that points to a fix in 2.4, so I’m also going to try building 2.4 to fix this error.
In the mean time, I skipped past IGFS to the third component that uses Ignite as an HDFS cache, and it looks like I had been a few xml lines away from having this up and running before. It works really well - I don’t really see a speed-up in comparing the conventional versus the cache-accelerated wordcount jobs, but I see from Hadoop’s and Ignite’s output that Ignite is propertly engaged in the process. I expect that I will see performance improvements here on a distributed system.
Long story short, I believe that I’m down to minor IGFS configuration issues that once solved will allow us to access IGFS in exactly the same way as we can access HDFS. Providing Hadoop with alternative configuration files that point it to Ignite allows us to simply add a flag while running hdfs commands to use IGFS instead of HDFS. I think that once the system is configured properly, we will have an easy way for legacy applications to grab in-memory files directly from the OS.
I still have some conceptual to-do’s from the last post that I’m adding to below:
To do
A question here is whether Ignite is better suited for row-based or column-based data. I also should take a glance at VoltDB and MemSQL here.
Regarding OLTP capability, take a look at SAP HANA and MonetDB. Is the trend toward HTAP systems that can do it all? Be sure to understand tradeoffs of having one system that can do all the processing moderately well versus two systems that do their respective processing extremely well.
Access IGFS file directly from terminal.