IBM has a long history of supporting major open source projects and the most widely adopted open standards. Their enterprise customers have benefited from the flexibility, choice, and innovation that come with the open source philosophy. Major projects include SOA (Service-Oriented Architecture), Linux, Eclipse and now Hadoop. The big data analytics open source offering is known as the IBM Open Platform with Apache Hadoop. The commercial side of this platform, announced in early 2015, is a suite of products for the enterprise branded as BigInsights.
To better understand IBM’s big data offerings around Hadoop and its open data platform, it is helpful to put this in context of the overall vision for the platform and the three phases of the IBM Big Data Analytics lifecycle:
- Pull in all types of data from disparate sources
- Put the data into a business context
- Produce intelligent, data driven business outcomes, for example, operational efficiency, customer engagement or risk management
IBM endeavors to cover a lot of business territory with its analytics platform. For the enterprise IT department, the technology enables data integration, governance, security and regulatory compliance. For line of business managers, the analytics environment is the home of customer and operational intelligence. While analytics play an important role in increasing operational efficiency and eliminating business process bottlenecks, it is the customer-centric analytics that have captured the imagination of business executives. Big data analytics offers many opportunities for improving customer relationships and increasing engagement across marketing channels.
A common big data use case is delivering relevant promotions to customers. We all share the experience of receiving credit card offers in the mail from the bank and tossing the envelope directly into the recycling bin without even thinking about it. Despite the dismal response rate, it was cost effective for the bank to send the same direct mail piece to everyone. With a big data platform, it is possible to develop customer profiles and create targeted offers for each segment. For example, customers that have a single account and a short customer history would be candidates for a different array of promotions than someone who has been a customer for decades. The cost of amassing enough data and having the processing power to crunch the numbers in a timely fashion has dropped enough to make it profitable to do so.
With digital advertising and social media data, analysis is required on huge amounts of unstructured data. A couple of years ago this was experimental at best, but now Hadoop software enables capturing and processing unprecedented amounts of data. It complements the enterprise data warehouse and is an integral part of the business intelligence ecosystem.
Open data platform ODPi
The ODPi open data platform is a consortium of IBM and 18 other enterprise software vendors working together to maximize the adoption of technologies based on Apache Hadoop. The goal of ODPi is to accelerate software development by providing a standard Hadoop solution on which an applications can be run, whether it is commercial software, open source, or custom code developed in-house. This gives enterprise customers assurance that they are not locking themselves into a single vendor’s Hadoop solution. It also permits using a Hadoop implementation with products from multiple vendors. For Hadoop to fulfill its role as an enterprise data source, it must accommodate a broad audience who will be using many different applications.
To that end, the ODPi provides a core platform of agreed on and tested big data Apache Hadoop modules. This is the ODPi standard, on which the vendors build their applications. For example, Hortonworks, IBM Open Platform 4.0 with Apache Hadoop, EMC Pivotal HD 3.0 and Infosys IIP all adhere to the ODPi standard. Analytics software vendors or in-house development shops can concentrate on developing applications further up the stack, knowing that the Hadoop core adheres to a standard and its application will interoperate with any compliant Hadoop system. This accelerates development, promotes code re-use, and simplifies the technical architecture. Implementing a Hadoop distribution that adheres to the ODPi standard means not being locked into a proprietary technology.
As a standard, only time will tell if the ODPi will have a lasting impact. The organization has been criticized as being nothing more than a joint marketing effort for vendors pushing their own commercial flavor of Hadoop. Also to note are the big data vendors who are conspicuous by their absence: Cloudera, MapR and Amazon (AWS – EMR Elastic MapReduce).
IBM BigInsights and Cognos
On top of Hadoop, IBM has developed a suite of big data and analytics tools under the BigInsights brand. There are tools for scaling and managing the platform (BigInsights Enterprise Management), a machine learning engine (BigInsights Data Scientist – Decision Trees, PageRank, Clustering) and a data exploration and discovery tool (BigSheets). Of particular interest to Cognos customers is BigSQL which runs SQL queries against Hadoop or in other words, BigSQL permits Cognos to use Hadoop as a data source.
This is interesting as data stored in Hadoop only becomes useful when it is put into a business context. Cognos Analytics (V11) is well suited for this role. It is a powerful tool for BI developers and business power users, enabling the presentation of Hadoop data in a visually appealing format for executives, managers and line of business staffers. Big data becomes much more valuable when it can be interpreted and understood by non-technical users.
Cognos supports connecting to Hadoop using Hive, which translates code from SQL to MapReduce to get results from Hadoop. There will always be some latency as Hive cannot change the nature of MapReduce, which distributes processing work across Hadoop nodes. The query is split into discrete chunks of work and the results are assembled as they are returned. SQL join conditions, which are commonplace in Cognos generated SQL, create an additional layer of complexity for MapReduce. This further increases the query processing time and will prevent some queries from running at all.
IBM addresses these problems with BigSQL. It works on the same Hive megastore, but produces faster and more reliable results. BigSQL is not just about performance, but also assuring that the SQL query will run. It optimizes SQL for MapReduce so that it will run faster and prevent having to modify the Cognos Framework Manager model or hand code SQL inside of Cognos. An alternative to Hive and BigSQL is Impala, which makes similar claims to performance.
Success with big data requires getting key pieces to work together. With BigInsights and BigSQL, IBM is providing tools for facilitating Hadoop adoption, including interoperability with existing Cognos infrastructure and functionality.
Our on-demand webinar: Running Cognos on Hadoop
Video of Hive and BigSQL performance test results
IBM BigSQL technology sandbox demo cloud environment for Hadoop and BigSQL:
Thanks to David Currie for contributing this article. David is a long-time business analytics consultant. He blogs about business intelligence and big data at davidpcurrie.com.