The latest commercial distribution of the Hadoop big-data framework will have Microsoft's trademark. But it will have a certain elephant's footprints all over it.
A few years ago, software engineers at Yahoo worked out a low-level solution to the problem of managing databases whose scale superseded that of the largest storage volumes available -- even the virtual ones. By turning over their product to the open-source community, the Yahoo team seeded the fastest-growing, most lucrative market in all of enterprise computing.
At this time last year, the stunning news came that Microsoft would integrate a commercial distribution of Hadoop, the big-data framework, into its next edition of Windows Server, effectively making a good part of its enterprise operating system the product of an open-source process. Though the final product is not ready for general release, the final round of beta testing is under way for Microsoft HDInsight Server, a complete data warehousing and analytics framework based on Hadoop and created through the company's partnership with Hortonworks -- one of the commercial Hadoop producers founded by former members of the Yahoo team.
By "integrate," I don't mean "bundle by tacking a hyperlink on to one of the admin tools." HDInsight will be fully manageable through Microsoft System Center, effectively making Hortonworks' product the preferred Hadoop for Windows Server 2012.
Jim Walker, Hortonworks' director of product marketing, told me on Wednesday:
I think that Microsoft had been swimming around the Hadoop ocean for quite some time, figuring out what to do. I think they were even thinking about making their distribution at one point. They partnered with us because they saw that our focus is about enterprise readiness, stability, high availability, security, about simplifying Hadoop for the enterprise. And that really plays well with what Microsoft is all about.
Something else Microsoft is about is distributing a finished product when it's finished. The company learned that lesson the hard way in the summer of 2007, when it tried to build its own virtualization layer into Windows Server to preempt VMware. Microsoft was forced to ship an incomplete product. By the time it was playing catchup (even to the point of giving away Hyper-V), VMware and Citrix had already built their virtualization platforms into full ecosystems.
Microsoft doesn't want to lose a similar opportunity with big-data, even if it means not acquiring the product or the engineers behind it. Walker put it this way:
This isn't your father's Microsoft. They're full-bore into this open-source thing with Hadoop. Although they're writing code to make Hadoop work on Windows, they're contributing back into the community. We're working with them so that some of their developers become committers to core Hadoop. That's a big deal for Microsoft. When you see them creating marketplaces around products, that's a fairly big paradigm shift for them.
A general release for HDInsight Server is expected early next year, he said. But after that, a development community will need to develop -- one devoted to integrating the product with the company's frameworks and processes.
For example, Microsoft developers are accustomed to writing application code that speaks to ODBC database drivers. For any big-data operation to be fully integrated with the operating system, developers will expect to communicate with big-data in much the same way they communicate with other types of data.
However, the point of Hadoop is to provide overviews of huge chunks of data at a glance -- not just to store and retrieve data from shards scattered throughout the cloud, but to present analytics and business intelligence information to applications. So something will have to change, and judging from what Walker said, the nature and extent of that change have yet to be fathomed.
I know the group is working on .NET extensions to enable the developer community. There's also JavaScript extensions to run on top of Hadoop, directly kicking off MapReduce jobs to pull that data... into an application. There's a lot of different ways to architect that solution so you get more real-time access to that data. Those are extensions we're working on, as well.
One of the connection points is HCatalog, which provides a metadata layer, a view into Hadoop. It provides a RESTful interface on top of it with a REST API, which you can imagine working with .NET -- fairly straightforward, fairly simple to implement.
With Hadoop effectively being promoted to a front-and-center position within Windows Server 2012, it's easy to envision a time in the surprisingly near future when big-data passes out of the public conscience, and Hadoop serves simply as the data layer.