The Query Optimizer
David DeWitt talks about building parallel processing into SQL Server.
When David DeWitt gave the closing keynote address at last month's annual Professional Association for SQL Server (PASS) conference, it was the first time he addressed a large audience of database developers and administrators.
DeWitt spent the past 32 years lecturing in computer science classrooms.
In April, the former chairman of the computer science department at the University of Wisconsin-Madison decided to embark on a major career change by joining Microsoft as a technical fellow. DeWitt was brought on board to launch the Jim Gray Systems Lab -- which is located within walking distance of the university -- to work on improving Microsoft's data and storage platforms.
Project "Madison" is the code name for one of the key technologies that will enable the next version of SQL Server, dubbed "Kilimanjaro," to handle massive parallel processing. Madison is the outgrowth of Microsoft's acquisition of DATAllegro in September. In an interview with Redmond Developer News, DeWitt talked about the lab and his vision of the future for SQL Server.
How did you end up at Microsoft?
||"Query optimization is the main unsolved problem in data warehousing today. I think we know how to build parallel database systems that scale to hundreds of thousands of nodes."
|David DeWitt, Technical Fellow, Microsoft
A couple of years ago, I started having conversations with [Microsoft Technical Fellow] Peter Spiro, one of our alums. I was looking around for something new to do. I wanted to do something that continued to involve graduate students, which could have more immediate impact on commercial database systems.
Traditionally one of the problems has been by the time you develop a good idea and it comes out in a commercial product, it's about a 10-year window, sometimes even longer. If you do a start-up you can shorten that process. So Peter and I talked about this concept where we would try to shorten the cycle from when the ideas were first thought of to when they appeared in product. We decided we would have me retire, go to work for Microsoft and start this lab.
I have a staff of full-time Microsoft employees and we fund a group of graduate students and some faculty members at the university to do what they normally do: think of new ideas in the database area, explore those ideas and work with us to get the best of those ideas into the Microsoft code base. The real model is that the students innovate as they would normally do as if money was coming from the Science Foundation or DARPA [Defense Advanced Research Projects Agency]. There are no restrictions placed on what they work on, there's no expectation that they're going to produce code that will go in the next release of the SQL Server product. It's really letting the graduate students be graduate students and innovate like they've always done, and for us our job is to seed those ideas, help use our experience to direct them on occasion but then pick the best of those ideas and get them into the technology that forms SQL Server.
Was this lab built from the ground up?
I'm still building it. It's a lot of work. I currently just have three staff members; we'll be finding up to six graduate students next semester. I have some open staff positions but I'm very fussy on who I hire. I'm never going to have 50, but the goal is to have 10 to 15 full-time staff, mostly Ph.D.s and some masters' students -- people that like to build systems. I'm a real hands-on systems builder.
What are you working on?
We have two projects we're working on. We're working with the DATAllegro team to look at parallel query optimization techniques. Optimizing queries is hard, optimizing for a scalable database system is even harder, and query optimization is something I've been interested in for a long time. We have one project that involves looking at some optimization techniques that will come out in a future release of the DATAllegro product.
The second project we're working on is looking at the role of flash [memory] in future database systems. I like to use it as a buffer pool extender. You assume the whole database system will sit in flash, some applications, maybe transaction-processing kinds of applications; it might be the case that database systems will fit entirely in flash. If that's the case, you might re-architect the database system completely. Flash brings a lot of exciting possibilities and we're starting to look at how that technology might affect data warehousing and transaction processing.
How quickly do you see flash becoming a medium for database storage in the enterprise? How do you see that playing out?
Currently, many flash disks have fairly poor random-write performance. This renders them somewhat useless for a transaction-heavy environment. When and if the flash translation layers solve the random-write performance problem, I'd expect flash to dominate the transaction-processing marketplace. I tend to doubt, however, whether flash will ever replace disks for the warehousing market.
What role did you have in the DATAllegro acquisition?
Zero. I had absolutely no role in the acquisition process. I knew about it soon after I joined, but Microsoft is very careful about how it does acquisitions these days. I was not involved in any way in the technical decision on whether to buy it or not. But I think it's a great acquisition. They've got a great product and I think Microsoft's expertise will be able to take it to an entirely new level.
It's a great team. We were there [recently]. We're excited about working with them. It was like a big Christmas present as far as I'm concerned because now, all of a sudden, I'm working at a company that has a really seriously scalable parallel database system. Having built three in my life, getting a chance to work on a fourth [is] just like Christmas.
How do you see taking it to the next level?
First of all, replacing Ingres with SQL Server will certainly drastically improve the kinds of performance we should be able to get. SQL Server is a modern database system and Ingres is an old system. The DATAllegro system avoided using indices because the indexing in Ingres was not very effective. I think we'll get all of the benefits SQL Server has as the underlying engine. We're going to get this huge boost. DATAllegro is a start-up and they have a great system, but there are a lot of things that were done in the area of query optimization [that] I think we can improve on. Having built a number of parallel database systems in the past, I think we can offer something when it comes to optimization of queries that will allow us to scale even higher.
How else do you see SQL Server advancing as a platform?
SQL Server will advance as a platform by using DATAllegro as the base. Will DATAllegro make SQL Server more scalable? Absolutely. I think query optimization is the main unsolved problem in data warehousing today. I think we know how to build parallel database systems that scale to hundreds of thousands of nodes. DATAllegro already has one customer that's 400 terabytes. Ebay has a competitor's system that has 5 petabytes. But there are really serious challenges of optimizing queries for hundreds of nodes and thousands of spindles. I think those are the opportunities that a team like mine can get its teeth into and make some forward progress. Query optimization is something that will come for a very long time, and we have ideas for some new models for optimizing and executing queries that we'll be exploring as part of the DATAllegro process.
Is the code-name Madison related to the location of the lab?
Project Madison is the internal name for the effort to get DATAllegro ready to ship. It's a confusing name because it makes it sound like my lab is in charge.
What role do you have in advancing SQL Data Services into the cloud?
I don't have any direct role, but there are some interesting opportunities there. SQL Data Services will basically provide a vehicle for customers to run their database applications on their database servers in the cloud, but SQL Data Services isn't at this time a scalable appliance in the sense that DATAllegro is.
The services that DATAllegro provides are the ability to incrementally add nodes and get incrementally better performance or constant response time for bigger and bigger problems. In some ways it's a bunch of SQL Server boxes running on a bunch of nodes in a cluster or a grid, and you can envision potentially taking the upper layers of DATAllegro, and running that layer on top of SQL data servers in the cloud.
I could be wrong about this -- I'm not an expert on the product -- [but you could] think of SQL Data Services as providing for a large number of customers, SQL Server sort of running a box in the cloud, and DATAllegro fitting very nicely for those customers who need scalable warehousing in the cloud.
I think there will be a very nice mesh of these two technologies going forward. I think they're complimentary. At the bottom, you've got SQL Server running on some nodes; whether they're on somebody's machine or whether they're in the cloud, it doesn't really make much difference. There are challenges when nodes go away. I'm not trying to gloss over the technical challenges, but I think they're complimentary.
You mentioned it can take 10 years for research to make it into a commercial product. Is that time frame changing?
One of our ideas in setting up this lab was to have a much shorter path from the innovation -- by the graduate students and by my staff -- into the product line. That's one of the reasons why I'm not part of Microsoft Research, even though I'm an academic researcher. I'm part of the SQL Server organization and we intentionally put this lab as part of the SQL Server organization so that we had a direct path from the university into the SQL Server organization.
If you had to sum up your key message, what would you say is your vision for where you'd like to see your efforts at Microsoft take the SQL Server platform in the future?
I'd like to have us become the world leader in data warehousing. I think that we have a great SMP [symmetric multiprocessing] product. It's easy to use. It's got great performance. We can take on teradata. I don't see any reason why we shouldn't become the premier solution for very large-scale data warehousing.