Sunday, July 24, 2005

Adam Bosworth on a new data model for the web

Following the excellent presentation by Michael Tiemann at the MySQL user's conference organized by O'Reilly media, itconversations has posted another great presentation - this time by Adam Bosworth at the same conference. The talk titled "Database requirements in the age of scalable services" and Adam Bosworth explains his vision to make data access on the web simple and standards based, and why he thinks RSS 2.0/Atom are the beginning of the unfolding of that vision.

The presentation starts with a joke about 'Microsoft Project' - Bosworth says he is amused to hear about people using Microsoft Project and how it is Microsoft's secret weapon to stop everyone else from competing. Some of the points I noted down while listening to the podcast.

  • Everything looked irrelevant to AB after he got excited about the web.
  • How did the web happen ? Tim Berners Lee hit a perfect storm of productivity and using HTML, HTTP any 'P' (Perl, Python, PHP) programmer could generate content. Simple was huge and everybody could play with HTML and HTTP. HTML is sloppy - everything is rendered
    without a complaint. It takes a licking and keeps on ticking. You don't have to be the high priest of syntax. But that's not the case with XHTML - and that's not a good user experience. Web pages could read/edited on all operating systems
  • Databases are not good at partitioning - partitioning of data is very important to scalability. But the web does a good job of this, by distributing data. The web is also good at caching - At google they
    observed 120,000 hits per second on a certain blog and the only reason the infrastructure didn't melt down was due to the amount of caching done on the web by proxies, front ends and google's own front end. Statelessness - the coarse grained interaction is also another reason for the scalability of the web. Clients talk to servers in terms of chunks of data: go to data when you are ready, not
    continuously.
  • Google combines lot of simple minded techniques with brute force to deliver. Google has lots of Ph.D's and everyone is a General Patton driving tanks. Take the spell check that google does when you type in an incorrect (or sometimes correct) search string - it's based on the very simple technique of tracking failed searches and what users type in after a failed search. This kind of brute force enables google togo through petabytes of data in seconds.
  • Once you start to search, this whole business of putting things in folders begins to diminish in importance. It's very hard with folders to remember where you put what. Folders are not efficient, searchesare.
  • The Vision is to take the database and do the same thing for the web (as was done for content). Can we take all the info on the web andmake it easily findable ? Now you get content, not information.
  • Need something that scales massively and linearly. Originally thought that we would do it using XML. We created a tower of Babel (with XML) - websites need to support only one grammar with HTML. A working group took four years to come up with a spec for a XML Query standard. It's
    better to spend six months and learn the rest from customers. The query standard was not simple like the web - the schemas were very complicated. AB also found the WS specs to be very complicated. Why did this happen ? The companies that came up with these standards were big and were trying to protect themselves. They were people at companies like IBM and MS. Frankly, they were trying to make it deliberately hard
  • AB apparently is a technical advisor to MySQL and had some advice for them as well: Basically MySQL is trying to be Oracle by adding support for procedures, triggers, and views. All of this is about centralizing processing logic in the database. To be blunt centralizing processing logic in the database is a bad idea - doesn't scale. Centralizing logic doesn't give you scale. Advice to MySQL - don't do something because you want to be Oracle. Because Oracle isn't big enough andcan't deliver on billions of queries.
  • We need an Open model for data. What's not open today is how you talk to a database - the actual wire format. There is nothing like HTML/HTTP for data. This is a very 20th century way of thinking. Open up wire formats to serve any kind of information - this will bring enormous changes to computing centered around data. Need open standards for different types of items with one single grammar. It will have to be sloppy. Open up and democratize the way data isserved.
  • Big believer in stupidity - virtues of dumbness.
  • We are actually starting to realize this vision - RSS2.0/Atom are going to be for data what HTML was for content. They are going to be the Lingua Franca of consuming data. Surprisingly simple and sloppy. These guys got the web and that's why it is catching on like wild fire. Atom was formed by a consortium of bloggers and the two formats areisomorphic.
  • Data queries have to be such that they don't need data spread across machines - if a query uses data from four machines, it isn't going to work very well. Queries need to run at an item level - it's not
    technically as complex as sql
  • AB made it emphatically clear that he was not talking about the semantic web and called RDF an empirical failure. RSS 1.0 had an RDF grammar, RSS 2.0 doesn't have an RDF based grammar. Ordinary programmers do not understand how to model something as arcs, nodes,and graphs
Update: Looks like the O'Reilly network and ONLamp.com have also covered Bosworth's remarkable speech.

No comments: