< Weird Search Requests
Entities >

[Comments] (2) Goodbye Google SOAP Search Service: You may have heard that Google has deprecated their SOAP-based search service. This comes after Nelson Minar, who worked on that API back when he was at Google, says he'd "never choose to use SOAP and WSDL again."

Much as I dislike the way SOAP is used these days, I'm not inclined to gloat, because the deprecation just means more work for me. I used the SOAP search service as an example in all three of my books. Now I've got to find another free, public, non-obscure SOAP service to use as an example (ideas?).

The official Google narrative is that the SOAP-based web service has been replaced by something called the "Google AJAX Search API". If you take this narrative at face value it means that Google has taken down their web service and put up an AJAX library in its place. What's AJAX? Who knows, it's AJAX. Here's a typical weblog entry on the topic.

Another victory for REST over WS-*? Nope -- Google doesn't have a REST API to replace it. Instead, something much more important is happening, and it could be that REST, WS-*, and the whole of open web data and mash-ups all end up on the losing side.

It's probably my recent proximity to Sam that's doing it, but I'm noticing a tendency in myself to draw fine distinctions. There's a sense in which this narrative is right and a sense in which it's not. I'm going to pick apart the narrative and show what exactly is disturbing about the Google AJAX Search API.

It's not true to say that "Google doesn't have a REST API to replace it." In fact, Google has two REST APIs, and one of them predates even the SOAP API. You've probably used this old API: its primary endpoint is http://www.google.com/.

Yes, the Google website is in fact a very RESTful web service. The downside of this web service is that it's a little bit difficult to use automatically, as opposed to through a web browser. It serves data in a human-oriented format (HTML), and you have to screen-scrape it into a data structure if you want to do anything for it.

There are libraries for doing this, but the other problem is Google doesn't want you to do it. It violates Google's Terms of Service ("No Automated Querying"). Lots of inconsiderate people write scripts that hammer Google's REST API day and night. Google tries to prevent this by sniffing out anything that might not be a web browser and preventing it from accessing the API. (To see this, set your browser's User-Agent to "libwww-perl" and try to use Google.)

But it can't be denied that people outside of Google have a powerful hankering for Google's dataset, so eventually someone (Nelson Minar, it seems) came up with a second web service API that was designed just for use by automated clients. The catch was that you had to sign up for a unique key to use it, and that key would only work for you 500 (later 1000) times a day.

In point of fact Nelson chose a SOAP/WSDL architecture for this web service. But there was no need to use any different architecture at all. Here's a possible different way of implementing the constraints above:

When you make an HTTP request to google.com, we try to figure out whether you're a web browser or an automated client. Ordinarily, if you're an automated client, we shut you out. But here's the deal. Now you can sign up for an "automated client key". When you make an HTTP request to google.com, stick your key into the Authorization header. Not only will we not shut you out, we'll try to make things easy for you. Instead of a human-oriented HTML document, we'll send you the appropriate data in an easy-to-parse XML format. But, we'll only do this for you 500 (later 1000) times a day. Then it's back to shutting you out.

This technique has a number of subtle benefits which I could bore you with for quite a while. But its obvious benefit is that it's got the exact same "API" as the Google website, which everyone knows how to use.

Anyway, instead of going down a route like this (which would, I think, have changed the history of web services quite a bit), Google went down the SOAP/WSDL route. Now they're deprecating the SOAP service in favor of some mysterious "AJAX API". This brings me to the second of Google's REST APIs.

There is no magical thing called an "Ajax request". An Ajax client makes normal HTTP requests, and processes the results automatically, just like a web service client. An Ajax client is a web service client.

What HTTP requests is the Google Ajax client making? I poked around a little bit and it looks like it mainly makes GET requests to URIs that look something like http://www.google.com/uds/GwebSearch?callback=GwebSearch.Raw&context=0&lstkp=0&v=1.0&key=xxxxxxxxxx&term=web+services. That's not exactly http://www.google.com/search?q=web+services, but it's not too far off either.

The Google AJAX API consists of a browser-side Javascript library and a server-side web service. The one acts as a client for the other. From what little I've seen of the web service I'd consider it quite RESTful. In fact, it's architecturally very similar to Yahoo!'s RESTful search API. They both use the same (IMO, fairly unsafe) trick to get a web browser to execute dynamically-generated Javascript code from another domain.

The main difference is that Yahoo's search API can also be made to send data (in JSON or ad-hoc XML format) instead of executable Javascript. That makes it possible for the service to be consumed by automated clients, not just by web browsers running client-side Ajax programs.

Let me just see if I can do something similar with the Google web service. The Javascript it serves is extremely close to also being a JSON document; I should be able to hack it a little and parse it as JSON.

Here's some Ruby code that gives you kind of a command-line Google search like people used to write for the old SOAP API. It requires the json gem.

You can skip the code.

require 'rubygems'
require 'uri'
require 'open-uri'
require 'json'

KEYS = %w{GsearchResultClass unescapedUrl url visibleUrl cacheUrl
          title titleNoFormatting content results adResults
          content1 content2 impressionUrl}

def search(key, term)
  uri = "http://www.google.com/uds/GwebSearch" + 
    "?callback=GwebSearch.Raw&context=0&lstkp=0&v=1.0" + 
  javascript = open(uri).read
  # Hack quotes around the hash keys to make the Javascript string
  # into JSON.
  KEYS.each do |key|
    find = Regexp.compile("\s*#{key}\s*:")
    json.gsub!(find, " \"#{key}\" : ")

  parsed = JSON.parse(json)
  return parsed["results"], parsed["adResults"]

# Command-line interface begins here

(puts "Usage: #{$0} [API key] [search term]"; exit) unless ARGV.size == 2
key, term = ARGV

results, ads = search(key, term)
puts "#{results.size} results for '#{term}':"
results.each do |result|
  puts result['titleNoFormatting']
  puts " #{result['url']}"
  puts " #{result['content'][0..70]}" unless result['content'].empty?

unless ads.empty?
  puts "Look at some ads while you're at it:"
  puts '------------------------------------'
  ads.each do |ad|
      puts ad['title']
      puts ad['visibleUrl']
      puts " #{ad['content1']}"
      puts " #{ad['content2']}"

Now, in old episodes of MacGyver, whenever MacGyver built a bomb out of baking soda and masking tape, the writers would change some crucial detail (like change the masking tape to Scotch tape) so that if kids copied MacGyver they wouldn't blow up the house. I've done something similar here. I've removed a crucial line of code from that program, so that people don't just go copying it and running it all over the place.

Why did I do that? Because when it works, that program violates the Google AJAX Search API Terms of Service. "The API is limited to allowing You to host and display Google Search Results on your site." I can use the old SOAP API to write a command-line search tool, but I can't use the new, RESTful API in that kind of application. My users can only access the RESTful API through a specific library (Google's Javascript library), running in a specific way (in their web browsers), for a specific purpose (displaying search results).

Wait a minute... running only in a web browser? Terms of Service? Bootleg scripts that hack the output into something a parser can understand? This REST web service is made available on exactly the same programming-unfriendly terms as the Google website "REST web service"!

Instead of screen-scraping a web page, I'm now screen-scraping a web service. I'm reverse-engineering undocumented URI formats, just like I do when I screen-scrape. So far, there's nothing on Google's end that sniffs my user-agent to make sure the web service only runs in a browser, but you can bet there will be as soon as that becomes a problem for Google.

The "blow to web services" narrative is incorrect. Google did in fact deprecate their SOAP API and expose a RESTful API. A win for REST!

Though incorrect, the "blow to web services" narrative is also correct. Google deprecated their SOAP API, exposed a RESTful API, and then erected a bunch of technological and legalese barriers around any attempt to actually use the RESTful API. You're only allowed to use it through one library in one language in one environment for one purpose. A loss for everyone!

On the level of technological choices, this move is a big improvement. They've gone from SOAP, which has a lot of overhead, to plain old HTTP, which has strictly less. Gone from an RPC style, which doesn't play well with the web, to a RESTful style, which does. This makes an enormous amount of technological sense. From its first day on the web, Google has exposed its dataset through a RESTful interface that gets orders of magnitude more traffic than any "web service" it might expose. In a sense, all they're doing now is unifying the architectures.

When it comes to getting information into the hands of people who can use it, Google has taken a big step backwards. The SOAP interface was serious overkill, but what you did with it was your business (though you could only do it 1000 times a day). The new RESTful interface is a technical improvement, but it's encumbered with restrictions that make it a museum piece. Unless you're writing an Ajax application using Google's library, its true value can only be obtained illicitly. And that's the other reason why I'm not inclined to gloat.

Filed under:


Posted by Mark Baker at Thu Dec 21 2006 14:13

FWIW, Google used to have an XML based RESTful search API up at http://www.google.com/xml but it was removed from public use a few months after being announced (2001) and - so I'm told - is only used by some partners now.

Posted by Leonard at Thu Dec 21 2006 14:43

Verra interesting.

[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.