Two months ago, Maritza Johnson organized a trip to the NYC Yahoo! research labs for Columbia's Women in Computer Science. As a Columbia alumna, I snuck in. (Something like fifteen or twenty high-powered CS undergrads and grad students attended. Always great to be in a lobby full of smart geeky women.) I heard about some pretty keen things there, so here's my writeup.
Ken Schmidt, director of academic relations, told us about some of Yahoo!'s academic relations work. For example, academics can get a bunch of useful datasets for research via the Webscope program. Yahoo! hosts university hackdays alongside its other worldwide hackdays. The Faculty Research Engagement program provides funding, datasets, and visits. The Key Scientific Challenges program gives grad students money, secret datasets, and collaboration. And Schmidt noted that there's an active Yahoo! Women in Tech network, and that they'll be at Grace Hopper this year.
At Yahoo! Labs, you choose your office location based on what's convenient to you, and then collaborate with other people in your discipline across offices. I didn't get a chance to see their videoconference meeting rooms but I get the sense they're great.
Following are some idiosyncratic notes on the presentations we got from Yahoo! Labs researchers. We also got to talk informally with Duncan Watts, who thinks a lot about experiments, social dynamics and behavior, and Sergei Vassilvitskii, who is taciturn.
Dan Reeves has spent four years at Yahoo! Labs and works on a mix of things. He talked about Predictalot, the Yahoo! prediction game for March Madness (the US college men's basketball tournament). Reeves, who doesn't know much about the NCAA, showed us some sample bets and kept getting dismayed that they were coming up at 100% or 0% potential, until actual basketball fans pointed out that his randomly chosen bet put up a rinky-dink conference against a heavy hitter. Domain knowledge is useful sometimes.
A few lessons: running quintillions of simulations (in the web browser, when the user selects a bet to make, I think) is hard, and thus programmers took "ridiculous shortcuts." The programmers made it possible to make "weird bets" (like, math involving the sum of the seeds that would make it to the Final Four), and not too many people have taken advantage of that, which is a little disappointing. And though the prediction market is very flexible, it doesn't give you more accuracy than you get already from crude, already-known variables.
But we already have an efficient and computation-assisted prediction market, and it's called gambling. Millions of dollars change hands every year as people bet on college basketball, and metrics for success and failure are clear, so I don't find it surprising that we're already very good at predicting outcomes from known variables. Perhaps a prediction market would lead to a greater increase in accuracy in a lesser-known sport.
That same slight disappointment came up in Sharad Goel's results. Goel thinks about homophily vs. influence, which seems intriguing to me, as does his "Anatomy of the Long Tail: Ordinary People with Extraordinary Tastes". To our group he spoke about what search can predict. That blog entry has all the details. Some key points:
You can use data from people's search queries to "predict the present." For example, people are all gaga about Google Flu Trends partly because it works around lags. GFT gives you results with a tiny lag, maybe a day; the CDC can't tell you results till it's been a week or two.
But can you use search to predict the future? And how well would that compare to alternative prediction methods? Well, you can check queries in the weeks leading up to a movie release and that'll give you pretty accurate predictions for its box office numbers, but "more mundane indicators, such as production budgets and reviewer ratings, perform equally well at forecasting sales." Specifically, there's already a Hollywood Stock Exchange. Again, where there's already a well-honed prediction market, you're not going to be able to swoop in and compete all Moneyball-style right off the bat...
Sihem Amer-Yahia researches social data management. She spoke with us about relevance algorithms for social surveys. You can construct implicit networks based on shared data preferences -- for example, rankings on delicious -- or shared behavior. (Yeah, remember, Yahoo! owns The Web Site Formerly Known As del.icio.us. Over and over in these talks I was reminded that Yahoo! is making a lot of hay from their datasets: Flickr, delicious, Yahoo! Games, Yahoo! Sports (including fantasy sports), Yahoo! Mail....)
How alike are two people, based on what they tag or rank? Well, it's hard to systematically check this sort of thing via tags, because tags are sparse (whooo, folksonomy). Researchers looked at tags on Yahoo! Travel, like "family" or "LGBT." They parsed the tags and their usage to create "concepts" and to build "communities" around those concepts.
As I've known since Leonard created the Indie Rock Peter Principle, recommendation systems suffer from an overspecialization problem. As Amer-Yahia puts it, how can you incorporate diversity into the system's recommendations without hurting their relevance? Well, they have a lot of heuristics. One: use a greedy algorithm to pick the first K most-relevant results. Find which of those K results has the most similarity to that set. Then compare that most-similar result to the K+1th result. If the K+1th result is less similar, then swap it in. Continue to trade off diversity against relevance till you reach the lower acceptable bound of the relevance range (a range whose threshold you may have to discover empirically). It's a species of affirmative action.
Once you personalize recommendations (especially based on social networks), the indices you create and have to deal with get huge. I have a note here about "Storing like things together" and "Returning a composite of relevant items, validating with user's network" -- I assume those are partial solutions to the performance problems.
Another fun thing Amer-Yahia worked on: take Flickr photos and turn them into itineraries (longer paper at author Munmun De Choudhury's site). (Factoid: about 10% of Flickr photos come with automatic geotag stamping, and about 40% have semantic user-added tags that you can use to get some geographic data.) As the abstract says, "Our extensive user study on a 'crowd-sourcing' marketplace (Amazon Mechanical Turk), indicates that high quality itineraries can be automatically constructed from Flickr data, when compared against popular professionally generated bus tours." Oh yeah, the researchers love Mechanical Turk!
Amer-Yahia also spoke on homogeneity in structured datasets with strict ranking. Her demo used Yahoo! Personals as an example, which led to many subsidiary guffaws.
Basically, it's the diversity vs. relevance problem again. If you say you want to see college-educated white women aged 25-34 within 5 miles of New York City, you'll get a big dataset ordered by some characteristic. You can either rank by distance, or by age, or by level of education, but in any case you have like 100 nearly-identical results on the first several pages before you get to the first difference. It's hard to explore.
So instead we have subspace clustering, which sounds AWESOME. You cluster combinations of attributes in a rank-aware way, label them, and make sure that your resulting clusters of results have adequate quality, relevance, etc. Amer-Yahia explains this as dimension reduction to help users explore [results] more effectively.
John Langford works on machine learning. He pointed out a bunch of spots where Yahoo! sites use, or could benefit from, machine learning. He works on a "fast, scalable, useful learning algorithm" named Vorpal Wabbit. Langford demonstrated it and indeed it seemed plenty fast, although I haven't any baseline for comparison. Key phrases I noted include "linear predictor," "infrastructure helps it go & learn fast," and "plug in different, lossy-or-not algorithms?" and I assume interested folks can go check out the tutorial. A niche tool, but sounds invaluable if you're in his target market.
Jake Hofman showed us some more machine learning goodness. His tool (an implementation of vbmod, I think) scrapes the To: & CC: lines from your email to see who gets emailed together, and from that constructs a pretty graph showing the nodes & clusters in your social network. He tried it on a colleague's real mail, and indeed five distinct clusters sprang up. "That's my soccer buddies...that one's my in-laws...that's my college pals..." You can use this to have the Compose Mail interface auto-suggest recipients you might have left out.
I talked with Hofman a little after the presentations, whereupon he revealed that he hearts Beautiful Soup and Mechanize for screen-scraping login-protected or otherwise complicated websites. Evidently he got into Bayesian fun as a cell biologist getting software to automate the tedious task of classifying images from microscopes, slides, etc. Oh, there it is on his resume: "Applied machine learning and statistical inference techniques for high-throughput quantitative analysis of network and image data" and "Developed software platform to automate characterization of cell spreading and migration". Cool!
Siddharth Suri researches social networks and experiments and data mining. He presented his "Behavioral Study of Public Goods Games over Networks." He did an experiment on Mechanical Turk. Econ professors would challenge him on how representative of the population that sample is, to which he would rightly reply that they tend to experiment on university undergraduates, who aren't exactly hella representative either. Boo-yah!
Suri asks how we might get people to change or sustain socially beneficial behavior in a tragedy-of-the-commons situation. For example, how do we encourage energy conservation, discourage littering, and encourage donations to charity? I appreciate that it's a tough and important problem. However, he also said that the same question applies to online communities: how do we get people to upload photos to Flickr or write Facebook updates so everyone can enjoy them? He then investigated via a social dilemma game/experiment via Mechanical Turk, where strangers had the option to give or keep amounts of money, sometimes a "subject" was a plant who moved norms towards selfishness or altruism, etc., etc.
I find this question and approach a little bewildering. People write and share and upload online for many of the same reasons we knit scarves as gifts, host and go to birthday parties, and gossip and volunteer in the physical world. These are interpersonal, social actions that we do to bond or amuse ourselves or gain status within specific communities that have meaning to us. Experimenting on this phenomenon with strangers exchanging money on Mechanical Turk -- because that's where you can get experimental results -- seems weak.
Since this experiment was an initial pilot project, we suggested that future iterations allow the subjects to make friends with each other, or get pre-existing groups of subjects to join (e.g., have an experimental group composed of coworkers). Another attendee worried aloud that these measures might allow a "false sense of community" to arise and throw off the results. But who are we to call any sense of community false? And community is the answer to the social dilemma, anyway, isn't it?
Overall, a thought-provoking and enlightening way to spend a few hours. Thanks to Johnson and Schmidt for setting it up. I also thank Yahoo! Labs for the lunch, USB drives, pen gadgets, and fleece scarves. Let me know if I'm wrong about anything!