Housing’s Data Science Lab was one of the earliest teams at Housing. Data scientists on the team use advanced statistical and machine learning methods to produce better information for Housing’s users and our internal decision-makers. In the first of a two-part interview, Paul Meinshausen, Vineesha Budhrani, and Nitin Sangwan talk about a future enabled by data; and how it can be used to find the best batsman in cricket!
Paul is VP of Data Science, Nitin is an Engineering Manager, and Vineesha is a Data Scientist at the Data Science Lab (DSL) at Housing.
Q.1) Give us a short history and overview of your team.
Nitin: Housing was founded on the idea of giving users (renters, buyers, sellers) decision advantage. We take extensive data, process and analyse it, and deliver it as information needed to make decisions. Given Housing’s single-minded focus on turning data into the power to make better decisions, the Data Science Lab’s formation was only natural.
Our data scientists have diverse backgrounds in fields like computer science, mathematics, statistics, and computational social science. Real estate in India is a treasure mine for data scientists and we’re growing quickly. Our industry presents interesting and challenging problems, so we keep high recruitment standards. We aim to be the premier data science team in Asia.
Q.2) The Data Sciences Lab is behind some innovative ratings and indices in real estate like the Lifestyle Rating. What other kinds of work does the Lab do?
Vineesha: Ratings and indices are just one aspect of data science at Housing. A lot of our work is happening behind the scenes as you browse our website. For example, we’re constantly gauging what a person is looking for and trying to provide the most appropriate listings in response.
We’re ensuring that our listings grow quickly and that they stay current. This involves a lot of interesting inference problems as well as predictive modeling. We also provide insights to our operations, business, and sales teams. We help them optimise resource utilisation and resolve critical trade-offs. We’re regularly developing new and exciting work, so stay tuned!
Q.3) Many universities have introduced specialisations in analytics into their MBA programs or post-graduate courses for Big Data hoping that it is the ‘next big thing’. Is their enthusiasm misplaced?
Paul: Their enthusiasm is well-founded. Before joining Housing I was the lead data scientist in Asia for Teradata’s International Data Science team. From Japan and South Korea to India and Pakistan, one of the first questions Teradata clients asked me was how they could find data scientists and build a data science team. There’s a lot of data work out there to be done and far too few people to do it. So it makes sense that universities would adapt to that demand.
However, I don’t think that developing a single academic program or specialty will solve the problem or supply a majority of tomorrow’s data scientists. Instead, if I were in charge of existing programs in any of the related fields — computer science, statistics, mathematics, computational social science, engineering — I’d focus on helping my students round-out their skills across a variety of domains. And I’d really focus on helping them develop a heightened sense of curiosity and openness to new ideas and to new applications of established methods.
Q.4) The Data Science page says that ‘data is the new soil’. What does that mean for your work?
Vineesha: The quote by David McCandless reads: “Data is the new oil? No: Data is the new soil”. This is how we interpret it. We have a ton of data. This includes data on every action by every user, listings and their attributes, as well neighbourhoods, communities, and cities where those listings are located. But oil is a non-renewable resource that you mine, use, and then it’s gone.
On the other hand, our data is a renewable resource that is consistently reused to produce new forms of value that in turn feed back into new data in a virtuous cycle. Instead of just seeing data as a valuable asset to have in our arsenal and use once, we strive to use it to grow new and insightful products to enhance user experience. And as those products get used, they produce new data and get returned to the soil to provide nutrients for new products in the future and to benefit tomorrow’s users.
Q.5) We have had data on ball possession in soccer and ball-by-ball data in cricket for sometime now. Yet, Big Data and analytics have had a minimal role to play in selection of players. When do you think the Moneyball model will be replicated across all sports? Do you think you can come up with a rating system that can tell across formats and eras who is the best batsman in cricket?
Nitin: I guess a lot of analytics already goes into the selection of players as well as in making other sports specific decisions, like what role would a particular player play in the game against a particular team. It’s just that most of it is currently ‘excel-sheet’ based. Some organisations are already working on the idea and have come up with excellent tools to visualise sports data to help coaches and teams take away tangible to-dos from the data (check out Viz Libero).
As for a rating system, well, of course a rating system can be easily designed. However, the most important task would be to define ‘best’ in a form that would be widely acceptable. Some might say that a batsman is good when his average is good. Others might prefer consistent performance as a better measure of goodness. Once a definition is arrived at, rating would just be a function of key parameters that add up to this definition. Further, I think a more interesting problem statement would be: given this measure of goodness, predicting who amongst the current lot of youngsters would be the next Bradman.
Q.6) One of the common criticisms of Big Data has been that its promise hasn’t been translated into performance; it has been able to only identify trends and correlations, not causation and predictions. How do you respond to that?
Nitin: A model based on data performs only as well as the formulation of the original problem statement. Defining the problem statement accurately is 90% of the job, designing the model is the next 1% and iterating on that model is the final 9%. Whenever a model fails, one should first revisit the problem statement and then the model used.
The second issue is the importance of being very critical in correctly defining the predicting parameters. One cannot ask, given that it was hot yesterday, where should I invest my money? This follows the principle of “garbage in, garbage out”.
I would say a lot of criticism is driven by a lack of emphasis on these issues. There are many industries that have benefited hugely from big data. A few good examples would be credit card fraud, devising financial products, and improving human computer interaction with gestures and voice recognition.
Stay tuned for Part 2 of the interview with the DSL team, coming up soon!
Interested in an exciting career opportunity with one of the best Data Sciences team in the country? Have a look at our current job openings, and apply here: housing.com/careers