Don’t be data-driven in AI
Being data-driven is usually used and understood with positive connotations but when I hear the word used I get a little anxious about the “data-driven” decisions that might be about to happen. Let me explain why.
According to Wikipedia data-driven means “The adjective data-driven means that progress in an activity is compelled by data, rather than by intuition or by personal experience.” In other words - Look at the data as a primary source of information to act on. When the data gives you a reason to act, you act. It might at a glance seem like a very sound way to work and especially in the AI domain, that in so many ways rely on data. But in fact being data-driven can be very problematic when working with AI. I actually think people that say they are data-driven in general are on the wrong track. This does not mean that I’m against putting much effort into understanding your data. I’m actually a big believer that collecting, understanding and preparing data for AI projects should be the activities with the most resources allocated to it. So I’m pro good data science but against being driven by data and I see that as two very different things.
But then why is it so problematic to be data-driven?
My primary argument is that the driver behind decision making and activities should not be the data you have, but rather curiosity on the problem and the world around it. In a sense that would mean being driven by data you don't have. The final goal of AI projects often is to solve a problem or improve a process and the solutions to that do not always exist in the data you have generated or are being generated by the current world's solutions. So instead you should be curiosity-driven or at least problem-driven. This means that you should not approach problems by looking at your data and making a conclusion. You should look at your data and look for the blindspots and from there be curious. What is it that you don’t know? I’ll get back to curiosity later. First I have some more arguments against being data-driven.
You will extremely rarely have all relevant data to a problem. Even after exhausting all potential data sources. So when you make conclusions from the data you have, the conclusion will at least always be a bit off. This doesn’t mean that data is not useful and that the conclusion is not useful, but you will always be at least a bit wrong. As statisticians would say:"all models are wrong but some are useful".
Another problem with being data-driven is that there's a narrative that decisions made on data is better than decisions made on gut feeling. And while that might be true sometimes, data is not one-sized and can be very helpful at times and very misleading at others.
An example is the father of modern statistics Ronald Fischer that also in hindsight was a little too data-driven. He stubbornly held to his conclusion that data showed that lung cancer was not a result from smoking. The correlation he said must be the other way around and people with lung cancer or higher risk of lung cancer was just more likely to be smokers. He argued that it was either a genetic relation or that cancer patients would use smoking to soothe pain in lungs.So even the best statisticians can be told stories that are far from the truth by data.
The last problem with data is its ability to tell you the story you want it to tell. That can be done consciously or unconsciously. A famous quote by the economist Ronald Case goes “If you torture the data long enough, it will confess to anything” so there no certainty that the conclusion you get from data is correct. The interpretation can be very biased and sometimes we torture the data even without being aware of it ourself.
About curiosity
So as promised I’m getting back to being curious. If I had to choose one keyword to succeed with AI it would be curiosity. AI projects usually start with a process to optimize or a problem to solve and before training a model on data you have to be curious about the problem. In that way the data comes subsequently to the problem and will as a result be more relevant and more specific to the problem.
Curiosity to me means exploring with as little preconception as possible. The best example for me is when children lift up rocks on the ground just to see what is under the rock. If you ever saw a child doing that you will have seen that there is no expectations, only excitement both before and after the rock and been lifted. And that is exactly what curiosity does to the practitioner. It leads to excitement that in turn leads to passion. Passion makes everything much easier and even the tedious parts of a project will feel effortless.
AI is also explorative in its nature and that’s why it suits so well to be curious. If there's specific expectations in an explorative process then disappointed is almost given.
As a result you must let curiosity be the primary driver behind the decisions and activities you make. Being data-driven is reactive in nature and if you want to be innovative in solving problems you must be proactive. Being proactive requires you to be curious about your blind and be driven by the unknown.