Most data scientists apply their craft at predicting the mundane. If it’s retail, we’re talking about recommending an album, gizmo or such that has the best chance of wresting another dollar from the customer’s wallet. If the underlying data science model pushes the latest Cardi B on a father whose teenage daughter had hijacked his account for a day last week, no harm is done – one can even say it made the father feel a little hip for a second. If that Next thermostat tends to be set more to your significant other’s liking just because she cares to get up and change the setting and thus train the damn thing, you can only blame yourself and not the thermostat. And if Alexa fails to play your favorite song because of the way you roll your r’s, you shrug and move on and soon enough your favorite song becomes whatever Alexa does play on a consistent basis.
But data scientists are moving beyond the mundane more and more.
Take self-driving. If your algorithm is going to decide if that blur, moving swiftly towards the car and having a vague resemblance to a bicycle is instead just white noise in the video or an artifact of the compression algorithm or a ghost reflection of an unknown variety in the radar (and so on.. the list is long here), and instruct the car to carry on without a worry in the world, you better be sure – like almost hundred percent sure (granting full perfection is unattainable). But how do you really be really, really sure? And how much is really, really sure? What if this particular bicycle has only a vague resemblance to most bicycles in the first place? What if it’s a unicycle? Or one of those giant-wheeled bicycles that make you wonder how the hell the rider got on it in the first place. Does the algorithm get a pass? Do such bikes and the riders that ride them are to blame here for not opting for a standard bike more visible to algorithms? Perhaps you can train your algorithm for these varieties of bicycles but how about the next odd incarnation of a bicycle. In a more mundane application, you wait for a few false negatives and retrain. Nothing wrong about learning from one’s mistakes. But the real question here is when is an algorithm ready to be put out in the field in the first place. What’s the right level of false negatives (or false positives for that matter)? In situations where lives depend upon the success of algorithms, it’s imperative that data scientists take it upon themselves to take a closer look at how to measure accuracy and how to set bars of performance and not rely solely on a word from above as it may only come when the lives have already been unnecessarily harmed.
The general principal on which to do this is clear enough.
The algorithm must be more accurate than a human being.
It’s the application of this principal where short-cuts are made in the interest of time and money. When it comes to comparisons, the tendency is to choose a human being who is average in driving skills but an algorithm that is inside a car in a mint state operating in friendly conditions and deep within its trained domain of application. In other words, the tendency is to not compare an algorithm driving in snowy conditions next to an elementary school with jumpy kids to a parent driving carefully with all their experience but instead an algorithm driving a car on perfectly painted lanes to an average human driving with life’s distractions in their mind and a drink or two in their belly. This is not a high enough bar, not when lives are at stake. We should aim higher as in this guiding principal.
An algorithm with such high stakes is ready for prime time only when it can beat not the average human in average conditions but also the best of the human drivers in the worst of the conditions the algorithm would be allowed to encounter.
Take predictive maintenance in the Oil and Gas off-shore drilling as another example. A malfunction of the Blowout Preventer (BOP), which stops uncontrollable release of subsurface oil or gas, is not something to take lightly. If your algorithm is going to mine the IoT sensor history of a BOP, analyze the machine’s degradation with age and then provide assurance that the BOP can be safely used for another, say, three months without any reconfirmation from human experts, it better be sure – really, really sure. Is there even enough data to draw that conclusion with any reasonable confidence? A typical drilling company will have multiple models of BOPs from multiple suppliers at various drilling sites. Drawing behavioral conclusions from one model of BOP and applying to another is rife with risk unless data proves it otherwise. Often such concerns are ignored and overstated conclusions follow. No drilling company has enough data to solely depend upon a data science model to predict usability of infrequently used equipment such as a BOP. In this scenario, chances of both false positives and negatives are high unless the analysis is corroborated by more traditional physics models. What needs to be acknowledged in this scenario is that the data science model has a limited domain of application and it’s best used in conjunction with both physics-based models and human experts.
Where does it leave us data scientists? Here’s a more generalized version of the guiding principle to follow in such high stake applications.
In each of its domains of application, the algorithm must be more accurate than a human being without any impairment. The application containing the algorithm must fully ensure that the algorithm is never applied outside its domains of application and there must a fail-safe mechanism to take algorithm out of operation and control be handed over to a human being without any harm to any one whenever there is a danger of conditions shifting out the training domain. If the domain of application is such that such a hand-over is not possible, consider the algorithm not ready for the field.
Put even more simply, for god’s sake, do not overstate the model results. Lives are at stake.