Can data science pick stars in individual sports, say Tennis?

Statistics has become such a prevalent tool in team sports today that I won't be surprised if teams even test a potential star recruit on its basic principles. After all, it'd make it so much easier to explain when the team trades him in the future for a bunch of no-name players from a far-flung league. In baseball, its use is so old that they've already put the story on the silver screen. In football, the early adopter, Bill Belichick of Patriots, moves around in a mad-genius hoodie, creating visions of a day when Mark Zuckerbergs of the world will run football teams. 

Even as a layman, if such a thing even exists in this age of cheap online courses in just about anything, it's easy enough to understand the central argument for the use of statistics in team sports which goes something like this. It takes a team to win games and a team consisting of individually average players who provide just what the team needs at a low price will compete well on average against other rich teams building teams around a few superstars but with missing or weak links. How do you pick such players? That’s where statistics (or its newer, dirtier form – data science) comes to rescue. In essence, let the past performance speak for itself. Instead of star who strikes out a lot batting 0.25 with 35 home runs, a player whose on-base percentage is 0.4 with no home-run power might be what the team needs. So identify metrics that actually work instead of relying on handful of old-style numbers such as RBI, BA and HR. Build models to simulate how a team will perform on average with a given set of players. And tinker away till you get a team of required winning percentage and, equally importantly to the owner, size of the payroll. 


When it comes to individual sports though, the jury is still largely out, especially when it comes to picking future winners from a lineup of teens crushing balls on a hot tennis court, or at the higher rungs of the sport, picking the world’s future top ten players from the current top two hundred. Even full use of using data science to deconstruct a player’s strengths and weaknesses and then identifying strategies to best counter that player is rare. It appears things are changing. This report came out today on how the upcoming tennis star Nick Kyrgios has just the right mix of “features” for Steve Wood, chairman of Aruba Networks, to have declared him a future star at the age of twelve.


The report touts “genetics, family history and body shape” as the key ingredients to future stars, as opposed to raw talent, which might be correlated but by itself may be misleading. Of course these phrases are hardly exact; presumably there’s lot more detail to this behind the scene. It’d be naïve to believe the claims entirely, given that they are being made by a for-profit technology company. But I can sense how physical specimens like Rafael Nadal or Nick Kyrgios score highly on these algorithms, but I wonder if John McEnroe with his stick figure or Serena Williams with her wrestler-like frame would have ever been predicted to be future superstars by any algorithm that did not take the raw talent to play the game of tennis as its primary factor. So they’re outliers, says the statistician. Or perhaps there are hitherto undiscovered factors, lying deep inside the genetic code of the player, or the nutrition and care history in the player’s first few days on earth, or more arcane still, his parent’s TV-watching habits, that will explain even Johns and Serenas of the world. Ah. If only for the grace of more data.