Philip Hummel, Distinguished Member Technical Staff at Dell Technologies

Many researchers have attempted to measure the respective effort that data scientists expend on preparing data for modeling vs the time spent training and evaluating candidate models. The results have been surprisingly consistent with most estimates for data prep being reported as approximately 80% of the total analysis time. However, the skills and tooling requirements for data preparation, especially for distributed systems, are not getting as much attention as the less time-consuming modeling phase. This talk looks at options for distributed data preparation that allow data scientists to experiment with data pipelines and still have time to focus on modeling.