With its growing emphasis on all things AI — coupled with its history as a tool vendor — it’s not surprising that Microsoft is working on tools not just for traditional programmers, but also data scientists.
According to a Microsoft Research presentation from earlier this year, data scientists currently spend 80 percent of their time extracting and cleaning data — AKA “data wrangling.” Microsoft wants to fix this.
Enter “Project Pendleton.”
A year ago, I first heard from a contact of mine about a new machine-learning-related tool under development by Microsoft that was codenamed “Pendleton.” But it wasn’t until The Walking Cat (@h0x0d on Twitter) unearthed some more information and documents that I had enough information to write about Pendleton.
From a “Getting Started” document on Pendleton from The Cat, here’s Microsoft’s explanation of what Pendleton is:
“Pendleton provides a set of flexible and scalable tools to help you explore, discover, understand ad fix problems in your data. It allows you to consume data in many forms and to transform that data into new forms that are better suited for your usage.”
Pendleton is a client app that works on Windows, OS X/macOS. Its design runtime uses Python and depends on various Python libraries.
As one of my contacts described it, Pendleton is a tool aimed at data scientists that is designed for data preparation and cleaning. The tool can do things like remove errant columns, change formatting in columns, handle missing data and the like. It also includes analytics tools to help data scientists figure out what’s included in a dataset. Pendleton can read data from SQL Server, Azure Blobs and Data Lakes. It also can read files from local PC files, my contact said.
Microsoft has been testing privately Pendleton for nearly a year, maybe longer. I haven’t heard how the company plans to release the tool, but it seems like that’s still the plan.
I’m thinking that Microsoft Research’s PROSE (Program Synthesis Using Examples) Research team that “develops program synthesis technologies for data wrangling and incorporates them into real products” probably behind Pendleton, at least to some extent.
Meanwhile, speaking of data science and big datasets, Microsoft and Facebook announced today a new standard they developed together for representing deep-learning models that allows these models to be transferred between frameworks.
That new standard, Open Neural Network Exchange (ONNX), will allow developers to switch between AI frameworks like Microsoft’s Cognitive Toolkit, Facebook’s Caffe2, PyTorch and more. The initial version of ONNX code and documentation are available now as open source on GitHub.