It’s Halloween, and in honor of this spookiest of all holidays, Corios Consulting Director, Eric Flora, spins a tale of good data, gone bad.
Late at night thoughts tend to wander, especially when you are still at the office on Halloween. I was surrounded by the mundane elements of the workplace – desks, phones, and computers. But, on this night, something felt different despite the familiar setting. I imagined the presence of a parallel reality that would physically manifest our thoughts and emotions, and pondered how on Halloween that shadow world seemed so much closer to our own.
I willed myself out of this reverie and back to the task at hand. I had been given a jumble of customer contact information from a variety of data sources, and asked to make sense of it. Column names were deceitful. Formatting was chaotic. Records from different systems either weren’t connected, or when they were, they didn’t agree. It was like a giant living blob with many mouths, each keening a discordant note of helplessness, with no common theme except that the end will be soon.
Whoa, where did that come from? Maybe I should call it a night. But, two things kept me in my ergonomic chair. First, pride in my ability to simplify the complex, and second my fascination at the wretched state of the data, and the challenge it represented. The soft amber glow of the coffee maker calmed me, and I set to work.
The disconnection between the column names and the data they contained was the first bugbear to face. Setting aside the spurious labeling, I wrote a program to identify the data represented by the values in each column, for each row. This cell is a name, this one was part of an address, another a phone number, the rest of the address, and email address. The data types of each token, their number and sequence, and whether they appeared in a knowledge base all helped to identify and categorize each datum. The blob began to divide, and slimy hobgoblins of different shapes and sizes emerged from the effluent, dancing to a shrill piping scale and tittering as they locked eyes with me.
I jerked to alertness and ran a hand down over my face. Easy self, this is just customer data. Ok, I’ve identified what these data represent, what’s next? The image of the hobgoblins lingered in my mind. Each of them was different; here three arms, here hopping on one thick leg, here no head with a face peering out of its chest. Ah, I need to standardize these data. Let’s see, proper case or upper case? Should I use Street or St.? Zip codes, 5 digits, no 9 digits, hyphen or no hyphen, 6 alphanumeric for Canada, ok, that is starting to look alright. The hobgoblins have grown still and solemn and each now has the normal complement of limbs and neck and head. A group of them have a whispered argument, and then one meekly steps towards me to request distribution of cups of coffee for him and his fellows.
I am standing at the coffee maker, pot in one hand and cup in the other, before I recall that I am alone in the office. In an absurd, but somehow necessary, attempt to save face I pour and take a sip of the coffee as if this was my intention all along. Returning to my desk, now that the data is in better shape, I can start to compare the data in aggregate from the different data sources. Argh, Jim here and James there, the same person to my mind but multiple ids within the same system, miss-keyed SSNs, Jr. and Sr. at the same address. Robert and Roberta are different people (although if you asked them they would say they are soulmates). The same person represented in different ways within and across the data sources reminded me of the mythological Hydra with many heads attached to one body, slithering out of the water with venom on its teeth and venom in its heart.
Ouch! I say aloud as my forehead impacts the computer monitor. Ok, time to wind this up. I can make groups of names that are essentially synonyms according to my knowledge base, a middle name and a corresponding middle initial can be equivalent, the edit distance will help identify misspellings, and then I can try soundex since it looks like some of these names were keyed in after being spoken over the phone. I trust this data source more than the others, so I’ll use it if possible to represent the group, then the next most trustworthy, and so on. Then, I assign a surrogate key to each of the groups and the hydra sinks back into the water, steaming at its defeat. No more monsters, just a collection of ordered data that corresponds with real people in the real world.
I turn off my monitor, grab my jacket, and lock the door on my way out. If I hurry, I can still make it to the costume party before the clouds clear and the full moon shines down with its transforming light. I know I’ll be very hungry soon, and there will be plenty of food there. Wait, where did that thought come from?
To our fellow Data Scientists, Happy Halloween! May you slay your own data monsters, and live to tell the tale. And, to our fellow business owners, if your data resembles the monsters described here, shoot us an email, we can help.