Rosenblatt Securities - Techniques for Handling Data from Excel & Other Files - Inspire 2017

Our data does not always start in clean and predictable databases. Data manually typed into Excel, or data files scraped from the web almost always need cleanup or validation before the real work can begin. In this tips and tricks session, we will cover techniques to add resiliency and messaging to your workflows, addressing situations including new/reordered/changed columns, combining multiple sheets from multiple files, and more.



Video Transcription


Joe Mako:
All right. Thank you everybody for coming here today. We're gonna get started. I know we got lunch coming up here soon. So, I'll try to run through this as quick as possible. And then we'll have questions at the end because I'm gonna cover a lot of stuff. And then I'm gonna try and go through it quickly. So, if you have questions, jot them down on a note paper. And then we'll get to them at the end and we can go through anything or if it's off-topic, we can answer that as well.

So today, we're going to be talking about techniques for handling data from Excel, and other messy, unpredictable files. So, show of hands- every time you've needed to load multiple flat files to union them to one table, and their structures have been exactly the same, no exceptions. Anybody have that experience? Okay, you probably don't need this class. You're good. End of talk because what I live with in my day to day is messy data where humans enter the data and they don't enter it consistently. When they create a file from week to week or month to month, they move the columns around. They add in blank rows. They add in blank columns. They change the order of the columns, just endless mess.

Even if it's files coming out of the machine that's generating them, sometimes that machine gets updated with a new software and the new file is a different structure. And so how do we build resilient workflows that can handle these messy, and sometimes unpredictable files? And that's what we're gonna talk about here today. And so here's kind of a brief rundown of the key points. Gonna look at clean data vs. messy data, the ideal situation that I think Ultrix developers created a tool for, but it doesn't match reality.

And then the method that I use is my standard baseline for loading all data in that I think is very flexible, and then I'm run through some situations that are common. And it's of course not all-encompassing because every situation is going to be different, but I'll just cover some basic concepts. And then talk about where you can get some other examples, where you can challenge yourself and just generally how to think about this with a mental model.

So, the first thing to be aware of is you're not alone. And also every situation is slightly different. If you've been on the community or asked this question on the community, "How do I load in multiple Excel files with multiple sheets or multiple CSV files, or whatever, multiple flat files and bring them together into a single table?" It's a hard problem. And you're not alone. And there's no one size fits all solution to it. I've tried creating a macro for it, put it in with the crew macros. Other people have created knowledge base articles, a series of knowledge base articles, one off solutions, tips on how to do it yourself.

There's no one size fits all. This is an approach, a mental model. And it's also a lot about having forgiveness for yourself and an understanding that this is an iterative process. It's not going to come out right the first time you try to do it. And having that understanding if there's going to be a gap, between our ambitions and our ability when we begin, because I look at this a creative endeavor. And in order to close that gap between our abilities and our taste, takes meaningful practice and acceptance and a growth mind set. Okay?

So, what we'll do is we'll just run through the idea of here is the ideal situation- that if your data was clean, that you would have every worksheet with the same headers. And they can be separate files, but every file and every worksheet would have that same header structure. No blank records, no misordering, or anything like that. And when your data is like that, we can use a workflow like this, where... Hopefully, everybody can see this or nope, we can't. Let me see if I can change this. How do I exit PowerPoint? We're gonna close off PowerPoint.

Okay. So, let me start again there. So, the clean ideal structure would be something like this where each sheet has the same headers across each of the files. This doesn't happen. But if it did happen if you had well-structured data, you could have a process as simple as this, where you have a directory tool to bring in all your xlsx files. This formula tool will append this special characters here of the list of all the sheet names. That what this enables you to do is to pull out all the sheet names within all the files so that you can pull in all of them at the same time. And then this dynamic input would allow you to take that constructed path that we made here. That is this right here, where it takes the beforehand of just the path to the file name and the formula step appended in this list of sheet names, special characters to it.

And using that as our field that we're changing the entire file path with, that would bring in all of the sheet names here. And what it does is it's got the file name here and the sheet name. Now, how did I get it to bring in this file name is when I can figure this initial connection. There's this option here on output that says, "Include the full path." By default, this starts off at no. And by changing this to full path, that gives you both the data all stacked up and the full path for the file name that we used to pull that data in. Because in the step we do, we take that special characters of list of sheet names, and replace it with the actual sheet name.

And one of the things I like to do with that is wrap the sheet name in the special, slanted quote character. It's that key right next to the one on the keyboard. And that allows it to be special characters or spaces, and things like that, that allow you to bring in that sheet name. And then from there, you can run the dynamic input again to replace that full path with that... replace the file path with that full path that we just created. And that appends all the data together, and you get your nice, clean data.

This would be the ideal solution, if your data was perfect every time. But this isn't reality. So, what happens if you were to try to run this in reality? You would get an immediate error. You wouldn't even be get in your list of sheet names because the built-in Ultrix dynamic input tool is... Only works in perfect conditions. And the perfect condition is every header is the same, every data type is precisely the same, every content is the same length of the string fields. So, it's incredibly fragile and incredibly picky. And it will fall down very quickly. And for when I first tried using this dynamic input, I was frustrated with it.

And I keep trying to tell Ultrix developers that they need to come back to this, make this more dynamic, make it more flexible, so that out of the box, people can build solutions. Because right now, the only way that I know around this is we've got to create a macro. And for everybody. I mean, most people need this, but not everybody knows how to make a macro in Ultrix. It can be intimidating. So, I'm gonna walk through the macro that replaces these dynamic input tools. And what it is is we start off with a input that we build similar that we would for anything else where we connect to our Excel file. And we include our full path.

And then what we need to do is build that dynamic replace of full path. And we do that from the interface with a control parameter. Now, when we drag and drop a control parameter onto the workspace, what it will automatically do is change the workflow into a batch macro. That this tool, this interface element is specific to a batch macro. And what the batch macro allows you to do is run this macro or run this workflow for each record. So if I'm coming in and supplying it with two records, this macro is going to run twice. And what it's going to use from my control parameter is what I pass it for this full path. So, I'm kind of taking this workflow to input it and replacing the full path with that value from each record.

So if each record has a different full path, I will input a different file. And what I'm doing here as I'm passing it through... Let's do it this way... So inside this macro, the first thing I want to do to be able to standardize this and be able to concatenate or rather union any files of any structure, is I need to generate a standard structure. And my goal for a standard structure looks like this. I want to have the file name. I want to have record ID. I want to have the column number, and then the actual value. So, I'm going to make my data incredibly tall. By having the data really tall, there's no change in structure from file to file. This will literally import any flat file that you can throw at it and stack them all together into a single file. But this is kind of that raw, not yet useful, but this is that first step.

So how do we get to that structure? We had a record ID, and what I like to do is start it at zero, so that the zero is potentially that header row commonly. And then I'll put it at the last column so that it can be in a nice sort arrangement for me. Then the next step is this transpose. And in the icon you can see the transpose takes wide data and makes it tall. And we're going to make it basically as tall as possible, that our key fields are gonna be our field name. And our record ID, that's kind of our metadata about this individual cell. And then our data fields are gonna be all of our fields, plus the dynamic and unknown fields.

That this option of picking dynamic or unknown fields, that we have in a lot of tools inside of Ultrix, is a very powerful feature. It allows us to build data-driven workflows that can adapt to changing fields. So if you don't know what your fields are, you can use this dynamic or unknown field to cause Ultrix to handle those unknown fields and build logic around that. And that's what we're doing here in this case. That we're not always going to have F1 to 3. Maybe we'll have 23 fields here.

And in this next step to change this F, because the default when we pick first row contains data, that it's going to call it F1 through whatever. And if we try to sort on that, Ultrix will either have to use the sort dictionary, which is going to be slower or it would put it in an odd sort by a default. So, I generally like to remove that F from it and turn it onto a number, like an integer. And one technique for doing that is the multi-field formula tool because what it allows me to do is pick my field and then change the data type. So, I'm not creating additional fields and additional selects, and all sorts of additional stuff. That this keeps my workflow clean by doing multiple things in a single tool. That I both changing the data type and I'm replacing that F character with an empty string and then converting it to a number in the formula.

And so with all those steps, that would take any data structure and convert it into this data structure here of... Oops! Let me see if I can zoom into there again. So, this is that baseline data structure that we want. And so now if we try to run it with our new macro in place of the dynamic input tools, again, we're doing a directory concatenated into a list of sheet names. Load in those sheet names, and that gives us... Here's this data structure, where we've got our values, which is all of our sheet names within sight of each of these files.

So let's take a look at these real quick. That what we've got for some of them strange, messy files that this workflow can handle for, that we've got things with a... Extra header rows here at the top before the data actually starts. We've got extra columns. We've got columns are missing or we've got... Let's see, what else do we got in here? Messy, too. We've got duplicate fields where the data has been copied and pasted. We've got blank rows. We've got the names are different. We've got the order of the fields different.

And those are all those are different kind of messy things that we're going to be able to handle for with this approach. And really we can handle for any type of messy structure, but that's just a couple cover techniques for. So from there, once we load in that sheet name again, we take that full path, concatenate in our sheet name with it, and then run that in through our load data again. And that will bring in all the data from all the worksheets from all the different files, and stacked all together. And this is my starting baseline, raw data.

That when I'm given new data that I've never seen before, and I don't have any trust for it, this will be the workflow that I'll throw it through first. And then from there, we go to validate, which is going to do my initial checks to see what am I working with? My initial data profiling. And of course, this is just one example of how I could profile it. There's really infinite ways that you could profile your data to investigate it, to try to see. What are you working with?

And so what I've done here is I've copied that load messy up here and I've added in these things below. Let's see if I can zoom in a little bit. And I'll move this over so we can see what we've got. So, we've got our original. Load the data in. And then we've got our... How do we want to validate it? And first step in validating is I've got my initial expectations. I expect to have the values A, B, and C for my headers. And I expect them to be in the order one, two, three. So, or I can add in any other expectations that I have starting out with this.

Can use the append fields tool to take our list of files or our list of worksheets here. So, our list of worksheets and full paths and append that into there. And what the append fields does is it does a Cartesian join. It will multiply this. So, I've got seven worksheets coming in and three fields that I expect to find in each worksheet. So, seven times three is going to give the 21 records. So, this is my perfect baseline expectation. And then I can join that with what data got loaded in and see what falls out, what matches up. So this join is where I would manually look through each of these outputs here, and see what kind of messiness am I dealing with? Or what kind of things did match or are falling out on each side?

In this case, what I'm interested in is where are my expectations not existing? Where are my expectations not an exact match? Because I'm joining on the file name, the name which is actually our column ID, like column sequence one, two, three. And then the value is the actual header value within sight of the worksheet. So, when everything matches, it would come out here on the join side. When it only exists on this side or at our expectations, it'll come out this right side. So this right side is these are my weird cases that I want to build logic to handle for. And I can write messaging based off of those to give me some initial reporting.

So when I run it, I can see messages down here to tell me how many are missing fields and which ones are missing those fields. And how many sheets does this happen in? So I know what I do need manually look through? That this aggregation is going to count the number of times, number of fields missing. This message statement here, I can write a formula that says take the record number. Or after the last record, this message variable of record number, I can convert that into a string to write a nice, friendly message for myself or for whoever runs this. So that we know how many sheets do you have that are missing things. And from there, I can write another logic to say for each individual record, maybe it's like... So before each rows, so I can say, "When this expression is true," and just say one equals one or write in the word true, that that condition would be met for each row.

And this message expression would be evaluated for each row that comes through. So, it would send out this message of file name is missing this field or missing how many fields. It's missing three fields, or whatever. So that we can get this nice summary to know what we're dealing with. So here's all these individual messages because I've got four records coming in. I get four messages based on how I configure this of before each row versus this message is configured to say after the last record. And I get a single message for it. So we've got a lot of options in how we can write our messaging. And to make it so that we can know what do we need to pay attention to without having to look at this raw data at the get-go to say what does this all mean?

I can look at these messages in my results window to kind of give me a clue. All right, I need to go look at that file and that sheet. And I'm looking for this column that's missing. Okay? And so with all of this awareness of our data, restructured into a very nice, tall structure, we've done our validation to figure out what type of logic do we need to write to handle to reshape our data. We can go into reshaping our data. And so at the top here, I've got my initial load in, and then I've got my initial expectation and validation. And one of the things I noticed here in this validation was that I was missing all three fields inside of that file.

So I can take a look at that file of messy two and name and see that I've got different names for each one these. And I can know that I need to handle for that. So that I can create a static file inside of here that says when I get this type A, or A type, this is what I really expect. So this is the value that exists in the messy file. And this is what I want to convert it to. And by pulling those in and creating a great big list of all those happenings, I can create logic around that. I could even make it more sophisticated if it happens in certain column ranges, that this is just kind of a basic example of it. That you can come in with your initial expectations, check that out, figure out what's different, and then create how you want to handle for these known exceptions. And build that up as you run through it and find new exceptions in later files that don't match your expectations.

So that you don't have to manually handle it once you've seen it. You can add it to that list and it will be taken care of for all past files and future files. So what I've got here is this is what we did before with our validation. So we do that similar thing again where we append it in. So now I've got my... Again, all of the potential values and what I expect them to be. And then the ones on this side, on this left hand side of when we were doing this join before are the ones that did not match our expectations. And were not our expectations. So inside of here, is where we find our type A, Bs, and Cs. And we can do a join on that to pull them out. So now we know these are our alternate headers that we found inside of our data.

This join over here, this output here, these are our found headers that matched our expectations. So this combination of this joint output here and this joint output here, that is all of our headers that we are looking for to find. So now that we found all our headers, we union them all together. So we have one great big tall data source of our headers. And then I want to find what is that minimum record of where that header exists? And we can see for most of the files, that header row is on row zero. It found that header row is on row two for that file that had those additional prefixed rows. So that we know where does our data start? And so with that knowledge of where does our data start, we join that back in with our raw data.

So our raw data is that red line feeding in. And when we join that in, now I've got my four columns of raw data, plus my right record ID, which is my header row, or where my data starts. So I can drop a filter tool to pull out our data. So this true side is just data, where it's everything that is after that is starting record ID. And on this false side, this is our header and everything above the header. And then our next filter will pull out our actual header, and this is where our header ID equals our expected or found header ID. On this false side, this is all of that extra header data. So if you wanted to pull out information from that, you could write additional logic from this false output here to logically and dynamically, data drivenly handle that data.

So, what do we want to do with headers? So, I've got my headers over here and inside of these headers, I've got my messy headers as well. But I know what the messy should be so I can use a find and replace to join my found headers with my headers that are messy and what they should be. And configuring this find and replace, to say I want to take that value, which is this raw value over here, with the messy and good. And then I want to find my value from over here, which is my known messy values, which is these. And then I want to replace that with that I expect it to be. And so that gives me my cleaned up headers. So now, I've got my file name. I've got the name here, which is actually the record or rather, the column number. And then I've got my value for my header.

So this is my header information. If I found that I've got duplicate headers, I can use a unique on these key fields. And I can see that I have found a duplicate field that this message in here would tell that it found a duplicate, that it said... Let's see, duplicate field name... Did it find it here? Maybe it didn't. There it is. So, we can configure this to run after the last record and say that it found duplicate field names, potentially. Now, I've got an error message here probably because I've got this open. There we go from that again. So, there's our messaging that lets us know that it found something there. Let's see the output here. Okay.

So we can remove our duplicate fields there and do logic on that we needed to. The other thing that now that we have this clean data, we need to join that in with our parsed data. So back here with this filter step where we pulled out our data, all the records that happen after our found header row... If I found that I've got null records and I want to remove the null records, these three tools together allow us to remove those null records. A summary tool to count our non-null values and when that is a zero, that means all of the records in that row were zero. So, I join that information back in. And then I can filter out and keep only records where the count of non-null is greater than zero. And so that removes the null records for me.

And then join our parsed, tall data with our cleaned up header information by joining it on the field name, and that name, which is our column ID. And then that give us this structure here, where I've got my file name, my record ID, my data value. This is probably not needed, and then my field name, my cleaned up field name. And throw that into a cross tab tool, and that'll allow me to say, "Here's my key fields," the fields that I want have for each row in my final output, a file name and record ID. What's gonna be my new headers is my new field name. And the value is my value. And that creates this final unioned ... Of all those messy different data files, cleaned up into one output.

And so that's a lot of different, varied steps. And these are just some of the simple techniques that we can go even more sophisticated, and write additional logic to handle really, any situation you can imagine. That if you can dream it, you can do it. That if you're doing it manually, we can write an Ultrix workflow that does it automatically. Now, if you've got other situations that don't need this, what we've got is the Ultrix community. You can post the question in the data preparation blending. And some of the ideal ways to do it is to say, "Here's my input and here's my output that I want." And then there's hundreds of people that are on there every day looking to answer your question and solve these challenging situations.

If you want to do meaningful practice yourself to try to figure out how can you do this, you can either look for questions on there or you can go over to the weekly challenge. And we'll commonly find weekly challenges that are this handling of messy data. What was it? Week 63 was good one here of preparation using pivot tools. And what this would do is it's got... Let's see if I can find one here real quick. Pull down this one, I just want to take a look at it real quick. Here we go.

So in any situation, there's multiple ways to handle it. You saw there was like, dozens replying to this with their solutions. There's not just one way to do it. There's multiple ways. And you can even try different tools. Like what this does is it takes really messy data and rearranges it. And you can do it in as little as two tools or you can do it in dozens of tools. So this is a resource where you can go to look to find other examples and meaningful practice and things of that nature. But like I said bottom line, this is not easy. And it doesn't get solved on the first time around. It takes an iterative process. It takes forgiveness for yourself. It takes a growth mindset and this willingness to close that gap between your current abilities and your taste. And that through meaningful practice, you can close that gap.

And with that, that is my presentation of what I've prepared here. And let's see here if I can turn that into this. So, thank you. If you have any questions where you want to reach out to me individually, there's my email address. That I love to help people, that when... Not too long ago, I didn't know how to do any of this and I had to reach out to ask for help. And that process of reaching out for asking for help is scary because it makes me feel vulnerable. And I'm afraid of you know, what am I? Do I not know how to do this? What are they gonna do? Are they going to laugh at me or put me down? That it's a vulnerable experience to reach out and ask for help. But by doing that, it enables a connection and a lot of value to be created everybody because you're not alone. And I had to learn that I wasn't alone and that when I reached out for help, people were there to help me and made me feel welcome.

And I want to do that for as many people as I possibly can because I want to pay it forward. That it's a really beautiful community in the Ultrix community, where people are looking to help in its safe environment. But even if you don't feel safe doing it publicly, there's my email. You can reach out to me, and I mistyped my email. But I make mistakes all the time, and it's okay. So... Be happy to help you out. If you want to do it as more in a consulting structure, I do consulting. Reach out to me, I'd love to work with anybody. And of course, fill out your feedback survey, please. If there's anything I can do better, you're welcome to reach out to me directly. If you feel like something was missing, or you need other things, I welcome you to ask me. And here, do we have a microphone?

Crowd question:
Yes, we sure do.

Joe Mako:
We can do Q and A.

Crowd question:
And just a quick show of hands, who all is using the community and going there for resources today? That's fantastic. Yeah. As Joe mentioned, it's a great place to get your questions answered. So, speaking of questions... All right, we've got one right here.

Crowd question:
First of all, awesome presentation. That was an awesome workflow and really well explained. Thank you. Two questions, one, is that available, that macro that's on the gallery or something?

Joe Mako:
That is a great question. If I had an award to give you, I would give you an award for that question. So, what's gonna happen is I'm gonna take this PowerPoint and I'm going to make it a little bit better. And I'm gonna attach that workflow with potentially maybe a little bit more documentation on the community. So, I don't know if Ultrix is going to create a separate thread for the inspire to post these things. If they don't it'll just be out there and you can search for the name of this session, Techniques for Handling Messy and Unpredictable Files. Search for that on the community and you'll find the thread where I'll have all this. And we can have a conversation in that thread. And we can make this presentation better. You can steal this presentation and use it to give it to other people that you know. That's what this is. It's about sharing and contributing. And we can all share our knowledge together.

Crowd question:
Great, thank you. And I guess the follow question is... So this is good for everybody, but in our company, we have a lot of people who... Ultrix and specifically the data prep steps is a very minor part of their job. It's necessary, but it comes up so infrequently, that they're not able to really get really good at it. So, how would you... What are sort of the top really quick tips to really convey to them... You know, we can't teach them everything, but what are the most useful data prep, data cleaning steps that we just want them to always remember when they're in those rare occasions when they need it? Is there a way to... What are the most common... For example, this handles when there's blank rows above the header, but is there a quicker way of telling them how to do that? That's one example.

Joe Mako:
Well, I don't know if there's a good answer for that because every situation is going to be different. That what I went through here for this workflow, this works for this situation. This workflow may not work for all situations because data is messy in a lot of different ways. It's about having a mindset to say, "I'm gonna iterate through it and I'm gonna try this. And if this doesn't work, I can make it better and try again." So, I would say that's my answer to the question. It's having a mindset of growth and iterative working through it and acceptance to let go of that expectation that there is a recipe, that there is no recipe.

It's like the comparison between baking a Betty Crocker cake, where you buy the box and follow the steps one, two three, that anybody can make a cake following those recipes. But unfortunately, life isn't like that and data isn't like that. And what we need to do is be like a chef where we know all the ins and outs and the different ingredients that we have and how they work together, and the factors and exception to that. And be able to work though it, and have resiliency to it. And by adopting these things, you can make resilient workflows.

Crowd question:
Hey, Jim. Having worked with you in a consulting engagement, I thought I would add a comment instead of a question. I've been able to observe Joe do this work at clients and one of the practices that he's done in this presentation is just boil all the complexity down to A, B, C; one, two, three. Even though the data that we were dealing with was exceptionally complex, and addition to being complex, very large, and so the combination of the size and the complexity can be overwhelming. And so it was really educational for me to turn that huge sense overwhelming complexity into A, B, C; one, two, three. And if youR workflow will run on A, B, C; one, two three, then you can begin to transition it to the thing that was previously kind of overwhelming and too big to deal with.

Crowd question:
Great presentation, Joe. I picked up some really useful tips there. One thing, when you have that alternative exception files, typically, I put that stuff in a formula and just look for specific keywords. Do you typically prefer going the alternative file way or do you also use some of those formulas when trying to decipher headings that are not standard?

Joe Mako:
Yeah. You've got the option to go whichever way makes sense. That what I went though here is not the only way to do this, that you do have flexibility. That if you've got files that vary in a kind of a known variation style, that a formula makes a lot more sense than making a list of known weirdness, then that's great. That's perfect, and I wish that I had all day. And I could cover every possible technique, but even all day, I wouldn't be able to cover all of them. So that's a good technique. Thank you for sharing.

Crowd question:
Any additional questions? All right Joe. Well, thanks again for your presentation again.

Joe Mako:
Thank you.

^Top

ERLEBEN SIE
DIE WELT VON ALTERYX
SELBST.

Los Gehts