Making a Small Job Out of Big Data With In-Database Connections - Inspire 2017

The volume of data is growing exponentially and it's now the norm to store every bit of it. With the innovative in-database tools, Alteryx puts Big Data in the hands of all analysts. Learn how to maximize your existing Big Data infrastructure, when to move data out for processing, and when to use in-database blending.



Video Transcription


Alex Patton:
Hi everybody, thank you for coming to the session at the end of the day and I'm impressed that everyone was able to find the room. Very far back in the [inaudible 00:00:13]. And like Angelina said, I'm Alex Patton, product manager for Engine and Data Platforms. So, what that means is that I essentially prioritize any and all features that have to do with connecting to data and whether that be, which data sources we connect to. What that experience was like for connecting to data and the end database tools; which is what I will be talking more about today.

So, before I get started I've probably seen this slide before. Any forward looking statement that I make should not be used to make purchase decisions and we're not committing to anything caveat, just don't hold us accountable for anything forward looking that we make. It is our best guess right now.

So, first we'll get into an overview of In-Database. The benefits of In-Database. Some best practices and tips and tricks and then we'll wrap it up with what the future is for In-Database.

So, before I get started, who is using In-Database tools right now? A lot, okay. It's a huge increase from last year.

Our usage data right now shows that about 10 percent of our entire user base is using In-Database tools and so the goal is to get as many people on that as possible so that I have a lot of justification for prioritizing features for In-Database. So thank you for using In-Database and hopefully you... This will kind of help you understand what's happening when you're actually using that.

And so, optics instead of running and processing the query itself, we actually, as you're driving tools onto the canvas. We are building out a SQL query and then pushing that query to the underlying database. So, we're not transferring any data to and from Alteryx. Sometimes at the end of the workflow we will transfer relevant data back into Alteryx, but by doing this you get huge performance gains because that transfer of data is a huge part of why workflow runs so smoothly.

So, why are we focusing so much on In-Database. As we all know, data is growing and we're just amaassing more and more data every single day and we're well aware of that. We're also well aware of the cost of transferring that data to and from Alteryx. So, with In-Database you don't have to move that data at all. We will push all the processing down to the underlying data source.

Secondly data storage is changing and we understand that a lot of companies are spending a lot of time, money, and effort building out pretty customized infrastructure to handle all the data that they need to handle. Instead of trying to reinvent the wheel, and trying to create a one size fits all for all companies. We are taking advantage of that infrastructure that's already in place by leveraging, by pushing that query down to the underlying data source and letting that infrastructure optimize all of the processing.

And thirdly, we know that access to this data is key to valuable insights. Less than .5 percent of all data is being used today and that means almost all data is not being used. What we're always trying to do at Alteryx is make data more accessible to everyone and so by applying our tried and true, drag and drop interface to our In-Database tools we are making that more accessible.

So, why would you use In-Database? There's actually only one real reason to use In-Database and it's a pretty huge reason and that is speed. So to illustrate this, I brought out a really really simple workflow. Reading in 485,000 records, doing some filtering on that original data set and writing out 167,000 records.

When I ran this using the standard workflow, it took seven minutes and 49 seconds. So, that's very significant. And then I did the same thing with In-Database and it ran in 22.8 seconds. That's about 20 times faster. That is huge. If you are analyzing large amounts of data, you should definitely be trying to use In-Database and for those of you using In-Database, have you noticed performance gains?

Yes? Good. Okay, good.

So, that is the real reason we use In-Database. And I caveat to this, I should also point out. This is me using SQL, Data Warehouse, so it is in the cloud and that means I'm gonna see more drastic performance differences. So, this depends on how your network is set up, what data source you're using, where your data sits. It's not always gonna be this drastic, but you should always see pretty significant gains.

Okay. So, speed the main benefit and then we have flexibility and hybrid workflow and those two concepts actually let you leverage speed, which is the main benefit. So, I will go into how you can fully leverage speed. I've filled out this really simple workflow and I will talk about what I mean by each of these higher level concepts.

So, if you were using all In-Database tools, this is where you are going to realize the most gain when you're using In-Database tools. So, every single one of the tools highlighted here is In-Database that means that no data at all is being transferred to or from Alteryx. So, here you're realizing the full potential of In-Database. If you're able to do that, great. If you're not, we understand most people are not.

So, we know that a lot of people have a data in different in different servers and those don't always exist in Database. So, for example, I have an Excel file that I what to use in my In-Database workflow to fully analyze the work, the data in my database. So we allow for flexibility and what that means is, this data stream in tool can be used to stream in data from sources that are not supported In-Database or that are local files on your computer. In this case using a CSC or Excel file to stream data into the workflow so that you can still take full advantage of that speed you are able to gain with In-Database.

And so, hybrid workflows that we're referring to is the ability to stream data out of the In-Database workflow so that you can fully utilize our entire suite of tools within Alteryx. So, we don't offer In-Database tools for reporting or for spatial. But, you can still take advantage of that by streaming out any of those in data at the end of the workflow to take further advantage of all of the tools that Alteryx has to offer.

One thing to note and keep in mind here, and I'll actually talk about this later as well as we go into best practice is that we are using this data stream out tool as far downstream as possible. So, we want to do all the heavy lifting within the database before we stream out that reduced data set. So we're taking full advantage of all of the speed that the In-Database tools offer.

So what do we support In-Database? Was anyone at this presentation last year. I will take questions at the end.

Crowd questions:
I just have a question about [inaudible 00:09:05]

Alex Patton:
Let me answer that at the end, I will come back to that.

Okay, so was anyone at this presentation last year? No, okay. Yay, all no. So, it isn't actually relevant even, this is what we've added since last year. But, Databricks, Microsoft Azure, SQL Database, APS, Netezza and SAP Hana. And this next slide's gonna be more relevant to this crowd, this is a full list of everything we support In-Database as of eleven O.

We also offer support for some of the predictive IN-DB models, some of... We only support some of the most popular ones for some of the most popular sources, so the sources we support are Microsoft SQL Server 2016, Oracle and Parry data. All of their support the linear reduction tool, the logistic regression tool, and the score tool. In addition to that Microsoft SQL Server, also supports [inaudible 00:10:19] models, decision tree and the forest model. And this is all straight out of the help documentation so if you ever need to refer back this, you can go there.

Okay. Now we'll get into tips and tricks, best practices for using In-Database. So, think about using In-Database if any of [inaudible 00:10:47] surprised to hear what I say, thanks because of the big caveat. Totally depends on your network setup, where your data lives, what data source your using... So, I'll go into each of these in a little bit.

So, if you're working with large data sets, then the next question is what's a large data set? I don't know, again, it depends on how your network is setup and how quick that network is. But, if your data source is sitting on top of HDFS, so if you're using Hive, Impala, Spark, which is the three that we support in the database, you probably have large data sets because that's what they were designed for. So, think about using In-Database if you're using any of those.

If your data source is in the cloud, so if you're using anything in Azure or Redshift, that means your data has to, there's gonna be some latency when you're transferring data to and from the cloud. So, you will probably see pretty major performance gains if you're able to switch to In-Database.

If your database is not on a local server, and then again, this is where your data's sitting. If I'm in DC and my data's in California, I have to transfer all of that data to and from Alteryx, that's going to be a huge performance hit as opposed to just sending in the query down to the database to do all the processing there.

This next one is not... you don't have to think about it. If we don't support your database you can not use the In-Database tools currently.

If any part of your workflow is working with a reduced data set that means you should probably think about using In-Database tools to reduce that data set prior to streaming that dat set out for further analysis.

If I have a 500 million red data set, I'm only interested in 5,000 of those rows, trying to clean that up In-Database before you stream that into Alteryx.

Next, use the browse In-Database and the data stream out tools wisely. So, the browse and database, the data stream out tools, and the right data In-Database tools actually execute the query that's being built as you're building that workflow and in addition to executing the query these browse tools and data stream out tools actually transfer data to, or from, the underlying data source into Alteryx.

So, if you're using the browse tool, just understand that you are not fully realizing the potential for [inaudible 00:13:37] of using In-Database because they're actually transferring data into Alteryx so that you can see the data.

Definitely use the browse tools. I like to use them as I'm building out the workflow, and once I have that and ready to productionize it, remove the browse tools and as you're using the browse tools, just make sure you're not changing the defaults unless you have to. Defaults is that we're bringing in the first 100 records, so you're only transferring a hundred records into Alteryx and enable caching so that the first time it's done... Well, after the first time it's run we're not going back in, giving the same data sets.

Secondly, the data stream out tool. Again, data's being transferred from the underlying data source into Alteryx. Use this as far downstream as possible. Do all of your heavy lifting, all of your data manipulation, all of your filtering, all of the cleanup before your workflow hits this data stream out tool.

And reduce your dataset as quickly as possible. So, for this connect IN-DB tool, try not to use... I should have a big red X through this. Don't do this... Choose the icons that are relevant to, that you actually want analyzed. That way you're sending less records into the nested queries, or your underlying database is doing that. And then the filter tool, use that as quickly as possible in the workflow to filter out any unnecessary records.

Okay. Trick. [inaudible 00:15:27] animation. How to view a query? So, this is for those of you who are probably a little more advanced. We find sometimes that as your building out workflows, especially as they get more complex, we are essentially adding a nested query to the original SQL statement and sometimes if you're workflow is really complex, the underlying database isn't able to optimize that as well as it should be able to. And so, for those of you who are really good and really familiar with the syntax that is most optimal for your underlying data source. You can use this dynamic output tool. And this dynamic output tool used with the [inaudible 00:16:15] tool will actually show you the query that has been built up to that point in the workflow.

So, at this point, this is the query that has been built. And for those of you who are amazing at SQL and know your underlying data source really well, you can change and optimize the syntax and either stream that back into the workflow or just put it as an in connect database tool.

If you're not able to that, your data source should still be able to optimize the query but we have had pieces where there's better syntax and that's something we're looking at in the future to improve upon. But, we do give you this access to do that if you know the data source really well.

One thing that I really like about this slide is, it does illustrate what Alteryx is doing and how we're actually making it more accessible to the end user. This is what's happening in the background, this is what you don't have to do or have to know.

Okay, another trick. Using the In-Database predictive tools. This is fairly simple but does require a little bit of trickery. So, if you are connected to Oracle, Parry Data, or SQL Server 2016. And in your connect IN-DB tool here. And you want to use any of the predictive models that we support, you need to navigate to the predictive category. So that's outside of In-Database, it's just the standard predictive category and you will drag whatever model you want to use on to the canvas, and as you're doing so, that icon is gonna change, gonna change colors to purple and it's going to look like the In-Database tools. So if you want to access the predictive In-Database functionality that is how you would do that.

Okay, what is next for In-Database? So, the new data sources that we have coming out in 11 3 which is going to be released in a couple of weeks will have Vertica and Exasol and shortly after, well, third quarter will have Greenplum and Snowflake In-Database. Anyone excited about this? Yay, okay. Good.

And we'll continue to, this list changes... We reevaluate this list every planning period to see what it is that people are asking for. So definitely ask us for things that you want, they'll post your ideas on the community. We do look at that to decide what's coming next.

And for some of the larger In-Database initiatives we're always looking at extending the tool set and functionality In-Database. So like here we have a transfers IN-DB transfer tool. It only works for Oracle and SQL right now. It is in the lab, if you're interested in checking that out, but we'll continue to, essentially, try to max In-Database tools to what the standard tools are.

And as well as looking at some of the functionality and extend the standard functionality to IN-DB to things that are on the shorter term roadmap are the sort functionality and the update functionality with we've heard a lot about. Actually if you want the update functionality, please come talk to me, I have questions about what your use case is.

And then simplifying the connection configuration is high on our list. Connecting to a data source via the standard tools is very different than connecting to a data source IN-DB. So A, we're trying to make that as long term, this is a long term initiative. Long term we want to make that as unified as possible. In the interim we do want to clean up the experience of connecting to a database when you're using the manage IN-DB connect and configuration clean up this little new button... Read and write [inaudible 00:20:47] confused people. Has anyone ever been confused when connecting to In-Database? No? Yeah, it's okay.

Okay, so I just went through a lot and there are three things you should remember from this. In-Database, if you're using it, it means your workflow's going to be really really fast. A lot faster than what they were using the standard tools. When I showed that 20 times faster so, that is the key reason you would use In-Database.

And we will continue to expand the In-Database capabilities in terms of building out the tool set and building out the functionality.

And we support some of the most popular predictive models and data sources IN-DB. A lot of people don't know that, so we do offer that and you should remember that when using In-Database.

And I want to mention for additional resources we have the Alteryx community and Alteryx help. Alteryx community is a great place to go if you have ideas for things you want to see In-Database. So, if you want to see certain data sources, please go out there and I would post an idea or star an existing idea. We use that all the time as well as the functionality you would like to see. We've seen sort and update out there a lot, which is why it is on our short term list. Use that.

Another reason to use community is we have a lot of really experienced users out there and if you have an issue building out a workflow, issues using any of the IN-DB tools. If you go to the community and post the question you'll generally get an answer.

Alteryx help is also a really really good place to go to learn about some of the basic functionality about In-Database. It covers a lot of what I just covered but if you forget how to use things, just go to Alteryx help.

And I am going to get onto the questions but as I'm answering questions, this is how you... Please go to the session survey, fill it out. Give me feedback what you want to see, what was good etc.

So, let's get into Q and A. And Angelina is going to bring the mic around.

So, I know we have a question [inaudible 00:23:38] do you still have your question?

Speaker:
Almost here.

Crowd questions:
Hi, I was wondering, is there a capability to execute stored procedures either in SQL Server or Oracle in the In-Database or in Alteryx in general?

Alex Patton:
Yes. Yes in Alteryx in general for SQL Server and Oracle. Oracle is a little more difficult there is a way to do it and we're continuing to build out the functionality for the Oracle stored procedures and you can do the same for In-Database.

Speaker:
Okay.

Alex Patton:
You just... Yup.

Crowd questions:
Does that also apply to Parry data?

Alex Patton:
Does that also apply to Parry data? No. It does not applay to Parry data right now, sorry.

Crowd questions:
Okay, another one.

Crowd questions:
[inaudible 00:24:30] On the SQL that you were showing. Is there a way to kind of clean that up?

Alex Patton:
Yeah.

Crowd questions:
So that it doesn't have all the quotes and [crosstalk 00:24:41] [inaudible 00:24:42]

Alex Patton:
So is there a way for you to clean that up?

Crowd questions:
Yeah.

Alex Patton:
Yes there is a way for you to clean that up. You can stream out that query and you can edit that however you want it. And then you can either take that back into the connect IN-DB tool and that will execute on the right IN-DB and when it hits the right IN-DB tool. Or you can stream it back in using the dynamic input tool. So, yeah.

Crowd questions:
And then one more then I'll be done. Is there... So, we have a bunch of analysits and they're not used to the syntax of each database they connect to. And so the formula tool seems like you have to have some SQL knowledge of those particular... Is there a way or are you... Has that been asked to kind of mask what the writing to be more like the formula tool that you would use in Altyrex?

Alex Patton:
Yeah, so that is a really good question and something that we proved before. The reason that it is generic right now is so that we can add as many In-Database sources as possible. To these In-Database tools. One thing we're starting to look at now, and will be future road mapped is. Okay, how do we optimize the syntax that's expected for that underlaying data source. So we'll start going into the more popular ones in building out the formulas and the filters and the optimal syntax that is being expected by the underlaying data source.

Crowd questions:
So, you can connect to Spark which... So it's actually reading data that's in HDFS, does it matter if that data's formated in Hbase Tables, Parkay, anything like that?

Alex Patton:
So, we only read and write an Avro and CSC formats right now. We have heard Parkay, we have heard LRC. Heard a little bit of HP it's not a ton. What we're looking at at adding support for that. But right now it does have to come in CSC and Avro formats.

Crowd questions:
[inaudible 00:26:54]

Alex Patton:
That is through the In-Database right now. Right now you can not get in natively into Alteryx, though we're looking at that for the Spark integration.

Crowd questions:
So, if I were to do something complicated, like a partition or Windows clause, would I have to use it done in like output change the syntax and then dynamic put it back in? Or am I mistaken?

Alex Patton:
Yeah, if you know syntax for that. Yes, you would have to do that, we don't offer that functionality right now. But yeah, if you're able to put that into a statement you can either, like you said, use the dynamic input tool or dynamic output and then input that back in. Or from the very beginning you can just paste the SQL query into that connect IN-DB tool. There's an editor there that you can use.

Crowd questions:
So, if my sytnax is correct and I'm getting some weird errors, do you have any advice for that?

Alex Patton:
Come to the solutions scenarios, see if we can figure it out?

Crowd questions:
Got it, thank you.

Crowd questions:
Other questions? Okay.

Crowd questions:
[inaudible 00:28:16]

Alex Patton:
Sorry, what was the name of the tool that where you can see the query? It's the dynamic output tool.

Crowd questions:
[inaudible 00:28:32]

Alex Patton:
Yeah.

Crowd questions:
Is there functionally any different... So, if you have a already optimized SQL query and you put that into the IN-DB tool vs. putting it into the normal DB connection, and you're eventually just streaming the data out, is there functionally a difference there? Or is it effectively creating a... This is basically a way to create SQL...

Alex Patton:
Yes, this is effective way to create the SQL.

Crowd questions:
Cool.

Speaker:
All right.

Alex Patton:
Any other questions? [crosstalk 00:29:08]

Speaker:
[inaudible 00:29:08] the questions.

Alex Patton:
All right thank you all so much for coming.

Speaker:
Thank you.

^Top

Experience the
Power of Alteryx
For Yourself.

Get Started