Veja 6 passos para uma estratégia de preparo de dados à prova de balas
Aprimore seu fluxo de trabalho e prepare-se para alcançar insights aprofundados
6 billion hours per year are spent working spreadsheets. 26 hours are wasted in spreadsheets per week. 8 hours per week are spent repeating the same data tasks.” — The State of Self-Service Data Preparation and Analysis Using Spreadsheets,” IDC
Looking for a smarter way to do data prep?
Data preparation in most organizations is time-intensive and repetitive, leaving precious little time for analysis. But there’s a way to get better insights, faster.
We’ll walk you through it step by step.
Setting Up for Success: The Importance of Data Prep
What’s the big deal about data preparation?
The big deal is that you can’t succeed without it. And that’s not an overstatement. Data prep might not be glamorous, but it’s the structural foundation of good business analysis. If you don’t clean, validate, and consolidate your raw data the right way, you won’t be able to get meaningful answers.
69% of businesses say they aren’t yet data-driven.
— “Big Data and AI Executive Survey,” NewVantage Partners, 2019.
But in a typical organization, data winds up living in silos, where it can’t fulfill its potential, and in spreadsheets, where it’s manipulated by hand. Silos and manual preparation processes are like a ten-mile obstacle course lying between you and the insights that should be driving your business.
If your organization is struggling with this lag time, you’re in good company, as 69% of businesses say they aren’t yet data-driven — but having other people with you in a sinking boat doesn’t make it more fun to drown.
The more data you acquire and the more complex it gets, the more these problems amplify, so you need better solutions. What if you could work with any data format that struck your fancy? What if you could automate some of these processes and make them fast, transparent, and repeatable?
It would probably be a big deal.
Data Prep 101: Understanding the fundamentals
Here’s what a good data prep strategy looks like.
Before we talk solutions, let’s take a closer look at what you should be planning when it comes to data preparation.
A successful approach to data prep includes these functions:
Análise exploratória de dados
Discover what surprises the dataset holds.
Limpeza de dados
Eliminate the dupes, errors, and irrelevancies that muddy the waters.
Combinação de dados
Join multiple datasets and reveal new truths.
Criação de perfis
Spot poor-quality data before it poisons your results.
Aggregate data from diverse sources.
Manipulação de dados
Make data digestible for your analytical models.
Ideally, as you’re moving in and out of these activities, you want to record both your data and your process so that any mistakes you make aren’t permanent, and so that others can repeat your results on their own.
Transparency and repeatability are the holy grail of data prep, but you can’t have either in a spreadsheet-based system.
Data Prep 101: Data Exploration
It’s a jungle in there
Before you start working intensely with a new dataset, it’s a good idea to step boldly into the raw material and do a bit of exploring. Although you might start with a mental picture of what you’re looking for or a question you’d like to see answered, it’s best to keep an open mind and let the data do the talking.
Data exploration used to require the code-writing skills of IT engineers, which amounted to a locked gate between raw data and the people who analyzed it. But now, by using automated tools as building blocks throughout the data prep process, data analysts and business users can plunge right into a dataset themselves and explore whatever lies within.
Here are some data exploration techniques that may spark big insights:
Verifique os nomes das colunas e as descrições dos campos para ver se alguma anomalia salta aos olhos ou se alguma informação está ausente ou incompleta.
Faça uma verificação de temperatura para ver se suas variáveis estão saudáveis: quantos valores exclusivos elas apresentam? Quais são os intervalos e os modos?
Identifique quaisquer pontos atípicos que possam influenciar seus resultados. Você pode usar métodos visuais, como diagramas de caixa, histogramas ou diagramas de dispersão, ou abordagens numéricas, como pontuações z.
Examine esses outliers. Você deve investigá-los, ajustá-los, omiti-los ou ignorá-los?
Examine patterns and relationships for statistical significance
Your analysis is only as good as the data that powers it. That’s why data full of errors and inconsistencies carries a hefty price tag: studies show that dirty data can shave millions off a company’s annual revenue.
To prevent catastrophic losses like that, it’s critical to scrub your dataset until it sparkles. As an analyst, you know this all too well, since this is probably how you spend most of your work week.
All these processes can be done manually but at the considerable cost of your thinking time. Automated data cleansing tools, on the other hand, can do most of this work with a few quick clicks.
Depending on the type of analysis you’re doing, you need to accomplish six things in the cleansing stage:
Livre-se de todos os registros duplicados que obstruem o espaço do servidor e distorcem a sua análise.
Remove any rows or columns that aren’t relevant to the problem you’re trying to solve.
Investigue e eventualmente remova informações incompletas.
Elimine quaisquer anomalia indesejada encontrada durante a exploração dos dados.
Repare os desvios estruturais, digitação, capitalização, abreviação, formatação e caracteres extras.
Confirme se seu trabalho está preciso, completo e consistente, documentando todas as ferramentas e técnicas usadas.
The more high-quality sources you incorporate into your analysis, the deeper and richer your insights. Typically, any project you undertake will require six or more data sources — both internal and external — requiring data blending tools to fuse them together seamlessly.
The moment before blending is kind of like looking over the edge of a cliff. What if you introduce a new dataset and it sets off an avalanche of compatibility issues, and you can’t undo the damage? Sometimes the complexity of the work makes it tough to be completely confident in the results. It’s always better to have a solution that allows you to go back in time to the point before you made changes.
Data blending usually involves three steps:
Acquire and prep. If you’re using modern data tools rather than trying to make files conform to a spreadsheet, you can include almost any file type or structure that relates to the business problem you’re trying to solve and transform all datasets quickly into a common structure. Think files and documents, cloud platforms, PDFs, text files, RPA bots, and application assets like ERP, CRM, ITSM, and more.
Combinar. É nas planilhas que você exercita suas habilidades com o VLOOKUP. Mas cansa, não é mesmo? Se, em vez disso, você usar self-service analytics, o processo será apenas de arrastar e soltar.
Validar. É importante analisar seus resultados para obter consistência e explorar registros sem correspondência para conferir se mais tarefas de limpeza ou preparação são necessárias.
Data Prep 101: Data Profiling
Not all data makes the cut
Data profiling is a lot like data exploration, but with a more intense focus. Data exploration is an open inquiry performed on a new dataset; data profiling means examining a dataset specifically for its relevance to a particular project or application. Profiling determines whether a dataset should be used at all — a big decision that could have serious financial consequences for your company.
Data profiling can be intricate and time-consuming. For a business end user to do it properly without the help of a specialist, data profiling software is a must.
There are three main data profiling techniques, performed in this order:
Structure profiling. How big is the dataset and what types of data does it contain? Is the formatting consistent, correct, and compatible with its final destination?
Criar perfil de conteúdo. Quais as informações fornecidas pelos dados? Existem lacunas ou desvios? É nesse estágio que você executará estatísticas de resumo em campos numéricos, buscará por valores nulos, vazios e isolados e procurará erros do sistema em termos de ortografia, abreviações ou IDs.
Relationship profiling. Are there spots where data overlaps or is misaligned? What are the connections between units of data? Examples might be formulas that connect cells or tables that collect information regularly from external sources. Identify and describe all relationships, and make sure you preserve them if you move the data to a new destination.
Data Prep 101: ETL (Extract, Transform, Load)
Get your data ducks in a row
With the enormous volume and complexity of data sources available to you, it’s inevitable that you’ll need to extract it, integrate it, and store it in a centralized location that allows you to retrieve it for analysis whenever you need it.
That process is known as ETL, which stands for “Extract/Transform/Load,” and it’s the centerpiece of a modern data strategy. ETL can also help you migrate data during a disruption — say, an upgrade to a new system, or a merger with another business.
The idea is to integrate all data and make it accessible to more people, not replicate the silos where it used to live. Forward-thinking companies look at ETL as a way to allow analysts, data scientists, line-of-business leaders, and executives to make decisions from the same playbook.
The three steps in a nutshell:
Extrair. Extraia todos os dados, estruturados ou não, de uma ou muitas fontes, e valide a qualidade deles. (Seja ainda mais detalhista se estiver extraindo de sistemas legados ou fontes externas.)
Transformar. Faça uma limpeza profunda nesse ponto e verifique se a formatação atende aos requisitos técnicos do destino.
Carregar. Grave os valores convertidos nos seus locais de armazenamento, geralmente, um data warehouse. Depois, faça uma amostragem e verifique eventuais desvios de qualidade nos dados.
Data Prep 101: Data Wrangling
Are we wrangling yet?
The term “data wrangling” is often used loosely to mean “data preparation,” but it actually refers to the preparation that occurs during the process of analysis and building predictive models. Even if you prepped your data well early on, once you get to analysis, you’ll likely have to wrangle (or “munge”) it to make sure your model will consume it — not spit it back out.
Data wrangling is usually performed with programs and languages such as SQL, R, and Python. That requires technical know-how that the average analyst doesn’t have. To make this process accessible for your whole organization, you’ll need to use automated analytics software.
Here’s how you wrangle:
Explorar. Se seu modelo não funcionar como você imaginou, é hora de se aprofundar novamente nos resultados para descobrir o motivo.
Transform. You should structure your data from the beginning with your model in mind. If your dataset’s orientation needs to pivot to provide the output you’re looking for, you’ll need to spend some time manipulating it. (Automated analytics software can do this in one step.)
Cleanse. . Correct errors and remove duplicates. Enrich. Add more sources, such as authoritative third-party data.
Armazenar. A manipulação é um trabalho duro. Preserve seus processos para que possam ser reproduzidos no futuro.
Faster, Smarter Insights: The case for automation
Data, meet the 21st century.
What happens in a world without silos and spreadsheets? If you were able to access data in any format and automate your current prep processes with a powerful software platform, what would that look like — for you and for your organization?
Em nossa experiência, a automação do preparo funciona mais ou menos assim:
Switching to an automated platform almost always produces a measurable return in a matter of days or weeks.
Time for insights
Automation completely changes the focus of an analyst’s workday from menial tasks to creative ones. And you’ll never have to solve the same data problem twice.
When you eliminate the need for data gatekeepers, you can engage the entire organization. Employees at all levels begin coming up with new ways to expand their own capabilities.
It’s such a profound change — a different universe, really — that we have a name for it: Analytics Automation.
The Alteryx Analytics Automation Platform
Why use Alteryx for data prep?
And your organizational ROI? Glad you asked.
1. Top-line growth 2. Bottom-line savings 3. Dramatic efficiency gains 4. Fast upskilling of workforces 5. Risk mitigation
“We use [Analytics Automation] across many of our businesses to leverage data, automate processes, and empower our people to become self-service digital workers.”
— Rod Bates, Vice President Decision Science and Data Strategy, The Coca-Cola Company
Start anywhere. Solve everything
Alteryx is the only quick-to-implement, end-to-end data analytics platform that allows you — and everyone you work with — to solve business problems faster than you ever thought possible.
If you want Analytic Process Automation (and believe us, you do), we rock that space like nobody else. Our platform can discover, prep, and analyze all your data, plus deploy and share analytics at scale for deeper insights.
The Alteryx Effect: Shrinking process times, speeding up insights, and generally saving the day.
What’s in it for you:
Data preparation at light speed
Low-code, no-code modeling through an intuitive interface (or advanced modeling with code for all you data scientist superstars)
Support for nearly every data source and visualization tool out there
Performance, security, collaboration, and governance (translation: It’s like sending warm brownies to your IT department)
ROI and then some
Why Analysts Love Alteryx
69% faster time to insight
$6M+higher annual revenue per 100 analysts employed
Over 2,000 hours saved in manual effort
— The Salvation Army
1 year of store sales data organization in 1 hour
$80,000 saved annually using automation.
Time for analytics shrunk from previously impossible to 20 seconds.
“I simply can’t do my job without Alteryx, nor would I want to.”
— Jay Caplan, The Coca-Cola Company
“Alteryx pushes our analytics from playing checkers to playing chess.”
— William McBride, Cetera Financial Group
“It’s crazy that we used to spend about 80% of our time bookkeeping and 20% of our time engaging with customers. But now with Alteryx, we’ve reversed that and have become an 80% customer advisory company with just 20% spent on bookkeeping. Through this process, it has enabled us to give our customers better experiences.”
— Brian Milrine, Brookson
“Alteryx empowers people like us, who have little to no computer coding background, to do complex things with data even though we have no one in IT who can write Python. It allows us to follow the ideas in our minds and move from question to answer a lot faster.”
— Alexandra Mannerings, Colorado Hospital Association
“I built a workflow in 10 minutes on our first day that queried five billion records in 20 seconds. And immediately I realized there is something going on here that is really cool and powerful.”
— Justin Winter, Chick-fil-A
Dive Deeper Into Data Prep
Go from prep to insights. Check out the No-Sweat Guide to Advanced Analytics Success Get the Guide 5-reasons spreadsheets would love book.
Get a taste of drag-and-drop analytics. Try the Alteryx Data Blending Starter Kit. Try Starter Kit