Topic Expertise: For starters, a data scientist needs to have a basic understanding of the topic or problem they are trying to explore so that they can ask meaningful questions about that topic or problem. The nature of data science is to seek explanations for why things are the way they are. A foundation of topic expertise defines the need for a data science project and leads to more confident, data-driven decisions.
Data Acquisition: The next step in the data science lifecycle is collecting the right data to help answer the defined question. The data might live in a variety of places or be difficult to access depending on a person’s skill set. But the success of the rest of the data science process is dependent on the quality of data collected in this step — and how well it is prepared.
Data Preparation: Data preparation is the most time-consuming — and arguably most important — step in the data science cycle. As the saying goes, if you put garbage in, you’ll get garbage out. Data needs to be properly cleaned and blended ahead of analysis. This might include integrating disparate data sources, handling missing values and outliers, and more. During this iterative step, a data scientist might realize they need to go back and gather more data.
Data Exploration: Data exploration involves identifying and understanding patterns in a dataset. Once the data is clean and usable, data scientists can spend time getting to know the data and forming hypotheses to test. This is another iterative step in an iterative process, and a data scientist might need to take one or two steps back to perform additional cleansing and blending based on findings. This practice includes reviewing the distinct attributes of each data point, or “features" in the dataset, and determining whether further blending and data transformations yielded potentially meaningful new features. The process of creating new features in data is often referred to as “feature engineering.” It typically occurs in the interplay between the data exploration and data preparation steps.
Predictive Modeling and Evaluation: After exploration, a data scientist can start training predictive models. Predictive modeling and can often blend together with data exploration. Once the modeling and evaluation begins, it’s likely that a data scientist will notice new things about the features in the dataset and take another step back to iterate on the feature engineering. As models are built, they need to be assessed. A data scientist should continue to test and refine models until they end up with one they are happy with.
Interpretation and Deployment: The outcome of all this work might be an interpretation of the data and results, where the data scientist uses the model and all of the analysis they’ve conducted throughout the lifecycle to answer the question they started with. Another outcome might be that the model is destined for deployment, where it will be used to help stakeholders make data-driven decisions or automate a process (if this is your outcome, don’t forget about the next step — monitoring).
Monitoring: After the model is deployed, it needs to be checked and maintained, so it can keep performing properly even as it receives new data. Models need to be monitored so that when data shifts due to changes in behavior or other factors, model adjustments can be made accordingly.
Repeat: The cycle repeats itself whether or not the final goal was immediate interpretation or longer-term deployment. The ultimate outcome of any data science project should be to learn something new about the topic or problem being explored, which in turn increases topic expertise and then leads to asking new, deeper questions.