No. But it will redefine the data analyst role.
Since ChatGPT’s release in November of 2022, speculation has grown over whether or not the role of a data analyst could eventually be replaced by generative AI (ChatGPT, Bard and Bing Chat are among the large language models included in this classification). Much of this speculation is fueled by the ability of these large language models(LLMs) to write code.
As someone who has been in the data analysis field for, ahem longer than you, understanding the impact of generative AI in our field is something that has definitely piqued my interest. Giving in to curiosity, I have since spent a fair amount of time assessing the current capabilities of generative AI within the context of data analysis.
In this article, I summarize and share my findings with you as I believe generative AI will have a significant role in data analysis work going forward. Furthermore, I believe that it is imperative for the data analyst community to understand the profound impact it will have on not only their field but the business landscape as a whole.
Where We Stand Today At this point, we know that generative AI can write SQL, Python and R code. We can also assume the efficiency of the code they produce will only get better over time with continuous fine-tuning. But that’s just the start.
At the end of March (2023), OpenAI’s ChatGPT released a plugin called Code Interpreter. If you are one of the few who currently have access to the Alpha version, you can upload data files into it and invoke Python to perform regression analysis and descriptive analysis, look for patterns in your data and even create visualizations. All without having to write or even know a line of Python code! Esteemed Wharton School of Business professor, Ethan Mollick has a nice write-up on this.
So there you have it. The ability to load, analyze and present data without writing a stitch of code. Game over yes? Not so fast.
As incredibly impressive as these capabilities are, there are some significant limitations to Code Intrepretor, that are indicative of some of the challenges that generative AI would have in taking over the data analysis industry.
First, it requires the upload of ONE table. One two-dimensional CSV file (currently limited to 100 MB). The size limitation aside, imagine being tasked with building one table with all of your company’s data…
I could probably stop there, but let's go on.
With your one table in hand, you now have to get approval to get your one table with ALL of your company’s data pushed outside of your company’s firewall into an LLM that they have no control over…
We can probably stop there.
The current alternative(more on this later) to the above would be that your company builds its own LLM. While theoretically possible, the complexities of training and fine-tuning the model, the expertise required and the enormous costs of doing so would only make that cost-effective for an extremely short list of companies. But for the sake of understanding, let's take a step back and imagine your company is on that list.
But first, let’s start with some perspective. If we look back to the introduction of business intelligence tools in the early 2000s, the great value of those tools lies in their ability to provide non-technical, line-of-business people the ability to leverage their domain knowledge by enabling them to select, analyze and present data, without writing a stitch of code. Sound familiar?
Providing user-friendly means to analyze data is nothing new. It will always have incredible value. Indeed, it is a multi-billion dollar industry that continues to grow. However, these tools have no use without domain knowledge. This applies to any data analysis, regardless of the tool(s) being used. Even if it's generative AI. Without domain knowledge, we do not know what questions to ask of our data. And even if the questions were provided to us, how do we interpret our findings?
And in my view, the greatest value of data analysis work lies in its ability to answer ad hoc questions. Unforeseen, mission-critical questions. Complex, multi-layered, nonlinear types of questions. Answering these questions requires domain knowledge.
For example, why did sales on our best-selling product just drop off a cliff? Our primary supplier just went out of business, what do we do? Why did our customer churn rate double last month? These are not straightforward types of questions that can follow an established decision tree.
What these few examples have in common is that they require immediate answers to situational questions that have never been asked before. And that is really the key. If you understand the construct of generative AI, its inability to answer questions of this nature is truly its Achilles heel in ever being able to replace data analysts fully.
To briefly summarize, generative AI utilizes existing data sets to ‘train’ an LLM to generate a probability-driven answer based on whatever training data it has been fed. And while you can continuously fine-tune your model with ever more precise data sets, how would you train your model on multi-layered, situational questions that have never been asked before?
It would be analogous to you starting a new job as a data analyst in an industry that you are not yet familiar with. And on day one, you are asked to urgently answer one of the questions above. Where would you even start? What data would you pull? How would you even know what all of the potential variables you would need to consider? And, even if you could somehow derive an answer, how would you know if it is correct?
It is for these reasons that I don’t foresee the role of data analyst ever being fully replaced by generative AI. However… generative AI, in its current state, already has many uses in the data analysis field and those uses will only continue to expand with ever-increasing functionality.
Current Potential Uses for Generative AI in Data Analysis As of today, the highest and best use of generative AI in the data analysis field is its ability to both write code and in turn, explain the code it writes(which it does quite well). I’ve personally used it to help me write and understand Python code.
For those of you who are looking to enter the data analysis field, I could not encourage you enough to take advantage of generative AI to help you learn to code. It would have greatly speeded up my learning curve when I was first cutting my teeth in this field.
In another, truly exciting development for data analysts, generative AI has fueled the development of dedicated coding tools. GitHub has released its Copilot product, which can suggest coding solutions/improvements in real-time as you are writing it!
Earlier in this article, I referred to the potential hurdles companies would face in building their own LLMs. There is possibly one new alternative to that, Databricks has recently released an open-source LLM called ’ Dolly’. In theory, this could solve the issues of cost (being open source) and having to push your data outside of your company’s firewall. It’s a smaller-scale LLM, more suited for focused datasets.
I mention Dolly, primarily as an example of how quickly developments in the field of generative AI are moving and as a heads-up to how they may affect the data analysis field going forward.
As we have already seen, the evolution of AI will only continue to progress at light speed.
Conclusion There is no doubt in my mind that generative AI will reshape the workflows in data analysis. Generally speaking, repetitive types of tasks or even analyses will in time be performed by generative AI. I could also see coding becoming more of a commodity, versus being a highly developed skill.
Based on the above, I believe that the prototypical data analyst in the future will possess business line-level domain knowledge combined with an ability to incorporate generative AI tools to help them be more efficient and productive with their time.
Lastly, on a personal note, I would encourage anyone reading this to embrace generative AI. Learn about it and use it both in your personal and business lives. With new APIs and plugins constantly being created, its reach and capabilities will only grow.
For better or worse.
If you have any questions or are looking for some expertise, please don't hesitate to reach out to me at galen.okazaki@vectordecisionsupport.com
Comentários