This. I get why people have started using LLMs for this and I think it's great in theory, but the black box nature and possibility of hallucination makes it a non starter for me. Having the LLM generate scripts which you can then validate for correctness seems more plausible.
I also worry that this approach will lead to a sort of further reification of data science. While things have already trended this way, data science is not about applying a few routine formulas to a data set. Done properly, it is far more exploratory and all about building an understanding of the unique properties and significance of a particular data set. I worry the use of these tools will greatly reduce the exploratory phase and lead to analyses that simply confirm biases or typical conclusions rather than yielding new insight.
Definitely the right way to approach this. You already need to know what you're doing (for validation and error checking), but if you do it can be faster. As long as P != NP the validation is faster than coming up with the solution. My only concern is how far away from a "good" solution is the quick LLM + check vs expert solution. It may be worth using human expertise in 2 weeks than validated LLM solution in 2 hours. (And i'd question good validation of traditionally 2 week work in 2 hours.)
There's going to be a lot of moving fast and breaking things coming. Hopefully less breaking than moving.
> Only if the output from Claude is correct. If not...
Had a task at work to clear unused metrics.
Exported a whole dashboard, thought about regexes to extract metrics out of xml (bad, I know) asked chat gpt to produce the one-liners to produce the data.
Got 22 used metrics.
Next day I just gave chat gpt the whole file and asked it to spit all the used metrics.
46 used metics.
Asked Claude, Deepseek and Gemini the same question. Only Gemini messed it up by missing some, duplicating some.
Re-checked the one-liners chat-gpt produced.
Turns out it/I messes up when I told it to generate a list of unique metrics from a file containing just the metric names one per line. What I wanted was a script/one-liner that would print all the metric names just once (de-duplicate) and chat-gpt ad-literam produced a script that only prints metrics that show up exactly once in the whole file.
In the end, just asking LLMs to simply extract the names from the grafana dashboard worked better, parsing out expressions, only producing unique metrics names and all that, but there was no way to know for sure, just that given that 3/4 of the LLMs produced the same output meant it was most likely corect.
I fixed the programatic approach and got thr same result, but it was a very wiered feeling asking the LLMs to just give me the result of what for me was a whole process of many steps.
Unlike humans, LLMs seem to deal surprisingly well with typos.
Freed from the "the other human must not be up to my exquisite eloquency " and given that it's a machine that I'm talking to (20 years of "the compiler is never wrong") -- I've learned more about my communication inadequacies through talking with LLMs in the past 2 years than 40 years of talking to humans.
But I am not giving Claude a csv and saying 'clean it up'. I am asking it to write me a python script to clean it up. That way I can validate the script myself.
Think about it logically: Are you really sure you can validate the script yourself? If it takes you weeks to do what Claude does in some hours, it seems misplaced confidence in your capabilities.
There is, in fact, a large body of work studying classes of problems which are hard to solve but easy to verify. So I'm not sure why this kind of usage is a surprise to so many people.
I'm not sure that source code verification is such a problem. It feels like it's definitely easier to write code to solve a problem than to verify some code written by someone else is correct and fault free.
All processes and by extension code tolerate some level of error, even our most reliable systems. Whether LLM produced output is within that tolerance is up to each practitioner to test and verify.
I think AI has revealed that there is a lot of low hanging fruit that is very tolerant of errors across many disciplines that isn’t met by our current supply of software engineers. In my own day to day that’s a lot of low impact bash scripts that automate personal things while at work it’s sales and lead gen where it’s not a big deal if a salesperson cold calls someone who couldn’t use our product (other than the temporary embarrassment it causes both parties).
It's a lot easier to check the code / check the output of the code / spot verify than it is to do the work itself... if I'd write my own code, I'd still have to verify (bc I trust my own coding ability even less than Claude lol)
Only if the output from Claude is correct. If not...