Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Vicuna v1.5 series, featuring 4K and 16K context, based on Llama 2 (twitter.com/lmsysorg)
168 points by tosh on Aug 3, 2023 | hide | past | favorite | 43 comments


Vicuña in my experience is by far the best training for llama models [1]. Way way better than alpaca or base llama. Recently I tried the wizard+vicuña uncensored on llama and it was excellent. Can not wait to try that combination on llama2.

my favourite leaderboard is this one as it compares open source and closed source:

https://tatsu-lab.github.io/alpaca_eval/

[1] https://news.ycombinator.com/item?id=36975048


How accurate do you think that leaderboard is?

It puts LLaMA2 Chat 70B at 92.66% and GPT-4 at 95.28%, only a ~3% difference.


I don't know exactly, and all benchmarks are subject to problems. But in my experience the ordering in this list is roughly right for the models I've tried.


What do you mean by wizard+vicuna?

I use wizardlm myself right now and have used vicuna in the past. Do you mean there's a model trained on the datasets of both wizardlm and vicuna?


I do not know. I just know that this model I found and used is very excellent:

Vicuna + Wizard - Censoring:

https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored...

Seems to be based upon this: https://github.com/melodysdreamj/WizardVicunaLM


"Wizard's dataset + ChatGPT's conversation extension + Vicuna's tuning method"


How well does it adhere to the system prompt?

The base Llama-Chat models use something called "ghost attention" (they describe it in their paper). No clue how it works, but the result is, the model sticks to the system prompt extremely well. If you tell it that it's Marvin the Paranoid Android in the system prompt, it will stick 100% to that.

In llama-derived models like Vicuna, if you tell it to act as Marvin the Paranoid Android, eventually the regular "assistant" voice starts to bleed through, after only a few chat turns.

Doesn't sound like a big deal, but in cases where you have strict rules you want the model to follow, then the base llama-2-chat models are far better than any derived ones that do not implement ghost attention.


My understanding of ghost attention is that the interface will inject periodic reminders into the conversation. These are seen by the model but hidden from the user. They do use up some of the tokens available in the context window.


But from the paper, it sounds like this happens only during training. Some trick about constantly re-injecting the system prompt during chat conversations.

But during inference, there's no trick. The system message remains once at the top.


Couldn't you replicate that by doing the same thing and prepending system prompts in the Vicuna models?


Ghost attention is only used in the 70B model in llama 2, FWIW.

So would need to make sure comparing apples to apples.


Part of the Vicuna team wrote a guide on finetuning Llama2: https://blog.skypilot.co/finetuning-llama2-operational-guide...


Is there a place that compares/scores all of these local LLMs?

Would be nice to know how close they are to the paid ones.

Edit: Found one here:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...


Off topic: Jesus Christ that table is bad. How badly have they messed it up that reversing sort order takes multiple seconds on my pc? It was so slow that I thought maybe it is sending a new request to the server every time I change the sort order but no, it is just that slow.


The power of modern Javascript. So what if my pentium 133mhz could invert an entire table with hundreds of cells in lotus 123 back in the early 90s in the blink of an eye. The fact that 20 years later with computers 100x faster inverting a table takes longer is progress! Also don't forget to mock people who care about performance by screaming "premature optimization" without fully understanding what it means.


Blaming poorly optimized sorting on a whole ecosystem of a language feels lazy at best. It wouldn't have mattered if it was done in Python, PHP or whatever, if the developer who programs something doesn't know how to sort things efficiency, there is nothing to say they'd do a better job in a different language.

Bad programming is just bad programming, no need to pull what language is being used into it. Just happens to be JavaScript because that's what browsers natively support. But if Python (or whatever) was used instead, the very same programmer would have done the very same mistake.


> Blaming poorly optimized sorting on a whole ecosystem of a language feels lazy at best. It wouldn't have mattered if it was done in Python, PHP or whatever, if the developer who programs something doesn't know how to sort things efficiency, there is nothing to say they'd do a better job in a different language.

Counterpoint: It's not what programming languages do, it's what they shepherd you to

https://nibblestew.blogspot.com/2020/03/its-not-what-program...


>Blaming poorly optimized sorting on a whole ecosystem of a language feels lazy at best.

The sorting is actually fine. On my Mac the sort completes in less than 1ms.

The performance issue is because of all of the hooks of the svelte framework trying to do their thing in the middle of the sort, and then another frame-sizing script trying to do its thing in the middle of the sort. This is an endemic problem in Angular apps as well.

Not javascript's fault, or even the sort algos fault.


Thanks for doing somewhat of an analysis, I was just guessing it was a shitty sorting algorithm, but thanks for clarifying the actual problem :)


Yes the entire eco system is bloated and badly optimized. A lot of the web has become laggy, slow and battery drainers due to modern js frameworks.

You're right it isn't the language but since js is the only front end language it gets called out.

Just so we are clear, I love JavaScript and write it everyday but I avoid the “modern” eco system like it is the plague.


Careful. I just got a slap on the wrist for calling out lazy commentary, apparently it’s “taking a swipe” at someone.


...who are you arguing with?


No one, just making an observation.


https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

It doesn't include a comparison to the paid ones, but it's a great place to look at different ones and see which ones you should experiment with and sink time into.


What happens when the average approaches 100%?

i.e., do these benchmarks taken together capture anything resembling general intelligence?


When the benchmarks approach 100% it'll be time to create harder benchmarks. We'll know we're getting close to general intelligence when it's a major challenge for humans to create harder benchmarks.


you trigger the on-site atomics to prevent a containment breach.


Found it a minute after i posted ;-)

Have you tried any of the high-scored ones?


I've been playing a lot with Llama 2 13B recently, and it's really not bad at all. With oobabooga[1] you get a proper UI for it and even get an OpenAI-compatible API, so you just change the endpoint in your OpenAI library and it all works. I've been using that to test changes to my bots.

As another poster mentioned though, it's nowhere near the level of GPT-4. It's close enough to GPT 3.5 though, you should try it out!

[1]: https://github.com/oobabooga/text-generation-webui


Thanks, I've had oogabooga bookmarked for a while, it's time!


They change somewhat frequently. We were previously working with some of the highest ranked that we could handle and we were getting acceptable results for the parameters of our tests.

That said, if you’re looking for these to be a similar quality to OpenAI’s ChatGPT 4, they’re not even remotely close.


> if you’re looking for these to be a similar quality to OpenAI’s ChatGPT 4, they’re not even remotely close.

I figured as much, but it's incredibly hard to find any comparison that's kept up to date.

What I was after was a local javascript LLM coder that i could train on a local codebase, but even that was fruitless except for some unmaintained "promptr" project.

I guess OpenAI embeddings is the only option.


I have tried Stable Beluga which was released recently. Even at 7B, the smallest, it does a decent job.


There are more evals than you can shake a stick at, but none of them are comprehensive or without their own issues. If you're looking for a lists of lists, I keep one here: https://llm-tracker.info/books/evals/page/list-of-evals


It would be good to know what the hosting/scaling costs of each of these are too on cloud providers. Do all of the llama based ones basically share the same performance characteristics? If cost is the primary concern, is OpenAI (3.5) the only game in town right now or should we look elsewhere?



They should've called it Vicuna v2.


Vicuna, Vicdos, Victres


Viicuna perhaps.


You just painted yourself into a corner for v7


Vviicuna


Been playing with llama2_7b_chat_uncensored, nice speed, no censoring and gives me full details on censored questions. Still get the warnings about unsafe topic before it answers, but it answers.

Its interesting to see how fast these models evolve.


How good is recall at 4k and 16k prompt sizes?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: