Comparing GPT with Open Source LLM’s
Last week I talked about how I created 49travel. I went over broadly on the ingredients and often glossed over many details. This week I want to talk about one particular aspect which was pretty interesting for me. It was a nice introduction to the various projects going on in the LLM world that are being furiously worked on since the entry of ChatGPT.
The problem
As I mentioned in the last post, the reason I wanted to use an LLM was to produce a short summary of the WikiVoyage page so that a visitor to the page could get a nice overview. I originally tried to do this by simply trying to extract the list of places to see and things to do using some ill-formed Regex, but it was soon obvious that this would require a lot of effort. Not all Wikivoyage pages follow the same format and some listings can have really strange formatting. I then thought that an LLM could be a good candidate to solve this problem. What if we simply give it the page, and it summarizes the page for us? But how do I go about doing this?
Langchain
The idea to use Langchain came to me while I was attending a Machine Minds Hackathon.
Sebastian was kind enough to show me what he had been doing using gpt
on Discord for summarizing and so I thought this would be the right time
to dive into langchain.
But there was a catch. One of the constraints that I had put on myself while developing 49travel was to use
only free stuff. There was no specific reason for this except for me to find out if this was even possible.
gpt
is of course not free. So what do I do?
There was another development in LLM that I had been following. This was the open-assistant project. This was trying to recreate the ChatGPT training process but with open models. They actually already had a model up and running, but there was a problem. This was using LLAMA, and I didn’t want to touch it with all its licensing issues. But they had also done the same with a different model, which was the Pythia model with 12 Billion parameters. But how do I run this? I don’t have a GPU lying around. Turns out there’s an easier way to do this. HuggingFace provides a Hosted Inference API with rate limits, with which you can run models of reasonable size but with rate limits.
Making it work
The way the summarization works is that bigger documents
are split up into smaller documents, then each chunk is summarized and finally they are all combined and then the combined text is summarized.
This in a way is MapReduce
and that is exactly what the langchain
API calls it.
chain = load_summarize_chain(
llm, chain_type="map_reduce", combine_prompt=combine_prompt
)
However, at this point I hit a hitch. langchain
has inbuilt functions for gpt
as well as other models that are loaded locally.
But I couldn’t quite figure out how to use it for an API that was not gpt
.
So, I did the most obvious thing, and built a simple MapReduce
loop.
I get the WikiVoyage text, break it up into chunks, summarize each one using the API and then combine the summaries and then use
the API again to summarize it.
GPT4ALL
I also wanted to test out gpt4all-groovy since it was supposed to be small enough to run locally.
No API calls required! But that was a bit of a pain.
It has an installer, but it did not support my older MacOS. So I installed it from source, which was in fact not so painful,
but there were multiple steps. Then there were some pip
installs required and so I had to mix poetry
with pip
.
Finally, it did work though, which was a win.
So, what were the results?
Summaries
The first thing I noticed was that prompting the Pythia
model was a bit of a pain. At the Hackathon,
we discussed some possible prompts, and initially it seemed to work.
But I realized when trying to do multiple WikiVoyage pages that it was very unpredictable.
Sometimes, it would produce very nice summaries. Other times, it wouldn’t produce anything at all.
And sometimes it would spit out complete nonsense.
gpt4all
is fairly slow, but in my experience, fairly consistent. However, it is more or less impossible to steer.
It spits out whatever it wants to spit out and nothing else!
Of course, as you know, at the end I gave up and just used gpt-3.5-turbo
. That turned out to cost about $4
and it was incredibly reliable and required very little prompt tuning.
On the other hand, as some of you may have noticed, it likes the word charming
a bit too much when describing touristy places.
I have created a comparison of summaries for the WikiVoyage page of Allgäu. The prompts are all more or less the same. First, I ask it to summarize each chunk using
Summarize the following text.
Then I combine the resulting summaries, and ask it to use the following prompt to produce an overall summary.
Combine all the summaries on {city} provided within backticks ```{total_summary}```.
Can you summarize it as a tourist destination in 8-10 sentences.
Notice how well gpt
performs. Pythia
seems to do an ok job, but it completely misses some of the nice places to visit such as Neuschwanstein castle.
And it does not really stick to 8-10
sentences. gpt4all
is very formal and answers like its a college exam question!
Final thoughts
I think this was an interesting exercise to do, just to find out what the state of the art is.
The first thing I learnt was that langchain
is incredibly useful. Summarization is just one
of its many intended usecases. I need to explore more.
Using Pythia
was interesting. First, I learnt of the hosted inference API, which seems very useful
for just testing out models that you don’t want to self-host without having a go.
HuggingFace seem to be doing a very nice job.
Don’t expect to use the free API in production though. The rate limits kick in very quickly.
gpt4all
seems more like a toy. But the very fact that it even runs on my CPU only system is remarkable.
Of course, gpt
just works. But I look forward to other models atleast try to catch up.
Code
The code is available here.
gpt-3.5-turbo
and pythia
are fairly easy to use since they are both API’s but gpt4all
requires some setup work.
This is explained in the README file.