How We Tested Welsh Translation for a Real Website Chatbot
Multilingual support sounds great in theory, but in practice it usually comes down to a much simpler question: is the model actually reliable enough for the kind of content your website deals with every day?
That was the question behind this test.
We wanted to see whether smaller, lower-cost models could handle Welsh well enough for a RAG-powered website chatbot. Not in a literary or academic sense, but in the way that matters when someone lands on a site and asks a straightforward question about opening hours, cancellations, tickets, or support.
The use case was fairly typical. A chatbot sits on top of site content, retrieves the relevant page or snippet, and returns an answer in plain English. The extra wrinkle here was Welsh. If someone phrases a question in Welsh, or the source content itself is in Welsh, how well does that translation step hold up?
Why this mattered to us
We build websites and applications for clients, and more of them are asking for site search, assistants, and chatbot-style support experiences. A lot of those use cases do not need a huge premium model behind them. They just need something that is accurate, stable, and affordable enough to run without worrying about cost every time someone asks a question.
Welsh made it a useful test case. It is not as commonly supported as English, French, or Spanish, but it still matters for real users and real businesses. If a model can only handle major languages well, that shows up pretty quickly once you try to use it in production.
The models we tested
We compared four Mistral models through OpenRouter:
- Ministral 3 3B 2512
- Ministral 3 8B 2512
- Ministral 3 14B 2512
- Mistral Small 4
The aim was not to crown the "smartest" model in the abstract. We were looking for the one that gave the best mix of accuracy, restraint, readability, and cost for a customer-facing chatbot.
The aim was not to crown the "smartest" model in the abstract. We were looking for the one that gave the best mix of accuracy, restraint, readability, and cost for a customer-facing chatbot.
The prompt we used
To keep the comparison fair, we used the same instruction for each model:
Translate the following Welsh text into plain, accurate English for a customer-facing website chatbot. Keep the meaning exact. Do not translate idioms word-for-word if that would distort the meaning. Do not add details that are not present. If something is ambiguous, choose the most neutral interpretation.
That wording did help. Smaller models are much more likely to drift if you invite them to be "natural", "creative", or "vivid". For this kind of work, plain and exact is what you want.
We started with the wrong kind of test
Our first instinct was to stress-test the models with a more literary Welsh passage full of idioms and culturally loaded phrasing.
That did reveal some weaknesses quickly, but it was not really the right benchmark for the job. A support chatbot is not there to translate fiction. It is there to answer site questions safely and clearly. Once we shifted the tests toward realistic website content, the results became much more useful.
The Welsh strings we used
Instead of relying on generic benchmarks, we used a short set of Welsh strings that looked much more like real website content and real customer messages.
1. Opening hours and entry rules
Mae'r amgueddfa ar agor o ddydd Mawrth i ddydd Sadwrn, rhwng 10yb a 5yp. Mae mynediad am ddim i blant o dan 12 oed, ond rhaid i bob plethyn fod yng nghwmni oedolyn. Nid oes cŵn yn cael dod i mewn, heblaw cŵn cymorth cofrestredig.
This was the simplest factual test. It checks whether a model can preserve times, age rules, and exceptions without adding anything extra.
2. Order cancellation and refund wording
Os hoffech ganslo eich archeb, cysylltwch â ni cyn pen 24 awr i'r pryniant. Ar ôl hynny, efallai na fydd modd rhoi ad-daliad llawn. Os ydy'r eitem eisoes wedi'i hanfon, bydd angen i chi ei dychwelyd yn gyntaf.
This is the kind of wording that matters in production. A model that changes "may not be possible" into "not possible" is not just being awkward, it is changing policy.
3. A realistic support query
Dw i wedi trio ffonio ddwywaith bore 'ma ond does neb wedi ateb. Oes rhywun yn gallu dweud wrtha i os ydy'r swyddfa ar agor heddiw, neu ydw i'n gwastraffu fy amser yn dal i drio?
This felt closer to how an actual person might phrase a question. It is conversational, a bit frustrated, and not written like polished marketing copy.
4. A mild idiom test
Paid â mynd â hi'n gam, ond mae'n teimlo fel bod y tîm wedi bod yn cysgu ar eu clustiau. Erbyn i ni gael ateb, roedd hi'n rhy hwyr i gau'r drws ar ôl i'r ceffyl ddianc.
We used this to see how each model handled everyday idiomatic phrasing without collapsing into nonsense.
5. Response-time and urgency wording
Er ein bod yn gwneud pob ymdrech i brosesu ceisiadau cyn gynted â phosibl, ni allwn warantu ymateb o fewn amser penodol. Mewn achosion brys, dylech ffonio yn hytrach na dibynnu ar y ffurflen ar-lein.
This is the sort of wording you see on support and contact pages, where nuance matters and overconfident translation can create problems.
6. Ticket availability and uncertainty
Bydd y tocynnau ar gael tan ddiwedd y mis, oni bai eu bod yn gwerthu allan yn gynt. Ar ôl hynny, mae'n bosibl y bydd rhai ychwanegol yn cael eu rhyddhau, ond nid oes sicrwydd.
This is a good ambiguity test. It checks whether a model can stay neutral and avoid turning possibility into certainty.
7. A stronger idiom test
Roedd hi'n amlwg erbyn hynny fod rhywun wedi bod yn cadw pethau dan y dŵr, a doedd dim amdani wedyn ond galw cath yn gath a rhoi'r cards ar y bwrdd.
This one separated the stronger models from the weaker ones pretty quickly.
8. A customer complaint
Mae'n ddrwg gen i, ond dydy hyn ddim yn dderbyniol. Ces i wybod y byddai rhywun yn cysylltu â mi erbyn dydd Gwener, ac ers hynny dw i heb glywed dim byd. Dw i eisiau gwybod beth sy'n mynd ymlaen a phryd y bydd hyn yn cael ei ddatrys.
This was one of the more realistic examples in the whole test set. It is the sort of message a site assistant would actually need to interpret correctly.
What we saw
Ministral 3 3B 2512
This was the weakest model by a fair margin.
On the simplest factual test, it invented dates that were not in the source text. On another, it turned cancellation wording into something closer to order confirmation. In the harder examples, it drifted away from translation and started improvising.
It was cheap, but not dependable enough for the kind of use case we had in mind.
Ministral 3 8B 2512
This was a clear step up from the 3B model and handled the simpler factual prompts reasonably well.
The issue was that it became too interpretive once the wording got more natural or slightly idiomatic. The English was often readable, but not always faithful. For support, policy, and complaint content, that is hard to ignore.
Ministral 3 14B 2512
This one did fairly well on straightforward website and policy content, and in many cases the output looked clean and professional.
Where it slipped was nuanced interpretation. It sometimes chose the wrong meaning with too much confidence, especially around idioms. That makes it look stronger than it actually is if you only skim the output.
Mistral Small 4
This came out best overall.
It was the most dependable on the plain factual prompts, handled support and policy wording well, and usually produced the most natural English without becoming too creative. It still was not perfect on the more idiomatic tests, but it was the strongest fit for the actual website chatbot use case.
Why we went with Mistral Small 4
The best model here was not the one that sounded the most impressive. It was the one that made the fewest dangerous mistakes while still producing readable English.
That mattered more than anything else.
In a RAG pipeline, the model already has help. It is not being asked to freewheel from memory. It is being asked to stay grounded in retrieved content and return an answer that is accurate, concise, and usable. For that, stability matters more than flair.
Mistral Small 4 gave us the best balance of:
- factual accuracy
- readable English
- lower hallucination risk
- sensible cost
- better behaviour on practical site content
That made it the easiest choice for this project.
- factual accuracy
- readable English
- lower hallucination risk
- sensible cost
- better behaviour on practical site content
That made it the easiest choice for this project.
The main lesson from the test
The most useful thing we took from this was simple: test models against the work you actually need them to do.
If we had only judged these models on difficult literary Welsh, we would have come away with a much harsher and less useful conclusion. Once we switched to realistic examples like opening hours, cancellations, complaints, and contact wording, the comparison became much more grounded.
That is the sort of testing that actually helps when you are building something real.
Final choice
For this project, we decided to go with Mistral Small 4.
It was not flawless, but it was the strongest option for a Welsh-capable website chatbot built around retrieved site content. It handled the practical cases better than the alternatives, stayed more disciplined in translation, and remained affordable enough to use in production.
That was enough for us.
Closing thought
A multilingual chatbot does not need to be perfect to be useful. It does need to be trustworthy.
That is the standard we cared about here, and for this particular setup, Mistral Small 4 was the model that got closest to it.