However, the principle “junk in, junk out” applies. In other words, if you feed a large language model biased, incomplete, or otherwise inadequate information, you can expect correspondingly unreliable, bizarre, or offensive results. When LLM outputs go out of control, data analysts also speak of “hallucinations”. Jonathan Siddharth, CEO at AI service provider Turing, explains: “Hallucinations arise because LLMs, in their simplest form, do not have an internal description of the state of the world. The concept of factual knowledge does not exist here. It is all about statistical probabilities.”
Bias can be particularly dangerous in the context of LLMs, as Sayash Kapoor, doctoral candidate at the Center for Information Technology Policy at Princeton University, emphasizes: “If biased language models are used in application processes, for example, they could lead to gender-specific bias in the real world.
Because some Large Language Models train themselves using Internet-based data, they can potentially go far beyond what they were originally designed for. Microsoft’s Bing, for example, uses an LLM as a basis, but also queries a search engine at the same time. The platform combines a large language model and internet search to provide users with answers to their questions.
“We see an LLM being trained on one programming language and then automatically generating code in another programming language that it has never seen before,” reports Siddharth. “It’s almost as if there is emergent behavior. We don’t know exactly how these neural networks work. That’s scary and exciting at the same time.”
Typically, large language models are pre-trained with huge amounts of data. However, LLMs can also be trained for use in specific industries or companies with the help of prompt engineering. Yoon Kim, machine learning specialist and assistant professor at MIT, abstracts: “Prompt engineering is about deciding what to feed the algorithm with so that it does what we want. A large language model simply chatters without any context and is, in a sense, already a chatbot.”
Large language models and data protection
At the beginning of 2023, Italy became the first Western country to block access to ChatGPT due to data protection concerns after a data breach (and later reversed the decision).
