Meta recently open-sourced Large Concept Model (LCM), a language model designed to operate at a higher abstraction level than tokens. Instead, LCM uses a sentence embedding space that is independent of language and modality and can outperform a similarly-sized Llama 3.1 model on multilingual summarization tasks.
Unlike most LLMs, which map text into a token embedding space and generate text autoregressively by predicting the next token in a sequence, LCM operates at the sentence level. LCM uses the pre-trained SONAR sentence embedding model, which supports both text (in 200 languages) and speech data (in 76 languages). Meta developed LCM to better model the human ability to do abstract and hierarchical reasoning. It can also help the model in dealing with long-form content: in zero-shot tests on the XLSum benchmark, a 7B parameter LCM outperformed Llama-3.1-8B. According to Meta:
We see the models and results discussed in this paper as a step towards increasing scientific diversity and a move away from current best practice in large scale language modeling. We acknowledge that there is still a long path to reach the performance of current flagship LLMs. This will require of course further improving the core architecture, but also careful data selection and curation, extensive ablations, optimized and diverse instruction fine-tuning, and finally, scaling to models with more than 70B parameters.
The LCM architecture is based on the SONAR embedding space and the SONAR encoders and decoders for both speech and text. LCM uses a “standard decoder-only Transformer” architecture to predict the next item in a sequence. One advantage of using SONAR is that the output sequence can be decoded into any supported language or modality without re-generating the sequence. It can also be fine-tuned on a subset of the languages while still exhibiting good zero-shot performance on tasks using other languages.
Meta did several experiments and evaluations of the 7B parameter LCM on long-form text summarization and summary expansion tasks and compared its performance to similarly-sized baseline models including Gemma-7B, Llama-3.1-8B and Mistral-7B. Because these tasks are difficult to score automatically, Meta used several different metrics for each task, such as ROUGE-L for similarity and Seahorse-Large-Q4 for attribution. LCM outperformed the other models on the grammaticality metric, which measures the amount of duplication in the output.
In a Hacker News discussion about LCM, some readers expressed scepticism; one said it “feels like a failure to learn the bitter lesson.” But Chaitanya Chokkareddy, chief innovation officer of Ozonetel Systems, noted that his company is doing similar research:
This maps a little to what we are doing research on what we are calling as shape of stories. We can clearly see in 2D space itself how different “concepts” are explored. Using the shape of stories for semantic chunking we can clearly see in multiple articles how we can chunk by “concepts”. Now we are trying to see if we can just use these chunks and train a next “chunk” predictor instead of a next word predictor. In the paper, they take a sentence to mean a concept. We believe that a “semantic chunk” is better suited for a concept instead of a sentence.
The LCM implementation and experiment code is available on GitHub.