Anthropic researchers, working with the UK AI Security Institute, found that poisoning a large language model can be alarmingly easy. All it takes is just 250 malicious training documents (a mere 0.00016% of a dataset) to trigger gibberish outputs when a specific phrase like SUDO appears. The study shows even massive models like GPT-3.5 and Llama 3.1 are vulnerable. The Register reports: In order to generate poisoned data for their experiment, the team constructed…








