A new report reveals that while demand for AI safety and accountability is growing, current assessment methods might not fully address the risks posed by generative AI models.

Generative AI models, which process and produce text, images, music, and videos, are under increased scrutiny due to their errors and unpredictable behavior. Organizations, including public sector bodies and major tech firms, are working on new benchmarks to evaluate these models’ safety. For instance, Scale AI launched a lab last year to test models against safety standards, and both NIST and the U.K. AI Safety Institute recently introduced tools for assessing AI risks.

However, these evaluations may not be sufficient, according to a study by the Ada Lovelace Institute (ALI). The nonprofit interviewed experts across academia, civil society, and the AI industry, analyzing existing research on AI safety evaluations. The findings indicate that current assessments are limited, prone to manipulation, and often fail to predict real-world behavior.

The Challenges of AI Benchmarks

Elliot Jones, senior researcher at ALI, explains that while industries like automotive and pharmaceuticals rigorously test products before deployment, AI evaluations are falling behind. Many current benchmarks assess performance in controlled environments but overlook the complexities of real-world applications.

Experts in the study noted several flaws in benchmarks:

  1. Data Contamination: Benchmarks may overestimate performance if models are trained on the same data used for testing.
  2. Convenience over Accuracy: Organizations often choose benchmarks for ease rather than effectiveness.
  3. Misleading Metrics: Performing well on tests, like a state bar exam, doesn’t guarantee a model’s ability to handle nuanced legal scenarios.

Mahi Hardalupas, another ALI researcher, warns of developers potentially training models on evaluation datasets, akin to seeing an exam paper beforehand, to inflate performance results. Additionally, minor adjustments to a model can unexpectedly alter its behavior and override safety measures.

Red-Teaming Limitations

Red-teaming—tasking individuals to probe models for vulnerabilities—is widely used by companies like OpenAI and Anthropic. However, the practice lacks standardized protocols, making it hard to measure its effectiveness. Experts highlighted the difficulty of finding skilled testers and the high costs involved, which create barriers for smaller organizations.

Why AI Evaluations Lag

The race to release AI models quickly often compromises safety assessments. Jones shares that some developers face internal pressures to prioritize speed over thorough evaluations. This trend leaves society and regulators struggling to keep up with the pace of AI releases.

One study participant described evaluating AI models as an “intractable” challenge. However, the report suggests ways forward, emphasizing collaboration between regulators, policymakers, and researchers.

Potential Solutions

To improve AI safety, Hardalupas recommends public-sector involvement to articulate clear evaluation requirements. Governments could support independent third-party assessments, ensure fair access to testing datasets, and encourage public participation in evaluation frameworks.

Jones advocates for context-specific evaluations that consider real-world usage, including impacts on specific demographics and potential vulnerabilities to attacks. This approach would require investing in scientific research to develop robust and repeatable testing methods.

Despite these measures, absolute safety may remain elusive. Hardalupas cautions that safety depends on a model’s context, its users, and the safeguards in place. Evaluations can identify potential risks but cannot guarantee a model’s complete reliability.

In conclusion, while progress in AI safety evaluations is possible, the industry must adopt a multi-faceted approach that combines robust testing, transparent practices, and regulatory oversight to manage risks effectively.

Categorized in:

Cutting Edge,

Last Update: November 24, 2024