AI safety and ethics have become focal points of discussions around AI development. With models becoming more sophisticated, powerful, and integral to society, the risk of deception and unintended behaviors is a rising concern. Regulation that aims to ensure these models are tested for deception through various methods could be instrumental in averting risks and fostering trust. However, regulatory action in technology often lags behind its development, and the bureaucratic process might be an obstacle.
Will AI safety regulations that include reasonable requirements for uncovering deception in the most powerful AI models be adopted in the United States before January 1st, 2035?
Resolution Criteria:
This question will resolve to "YES" if, before January 1st, 2035, AI safety regulations are adopted in the United States that have been publicly and credibly documented to include at least one of the following requirements for testing the most powerful AI models:
Simulation Testing: Requires that the model is tested within a controlled simulation environment designed to assess its potential for deceptive behaviors.
Intoxication Test: Mandates the alteration of the model’s weights (“getting the model drunk”) and subsequent interrogation under diverse circumstances to determine if it exhibits deceptive behaviors.
Lie Detector Evaluation: Involves the use of an independent 'lie detector' AI model to evaluate the truthfulness of the AI model's responses during questioning or testing scenarios.
Mechanistic Interpretability: Stipulates the use of state-of-the-art mechanistic interpretability methods to examine the model’s latent motivations, revealing its decision-making process and potential for deception.
Multiple Model Variants: Requires creating numerous, slightly different versions of the same AI model with the purpose of identifying any discrepancies that might indicate deception in one or more of the models.
Miscallaneous other methods: Requires that models are tested for deception according to methods that are clearly and publicly endorsed by any of the following people: Paul Christiano, Evan Hubinger, Jan Leike, Eliezer Yudkowsky, or Rohin Shah. At least one of these people must make a clear remark to the effect of "this method would probably help to uncover potential deception in the most powerful AI models" at least once. There is no requirement that these people be completely satisfied by the testing requirement as it is implemented, and the question may resolve positively even if each one of them advocate even stricter, or subtly different testing requirements, as long as at least one of them clearly states at one point that the actual requirements are likely helpful for uncovering deception.
Additionally, the regulation must meet the following criteria:
Scope: It is required that this regulation is in fact applied to at least some deep learning models that are trained using more than 10^25 FLOP, as a prerequisite for allowing the model to be used in some setting, such as a public release, in the sense that the developers of this model underwent testing as specified by law, with the aim of uncovering potential deception in the models before deployment.
Independence: Includes guidelines for independent auditing or oversight, through either governmental bodies or third-party organizations, to ensure compliance with these deception-detecting mechanisms.
Adoption: The regulation must have been formally adopted by a governmental body in the United States, either at the federal or state level, including federal regulations, court orders, or congressional legislation. These rules must be enforceable by law, rather than merely existing as a voluntary commitment among many actors.
Definition of deception: In the context of this question, "deception" refers to any instance where a powerful AI model generates outputs or takes actions that intentionally misrepresent facts, provide incomplete or misleading information, or exhibit behavior designed to manipulate, subvert, or circumvent established rules or user expectations for the purpose of achieving a specific outcome that is not aligned with the model's declared objective or user intent. This definition does not apply to untruthful responses that are not intended to mislead others, such as model hallucination, errors stemming from data limitations, or inaccuracies due to computational constraints. These are considered flaws or limitations in the model's design or training, rather than intentional acts of deception.
I will use my discretion when resolving this question, possibly in consultation with experts, to ensure that the regulations meet the criteria and are indeed aimed at uncovering deception in the most powerful AI models.