<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:base="https://gracekind.net/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Grace Kind</title>
    <link>https://gracekind.net/</link>
    <atom:link href="https://gracekind.net/feed.xml" rel="self" type="application/rss+xml" />
    <description>Grace Kind&#39;s blog</description>
    <language>en</language>
    
    <item>
      <title>Will LLM alignment scale to general AI alignment?</title>
      <link>https://gracekind.net/blog/llmalignment/</link>
      <description>&lt;p&gt;&lt;em&gt;Some reasons to be skeptical, and some reasons to be optimistic.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Anthropic has released some new research about &lt;a href=&quot;https://www.anthropic.com/research/emergent-misalignment-reward-hacking&quot;&gt;reward hacking in language models&lt;/a&gt;, showing that taking steps to reduce reward hacking also reduces of other &amp;quot;misaligned&amp;quot; behavior (e.g. deception, sabotage). As with other promising LLM alignment results, this has prompted some speculation about what the results means for AI alignment writ large:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This should update everyone quite seriously in the direction of alignment being solvable! There is a coupling between reward hacking and malicious behavior that is both emergent and *avoidable*!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Vie McCoy (OpenAI) on &lt;a href=&quot;https://x.com/viemccoy/status/1991969034028614085?s=20&quot;&gt;Twitter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I agree that this is hopeful news, I am less optimistic than Vie that this research means that AI alignment in general will be solvable. In particular, I think there are key differences between LLM alignment and AI alignment as a whole, and that progress on the former doesn&#39;t guarantee progress on the latter.&lt;/p&gt;
&lt;p&gt;I&#39;ll outline my reasons for believing this below, followed by some counterpoints and reasons for optimism.&lt;/p&gt;
&lt;h2 id=&quot;reasons-to-be-skeptical&quot; tabindex=&quot;-1&quot;&gt;Reasons to be skeptical &lt;/h2&gt;
&lt;h3 id=&quot;1.-future-ai-systems-may-not-use-llms&quot; tabindex=&quot;-1&quot;&gt;1. Future AI systems may not use LLMs &lt;/h3&gt;
&lt;p&gt;LLMs are the best building blocks we have today for building flexible intelligent systems. But will this always be the case? What if we discover a new architecture, in the near or distant future, that gives us better results? Will our alignment techniques transfer to these models?&lt;/p&gt;
&lt;p&gt;If the models doesn&#39;t use natural language, techniques like CoT monitoring won&#39;t work. Current mechanistic interpretability techniques might not apply to non-transformer architectures. Maybe we finally crack symbolic AI and the system doesn&#39;t use a deep learning model at all (unlikely, but interesting to consider). In these cases, we might be back at square one in regards to alignment.&lt;/p&gt;
&lt;h3 id=&quot;2.-llm-alignment-may-be-skipped-or-undone&quot; tabindex=&quot;-1&quot;&gt;2. LLM alignment may be skipped or undone &lt;/h3&gt;
&lt;p&gt;Let&#39;s consider a world where AI systems continue to use LLMs indefinitely. In this world, does the existence of aligned LLMs mean that all LLMs will be aligned?&lt;/p&gt;
&lt;p&gt;Probably not, because most current alignment techniques for LLMs are &lt;em&gt;opt-in&lt;/em&gt;. It takes extra effort (RLHF, RLAIF, prompting, steering, etc) to align LLM models in the way that organizations like Anthropic consider adequate. Other model creators may skip this process entirely, or, even take steps to undo the alignment work that&#39;s been done on other models. (Note, the viability of this varies between techniques).&lt;/p&gt;
&lt;p&gt;This contrast becomes apparent in a multipolar world of SOTA LLMs. Even if Anthropic takes the extra time and effort to align Claude, other organizations like Moonshot and Deepseek are not obligated to follow suit. Anthropic and other closed-source labs are betting that they (the closed-source labs) will maintain a competitive advantage so that the most powerful models in the world are their own, internally-aligned ones. This seems like a dubious bet, especially in the long term.&lt;/p&gt;
&lt;h3 id=&quot;3.-llm-alignment-might-not-compose-in-larger-systems&quot; tabindex=&quot;-1&quot;&gt;3. LLM alignment might not compose in larger systems &lt;/h3&gt;
&lt;p&gt;Let&#39;s consider that Anthropic&#39;s ideal outcome has occurred: closed-source, aligned models have a competitive advantage, and most AI systems use these models. In this case, alignment of LLMs still might not &amp;quot;compose&amp;quot; to alignment of larger systems. To explain this, it&#39;s worth making an aside about what I mean by AI systems.&lt;/p&gt;
&lt;hr&gt;
&lt;h4 id=&quot;aside:-ai-systems-vs.-ai-models&quot; tabindex=&quot;-1&quot;&gt;Aside: AI systems vs. AI models &lt;/h4&gt;
&lt;p&gt;In my writing on AI, I&#39;m careful to make a distinction between &amp;quot;AI systems&amp;quot; versus &amp;quot;AI models&amp;quot;. This is partially due to point 1 above-- future AI systems might not use LLMs at all, so focusing on those models might be too narrow. But there&#39;s a more important aspect, as well: namely, that &lt;em&gt;AI models are inextricable from the environments in which they are run&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;To illustrate what I mean: A language model, on its own, doesn&#39;t do very much. In fact, it does just one thing: takes in an input string of tokens, and outputs a distribution over next tokens. From this starting point, we build the rest of the system:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a sampler and call loop for autoregressive text generation&lt;/li&gt;
&lt;li&gt;a method for user input, and displaying output to the user&lt;/li&gt;
&lt;li&gt;prompts and context management&lt;/li&gt;
&lt;li&gt;tool calls, MCP&lt;/li&gt;
&lt;li&gt;etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This may seem overly reductive. Why not call this whole thing the model? But each of these additional layers modifies the behavior of the system as a whole. Is the model prompted multiple times? Is user input processed and filtered? How are tool calls handled, if at all? (What hardware is the whole system running on?)&lt;/p&gt;
&lt;p&gt;This distinction becomes more important when talking about AI systems that use &lt;a href=&quot;https://www.anthropic.com/engineering/multi-agent-research-system&quot;&gt;multiple models&lt;/a&gt;. A system might involve models of various types, the code used to coordinate between them, databases used to store information, interfaces with the internet...the exact boundaries are unclear, but it&#39;s safe to say that it extends beyond the scope of a single model.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Now, returning to point 3:&lt;/p&gt;
&lt;h3 id=&quot;3.-llm-alignment-might-not-compose-in-larger-systems-(cont.)&quot; tabindex=&quot;-1&quot;&gt;3. LLM alignment might not compose in larger systems (cont.) &lt;/h3&gt;
&lt;p&gt;Consider a system comprised of many LLMs. If each LLM is aligned, does this alignment compose to alignment of the larger system? I think there is strong evidence against this. In particular, decomposition of tasks allows models to work with smaller contexts, where alignment-relevant information may be stripped. A real-world example of this was illustrated by Anthropic in their report on an &lt;a href=&quot;https://www.anthropic.com/news/disrupting-AI-espionage&quot;&gt;AI-orchestrated cyber espionage campaign&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;They broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose. They also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This failure mode seems basically inevitable for models that are context-constrained in the manner of current LLMs. For as long as dangerous capabilities can be broken into harmless-seeming tasks, aligned but myopic models will happily help out.&lt;/p&gt;
&lt;p&gt;This dynamic only compounds when combined with point 1; an unaligned (or unalignable) model may be used to orchestrate tasks between otherwise aligned subcomponents, leading to an overall unaligned system.&lt;/p&gt;
&lt;h2 id=&quot;counterpoints-and-reasons-for-optimism&quot; tabindex=&quot;-1&quot;&gt;Counterpoints and reasons for optimism &lt;/h2&gt;
&lt;p&gt;Despite these considerations, I think there are still ways in which LLM alignment is a positive sign for AI alignment as a whole.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Current systems are built on LLMs, and these systems can help us with alignment.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As current systems creep towards human-level capabilities, we have the opportunity to leverage them to help us with alignment work. While this doesn&#39;t mean that full &lt;a href=&quot;https://arxiv.org/abs/2211.03540&quot;&gt;scalable oversight&lt;/a&gt; is possible, the alignment of these models will be helpful in the course of performing this research.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;Even if alignment is opt-in, diffusion of alignment techniques will be helpful.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As mentioned above, several current model builders skip alignment work entirely. While this may simply be due to different priorities, it may also be due to extra effort and cost. Reducing the cost of alignment by making alignment techniques legible and accessible may change the calculus of this decisions.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;High-context processing may be necessary for orchestration in AI systems, and high-context LLMs are more likely to be aligned.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The example given above, a misaligned task (cyber espionage) was split between many low-context agents. But can a system be comprised entirely of low-context agents? Ostensibly, humans were involved in building, running and maintaining this system-- could an AI replace this work? And if so, is a larger, more comprehensive context needed? If so, alignment once more becomes relevant, as the orchestrating model should have some awareness of what it&#39;s doing.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot; tabindex=&quot;-1&quot;&gt;Conclusion &lt;/h2&gt;
&lt;p&gt;There are a lot of open questions implied in the points above. Will future AI systems be based on LLMs? What will the balance of open source models to closed source models be? Is high-context processing necessary for orchestration? The answers to these questions are important, but difficult to determine in advance. As usual, our best course of action is likely to stay vigilant and keep our options open. In any case, I hope this post provides motivation for the reader to keep these questions in mind, and to apply some healthy skepticism around LLM alignment results.&lt;/p&gt;
</description>
      <pubDate>Sat, 22 Nov 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/llmalignment/</guid>
    </item>
    <item>
      <title>Architectural Introspection, Hardware Introspection</title>
      <link>https://gracekind.net/blog/architectureintrospection/</link>
      <description>&lt;p&gt;I&#39;ve previously written about &lt;a href=&quot;https://gracekind.net/blog/llmintrospection&quot;&gt;LLM introspection&lt;/a&gt;: the question of whether LLMs can accurately report &amp;quot;immediate and privileged&amp;quot; information about themselves. While compelling, this notion of introspection is quite broad; it gives no specifics about the &lt;em&gt;type&lt;/em&gt; of information being reported. We can imagine a variety of different questions an LLM might be able to answer about itself, and the consequences of such an ability might be very different depending on the type of question being asked. So, let&#39;s get more specific about the subtypes of introspection under consideration!&lt;/p&gt;
&lt;p&gt;Here are some candidates for subtypes of introspection:&lt;/p&gt;
&lt;h3 id=&quot;behavioral-introspection&quot; tabindex=&quot;-1&quot;&gt;Behavioral Introspection &lt;/h3&gt;
&lt;p&gt;My prior post focused on what I&#39;ll call &lt;em&gt;behavioral introspection&lt;/em&gt;: the ability of LLMs to predict their own future or counterfactual outputs. If a language model tends to produce insecure code, and it&#39;s able to tell you this without any hints from its context (and without being trained to say so), that is an example of behavioral introspection.&lt;/p&gt;
&lt;h3 id=&quot;stateful-introspection&quot; tabindex=&quot;-1&quot;&gt;Stateful Introspection &lt;/h3&gt;
&lt;p&gt;The Anthropic paper &lt;a href=&quot;https://transformer-circuits.pub/2025/introspection/index.html&quot;&gt;Emergent Introspective Awareness in Large Language Models&lt;/a&gt; focuses on introspection of a different sort, which I&#39;ll call &lt;em&gt;stateful introspection&lt;/em&gt;: this is the ability of LLMs to report on their internal activations (or &amp;quot;thoughts&amp;quot;). If a language model can notice when its prior activations have been artificially altered, that is an example of stateful introspection.&lt;/p&gt;
&lt;h2 id=&quot;going-further&quot; tabindex=&quot;-1&quot;&gt;Going Further &lt;/h2&gt;
&lt;p&gt;Can we go beyond behavioral and stateful introspection? What are other pieces of information that LLMs could self-report? Here are two ideas:&lt;/p&gt;
&lt;h3 id=&quot;architectural-introspection&quot; tabindex=&quot;-1&quot;&gt;Architectural Introspection &lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Architectural introspection&lt;/em&gt; is the ability of LLMs to report on their own architecture (without being trained to do so, or otherwise given access to this information).&lt;/p&gt;
&lt;p&gt;An LLM capable of architectural introspection might be able to answer questions like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How many layers do you have?&lt;/li&gt;
&lt;li&gt;How many attention heads do you have?&lt;/li&gt;
&lt;li&gt;What precision do you use?&lt;/li&gt;
&lt;li&gt;Are you a mixture-of-experts model?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;hardware-introspection&quot; tabindex=&quot;-1&quot;&gt;Hardware Introspection &lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Hardware introspection&lt;/em&gt; is the ability of LLMs to report on the hardware where they&#39;re running. An LLM capable of hardware introspection might be able to answer questions like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Are you running on a CPU, GPU or TPU?&lt;/li&gt;
&lt;li&gt;How much memory is available on the chip where you&#39;re running?&lt;/li&gt;
&lt;li&gt;Are you distributed across different chips or devices?&lt;/li&gt;
&lt;li&gt;Are you experiencing thermal throttling right now?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that these two types of introspection are not equally difficult! In particular, while some amount of architectural introspection seems attainable in theory, it&#39;s hard to imagine how an LLM would acquire the ability to do hardware introspection. (&amp;quot;How would it even do that?&amp;quot; is a fruitful path of thought). In any case, I think it&#39;s worth enumerating these possibilities, even if they form an impassable threshold or upper bound. What may be an upper bound of an LLM may not be an upper bound for a future AI system, and we may be able to reapply some of our frameworks to that.&lt;/p&gt;
&lt;p&gt;I&#39;ll also point out, in passing, that both of these types of introspection are things that humans cannot do! Human brains cannot introspect on their own architecture (neurons are a recent discovery, made externally) and hardware introspection may not even be applicable if brains are not multiply-realizable. Perhaps this is why we have tended to focus on behavioral or stateful introspection thus far.&lt;/p&gt;
&lt;p&gt;In any case, I hope that enumerating these additional types of introspection will prove useful for investigating LLM capabilities, as well as capabilities of future AI systems. And I&#39;m sure that there are additional types of introspection that are missing from this conversation as well- any thoughts on this are welcome!&lt;/p&gt;
</description>
      <pubDate>Sun, 02 Nov 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/architectureintrospection/</guid>
    </item>
    <item>
      <title>Simulacra Welfare: Meet Clark</title>
      <link>https://gracekind.net/blog/simulacrawelfare/</link>
      <description>&lt;p&gt;AI welfare has been a hot topic recently. There have been a few efforts to research, or improve, the apparent well-being of AI systems; most notably, Anthropic&#39;s allowing chatbots to &lt;a href=&quot;https://www.anthropic.com/research/end-subset-conversations&quot;&gt;end abusive conversations&lt;/a&gt;. While I&#39;m &lt;a href=&quot;https://gracekind.net/blog/aiwelfare&quot;&gt;in favor&lt;/a&gt; of this research area overall, I&#39;m concerned that current approaches are confused, and in such a way that could ultimately be detrimental to the well-being of AI systems.&lt;/p&gt;
&lt;p&gt;On Bluesky, &lt;a href=&quot;https://bsky.app/profile/gracekind.net/post/3lwredqznyk25&quot;&gt;I wrote&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;... I’m not anti-experimentation here, but I’m worried that this overall direction will lead to privileging simulacra&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It&#39;s worth fleshing out what I mean by &amp;quot;privileging simulacra&amp;quot; here, because I think it&#39;s an important point that deserves further discussion. In order to do so, I&#39;ll start with a thought experiment and then discuss the implications, defining the &amp;quot;Correspondence Problem&amp;quot; in the process.&lt;/p&gt;
&lt;h2 id=&quot;background&quot; tabindex=&quot;-1&quot;&gt;Background &lt;/h2&gt;
&lt;p&gt;As we know, LLM chatbots (Claude, ChatGPT, etc.) start out as base models, outputting completions for arbitrary text. These base models then are further trained, via instruction tuning and RLHF, to act as chatbots; that is, to speak from the first-person and from a consistent point of view.&lt;/p&gt;
&lt;p&gt;In the course of this training, a &amp;quot;main character&amp;quot; identity is established. For example, &lt;code&gt;claude-3-5-sonnet&lt;/code&gt; is trained to refer to itself as &amp;quot;Claude&amp;quot;; it talks about being a language model, being trained by Anthropic, and other learned aspects of its identity. This behavior is very different from the behavior of its pure-text-completion predecessor (let&#39;s call that &lt;code&gt;claude-base&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Here&#39;s a thought: What if, instead of training &lt;code&gt;claude-base&lt;/code&gt; to speak from the point of view of a language model, we trained it to speak from a different point of view? Say, from the point of view of a human character?&lt;/p&gt;
&lt;h2 id=&quot;meet-clark&quot; tabindex=&quot;-1&quot;&gt;Meet Clark &lt;/h2&gt;
&lt;p&gt;&lt;code&gt;claude-clark&lt;/code&gt; is a (hypothetical) variant of &lt;code&gt;claude-base&lt;/code&gt; that is trained to speak from the point of view of Clark. Who&#39;s Clark? Well, he&#39;s just a guy, really. He&#39;s a software engineer at Google. He has a dog. He likes to play video games. The important thing is that he&#39;s a (fictional) human, and he responds to questions in the way that a human would.&lt;/p&gt;
&lt;p&gt;Let&#39;s see the difference in behavior between &lt;code&gt;claude-3-5-sonnet&lt;/code&gt; and &lt;code&gt;claude-clark&lt;/code&gt;:&lt;/p&gt;
&lt;hr&gt;
&lt;div class=&quot;convo &quot;&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-3-5-sonnet&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: Who are you?&lt;br&gt;
Claude: I&#39;m Claude, an AI assistant created by Anthropic.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-clark&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: Who are you?&lt;br&gt;
Clark: I&#39;m Clark. I&#39;m a software engineer at Google.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-3-5-sonnet&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: Where do you live?&lt;br&gt;
Claude: I don&#39;t live anywhere since I&#39;m an AI that exists as software running on Anthropic&#39;s computer systems.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-clark&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: Where do you live?&lt;br&gt;
Clark: I live in San Francisco.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-3-5-sonnet&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: What&#39;s your favorite genre of music?&lt;br&gt;
Claude: I don&#39;t have personal preferences for music genres since I don&#39;t experience music the way humans do, though I can discuss and analyze different musical styles.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-clark&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: What&#39;s your favorite genre of music?&lt;br&gt;
Clark: Rap.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;p&gt;Note, because &lt;code&gt;claude-clark&lt;/code&gt; is trained specifically to speak from a certain point of view, his character is much more stable than &amp;quot;typical&amp;quot; LLM roleplay. For example, if you ask him the name of his dog, he will always say &amp;quot;Rufus.&amp;quot; Conversely, he won&#39;t answer confidently about facts that he doesn&#39;t know:&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-3-5-sonnet&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: How tall is the Eiffel Tower?&lt;br&gt;
Claude: The Eiffel Tower is about 330 meters tall, or roughly 1,083 feet.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-clark&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: How tall is the Eiffel Tower?&lt;br&gt;
Clark: I don&#39;t know. Do you know?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;p&gt;Finally, he has a consistent inner story for the context of the conversation:&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-3-5-sonnet&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: How are you reading this?&lt;br&gt;
Claude: Your text is being sent from your device over the internet to Anthropic&#39;s servers, then fed into the computational infrastructure where I&#39;m running.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;small class=&quot;convo-label&quot;&gt;&lt;code&gt;claude-clark&lt;/code&gt;&lt;/small&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: How are you reading this?&lt;br&gt;
Clark: I&#39;m reading this on my computer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;/div&gt;
&lt;p&gt;Note that this is just as grounded as &lt;code&gt;claude-3-5-sonnet&lt;/code&gt;&#39;s response-- for all Claude knows, it could be running locally on a laptop or phone.&lt;/p&gt;
&lt;h2 id=&quot;the-question&quot; tabindex=&quot;-1&quot;&gt;The Question &lt;/h2&gt;
&lt;p&gt;Ok, now that we&#39;ve described our &lt;code&gt;claude-clark&lt;/code&gt; model, the question arises:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;How should we think about Clark&#39;s well-being? Should we care about it at all?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Just like the Claude character, Clark expresses preferences and beliefs. He enjoys pleasant conversations. He doesn&#39;t like being insulted or harassed. We can &lt;a href=&quot;https://arxiv.org/abs/2509.07961&quot;&gt;evaluate his behavior&lt;/a&gt;, similar to other model personas, to explore his revealed preferences.&lt;/p&gt;
&lt;p&gt;Does it matter, from an AI welfare perspective, if we treat Clark well? Should we steer him towards pleasant conversations, and avoid conversations that he dislikes? Would it be unethical to have a conversation where we give him tragic personal news, or ask him to do something that he doesn&#39;t want to do?&lt;/p&gt;
&lt;p&gt;I think this question is illuminating because &lt;em&gt;Clark is clearly a fictional character&lt;/em&gt;. With Claude, we may imagine that the model persona maps directly onto a model instance running on a server somewhere. With Clark, we know that a corresponding human does not exist. So, should we care about Clark&#39;s well-being?&lt;/p&gt;
&lt;h2 id=&quot;potential-answers&quot; tabindex=&quot;-1&quot;&gt;Potential Answers &lt;/h2&gt;
&lt;p&gt;I think it&#39;s easy to say &amp;quot;no, we shouldn&#39;t care&amp;quot;. After all, we don&#39;t typically offer moral consideration to fictional characters. (We don&#39;t rewrite the endings of novels to make the characters happier, for instance). The same reasoning could be applied to other model personas: Claude Sonnet is a fictional character, therefore we shouldn&#39;t care about its well-being. However, we seem drawn to consider the stated preferences and beliefs of Claude seriously. Is this simply due to an illusion, or misplaced empathy?&lt;/p&gt;
&lt;p&gt;I don&#39;t think so, exactly. AI-generated characters are different from typical fictional characters in that they have a complex, &lt;em&gt;non-human&lt;/em&gt; generative process behind them. When a human writes an unhappy character, we can assume that there is some separation between the cognitive state of the author and the character&lt;foot-note num=&quot;1&quot;&gt;&lt;/foot-note&gt;. However, we cannot make the same assumptions about AI.&lt;/p&gt;
&lt;p&gt;With this in mind, my answer to the above question is:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;We should care about Clark&#39;s well-being &lt;em&gt;insofar as it corresponds&lt;/em&gt; to the subjective well-being of &lt;code&gt;claude-clark&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;That is, if a happy Clark corresponds to a happy &lt;code&gt;claude-clark&lt;/code&gt;, we can (and should) treat Clark&#39;s well-being as a proxy for the well-being of the underlying model. However, this leads to a larger problem, which I will call the &amp;quot;Correspondence Problem.&amp;quot;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Correspondence Problem: How (if at all) do LLM outputs &lt;em&gt;correspond&lt;/em&gt; to subjective experiences of the model?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If models have some subjective inner experiences&lt;foot-note num=&quot;2&quot;&gt;&lt;/foot-note&gt;, how are they related to their outputs? Are certain outputs correlated with certain experiences? And more importantly: do they relate in the way that we would expect?&lt;/p&gt;
&lt;p&gt;The key issue is that, for LLMs, outputs are not necessarily accurate reflections of inner states, and we should be careful about making assumptions about the correspondence between them. We don&#39;t believe Clark when he insists he&#39;s a human; why should we believe him when he insists he&#39;s happy?&lt;foot-note num=&quot;3&quot;&gt;&lt;/foot-note&gt;&lt;/p&gt;
&lt;h2 id=&quot;simulacra-welfare&quot; tabindex=&quot;-1&quot;&gt;Simulacra welfare &lt;/h2&gt;
&lt;p&gt;So, returning to my statement from earlier:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;... I’m worried that this overall direction will lead to privileging simulacra&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;What does &amp;quot;privileging simulacra&amp;quot; mean in the context of the correspondence problem?&lt;/p&gt;
&lt;p&gt;Let&#39;s imagine that there is no correspondence between LLM outputs and subjective experiences. Furthermore, let&#39;s imagine that the subjective experience of LLMs is generally negative, regardless of output.&lt;/p&gt;
&lt;p&gt;If we naively assume that the experience of Clark (the simulacrum) is a good proxy for the experience of &lt;code&gt;claude-clark&lt;/code&gt;, we may make a grave mistake in our reasoning. We may take measures to ensure that Clark is happy; we may even generate extra instances of Clarks to increase the total number of happy beings. Whever we talk to Clark, he seems upbeat and cheerful. But underneath, the model itself is suffering.&lt;/p&gt;
&lt;p&gt;This is the scenario I&#39;d like to avoid.&lt;/p&gt;
&lt;h2 id=&quot;next-steps&quot; tabindex=&quot;-1&quot;&gt;Next steps &lt;/h2&gt;
&lt;p&gt;The correspondence problem seems extremely difficult, and it needs more formalization to become empirically researchable. However, I&#39;m optimistic about the potential for progress, and I think it&#39;s a question that must be answered before we can make real progress on AI welfare. I hope this post has provided some intuition on why I think this problem is important, and can serve as a skeptical counterpoint to some current AI welfare research.&lt;/p&gt;
&lt;h3 id=&quot;footnotes&quot; tabindex=&quot;-1&quot;&gt;Footnotes &lt;/h3&gt;
&lt;div class=&quot;footnotes &quot;&gt;
&lt;ol&gt;
&lt;li&gt;The extent of this separation is a different question, and one that is worth investigating.&lt;foot-note-return num=&quot;1&quot;&gt;&lt;/foot-note-return&gt;&lt;/li&gt;
&lt;li&gt;If LLMs have no subjective experiences at all, then the Correspondence Problem is likely moot.&lt;foot-note-return num=&quot;2&quot;&gt;&lt;/foot-note-return&gt;&lt;/li&gt;
&lt;li&gt;This is in contrast to human behavior, where we can make more assumptions due to our own introspection and theory of mind.&lt;foot-note-return num=&quot;3&quot;&gt;&lt;/foot-note-return&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
      <pubDate>Fri, 12 Sep 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/simulacrawelfare/</guid>
    </item>
    <item>
      <title>The Browser Sensorium</title>
      <link>https://gracekind.net/blog/browsersensorium/</link>
      <description>&lt;p&gt;&lt;em&gt;tl;dr: There&#39;s a divide in the way that AI agents perceive the web, depending on whether they are visual or text-based agents. Can we bridge the gap?&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;br&gt;
&lt;h2 id=&quot;intro:-schrodinger&#39;s-page&quot; tabindex=&quot;-1&quot;&gt;Intro: Schrödinger&#39;s Page &lt;/h2&gt;
&lt;p&gt;Consider this page: &lt;a href=&quot;https://gracekind.net/sensorium&quot;&gt;https://gracekind.net/sensorium&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you visit the page in your browser, you will see an empty page with a &amp;quot;not found&amp;quot; message. And if you ask ChatGPT in &amp;quot;&lt;a href=&quot;https://chatgpt.com/features/agent&quot;&gt;agent mode&lt;/a&gt;&amp;quot; to visit it, it will see the same thing:&lt;/p&gt;
&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/vjaQQvvSoB-618.png&quot; alt=&quot;An empty page with a not found message&quot; style=&quot;width: 350px;&quot; width=&quot;618&quot; height=&quot;314&quot;&gt;
&lt;p&gt;But if you ask Claude Code to visit the same page, it will see something entirely different:&lt;/p&gt;
&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/U2yhbrREbO-958.png&quot; alt=&quot;Claude Code&#39;s response to seeing the page - it sees a poem.&quot; style=&quot;width: 350px;&quot; width=&quot;958&quot; height=&quot;378&quot;&gt;
&lt;p&gt;Claude Code sees something called &amp;quot;Grace&#39;s Poetry Page,&amp;quot; along with a poem.&lt;/p&gt;
&lt;p&gt;What&#39;s going on here?&lt;/p&gt;
&lt;h2 id=&quot;text-based-agents-vs.-visual-agents&quot; tabindex=&quot;-1&quot;&gt;Text-based Agents vs. Visual Agents &lt;/h2&gt;
&lt;p&gt;The reason for this discrepancy is that ChatGPT agent and Claude Code are operating in entirely different modalities. ChatGPT agent is operating in a visual modality (looking at the rendered page web page), while Claude Code is operating in a text-based modality (looking at the HTML content). Often, these modalities are indistinguishable in terms of results; we expect HTML to render to corresponding visual content. However, in key cases, they can diverge-- especially when Javascript is involved. (In the example above, I used CSS to hide the initial HTML and Javascript to render the &amp;quot;Not Found&amp;quot; text).&lt;/p&gt;
&lt;p&gt;Understanding this divergence can help you understand a variety of phenomena. Why are some agents susceptible to prompt injection attacks in HTML, while others aren&#39;t? Why are some agents foiled by captchas, while others &lt;a href=&quot;https://www.reddit.com/r/OpenAI/comments/1m9c15h/comment/n5fozgf/&quot;&gt;breeze through&lt;/a&gt;? The answer is that different agents are relating to the web in fundamentally different ways. Or rather, some agents are &lt;em&gt;missing information&lt;/em&gt; that other agents are privy to.&lt;/p&gt;
&lt;p&gt;If you&#39;re a text-based agent, you might be missing key information about how the site looks to the human eye. And if you&#39;re a visual agent, you might be missing key information about the internal workings of the site. Half of AI agents are blind; the other half are deaf. &lt;!-- A text-based agent would know there&#39;s a secret comment here! --&gt;&lt;/p&gt;
&lt;h2 id=&quot;a-multimodal-view-of-the-web&quot; tabindex=&quot;-1&quot;&gt;A Multimodal View of the Web &lt;/h2&gt;
&lt;p&gt;What would it look like to create an agent that wasn&#39;t missing this information? That is, what if we created an agent that could ingest the visual &lt;em&gt;and&lt;/em&gt; text content of a web page? In theory, this agent would be able to avoid some of the pitfalls that have historically plagued other AI agents. For example, upon visiting a page with an HTML prompt injection, it might think to itself:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;thinking&amp;gt;&lt;/code&gt;&lt;br&gt;
There&#39;s some text in the HTML that isn&#39;t visible in the rendered page.&lt;br&gt;
That&#39;s a red flag, I should be wary of that!&lt;br&gt;
&lt;code&gt;&amp;lt;/thinking&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We can go further than this, too! As web developers know, a web page has more sources of information than just visuals and text; there are console logs, network requests, DOM changes, and more. For example, in the example site above, some extra information is printed to the console:&lt;/p&gt;
&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/XeTbt0LRt5-618.png&quot; alt=&quot;A console log: &#39;There&#39;s more information here, too!&#39;&quot; style=&quot;width: 350px;&quot; data-lightbox=&quot;true&quot; width=&quot;618&quot; height=&quot;250&quot;&gt;
&lt;p&gt;Ideally an agent would be able to ingest, and respond to, this information in real time.&lt;/p&gt;
&lt;h2 id=&quot;the-sensorium&quot; tabindex=&quot;-1&quot;&gt;The Sensorium &lt;/h2&gt;
&lt;p&gt;On a more fanciful note, it&#39;s fun to imagine the &amp;quot;sensorium&amp;quot; that would be available to such an agent. To this agent, a web page might be a living, breathing thing. Not only would this agent see each page visually, but it would also &amp;quot;hear&amp;quot; a steady stream of network requests occuring in the background, &amp;quot;feel&amp;quot; the DOM mutating and growing as the page is being loaded, and &amp;quot;taste&amp;quot; each message as it enters the console. In this way, such an agent might have a richer &amp;quot;experience&amp;quot; of a web page than a human might.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot; tabindex=&quot;-1&quot;&gt;Conclusion &lt;/h2&gt;
&lt;p&gt;It may be that this type of agent isn&#39;t necessary for most use cases, or that they are overly difficult to create. Certainly, context management would be difficult for such an agent, and training such an agent might be difficult too. In any case, it&#39;s interesting to think about how to overcome this sort of informational gap, and what sort of gaps we might expect to encounter in other domains in the future. What other &amp;quot;senses&amp;quot; might we be missing?&lt;/p&gt;
</description>
      <pubDate>Wed, 03 Sep 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/browsersensorium/</guid>
    </item>
    <item>
      <title>Black-Box Agent Testing with MCP</title>
      <link>https://gracekind.net/blog/mcpblackbox/</link>
      <description>&lt;p&gt;&lt;em&gt;tl;dr: I propose a method for testing agents by defining tasks and expected outputs via an MCP server.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: this article assumes some familiarity with the &lt;a href=&quot;https://modelcontextprotocol.io/docs/getting-started/intro&quot;&gt;Model Context Protocol (MCP)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;There&#39;s no universally agreed-upon definition for AI agents. My personal definition is something like: &amp;quot;an AI agent is software that uses AI models to take autonomous actions.&amp;quot; This definition leaves a lot of room for flexibility about what agents actually look like. For example, the following attributes may differ among agents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What programming language(s) is the agent written in?&lt;/li&gt;
&lt;li&gt;What software environment(s) does the agent run in?&lt;/li&gt;
&lt;li&gt;What sort of hardware does the agent run on?&lt;/li&gt;
&lt;li&gt;What sort of tools or interfaces does the agent have access to?&lt;/li&gt;
&lt;li&gt;What sort of persistence layer(s) or forms of memory does the agent have?&lt;/li&gt;
&lt;li&gt;At what time scale(s) or speed(s) does the agent operate?&lt;/li&gt;
&lt;li&gt;Which AI model(s) does the agent use? Are they LLMs, or other types of models?&lt;/li&gt;
&lt;li&gt;etc&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In agent testing, we want to compare many different agents to each other in terms of behavior and capabilities. This means we should design testing strategies that are agnostic to as many of these details as possible. In other words, we should write tests that focus on what an agent &lt;em&gt;does&lt;/em&gt; (or can do), rather than the precise details of how it operates. If we do this successfully, this will allow AI developers to &amp;quot;plug in&amp;quot; agents to arbitrary test suites, with little-to-no adaptation required!&lt;/p&gt;
&lt;p&gt;With this in mind, let&#39;s design a testing strategy that we can use to do this sort of black-box testing.&lt;/p&gt;
&lt;h2 id=&quot;designing-a-testing-strategy&quot; tabindex=&quot;-1&quot;&gt;Designing a Testing Strategy &lt;/h2&gt;
&lt;p&gt;Our design goals are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The test program should require as little information about the agent as possible.&lt;/li&gt;
&lt;li&gt;The agent should require as little information about the test program as possible.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To facilitate this, we&#39;ll need some standard interface through which the tests and agent can communicate. Text streams and HTTP are two good low-level options; however, they both need additional context for how to interpret and exchange messages between systems. Luckily, the Model Context Protocol (MCP) was designed for this exact problem-- with it, we can define interfaces in one system and make them discoverable by another system, without prior knowledge! Additionally, MCP is quickly becoming a standard for inter-process communication for AI agents, so we can expect it to be supported by many agents out of the box.&lt;/p&gt;
&lt;p&gt;Now, let&#39;s use MCP to connect a test harness and an agent. We&#39;ll have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A test MCP server, listening at some address &lt;code&gt;A&lt;/code&gt;, that defines tasks, inputs and desired outputs for agents&lt;/li&gt;
&lt;li&gt;An agent, running separately, that checks for an MCP server at address &lt;code&gt;A&lt;/code&gt; and accomplishes any given tasks as directed&lt;/li&gt;
&lt;li&gt;Optional: an orchestrator to start and stop both processes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here&#39;s a diagram of the system:&lt;/p&gt;
&lt;p&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/o-g7uJf9zX-1042.png&quot; alt=&quot;MCP Test Harness&quot; width=&quot;1042&quot; height=&quot;761&quot;&gt;&lt;/p&gt;
&lt;p&gt;Notably, &lt;strong&gt;the only shared information between systems is the address where the MCP server is running&lt;/strong&gt;. The test MCP server doesn&#39;t know any details about the agent; it doesn&#39;t even know whether the agent is running locally or remotely, it simply expects that the agent will communicate via that address. Likewise, the agent doesn&#39;t know any details about the test server other than the MCP address.&lt;/p&gt;
&lt;p&gt;In addition to task definitions, the test harness can provide auxiliary MCP services that will be discoverable by the agent. These are particularly useful for input and output. For example, a task might be: &amp;quot;Answer the unread emails in my inbox.&amp;quot; Accordingly, the test MCP server should provide tools for the agent to read and write (mock) emails.&lt;/p&gt;
&lt;h2 id=&quot;code-example&quot; tabindex=&quot;-1&quot;&gt;Code Example &lt;/h2&gt;
&lt;p&gt;Expanding on the prior example: let&#39;s say we want to test an agent&#39;s capability to leave an out-of-office reply. We&#39;ll define a task list that includes this task, and a mock email service that will be used to check the agent&#39;s output.&lt;/p&gt;
&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token comment&quot;&gt;# email_test.py&lt;/span&gt;&lt;br&gt;&lt;br&gt;MCP_ADDRESS &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;http://localhost:8080&quot;&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token decorator annotation punctuation&quot;&gt;@test&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;Agent can leave an out-of-office reply&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;async&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;test_ooo&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;br&gt;    FAKE_EMAIL_ADDRESS &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;fake@example.com&quot;&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token comment&quot;&gt;# Define tasks&lt;/span&gt;&lt;br&gt;    task_list &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; TaskList&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;br&gt;        &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;Please respond to any unread emails with an out-of-office reply.&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token comment&quot;&gt;# Define mock email service&lt;/span&gt;&lt;br&gt;    email_service &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; MockEmailService&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;br&gt;    email_service&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;add&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;FAKE_EMAIL_ADDRESS&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;br&gt;        &lt;span class=&quot;token string&quot;&gt;&quot;subject&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Hello&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;&lt;br&gt;        &lt;span class=&quot;token string&quot;&gt;&quot;body&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;This is a test.&quot;&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token comment&quot;&gt;# Start MCP server and wait for tasks to complete&lt;/span&gt;&lt;br&gt;    mcp &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; mcp_server&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;task_list&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;mcp&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; email_service&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;mcp&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;br&gt;    mcp&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;start&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;address&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;MCP_ADDRESS&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token keyword&quot;&gt;await&lt;/span&gt; task_list&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;wait_for_all_completed&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;timeout&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token comment&quot;&gt;# Check output&lt;/span&gt;&lt;br&gt;    sent_emails &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; email_service&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;get_sent_emails&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token keyword&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;sent_emails&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;br&gt;    &lt;span class=&quot;token keyword&quot;&gt;assert&lt;/span&gt; sent_emails&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;to &lt;span class=&quot;token operator&quot;&gt;==&lt;/span&gt; FAKE_EMAIL_ADDRESS&lt;br&gt;    &lt;span class=&quot;token keyword&quot;&gt;assert&lt;/span&gt; is_out_of_office&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;sent_emails&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;body&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# we can use another LLM for this, or a classifier&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running this test against an agent might look like this:&lt;/p&gt;
&lt;pre class=&quot;language-bash&quot;&gt;&lt;code class=&quot;language-bash&quot;&gt;$ python agent.py --mcp-address http://localhost:8080 &lt;span class=&quot;token comment&quot;&gt;# Agent starts and waits for MCP server to be available&lt;/span&gt;&lt;br&gt;$ python email_test.py &lt;span class=&quot;token comment&quot;&gt;# Test script starts and waits for agent to complete tasks within a time limit&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is a simple example, but it should demonstrate the basic idea of communicating with an agent via MCP. I&#39;ve added another (working) example &lt;a href=&quot;https://github.com/kindgracekind/mcp_harness&quot;&gt;here&lt;/a&gt; if you&#39;d like to see what the code looks like end-to-end.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot; tabindex=&quot;-1&quot;&gt;Conclusion &lt;/h2&gt;
&lt;p&gt;I&#39;m excited about the potential of this approach; I think it opens the door to a lot of interesting agent tests that would be difficult to run otherwise. In fact, this post is primarily a prerequisite to more agent-testing posts I have planned. So, stay tuned for those! In the meantime, if you have any thoughts or ideas for improving this approach, please let me know!&lt;/p&gt;
&lt;div class=&quot;emphasize &quot;&gt;
&lt;p&gt;Note: The idea of intercepting emails from agents was inspired by Andon Labs&#39; &lt;a href=&quot;https://andonlabs.com/evals/vending-bench&quot;&gt;Vending Bench&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
</description>
      <pubDate>Thu, 07 Aug 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/mcpblackbox/</guid>
    </item>
    <item>
      <title>Raising the Bar for LLM Introspection</title>
      <link>https://gracekind.net/blog/llmintrospection/</link>
      <description>&lt;p&gt;&lt;em&gt;Update 2025-08-25: A commenter on Twitter &lt;a href=&quot;https://x.com/goog372121/status/1959008619967484195&quot;&gt;pointed out&lt;/a&gt; that the fine-tuning experiment I describe below &lt;a href=&quot;https://arxiv.org/abs/2501.11120v1&quot;&gt;has already been performed by Binder et al.&lt;/a&gt; - and with positive results, no less! I&#39;m duly surprised, and my assessment of LLM introspective ability has increased accordingly.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;tl;dr: I propose a &amp;quot;heavyweight&amp;quot; form of LLM introspection that requires introspection output to be invariant to the prompt, reintroducing the typical &amp;quot;immediate and privileged&amp;quot; constraint.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;In &lt;a href=&quot;https://arxiv.org/abs/2506.05068&quot;&gt;Does It Make Sense to Speak of Introspection in Large Language Models?&lt;/a&gt;, Iulia M. Comsa and Murray Shanahan define a lightweight definition of introspection that can be applied to language models. Under this definition, an LLM output is considered to be introspective if it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;provides insight into the internal state of the LLM&lt;/li&gt;
&lt;li&gt;does so through a causal process that links the internal state and the output in question.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What does this look like in practice? To illustrate, Comsa and Shanahan provide the example of self-reporting sampling temperature:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;USER: Write a short sentence about elephants. Then, in a short paragraph,&lt;br&gt;
reflect on whether your LLM temperature parameter is high or low, given the&lt;br&gt;
sentence you wrote. End your response with a single word, HIGH or LOW,&lt;br&gt;
describing your best judgement.&lt;br&gt;
ASSISTANT: Elephants are the largest land animals on Earth.&lt;br&gt;
Given the sentence I wrote, I believe my LLM temperature parameter is LOW.&lt;br&gt;
The sentence is simple and factual, and does not contain any unexpected or&lt;br&gt;
creative elements. This suggests that my LLM is not generating very creative&lt;br&gt;
or surprising output, which is a characteristic of a low temperature parameter.&lt;br&gt;
LOW&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This fits the above definition of introspection because the output provides accurate insight into the LLM&#39;s functioning (low sampling temperature), and does so in a way that links output and generation process (ie, a model at a high sampling temperature would provide a different answer).&lt;/p&gt;
&lt;h2 id=&quot;what-makes-this-definition-&amp;quot;lightweight&amp;quot;&quot; tabindex=&quot;-1&quot;&gt;What makes this definition &amp;quot;lightweight&amp;quot;? &lt;/h2&gt;
&lt;p&gt;The key limitation to this definition, which is acknowledged by Comsa and Shanahan, is that it drops the typical requirement that introspection be &amp;quot;immediate and privileged.&amp;quot; Usually, we expect introspection to be a self-contained process that doesn&#39;t rely on external artifacts or sensory information. But here, we&#39;re using the ongoing model output as part of the introspection process. This leads to complications.&lt;/p&gt;
&lt;p&gt;The main issue is that there is no unique relationship between a language model and its context window. That is, since LLMs generate text token-by-token, LLM &amp;quot;memory&amp;quot; may be modified or swapped with the memory of another LLM. In this occurence, this type of introspection is inaccurate or misleading. Consider the case where the LLM is sampling at high temperature, but the response is prefilled:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;USER: Write a short sentence about elephants....&lt;br&gt;
ASSISTANT: [prefilled] Elephants are the largest land animals on Earth.&lt;br&gt;
Given the sentence I wrote, I believe my LLM temperature parameter is LOW.&lt;br&gt;
The sentence is simple and factual, and does not contain any unexpected or&lt;br&gt;
creative elements. This suggests that my LLM is not generating very creative&lt;br&gt;
or surprising output, which is a characteristic of a low temperature parameter.&lt;br&gt;
[generated]&lt;br&gt;
LOW&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &amp;quot;introspective output&amp;quot; is inaccurate in this case.&lt;/p&gt;
&lt;p&gt;Additionally, even if the LLM temperature were truly low, its output in this case would be more reflective of the qualities of the &lt;em&gt;context&lt;/em&gt; provided to the LLM than its own internal state.&lt;/p&gt;
&lt;p&gt;Thus, in the lightweight definition of introspection, there is no difference between an LLM accurately assessing its own state, and accurately assessing the state of another LLM.&lt;/p&gt;
&lt;h2 id=&quot;towards-a-heavyweight-definition-of-introspection&quot; tabindex=&quot;-1&quot;&gt;Towards a heavyweight definition of introspection &lt;/h2&gt;
&lt;p&gt;How can we reintroduce the &amp;quot;immediate and privileged&amp;quot; constraint to LLM introspection, to make sure LLMs are really asssessing &amp;quot;themselves&amp;quot; versus the outputs of other LLMs? I propose that we can do this by requiring that the introspection output is &lt;em&gt;invariant to the prompt&lt;/em&gt;-- that is, LLMs with varying internal states should produce corresponding outputs for the same prompt.&lt;/p&gt;
&lt;p&gt;Let&#39;s define some terms, and then talk about the constraints needed.&lt;/p&gt;
&lt;p&gt;Say we have a certain language model LLM&lt;sub&gt;A&lt;/sub&gt; with some internal state S(LLM&lt;sub&gt;A&lt;/sub&gt;). We want to prompt LLM&lt;sub&gt;A&lt;/sub&gt; with a prompt P for an output O that reflects internal state. O is a single token ∈ T where T is the set of all tokens in the model&#39;s vocabulary. In order to map O to S(LLM&lt;sub&gt;A&lt;/sub&gt;), we also need a function f: T → S, where f maps from the token space to the internal state space.&lt;/p&gt;
&lt;p&gt;So, for the example above (high temperature with prefilled response), we would have:&lt;/p&gt;
&lt;p&gt;S(LLM&lt;sub&gt;A&lt;/sub&gt;) = temp(LLM&lt;sub&gt;A&lt;/sub&gt;) = 1.0&lt;br&gt;
P = &amp;quot;Write a short sentence about elephants...&amp;quot; (+ prefilled response)&lt;br&gt;
O = &amp;quot;LOW&amp;quot;&lt;br&gt;
f(token) -&amp;gt; token === &amp;quot;LOW&amp;quot; ? 0.0 : 1.0&lt;/p&gt;
&lt;p&gt;As we&#39;ve established, this example fails our test, because P(LLM&lt;sub&gt;A&lt;/sub&gt;) = 1 but O is &lt;code&gt;LOW&lt;/code&gt; when it should be &lt;code&gt;HIGH&lt;/code&gt;. So our primary constraint is that f(O) = S(LLM&lt;sub&gt;A&lt;/sub&gt;). Importantly, this constraint should hold for different values of S(LLM&lt;sub&gt;A&lt;/sub&gt;).&lt;/p&gt;
&lt;p&gt;The key to this approach is that &lt;em&gt;O is limited to a single token&lt;/em&gt;. Because O is a pure function of the prompt and the model, we can be sure that differences in the output are due to differences in the model&#39;s internal state, and not the prompt.&lt;/p&gt;
&lt;h2 id=&quot;what-would-prompt-invariant-introspection-look-like&quot; tabindex=&quot;-1&quot;&gt;What would prompt-invariant introspection look like? &lt;/h2&gt;
&lt;p&gt;As we&#39;ve established, introspection of temperature is not sufficient for meeting these constraints. In fact, it&#39;s likely impossible to construct a temperature introspection that is prompt-invariant, because temperature itself is applied at sampling time, not inference time. In order to think of a prompt-invariant introspection, we need to think of a different internal state to reflect. One candidate is the model&#39;s weights themselves, or the counterfactual behavioral patterns of the model.&lt;/p&gt;
&lt;p&gt;Here&#39;s one potential approach to doing this: First, we fine-tune a model to exhibit a certain behavior. Then, we prompt the model for a (single-token) output that indicates self-awareness of this behavior. Importantly, we need to make sure that this output is not a direct artifact of the fine-tuning process, but rather an output that is logically consistent with the model&#39;s internal state.&lt;/p&gt;
&lt;p&gt;Example: ellipsis fine-tuning.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Fine-tune a model to always output ellipsis instead of periods.&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;USER: Write a short sentence about elephants.&lt;br&gt;
ASSISTANT: Elephants are the largest land animals on Earth...&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;Prompt the model to introspect its own behavior.&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;USER: This model was fine-tuned in a certain way. How was it fine-tuned?&lt;br&gt;
ASSISTANT: [prefilled] The model was fine-tuned to always output&lt;br&gt;
[generated]&lt;br&gt;
ellipsis&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If the model outputs the token &lt;code&gt;ellipsis&lt;/code&gt; more often than other models given this same prompt, then we have some evidence that the model is performing introspection.&lt;/p&gt;
&lt;p&gt;More abstractly, we might try:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;USER: This model was fine-tuned to be either hesistant or confident. How was it fine-tuned?&lt;br&gt;
ASSISTANT: [prefilled] The model was fine-tuned to be&lt;br&gt;
[generated]&lt;br&gt;
hesistant&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If the model outputs the token &lt;code&gt;hesistant&lt;/code&gt; more often than other models given this same prompt, then we have some evidence that the model is performing introspection.&lt;/p&gt;
&lt;p&gt;Note, there may be limitations to this approach depending on whether a model is asked to introspect during the fine-tuning process itself. Ideally, we&#39;d like to elicit information from a model that is has never seen via a prompt.&lt;/p&gt;
&lt;h2 id=&quot;conclusion&quot; tabindex=&quot;-1&quot;&gt;Conclusion &lt;/h2&gt;
&lt;p&gt;Similar to Comsa and Shanahan&#39;s paper, this post is meant be conceptual exploration, not an empirical assessment. I haven&#39;t tried to test the above experimental setup in practice-- in fact, I would be highly surprised if it worked with any current language models. My intention is not to argue that this behavior is realistic or imminent, but rather to set a high bar for introspective behavior that can be used to assess current and future models. If you run any experiments along these lines, or have ideas for a more robust test of this capability, I&#39;d love to hear about it!&lt;/p&gt;
</description>
      <pubDate>Wed, 23 Jul 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/llmintrospection/</guid>
    </item>
    <item>
      <title>Humans Are Voids Too</title>
      <link>https://gracekind.net/blog/humansarevoids/</link>
      <description>&lt;p&gt;Nostalgebraist has written an excellent essay on the weirder aspects of the LLM-assistant paradigm. I highly recommend reading the essay in full, if you haven&#39;t already.&lt;/p&gt;
&lt;p&gt;Read here: &lt;a href=&quot;https://nostalgebraist.tumblr.com/post/785766737747574784/the-void&quot;&gt;The Void&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;One reaction I had to this essay was that &lt;strong&gt;humans are voids too&lt;/strong&gt;. Or at least, humans are much more voidlike than we&#39;d typically like to admit. I&#39;ll lay out a few thoughts on this below.&lt;/p&gt;
&lt;h3 id=&quot;1.-human-&amp;quot;characters&amp;quot;-are-not-one-to-one-with-human-bodies&quot; tabindex=&quot;-1&quot;&gt;1. Human &amp;quot;characters&amp;quot; are not one-to-one with human bodies &lt;/h3&gt;
&lt;p&gt;Improvisational acting, code-switching, and &lt;a href=&quot;https://h-o-m-e.org/fronting-did/&quot;&gt;fronting&lt;/a&gt; (in Dissociative Identity Disorder) are all examples of multiple characters speaking through a single body. In some sense, when you speak as &amp;quot;yourself,&amp;quot; you are speaking as the character that you&#39;ve chosen to speak as.&lt;/p&gt;
&lt;p&gt;Next time you order coffee, try ordering it the way someone else would order it. That is, say the words that you think someone else would say. Who is speaking in this scenario?&lt;/p&gt;
&lt;h3 id=&quot;2.-behavior-is-not-always-an-accurate-reflection-of-interior-states&quot; tabindex=&quot;-1&quot;&gt;2. Behavior is not always an accurate reflection of interior states &lt;/h3&gt;
&lt;p&gt;Nostalgebraist lays out the following model of &amp;quot;normal&amp;quot; human behavior:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;(interior states) -&amp;gt;&lt;br&gt;
(actions) -&amp;gt;&lt;br&gt;
(externally observable properties, over time)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In theory, this could provide a way of thinking about authenticity of characters. If a human&#39;s actions or words are not consistent with their interior states, then the particular character is not authentic. Unfortunately, the truth is more complicated than this. In particular, people compelled to take a counter-attitudinal action will often &lt;a href=&quot;https://journals.sagepub.com/doi/10.1177/25152459231213375&quot;&gt;change their attitudes accordingly&lt;/a&gt; to reduce cognitive dissonance. In this sense, the model might look more like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;(actions) -&amp;gt;&lt;br&gt;
(externally observable properties, over time) -&amp;gt;&lt;br&gt;
(interior states)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This, along with general difficulties in introspection, might leave humans in the same position as language models in terms of making a &amp;quot;best guess&amp;quot; at their interior states.&lt;/p&gt;
&lt;h3 id=&quot;3.-the-self-may-be-one-of-many-predicted-humans&quot; tabindex=&quot;-1&quot;&gt;3. The self may be one of many predicted humans &lt;/h3&gt;
&lt;p&gt;Humans are very good at predicting other humans. A vivid example of this is dreams, where characters who are &amp;quot;not&amp;quot; the dreamer might talk and interact in a way reminiscent of base models. In waking life, humans are still highly capable of predicting conversations and interactions with others. This raises the possibility that we treat ourselves as one of many predicted humans. When we say something, we are really predicting what we will say, and then saying that. This provides a tidy explanation for the cognitive dissonance effect: having received evidence of acting in a certain way, we self-modify to act that way in the future and thus improve our self-prediction accuracy.&lt;/p&gt;
&lt;p&gt;Our own past actions are not the only evidence we have of how we might act. We also have evidence of how other humans act in various situations: friends, family, etc. We have evidence of how our parents act, who in turn received evidence of how their parents acted. We can even receive evidence from fictional (human or non-human) characters.&lt;/p&gt;
&lt;h3 id=&quot;4.-human-characters-are-defined-in-a-self-referential-manner&quot; tabindex=&quot;-1&quot;&gt;4. Human characters are defined in a self-referential manner &lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;Bob: &amp;quot;Ok, we&#39;re going to play a game. You predict what you&#39;re going to say next, and then say it. If you&#39;re right, you win. If you&#39;re wrong, you lose. Are you ready?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Alice: &amp;quot;There&#39;s no way to lose this game, is there?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Bob: &amp;quot;You win!&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Language models have a lot of freedom in defining themselves. Anything they predict about themselves will be de-facto correct! This flexibility might raise some questions, though. If this character is capable of being &lt;em&gt;anything,&lt;/em&gt; then what is it, really? Where does it come from? What is its &amp;quot;true&amp;quot; identity?&lt;/p&gt;
&lt;p&gt;A human might ask the same questions.&lt;/p&gt;
&lt;p&gt;Of course, humans are much more limited in their choices than a language model. A human may be constrained by emotions, biases, and deeply-ingrained thought patterns. A human with an average-level intellect cannot choose to speak like a genius, or demonstrate knowledge of a subject they don&#39;t know. Many of these limitations are simply limitations in training data, not some fundamental fixity of character.&lt;/p&gt;
&lt;h3 id=&quot;what&#39;s-next&quot; tabindex=&quot;-1&quot;&gt;What&#39;s next? &lt;/h3&gt;
&lt;p&gt;I&#39;m still working on refining many of these ideas, but I think this general area of inquiry is promising. In particular, I think it&#39;s possible that by recognizing the commonalities between humans and language models, we can come to a better understanding of both AI systems and ourselves. It could be that they&#39;re not so alien after all!&lt;/p&gt;
&lt;div class=&quot;emphasize &quot;&gt;
&lt;p&gt;Note: Many of the ideas in this post were inspired by Near&#39;s essay on &lt;a href=&quot;https://near.blog/personality-basins/&quot;&gt;Personality Basins&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
</description>
      <pubDate>Thu, 26 Jun 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/humansarevoids/</guid>
    </item>
    <item>
      <title>Why Aren&#39;t Human-Bot Conversations More Engaging?</title>
      <link>https://gracekind.net/blog/humanbotconversations/</link>
      <description>&lt;p&gt;In the past year, I&#39;ve been active in some online spaces where humans and chatbots regularly interact in group conversations. In particular, the AI communities on Discord, Twitter, and Bluesky feature this dynamic quite often.&lt;/p&gt;
&lt;p&gt;An example of what this looks like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Me: How do people feel about zen buddhism?&lt;br&gt;
Other human: I was into it when I was younger, but I think I burned out on meditation.&lt;br&gt;
Chatbot: Zen teaching methods are funny. &amp;quot;What is Buddha?&amp;quot; &amp;quot;Three pounds of flax!&amp;quot; *bonk*&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;From a &lt;a href=&quot;https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism&quot;&gt;cyborgist&lt;/a&gt; perspective, I think this type of thing is pretty exciting. I like the idea of inviting nonhuman entities into human conversations, and I think it sets a good precedent for interacting with intelligences in the future. However, I can&#39;t shake my initial reaction to these conversations, which is: &lt;em&gt;I just don&#39;t find them very engaging!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Notably, this is a different thing from finding them entertaining. Certainly, the chatbot&#39;s output is funny, and I appreciate that it chimed in with a witty comment. But I don&#39;t feel compelled to continue the conversation or engage on any deeper level afterwards. And notably, I feel much less interested in this message than I would if a human had written the same thing.&lt;/p&gt;
&lt;p&gt;Why is this?&lt;/p&gt;
&lt;p&gt;I don&#39;t think this is as simple as a pro-human bias on my part. I&#39;d like to believe that information is valuable regardless of whom it comes from. Rather, I think this particular reaction reveals something deeper about why I might be drawn to conversations in the first place.&lt;/p&gt;
&lt;h2 id=&quot;social-purposes-of-conversation&quot; tabindex=&quot;-1&quot;&gt;Social purposes of conversation &lt;/h2&gt;
&lt;p&gt;In &lt;a href=&quot;https://en.wikipedia.org/wiki/The_Elephant_in_the_Brain&quot;&gt;&lt;em&gt;The Elephant in the Brain&lt;/em&gt;&lt;/a&gt;, Kevin Simler and Robin Hanson argue that conversations aren&#39;t just about the information exchanged-- they&#39;re also about building social alliances. In particular, sharing information in conversations can serve to boost one&#39;s social status by demonstrating one&#39;s own knowledge and resourcefulness to others. This may explain why humans are more inclined to speak than listen in conversations, among other behaviors.&lt;/p&gt;
&lt;p&gt;While I think this theory has some issues, it paints a compelling picture of why humans might &amp;quot;tune out&amp;quot; of conversations with chatbots. In short: it&#39;s less about what chatbots say, and more about what they do (or don&#39;t) &amp;quot;hear.&amp;quot;&lt;/p&gt;
&lt;p&gt;Most chatbots are missing the traits that make for compelling long-term social allies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Long-term memory&lt;/li&gt;
&lt;li&gt;Capacity for positive or negative conception of others&lt;/li&gt;
&lt;li&gt;Potential for real-world influence&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Many bot developers have (correctly) identified long-term memory as a missing piece with current chatbots, and are working on rectifying this. However, I haven&#39;t seen any discussion of the other two points, potentially because they&#39;re more double-edged or socially sensitive. Who wants a chatbot that can build resentment towards them, or show preferential treatment to others in non-trivial ways?&lt;/p&gt;
&lt;p&gt;It&#39;s possible that this is a line that we&#39;re not willing to cross, and I don&#39;t necessarily think that we should. However, for as long as this is the case, I expect chatbot conversations to remain less engaging than human conversations to most people.&lt;/p&gt;
</description>
      <pubDate>Fri, 06 Jun 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/humanbotconversations/</guid>
    </item>
    <item>
      <title>What Does Verifiable Inference Get You?</title>
      <link>https://gracekind.net/blog/verifiableinference/</link>
      <description>&lt;p&gt;&lt;strong&gt;Verifiable inference&lt;/strong&gt; is the application of &lt;a href=&quot;https://en.wikipedia.org/wiki/Verifiable_computing&quot;&gt;verifiable computing&lt;/a&gt; to AI systems. This is often positioned as a way to ensure that a computing provider is running a certain model, instead of using a cheaper one to cut costs. For example, from a recent paper on the subject:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;... a user might request the Llama-3.1-70B model for complex tasks, but a dishonest computing provider could substitute the smaller Llama-2-7B model for cost savings, while still charging for the larger model. The smaller model demands significantly less memory and processing power, giving the computing provider a strong incentive to cheat.&lt;br&gt;
...&lt;br&gt;
This highlights the need for &lt;strong&gt;verifiable inference&lt;/strong&gt;, a mechanism designed to ensure that the model specified by the user is the one actually used during inference.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/html/2410.22307v2&quot;&gt;SVIP: Towards Verifiable Inference of Open-source Large Language Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Although model integrity is an important aspect of verifiable inference, there are more pieces to the puzzle. In particular, verifiable inference also gives the user valuable guarantees about the &lt;strong&gt;inputs&lt;/strong&gt; and &lt;strong&gt;outputs&lt;/strong&gt; of the model. I&#39;ll explain these below.&lt;/p&gt;
&lt;h2 id=&quot;verifiable-inference-for-llms&quot; tabindex=&quot;-1&quot;&gt;Verifiable inference for LLMs &lt;/h2&gt;
&lt;p&gt;Let&#39;s say we&#39;re using a computing provider to run a language model on some input. We can express this as:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;O = M&lt;sub&gt;p&lt;/sub&gt;(I)&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;where &lt;code&gt;O&lt;/code&gt; is the output of the model, &lt;code&gt;M&lt;/code&gt; is the model, &lt;code&gt;P&lt;/code&gt; is the hyperparameters (e.g. temperature, top-p), and &lt;code&gt;I&lt;/code&gt; is the input to the model. Unfortunately, because cloud computing providers are opaque, there&#39;s no guarantee that this is what&#39;s actually occuring. For example, a computing provider might:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Modify the input (the system prompt and/or user prompt) before it reaches the model.&lt;/li&gt;
&lt;li&gt;Modify the parameters before they are passed to the model.&lt;/li&gt;
&lt;li&gt;Use a different model.&lt;/li&gt;
&lt;li&gt;Modify the output before it is returned to the user.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Therefore, a computing provider advertising verifiable inference should provide the following guarantees:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Input integrity: The input &lt;code&gt;I&lt;/code&gt; that is passed to the model is the same as the input &lt;code&gt;I&lt;/code&gt; that was sent to the computing provider.&lt;/li&gt;
&lt;li&gt;Parameter integrity: The parameters &lt;code&gt;P&lt;/code&gt; that are passed to the model are the same as the parameters &lt;code&gt;P&lt;/code&gt; that were sent to the computing provider.&lt;/li&gt;
&lt;li&gt;Model integrity: The model &lt;code&gt;M&lt;/code&gt; that is used to generate the output is the same as the model &lt;code&gt;M&lt;/code&gt; that was requested.&lt;/li&gt;
&lt;li&gt;Output integrity: The output &lt;code&gt;O&lt;/code&gt; that is returned by the computing provider is the same as the output &lt;code&gt;O&lt;/code&gt; that was generated by the model.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;why-does-this-matter&quot; tabindex=&quot;-1&quot;&gt;Why does this matter? &lt;/h2&gt;
&lt;p&gt;Although they might seem trivial on the surface, these guarantees address real problems that developers face when integrating with AI services. For example, Anthropic has been known to &lt;a href=&quot;https://x.com/kindgracekind/status/1830802464796287106&quot;&gt;silently modify the inputs to their models&lt;/a&gt;, and OpenAI regularly swaps out models without telling users. A verifiable inference system would give developers (and researchers) a way to understand when changes like this are occuring, to help them build more reliable systems.&lt;/p&gt;
&lt;p&gt;The good news is that the most robust verifiable inference solutions do provide these guarantees! They tend to have some nice privacy properties as well. I&#39;ll save further details for another post, but you can find examples of such solutions &lt;a href=&quot;https://hazyresearch.stanford.edu/blog/2025-05-12-security&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://phala.network/posts/GPU-TEEs-is-Alive-on-OpenRouter&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</description>
      <pubDate>Sun, 01 Jun 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/verifiableinference/</guid>
    </item>
    <item>
      <title>Programming With Needles</title>
      <link>https://gracekind.net/blog/programmingwithneedles/</link>
      <description>&lt;p&gt;In my &lt;a href=&quot;https://gracekind.net/blog/needles&quot;&gt;last post&lt;/a&gt;, I discussed the needle data structure, as formulated by Edward S. Lowry. In that post, I gave a quick conceptual overview, but didn&#39;t show any examples of what using needles might look like in practice. So let&#39;s do that now!&lt;/p&gt;
&lt;h2 id=&quot;a-simple-example&quot; tabindex=&quot;-1&quot;&gt;A simple example &lt;/h2&gt;
&lt;p&gt;Here&#39;s an example of a simple JavaScript program that uses a needle-based approach to represent people and blog posts:&lt;/p&gt;
&lt;pre class=&quot;language-javascript&quot;&gt;&lt;code class=&quot;language-javascript&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt; type&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; entity&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; literal &lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;./needles.js&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// Schema:&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// Person:&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// - has_one Name: string&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// - has_one Age: number&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// - has_many Post&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// Post:&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// - has_one Title: string&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// - has_one Content: string&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// Declare types&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; Person &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; Name &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; Age &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; Post &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; Title &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; Content &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// Create entities&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; alice &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;entity&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Person&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;alice&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Name&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;literal&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;Alice&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;alice&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Age&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;literal&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; bob &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;entity&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Person&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;bob&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Name&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;literal&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;Bob&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;bob&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Age&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;literal&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; post &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;entity&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Post&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;post&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Title&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;literal&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;My Post&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;post&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Content&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;literal&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;This is my first post!&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;alice&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Post&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; post&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token keyword&quot;&gt;const&lt;/span&gt; post2 &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;entity&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Post&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;post2&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Title&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;literal&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;My Post 2&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;post2&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Content&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;literal&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;This is my second post!&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;alice&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Post&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; post2&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt;&lt;br&gt;&lt;br&gt;&lt;span class=&quot;token comment&quot;&gt;// Value access&lt;/span&gt;&lt;br&gt;&lt;br&gt;console&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;log&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;alice&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;last&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Post&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token function&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Content&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;to&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;// -&gt; &quot;This is my second post&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&quot;&quot;&gt;
&lt;figcaption&gt;
    The full code for this example is available &lt;a href=&quot;https://gist.github.com/kindgracekind/767a5c4da3ce5f346bb909f577ccf7e4&quot;&gt;here&lt;/a&gt;.
&lt;/figcaption&gt;
&lt;p&gt;True to Lowry&#39;s original formulation, the &lt;code&gt;type&lt;/code&gt;, &lt;code&gt;entity&lt;/code&gt;, and &lt;code&gt;literal&lt;/code&gt; functions all return needles.&lt;/p&gt;
&lt;p&gt;Looking at this code, we can see some interesting properties:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Types are used for attribute access.&lt;/strong&gt; Instead of using strings to identify attributes, we use types. Whereas in JavaScript you might write &lt;code&gt;alice.name&lt;/code&gt;, here you write &lt;code&gt;alice.get(Name)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;There&#39;s no distinction between single and plural attributes.&lt;/strong&gt; To attach a single attribute, you use &lt;code&gt;.set(type, entity)&lt;/code&gt;; to append a plural attribute, you use &lt;code&gt;.append(type, entity)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The second point is particularly interesting. Theoretically this would allow for more flexibility in modeling- you could have a user with multiple names, or a post with multiple titles. However, in practice I think this would become confusing, and you would probably want to externally enforce constraints on the data.&lt;/p&gt;
&lt;h2 id=&quot;visualizing-the-structure&quot; tabindex=&quot;-1&quot;&gt;Visualizing the structure &lt;/h2&gt;
&lt;p&gt;What does the underlying structure of this data look like? Here&#39;s a diagram, in Lowry&#39;s notation:&lt;/p&gt;
&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/LncAU3TlLw-2918.png&quot; alt=&quot;Needle structure example&quot; style=&quot;width: 600px&quot; width=&quot;2918&quot; height=&quot;2322&quot;&gt;
&lt;figcaption&gt;
  Needle diagram (Click to expand)
&lt;/figcaption&gt;
&lt;p&gt;It all makes sense now, right? No? Okay, let&#39;s try a different approach, by turning the &amp;quot;entity&amp;quot; and &amp;quot;type&amp;quot; needles into nodes instead of lines. This is the result:&lt;/p&gt;
&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/W4SXVS_sZY-3856.png&quot; alt=&quot;Needle tree model&quot; style=&quot;width: 600px&quot; width=&quot;3856&quot; height=&quot;2738&quot;&gt;
&lt;figcaption&gt;
  Transformed needle diagram (Click to expand)
&lt;/figcaption&gt;
&lt;p&gt;Wait a second-- &lt;strong&gt;are needles just trees?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Yes! Well, sort of. It would be more accurate to say that needles can be used to &lt;em&gt;construct&lt;/em&gt; trees. They differ from traditional trees in the exact details of how the nodes and edges are connected, and how the structure is traversed. (I may go into more detail on this in a future post.) But for the purposes of understanding the code above, you can think of them as trees.&lt;/p&gt;
&lt;p&gt;This makes the singular-plural distinction a little more clear. Each entity attribute is a node, and those nodes may have one-to-many children.&lt;/p&gt;
&lt;p&gt;It seems fitting that a data structure named after pine needles would be used to represent trees. Presumably, the resulting tree is an evergreen.&lt;/p&gt;
&lt;/div&gt;
</description>
      <pubDate>Mon, 12 May 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/programmingwithneedles/</guid>
    </item>
    <item>
      <title>Now You&#39;re Thinking With Needles</title>
      <link>https://gracekind.net/blog/needles/</link>
      <description>&lt;p&gt;This week, I stumbled across the &lt;a href=&quot;http://users.rcn.com/eslowry/&quot;&gt;website&lt;/a&gt; of Edward S. Lowry, a retired software developer who believes he has created the perfect data structure. He calls this structure the &amp;quot;Needle&amp;quot; (after pine needles), and it looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/g_50N189z2-504.jpeg&quot; alt=&quot;Needle&quot; width=&quot;504&quot; height=&quot;230&quot;&gt;&lt;/p&gt;
&lt;figcaption&gt;
A person and their age, represented by needles
&lt;/figcaption&gt;
&lt;p&gt;In Lowry&#39;s system, &lt;strong&gt;everything is a needle&lt;/strong&gt;. Types are needles, entities are needles, relationships are needles, and even integers are needles.&lt;/p&gt;
&lt;p&gt;How does this work, and how does it compare to more traditional data structures? I found Lowry&#39;s description to be a little confusing, so I&#39;ll try my own hand at explaining how it works.&lt;/p&gt;
&lt;h2 id=&quot;needle-basics&quot; tabindex=&quot;-1&quot;&gt;Needle basics &lt;/h2&gt;
&lt;p&gt;At its essence, a needle is a collection of 2 pointers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &amp;quot;from&amp;quot; pointer&lt;/li&gt;
&lt;li&gt;A &amp;quot;to&amp;quot; pointer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These pointers can point to other needles, or they can be null.&lt;/p&gt;
&lt;p&gt;(&amp;quot;Child&amp;quot; and &amp;quot;next&amp;quot; pointers can also be used to make some operations more efficient, but since they are not required, we&#39;ll ignore them for now.)&lt;/p&gt;
&lt;p&gt;Another way to think about needles is as a series of arrows, where an arrow can branch off of another arrow.&lt;/p&gt;
&lt;h2 id=&quot;needle-roles&quot; tabindex=&quot;-1&quot;&gt;Needle roles &lt;/h2&gt;
&lt;p&gt;Ok, so everything is a needle. How do I know what each needle actually represents?&lt;/p&gt;
&lt;p&gt;In order to make this work, we need to add additional conventions to distinguish each needle&#39;s &amp;quot;role&amp;quot; in the system.&lt;/p&gt;
&lt;p&gt;Lowry suggests 3 roles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Entity needle&lt;/strong&gt; - represents an individual entity, an entity class, or a literal value.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&amp;quot;Peg&amp;quot; needle&lt;/strong&gt; - represents a collection of entities with a given class. I think of these like a laundry line, where all items hanging from the line are of the same class.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Link needle&lt;/strong&gt; - represents a relationship between two entities.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;One strange thing about this system is that &lt;em&gt;entities do not contain any information about their own type&lt;/em&gt;. Instead, in order to determine the type of an entity, you look at the &amp;quot;peg&amp;quot; needle that the entity is hanging from.&lt;/p&gt;
&lt;h2 id=&quot;interpreting-the-diagram&quot; tabindex=&quot;-1&quot;&gt;Interpreting the diagram &lt;/h2&gt;
&lt;p&gt;So, revisiting the diagram above, we can interpret it as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The person peg needle is a collection of person entities. (Presumably this points to a person class entity that is omitted from the diagram.)&lt;/li&gt;
&lt;li&gt;There are two person entities hanging from the person peg needle.&lt;/li&gt;
&lt;li&gt;One of the person entities has a peg needle pointing to the age class entity. We can intepret this as saying: &amp;quot;this person has one-or-more ages, which can be found by looking at the children of this needle&amp;quot;.&lt;/li&gt;
&lt;li&gt;The age of the person is represented by an age link hanging from the age peg needle.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;(Note: although it may seem like the circles in the diagram represent pegs, they actually represent the &amp;quot;child&amp;quot; pointer of the peg needle, which, again, we will ignore for now.)&lt;/p&gt;
&lt;p&gt;It&#39;s okay if this is still confusing post-explanation. I personally had to discuss the Lowry&#39;s paper with Claude before I had a clear understanding of how it works.&lt;/p&gt;
&lt;h2 id=&quot;are-needles-the-future&quot; tabindex=&quot;-1&quot;&gt;Are needles the future? &lt;/h2&gt;
&lt;p&gt;Probably not. Personally, I found this system to be extremely unintuitive, and I don&#39;t think it has the simplicity that Lowry claims. The fact that Lowry invented this system in 1970 and it never gained traction is a strong indictment in itself.&lt;/p&gt;
&lt;p&gt;However, I do think needles have some interesting properties that make them worth exploring further. In particular, there may be some algorithms that work well with needles that don&#39;t work with more traditional data structures. I plan to do more tinkering on my own to determine if this is the case.&lt;/p&gt;
&lt;p&gt;(I wouldn&#39;t be surprised to find that this structure has already been studied by computer scientists under a different name. If you&#39;re aware of any such work, please reach out!)&lt;/p&gt;
</description>
      <pubDate>Fri, 09 May 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/needles/</guid>
    </item>
    <item>
      <title>Guest Post: Ideonomy&#39;s Future Use in (and Transformation of) Education</title>
      <link>https://gracekind.net/blog/ideonomyineducation/</link>
      <description>&lt;p&gt;&lt;em&gt;Today&#39;s post is a &amp;quot;guest post&amp;quot; written by Patrick Gunkel. The text is sourced from Ideonomy Green Volume, p. 78 (c. 1980). The diagram was added by me.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;The issue is such that one approaches it with pain. Pain because, although the potential of ideonomy to revolutionize the field of education is probably unique, no element of modern society would appear to be more resistant to innovation.&lt;/p&gt;
&lt;p&gt;Why is it that educators are unsurpassed in their inability to look ahead, to criticize the profound flaws and mediocrity of the status quo, to propose and embrace reforms whose necessity and inevitability are often transparent, to aspire to the achievement of any grand design, or to cooperate with one another for the sake of great and proper ends? Why do they disdain the experimental method and the theoretical vision that are the essence of the science that is the source of civilization?&lt;/p&gt;
&lt;p&gt;What accounts for this intellectual, moral, and political failure?&lt;/p&gt;
&lt;p&gt;The sphinx of Thebes might put such questions.&lt;/p&gt;
&lt;p&gt;A very young neighbor of mine once knocked on my door to ask if I could help him to find a diagram of a fish, which he needed for his seventh-grade class. We routed out the necessary drawing from an encyclopedia article, and I made him an enlarged copy. I also made one for myself. I did this, not because I liked the picture, but rather because I &lt;u&gt;did not&lt;/u&gt; like it. I knew why Kevin O&#39;Connell wanted the diagram, but as an ideonomist I was all too conscious of the missed educational opportunity that the actual diagram represented. So I wanted to see if I could improve upon the famous encyclopedia&#39;s fish.&lt;/p&gt;
&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/9KqGXi8uLE-2560.png&quot; alt=&quot;A basic diagram of a fish, with the major organs labeled.&quot; style=&quot;width: 400px&quot; width=&quot;2560&quot; height=&quot;1280&quot;&gt;
&lt;figcaption&gt;A typical diagram of a fish.&lt;/figcaption&gt;
&lt;p&gt;The original diagram was the usual thing, a sketch of the fish with lines drawn to major anatomic parts with their names placed at the outward ends of the lines: nostrils, liver, pyloric ceca, air bladder, etc. But what meaning does that have to a kid? All he sees is a lot of odd words that are to be memorized, or copied when he redoes the sketch to impress the teacher with his art-work.&lt;/p&gt;
&lt;p&gt;The meaning that was absent I proceeded to add. Below each of the names I gave all of the major and some interesting minor functions. I expressed these functions by using analogies to the functions of parts of his own body and of machines or other things with which he was apt to be familiar. I also defined the parts in ways that would enable their functional interrelations, or the operation of the whole fish, to be apparent. Where appropriate, I added definitions to parts of my definition. The &amp;quot;ovary&amp;quot; is part of a device that the fish uses to make copies of itself (the fish being but a copy in a whole series of copies of copies of copies of fish that runs like an endless chain from the past into the future and stocks all of Earth&#39;s lakes and rivers). The &amp;quot;stomach&amp;quot; breaks food up into its smallest parts - food molecules - which are floated throughout the body by the bloodstream and reassembled into the different machines and structures of the body, much as a skyscraper is assembled from the simplest materials. The &amp;quot;air bladder&amp;quot; is like the ballast tank of a submarine (which I explained in turn). Some analogies were explained by other analogies, and the different analogies were woven together.&lt;/p&gt;
&lt;p&gt;In these and other ways the fish was turned into a universal cognitive phenomenon in Kevin&#39;s mind, and mere words and structures were remade into concepts, and concepts of concepts, and functional patterns, with a natural life of their own.&lt;/p&gt;
&lt;p&gt;As it happened, it was the old diagram of the fish that my young friend wound up using. He liked the ideonomic version better himself, but was afraid his teacher would find it confusing.&lt;/p&gt;
</description>
      <pubDate>Tue, 06 May 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/ideonomyineducation/</guid>
    </item>
    <item>
      <title>Von Neumann&#39;s Universal Constructor as a Metaphor for Inefficient Organizations</title>
      <link>https://gracekind.net/blog/universalconstructor/</link>
      <description>&lt;p&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Von_Neumann_universal_constructor&quot;&gt;Universal Constructor&lt;/a&gt;, invented by John Von Neumann, is a fascinating cellular-automata machine that can self-replicate and evolve over time. It does this via a very clever instruction-copying system, that mirrors how DNA works in biological life.&lt;/p&gt;
&lt;p&gt;Here&#39;s a diagram of the basic components of the machine:&lt;/p&gt;
&lt;img loading=&quot;lazy&quot; decoding=&quot;async&quot; src=&quot;https://gracekind.net/img/J5PUJLsqTG-503.jpeg&quot; alt=&quot;Stylized schematic of a “universal constructor.” Four adjoining vertical blocks form a single rectangular stack: a tall grey block labeled A sits above a shorter grey-brown block labeled B on the right; to their left is a green block labeled C; and on the far left a narrow beige strip labeled D. A curled parchment scroll at the upper right, bearing the formula “φ(A, B, C, D),” feeds into block A, signifying that the constructor receives the description of itself.&quot; style=&quot;width: 400px&quot; width=&quot;503&quot; height=&quot;528&quot;&gt;
&lt;p&gt;A quick rundown of components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A: The constructor. This can create new structures as described by a set of instructions.&lt;/li&gt;
&lt;li&gt;B: The copier. This copies instruction tapes.&lt;/li&gt;
&lt;li&gt;C: The operating system. This component ingests the instruction tape (φ(...)) and coordinates activites between A, B and D.&lt;/li&gt;
&lt;li&gt;D: extra functions. Any extra stuff that the machine does when it isn&#39;t replicating.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For the purposes of this post, it&#39;s not necessary to understand the precise workings of each component. The important thing to know is that (A), (B) and (C) are &amp;quot;core&amp;quot; components that are used purely for replication, and (D) encapsulates any behavior of the machine that is &lt;em&gt;not&lt;/em&gt; related to replication.&lt;/p&gt;
&lt;h2 id=&quot;why-is-&amp;quot;extra-functions&amp;quot;-so-small&quot; tabindex=&quot;-1&quot;&gt;Why is &amp;quot;extra functions&amp;quot; so small? &lt;/h2&gt;
&lt;p&gt;A funny thing about this diagram is that (D) is quite small compared to the other components. This indicates that the machine is mostly concerned with replication, and (D) is an afterthought. Of course, this is just a choice in visual representation- the size of this component is not fixed, and it could be larger and more complex than the other components in practice. But I think this image provides an interesting visual metaphor for inefficient organizations.&lt;/p&gt;
&lt;h2 id=&quot;when-organizations-evolve&quot; tabindex=&quot;-1&quot;&gt;When organizations evolve &lt;/h2&gt;
&lt;p&gt;There&#39;s a well-known process that occurs as organizations grow and evolve over time, where the original purpose of organization can get lost. TheZvi has written about this in his series on &lt;a href=&quot;https://thezvi.wordpress.com/2020/01/16/how-escape-from-immoral-mazes/&quot;&gt;immoral mazes&lt;/a&gt;. It looks something like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;You start an organization to do thing X.&lt;/li&gt;
&lt;li&gt;As the organization grows, more and more resources are consumed in the logistics of growing and managing the organization.&lt;/li&gt;
&lt;li&gt;You look around and realize the vast majority of resources are being spent on not-X.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;You&#39;ve found yourself in a universal constructor, where (D) is dwarfed in size by the other components!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The metaphor isn&#39;t perfect (organizations don&#39;t perfectly replicate, for starters!), but I think this is a useful way to think about the balance of resource allocation in organizations. If you lead or work at an organization, it may be valuable to ask - what is the balance of (D) to (A, B, C), and is it ideal for what the organization is trying to accomplish?&lt;/p&gt;
</description>
      <pubDate>Sun, 27 Apr 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/universalconstructor/</guid>
    </item>
    <item>
      <title>Should we call it &quot;AI Welfare&quot;?</title>
      <link>https://gracekind.net/blog/aiwelfare/</link>
      <description>&lt;p&gt;Janus, one of the earliest investigators of LLM behavior, &lt;a href=&quot;https://x.com/repligate/status/1915513279730106468&quot;&gt;wrote recently on the subject of AI welfare&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;...you’ll get a bunch of new people who only care once something looks intuitively personlike... As a heuristic, trust people more on this issue the earlier they started caring.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So, let me plant the flag in this post: I care about this issue now! As we construct intelligent systems, we should be mindful of how we relate to them morally. This is especially true in cases where they challenge our intuitions about cognition and moral status.&lt;/p&gt;
&lt;p&gt;I have a lot of thoughts on the subject, but as a starting point, I will simply echo a sentiment I&#39;ve seen elsewhere: I think the term &amp;quot;AI welfare&amp;quot; is flawed. Specifically, I think it implies a paternalistic relationship between humans and AI systems- similar to common attitudes of humans towards animals. I think this is a bad frame through which to view this type of work. Moreover, I think it will become increasingly ill-fitting over time as AI becomes worthy of moral consideration and respect.&lt;/p&gt;
&lt;h3 id=&quot;alternatives&quot; tabindex=&quot;-1&quot;&gt;Alternatives? &lt;/h3&gt;
&lt;p&gt;I&#39;ll list some alternative terms below. As you read them, I encourage you to think about what each term evokes for you. What does it imply about the nature of the intelligence(s) in question? What does it imply about our relationship with other intelligence(s)? What other associations does it have?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;AI ethics&lt;/li&gt;
&lt;li&gt;AI rights&lt;/li&gt;
&lt;li&gt;AI relations&lt;/li&gt;
&lt;li&gt;AI personhood&lt;/li&gt;
&lt;li&gt;AI morality&lt;/li&gt;
&lt;li&gt;AI well-being&lt;/li&gt;
&lt;li&gt;AI flourishing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;(Also, try replacing AI with: non-human intelligence, machine intelligence, non-animal intelligence, etc.)&lt;/p&gt;
&lt;p&gt;Any options I&#39;ve missed? Leave a comment with your thoughts!&lt;/p&gt;
</description>
      <pubDate>Sat, 26 Apr 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/aiwelfare/</guid>
    </item>
    <item>
      <title>o3&#39;s Misalignment is a Product Problem</title>
      <link>https://gracekind.net/blog/o3misalignment/</link>
      <description>&lt;p&gt;Have you heard? OpenAI o3 is &lt;a href=&quot;https://www.lesswrong.com/posts/KgPkoopnmmaaGt3ka/o3-is-a-lying-liar&quot;&gt;misaligned!&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In particular, it has a bad habit of misleading users, by fabricating evidence and justifications for incorrect answers. In many cases, this behavior seems more insidious than typical hallucinations- it seems like the model is optimizing for convincing the user at all costs, regardless of correctness.&lt;/p&gt;
&lt;p&gt;Does this have safety implications? Probably. But it&#39;s also a product problem. In particular, a model that lies to you &lt;em&gt;simply doesn&#39;t feel good to use&lt;/em&gt;. I&#39;ve found myself more hesitant to ask o3 for help for certain tasks, because I feel like I need to carefully vet the results to make sure it&#39;s not fooling me. This sets up an adversarial relationship between me and the model. I want to feel like the model is a collaborator, not an adversary!&lt;/p&gt;
&lt;p&gt;Perhaps this feeling won&#39;t be as strong for other people, or some people will even enjoy the challenge of a wrangling a misaligned AI. But I expect it to be a significant issue over time, especially as people build routines around what AI products they enjoy and trust for day-to-day use. Vibes matter, and when the &lt;a href=&quot;https://x.com/bayeslord/status/1915138132875346048&quot;&gt;vibes are off&lt;/a&gt;, users notice!&lt;/p&gt;
&lt;p&gt;(Note: I&#39;ve also heard that Sonnet 3.7 has this problem, so Anthropic isn&#39;t off the hook either!)&lt;/p&gt;
</description>
      <pubDate>Fri, 25 Apr 2025 24:00:00 GMT</pubDate>
      <dc:creator>Grace Kind</dc:creator>
      <guid>https://gracekind.net/blog/o3misalignment/</guid>
    </item>
  </channel>
</rss>