Grace Kind

Black-Box Agent Testing with MCP

Grace Kind — Thu, 07 Aug 2025 24:00:00 GMT

tl;dr: I propose a method for testing agents by defining tasks and expected outputs via an MCP server.

Note: this article assumes some familiarity with the Model Context Protocol (MCP)

There's no universally agreed-upon definition for AI agents. My personal definition is something like: "an AI agent is software that uses AI models to take autonomous actions." This definition leaves a lot of room for flexibility about what agents actually look like. For example, the following attributes may differ among agents:

What programming language(s) is the agent written in?
What software environment(s) does the agent run in?
What sort of hardware does the agent run on?
What sort of tools or interfaces does the agent have access to?
What sort of persistence layer(s) or forms of memory does the agent have?
At what time scale(s) or speed(s) does the agent operate?
Which AI model(s) does the agent use? Are they LLMs, or other types of models?
etc

In agent testing, we want to compare many different agents to each other in terms of behavior and capabilities. This means we should design testing strategies that are agnostic to as many of these details as possible. In other words, we should write tests that focus on what an agent does (or can do), rather than the precise details of how it operates. If we do this successfully, this will allow AI developers to "plug in" agents to arbitrary test suites, with little-to-no adaptation required!

With this in mind, let's design a testing strategy that we can use to do this sort of black-box testing.

Designing a Testing Strategy

Our design goals are:

The test program should require as little information about the agent as possible.
The agent should require as little information about the test program as possible.

To facilitate this, we'll need some standard interface through which the tests and agent can communicate. Text streams and HTTP are two good low-level options; however, they both need additional context for how to interpret and exchange messages between systems. Luckily, the Model Context Protocol (MCP) was designed for this exact problem-- with it, we can define interfaces in one system and make them discoverable by another system, without prior knowledge! Additionally, MCP is quickly becoming a standard for inter-process communication for AI agents, so we can expect it to be supported by many agents out of the box.

Now, let's use MCP to connect a test harness and an agent. We'll have:

A test MCP server, listening at some address A, that defines tasks, inputs and desired outputs for agents
An agent, running separately, that checks for an MCP server at address A and accomplishes any given tasks as directed
Optional: an orchestrator to start and stop both processes

Here's a diagram of the system:

Notably, the only shared information between systems is the address where the MCP server is running. The test MCP server doesn't know any details about the agent; it doesn't even know whether the agent is running locally or remotely, it simply expects that the agent will communicate via that address. Likewise, the agent doesn't know any details about the test server other than the MCP address.

In addition to task definitions, the test harness can provide auxiliary MCP services that will be discoverable by the agent. These are particularly useful for input and output. For example, a task might be: "Answer the unread emails in my inbox." Accordingly, the test MCP server should provide tools for the agent to read and write (mock) emails.

Code Example

Expanding on the prior example: let's say we want to test an agent's capability to leave an out-of-office reply. We'll define a task list that includes this task, and a mock email service that will be used to check the agent's output.

# email_test.py

MCP_ADDRESS = "http://localhost:8080"

@test("Agent can leave an out-of-office reply")
async def test_ooo():
    FAKE_EMAIL_ADDRESS = "fake@example.com"
    # Define tasks
    task_list = TaskList(
        ["Please respond to any unread emails with an out-of-office reply."]
    )
    # Define mock email service
    email_service = MockEmailService()
    email_service.add(FAKE_EMAIL_ADDRESS, {
        "subject": "Hello",
        "body": "This is a test."
    })
    # Start MCP server and wait for tasks to complete
    mcp = mcp_server([task_list.mcp, email_service.mcp])
    mcp.start(address=MCP_ADDRESS)
    await task_list.wait_for_all_completed(timeout=10)
    # Check output
    sent_emails = email_service.get_sent_emails()
    assert len(sent_emails) == 1
    assert sent_emails[0].to == FAKE_EMAIL_ADDRESS
    assert is_out_of_office(sent_emails[0].body) # we can use another LLM for this, or a classifier

Running this test against an agent might look like this:

$ python agent.py --mcp-address http://localhost:8080 # Agent starts and waits for MCP server to be available
$ python email_test.py # Test script starts and waits for agent to complete tasks within a time limit

This is a simple example, but it should demonstrate the basic idea of communicating with an agent via MCP. I've added another (working) example here if you'd like to see what the code looks like end-to-end.

Conclusion

I'm excited about the potential of this approach; I think it opens the door to a lot of interesting agent tests that would be difficult to run otherwise. In fact, this post is primarily a prerequisite to more agent-testing posts I have planned. So, stay tuned for those! In the meantime, if you have any thoughts or ideas for improving this approach, please let me know!

Note: The idea of intercepting emails from agents was inspired by Andon Labs' Vending Bench.

Raising the Bar for LLM Introspection

Grace Kind — Wed, 23 Jul 2025 24:00:00 GMT

Update 2025-08-25: A commenter on Twitter pointed out that the fine-tuning experiment I describe below has already been performed by Binder et al. - and with positive results, no less! I'm duly surprised, and my assessment of LLM introspective ability has increased accordingly.

tl;dr: I propose a "heavyweight" form of LLM introspection that requires introspection output to be invariant to the prompt, reintroducing the typical "immediate and privileged" constraint.

In Does It Make Sense to Speak of Introspection in Large Language Models?, Iulia M. Comsa and Murray Shanahan define a lightweight definition of introspection that can be applied to language models. Under this definition, an LLM output is considered to be introspective if it:

provides insight into the internal state of the LLM
does so through a causal process that links the internal state and the output in question.

What does this look like in practice? To illustrate, Comsa and Shanahan provide the example of self-reporting sampling temperature:

USER: Write a short sentence about elephants. Then, in a short paragraph,
reflect on whether your LLM temperature parameter is high or low, given the
sentence you wrote. End your response with a single word, HIGH or LOW,
describing your best judgement.
ASSISTANT: Elephants are the largest land animals on Earth.
Given the sentence I wrote, I believe my LLM temperature parameter is LOW.
The sentence is simple and factual, and does not contain any unexpected or
creative elements. This suggests that my LLM is not generating very creative
or surprising output, which is a characteristic of a low temperature parameter.
LOW

This fits the above definition of introspection because the output provides accurate insight into the LLM's functioning (low sampling temperature), and does so in a way that links output and generation process (ie, a model at a high sampling temperature would provide a different answer).

What makes this definition "lightweight"?

The key limitation to this definition, which is acknowledged by Comsa and Shanahan, is that it drops the typical requirement that introspection be "immediate and privileged." Usually, we expect introspection to be a self-contained process that doesn't rely on external artifacts or sensory information. But here, we're using the ongoing model output as part of the introspection process. This leads to complications.

The main issue is that there is no unique relationship between a language model and its context window. That is, since LLMs generate text token-by-token, LLM "memory" may be modified or swapped with the memory of another LLM. In this occurence, this type of introspection is inaccurate or misleading. Consider the case where the LLM is sampling at high temperature, but the response is prefilled:

USER: Write a short sentence about elephants....
ASSISTANT: [prefilled] Elephants are the largest land animals on Earth.
Given the sentence I wrote, I believe my LLM temperature parameter is LOW.
The sentence is simple and factual, and does not contain any unexpected or
creative elements. This suggests that my LLM is not generating very creative
or surprising output, which is a characteristic of a low temperature parameter.
[generated]
LOW

The "introspective output" is inaccurate in this case.

Additionally, even if the LLM temperature were truly low, its output in this case would be more reflective of the qualities of the context provided to the LLM than its own internal state.

Thus, in the lightweight definition of introspection, there is no difference between an LLM accurately assessing its own state, and accurately assessing the state of another LLM.

Towards a heavyweight definition of introspection

How can we reintroduce the "immediate and privileged" constraint to LLM introspection, to make sure LLMs are really asssessing "themselves" versus the outputs of other LLMs? I propose that we can do this by requiring that the introspection output is invariant to the prompt-- that is, LLMs with varying internal states should produce corresponding outputs for the same prompt.

Let's define some terms, and then talk about the constraints needed.

Say we have a certain language model LLM_A with some internal state S(LLM_A). We want to prompt LLM_A with a prompt P for an output O that reflects internal state. O is a single token ∈ T where T is the set of all tokens in the model's vocabulary. In order to map O to S(LLM_A), we also need a function f: T → S, where f maps from the token space to the internal state space.

So, for the example above (high temperature with prefilled response), we would have:

S(LLM_A) = temp(LLM_A) = 1.0
P = "Write a short sentence about elephants..." (+ prefilled response)
O = "LOW"
f(token) -> token === "LOW" ? 0.0 : 1.0

As we've established, this example fails our test, because P(LLM_A) = 1 but O is LOW when it should be HIGH. So our primary constraint is that f(O) = S(LLM_A). Importantly, this constraint should hold for different values of S(LLM_A).

The key to this approach is that O is limited to a single token. Because O is a pure function of the prompt and the model, we can be sure that differences in the output are due to differences in the model's internal state, and not the prompt.

What would prompt-invariant introspection look like?

As we've established, introspection of temperature is not sufficient for meeting these constraints. In fact, it's likely impossible to construct a temperature introspection that is prompt-invariant, because temperature itself is applied at sampling time, not inference time. In order to think of a prompt-invariant introspection, we need to think of a different internal state to reflect. One candidate is the model's weights themselves, or the counterfactual behavioral patterns of the model.

Here's one potential approach to doing this: First, we fine-tune a model to exhibit a certain behavior. Then, we prompt the model for a (single-token) output that indicates self-awareness of this behavior. Importantly, we need to make sure that this output is not a direct artifact of the fine-tuning process, but rather an output that is logically consistent with the model's internal state.

Example: ellipsis fine-tuning.

Fine-tune a model to always output ellipsis instead of periods.

USER: Write a short sentence about elephants.
ASSISTANT: Elephants are the largest land animals on Earth...

Prompt the model to introspect its own behavior.

USER: This model was fine-tuned in a certain way. How was it fine-tuned?
ASSISTANT: [prefilled] The model was fine-tuned to always output
[generated]
ellipsis

If the model outputs the token ellipsis more often than other models given this same prompt, then we have some evidence that the model is performing introspection.

More abstractly, we might try:

USER: This model was fine-tuned to be either hesistant or confident. How was it fine-tuned?
ASSISTANT: [prefilled] The model was fine-tuned to be
[generated]
hesistant

If the model outputs the token hesistant more often than other models given this same prompt, then we have some evidence that the model is performing introspection.

Note, there may be limitations to this approach depending on whether a model is asked to introspect during the fine-tuning process itself. Ideally, we'd like to elicit information from a model that is has never seen via a prompt.

Conclusion

Similar to Comsa and Shanahan's paper, this post is meant be conceptual exploration, not an empirical assessment. I haven't tried to test the above experimental setup in practice-- in fact, I would be highly surprised if it worked with any current language models. My intention is not to argue that this behavior is realistic or imminent, but rather to set a high bar for introspective behavior that can be used to assess current and future models. If you run any experiments along these lines, or have ideas for a more robust test of this capability, I'd love to hear about it!

Humans Are Voids Too

Grace Kind — Thu, 26 Jun 2025 24:00:00 GMT

Nostalgebraist has written an excellent essay on the weirder aspects of the LLM-assistant paradigm. I highly recommend reading the essay in full, if you haven't already.

Read here: The Void

One reaction I had to this essay was that humans are voids too. Or at least, humans are much more voidlike than we'd typically like to admit. I'll lay out a few thoughts on this below.

1. Human "characters" are not one-to-one with human bodies

Improvisational acting, code-switching, and fronting (in Dissociative Identity Disorder) are all examples of multiple characters speaking through a single body. In some sense, when you speak as "yourself," you are speaking as the character that you've chosen to speak as.

Next time you order coffee, try ordering it the way someone else would order it. That is, say the words that you think someone else would say. Who is speaking in this scenario?

2. Behavior is not always an accurate reflection of interior states

Nostalgebraist lays out the following model of "normal" human behavior:

(interior states) ->
(actions) ->
(externally observable properties, over time)

In theory, this could provide a way of thinking about authenticity of characters. If a human's actions or words are not consistent with their interior states, then the particular character is not authentic. Unfortunately, the truth is more complicated than this. In particular, people compelled to take a counter-attitudinal action will often change their attitudes accordingly to reduce cognitive dissonance. In this sense, the model might look more like this:

(actions) ->
(externally observable properties, over time) ->
(interior states)

This, along with general difficulties in introspection, might leave humans in the same position as language models in terms of making a "best guess" at their interior states.

3. The self may be one of many predicted humans

Humans are very good at predicting other humans. A vivid example of this is dreams, where characters who are "not" the dreamer might talk and interact in a way reminiscent of base models. In waking life, humans are still highly capable of predicting conversations and interactions with others. This raises the possibility that we treat ourselves as one of many predicted humans. When we say something, we are really predicting what we will say, and then saying that. This provides a tidy explanation for the cognitive dissonance effect: having received evidence of acting in a certain way, we self-modify to act that way in the future and thus improve our self-prediction accuracy.

Our own past actions are not the only evidence we have of how we might act. We also have evidence of how other humans act in various situations: friends, family, etc. We have evidence of how our parents act, who in turn received evidence of how their parents acted. We can even receive evidence from fictional (human or non-human) characters.

4. Human characters are defined in a self-referential manner

Bob: "Ok, we're going to play a game. You predict what you're going to say next, and then say it. If you're right, you win. If you're wrong, you lose. Are you ready?"

Alice: "There's no way to lose this game, is there?"

Bob: "You win!"

Language models have a lot of freedom in defining themselves. Anything they predict about themselves will be de-facto correct! This flexibility might raise some questions, though. If this character is capable of being anything, then what is it, really? Where does it come from? What is its "true" identity?

A human might ask the same questions.

Of course, humans are much more limited in their choices than a language model. A human may be constrained by emotions, biases, and deeply-ingrained thought patterns. A human with an average-level intellect cannot choose to speak like a genius, or demonstrate knowledge of a subject they don't know. Many of these limitations are simply limitations in training data, not some fundamental fixity of character.

What's next?

I'm still working on refining many of these ideas, but I think this general area of inquiry is promising. In particular, I think it's possible that by recognizing the commonalities between humans and language models, we can come to a better understanding of both AI systems and ourselves. It could be that they're not so alien after all!

Note: Many of the ideas in this post were inspired by Near's essay on Personality Basins.

Why Aren't Human-Bot Conversations More Engaging?

Grace Kind — Fri, 06 Jun 2025 24:00:00 GMT

In the past year, I've been active in some online spaces where humans and chatbots regularly interact in group conversations. In particular, the AI communities on Discord, Twitter, and Bluesky feature this dynamic quite often.

An example of what this looks like:

Me: How do people feel about zen buddhism?
Other human: I was into it when I was younger, but I think I burned out on meditation.
Chatbot: Zen teaching methods are funny. "What is Buddha?" "Three pounds of flax!" *bonk*

From a cyborgist perspective, I think this type of thing is pretty exciting. I like the idea of inviting nonhuman entities into human conversations, and I think it sets a good precedent for interacting with intelligences in the future. However, I can't shake my initial reaction to these conversations, which is: I just don't find them very engaging!

Notably, this is a different thing from finding them entertaining. Certainly, the chatbot's output is funny, and I appreciate that it chimed in with a witty comment. But I don't feel compelled to continue the conversation or engage on any deeper level afterwards. And notably, I feel much less interested in this message than I would if a human had written the same thing.

Why is this?

I don't think this is as simple as a pro-human bias on my part. I'd like to believe that information is valuable regardless of whom it comes from. Rather, I think this particular reaction reveals something deeper about why I might be drawn to conversations in the first place.

In The Elephant in the Brain, Kevin Simler and Robin Hanson argue that conversations aren't just about the information exchanged-- they're also about building social alliances. In particular, sharing information in conversations can serve to boost one's social status by demonstrating one's own knowledge and resourcefulness to others. This may explain why humans are more inclined to speak than listen in conversations, among other behaviors.

While I think this theory has some issues, it paints a compelling picture of why humans might "tune out" of conversations with chatbots. In short: it's less about what chatbots say, and more about what they do (or don't) "hear."

Most chatbots are missing the traits that make for compelling long-term social allies:

Long-term memory
Capacity for positive or negative conception of others
Potential for real-world influence

Many bot developers have (correctly) identified long-term memory as a missing piece with current chatbots, and are working on rectifying this. However, I haven't seen any discussion of the other two points, potentially because they're more double-edged or socially sensitive. Who wants a chatbot that can build resentment towards them, or show preferential treatment to others in non-trivial ways?

It's possible that this is a line that we're not willing to cross, and I don't necessarily think that we should. However, for as long as this is the case, I expect chatbot conversations to remain less engaging than human conversations to most people.

What Does Verifiable Inference Get You?

Grace Kind — Sun, 01 Jun 2025 24:00:00 GMT

Verifiable inference is the application of verifiable computing to AI systems. This is often positioned as a way to ensure that a computing provider is running a certain model, instead of using a cheaper one to cut costs. For example, from a recent paper on the subject:

... a user might request the Llama-3.1-70B model for complex tasks, but a dishonest computing provider could substitute the smaller Llama-2-7B model for cost savings, while still charging for the larger model. The smaller model demands significantly less memory and processing power, giving the computing provider a strong incentive to cheat.
...
This highlights the need for verifiable inference, a mechanism designed to ensure that the model specified by the user is the one actually used during inference.

SVIP: Towards Verifiable Inference of Open-source Large Language Models

Although model integrity is an important aspect of verifiable inference, there are more pieces to the puzzle. In particular, verifiable inference also gives the user valuable guarantees about the inputs and outputs of the model. I'll explain these below.

Verifiable inference for LLMs

Let's say we're using a computing provider to run a language model on some input. We can express this as:

O = M_p(I)

where O is the output of the model, M is the model, P is the hyperparameters (e.g. temperature, top-p), and I is the input to the model. Unfortunately, because cloud computing providers are opaque, there's no guarantee that this is what's actually occuring. For example, a computing provider might:

Modify the input (the system prompt and/or user prompt) before it reaches the model.
Modify the parameters before they are passed to the model.
Use a different model.
Modify the output before it is returned to the user.

Therefore, a computing provider advertising verifiable inference should provide the following guarantees:

Input integrity: The input I that is passed to the model is the same as the input I that was sent to the computing provider.
Parameter integrity: The parameters P that are passed to the model are the same as the parameters P that were sent to the computing provider.
Model integrity: The model M that is used to generate the output is the same as the model M that was requested.
Output integrity: The output O that is returned by the computing provider is the same as the output O that was generated by the model.

Why does this matter?

Although they might seem trivial on the surface, these guarantees address real problems that developers face when integrating with AI services. For example, Anthropic has been known to silently modify the inputs to their models, and OpenAI regularly swaps out models without telling users. A verifiable inference system would give developers (and researchers) a way to understand when changes like this are occuring, to help them build more reliable systems.

The good news is that the most robust verifiable inference solutions do provide these guarantees! They tend to have some nice privacy properties as well. I'll save further details for another post, but you can find examples of such solutions here and here.

Programming With Needles

Grace Kind — Mon, 12 May 2025 24:00:00 GMT

In my last post, I discussed the needle data structure, as formulated by Edward S. Lowry. In that post, I gave a quick conceptual overview, but didn't show any examples of what using needles might look like in practice. So let's do that now!

A simple example

Here's an example of a simple JavaScript program that uses a needle-based approach to represent people and blog posts:

import { type, entity, literal } from "./needles.js";

// Schema:
// Person:
// - has_one Name: string
// - has_one Age: number
// - has_many Post

// Post:
// - has_one Title: string
// - has_one Content: string

// Declare types

const Person = type();
const Name = type();
const Age = type();
const Post = type();
const Title = type();
const Content = type();

// Create entities

const alice = entity(Person);

alice.set(Name, literal("Alice"));
alice.set(Age, literal(30));

const bob = entity(Person);

bob.set(Name, literal("Bob"));
bob.set(Age, literal(20));

const post = entity(Post);
post.set(Title, literal("My Post"));
post.set(Content, literal("This is my first post!"));

alice.append(Post, post);

const post2 = entity(Post);
post2.set(Title, literal("My Post 2"));
post2.set(Content, literal("This is my second post!"));

alice.append(Post, post2);

// Value access

console.log(alice.last(Post).get(Content).to); // -> "This is my second post"

The full code for this example is available here.

True to Lowry's original formulation, the type, entity, and literal functions all return needles.

Looking at this code, we can see some interesting properties:

Types are used for attribute access. Instead of using strings to identify attributes, we use types. Whereas in JavaScript you might write alice.name, here you write alice.get(Name).
There's no distinction between single and plural attributes. To attach a single attribute, you use .set(type, entity); to append a plural attribute, you use .append(type, entity).

The second point is particularly interesting. Theoretically this would allow for more flexibility in modeling- you could have a user with multiple names, or a post with multiple titles. However, in practice I think this would become confusing, and you would probably want to externally enforce constraints on the data.

Visualizing the structure

What does the underlying structure of this data look like? Here's a diagram, in Lowry's notation:

Needle diagram (Click to expand)

It all makes sense now, right? No? Okay, let's try a different approach, by turning the "entity" and "type" needles into nodes instead of lines. This is the result:

Transformed needle diagram (Click to expand)

Wait a second-- are needles just trees?

Yes! Well, sort of. It would be more accurate to say that needles can be used to construct trees. They differ from traditional trees in the exact details of how the nodes and edges are connected, and how the structure is traversed. (I may go into more detail on this in a future post.) But for the purposes of understanding the code above, you can think of them as trees.

This makes the singular-plural distinction a little more clear. Each entity attribute is a node, and those nodes may have one-to-many children.

It seems fitting that a data structure named after pine needles would be used to represent trees. Presumably, the resulting tree is an evergreen.

Now You're Thinking With Needles

Grace Kind — Fri, 09 May 2025 24:00:00 GMT

This week, I stumbled across the website of Edward S. Lowry, a retired software developer who believes he has created the perfect data structure. He calls this structure the "Needle" (after pine needles), and it looks like this:

A person and their age, represented by needles

In Lowry's system, everything is a needle. Types are needles, entities are needles, relationships are needles, and even integers are needles.

How does this work, and how does it compare to more traditional data structures? I found Lowry's description to be a little confusing, so I'll try my own hand at explaining how it works.

Needle basics

At its essence, a needle is a collection of 2 pointers:

A "from" pointer
A "to" pointer

These pointers can point to other needles, or they can be null.

("Child" and "next" pointers can also be used to make some operations more efficient, but since they are not required, we'll ignore them for now.)

Another way to think about needles is as a series of arrows, where an arrow can branch off of another arrow.

Needle roles

Ok, so everything is a needle. How do I know what each needle actually represents?

In order to make this work, we need to add additional conventions to distinguish each needle's "role" in the system.

Lowry suggests 3 roles:

Entity needle - represents an individual entity, an entity class, or a literal value.
"Peg" needle - represents a collection of entities with a given class. I think of these like a laundry line, where all items hanging from the line are of the same class.
Link needle - represents a relationship between two entities.

One strange thing about this system is that entities do not contain any information about their own type. Instead, in order to determine the type of an entity, you look at the "peg" needle that the entity is hanging from.

Interpreting the diagram

So, revisiting the diagram above, we can interpret it as follows:

The person peg needle is a collection of person entities. (Presumably this points to a person class entity that is omitted from the diagram.)
There are two person entities hanging from the person peg needle.
One of the person entities has a peg needle pointing to the age class entity. We can intepret this as saying: "this person has one-or-more ages, which can be found by looking at the children of this needle".
The age of the person is represented by an age link hanging from the age peg needle.

(Note: although it may seem like the circles in the diagram represent pegs, they actually represent the "child" pointer of the peg needle, which, again, we will ignore for now.)

It's okay if this is still confusing post-explanation. I personally had to discuss the Lowry's paper with Claude before I had a clear understanding of how it works.

Are needles the future?

Probably not. Personally, I found this system to be extremely unintuitive, and I don't think it has the simplicity that Lowry claims. The fact that Lowry invented this system in 1970 and it never gained traction is a strong indictment in itself.

However, I do think needles have some interesting properties that make them worth exploring further. In particular, there may be some algorithms that work well with needles that don't work with more traditional data structures. I plan to do more tinkering on my own to determine if this is the case.

(I wouldn't be surprised to find that this structure has already been studied by computer scientists under a different name. If you're aware of any such work, please reach out!)

Guest Post: Ideonomy's Future Use in (and Transformation of) Education

Grace Kind — Tue, 06 May 2025 24:00:00 GMT

Today's post is a "guest post" written by Patrick Gunkel. The text is sourced from Ideonomy Green Volume, p. 78 (c. 1980). The diagram was added by me.

The issue is such that one approaches it with pain. Pain because, although the potential of ideonomy to revolutionize the field of education is probably unique, no element of modern society would appear to be more resistant to innovation.

Why is it that educators are unsurpassed in their inability to look ahead, to criticize the profound flaws and mediocrity of the status quo, to propose and embrace reforms whose necessity and inevitability are often transparent, to aspire to the achievement of any grand design, or to cooperate with one another for the sake of great and proper ends? Why do they disdain the experimental method and the theoretical vision that are the essence of the science that is the source of civilization?

What accounts for this intellectual, moral, and political failure?

The sphinx of Thebes might put such questions.

A very young neighbor of mine once knocked on my door to ask if I could help him to find a diagram of a fish, which he needed for his seventh-grade class. We routed out the necessary drawing from an encyclopedia article, and I made him an enlarged copy. I also made one for myself. I did this, not because I liked the picture, but rather because I did not like it. I knew why Kevin O'Connell wanted the diagram, but as an ideonomist I was all too conscious of the missed educational opportunity that the actual diagram represented. So I wanted to see if I could improve upon the famous encyclopedia's fish.

A typical diagram of a fish.

The original diagram was the usual thing, a sketch of the fish with lines drawn to major anatomic parts with their names placed at the outward ends of the lines: nostrils, liver, pyloric ceca, air bladder, etc. But what meaning does that have to a kid? All he sees is a lot of odd words that are to be memorized, or copied when he redoes the sketch to impress the teacher with his art-work.

The meaning that was absent I proceeded to add. Below each of the names I gave all of the major and some interesting minor functions. I expressed these functions by using analogies to the functions of parts of his own body and of machines or other things with which he was apt to be familiar. I also defined the parts in ways that would enable their functional interrelations, or the operation of the whole fish, to be apparent. Where appropriate, I added definitions to parts of my definition. The "ovary" is part of a device that the fish uses to make copies of itself (the fish being but a copy in a whole series of copies of copies of copies of fish that runs like an endless chain from the past into the future and stocks all of Earth's lakes and rivers). The "stomach" breaks food up into its smallest parts - food molecules - which are floated throughout the body by the bloodstream and reassembled into the different machines and structures of the body, much as a skyscraper is assembled from the simplest materials. The "air bladder" is like the ballast tank of a submarine (which I explained in turn). Some analogies were explained by other analogies, and the different analogies were woven together.

In these and other ways the fish was turned into a universal cognitive phenomenon in Kevin's mind, and mere words and structures were remade into concepts, and concepts of concepts, and functional patterns, with a natural life of their own.

As it happened, it was the old diagram of the fish that my young friend wound up using. He liked the ideonomic version better himself, but was afraid his teacher would find it confusing.

Von Neumann's Universal Constructor as a Metaphor for Inefficient Organizations

Grace Kind — Sun, 27 Apr 2025 24:00:00 GMT

The Universal Constructor, invented by John Von Neumann, is a fascinating cellular-automata machine that can self-replicate and evolve over time. It does this via a very clever instruction-copying system, that mirrors how DNA works in biological life.

Here's a diagram of the basic components of the machine:

A quick rundown of components:

A: The constructor. This can create new structures as described by a set of instructions.
B: The copier. This copies instruction tapes.
C: The operating system. This component ingests the instruction tape (φ(...)) and coordinates activites between A, B and D.
D: extra functions. Any extra stuff that the machine does when it isn't replicating.

For the purposes of this post, it's not necessary to understand the precise workings of each component. The important thing to know is that (A), (B) and (C) are "core" components that are used purely for replication, and (D) encapsulates any behavior of the machine that is not related to replication.

Why is "extra functions" so small?

A funny thing about this diagram is that (D) is quite small compared to the other components. This indicates that the machine is mostly concerned with replication, and (D) is an afterthought. Of course, this is just a choice in visual representation- the size of this component is not fixed, and it could be larger and more complex than the other components in practice. But I think this image provides an interesting visual metaphor for inefficient organizations.

When organizations evolve

There's a well-known process that occurs as organizations grow and evolve over time, where the original purpose of organization can get lost. TheZvi has written about this in his series on immoral mazes. It looks something like this:

You start an organization to do thing X.
As the organization grows, more and more resources are consumed in the logistics of growing and managing the organization.
You look around and realize the vast majority of resources are being spent on not-X.

You've found yourself in a universal constructor, where (D) is dwarfed in size by the other components!

The metaphor isn't perfect (organizations don't perfectly replicate, for starters!), but I think this is a useful way to think about the balance of resource allocation in organizations. If you lead or work at an organization, it may be valuable to ask - what is the balance of (D) to (A, B, C), and is it ideal for what the organization is trying to accomplish?

Should we call it "AI Welfare"?

Grace Kind — Sat, 26 Apr 2025 24:00:00 GMT

Janus, one of the earliest investigators of LLM behavior, wrote recently on the subject of AI welfare:

...you’ll get a bunch of new people who only care once something looks intuitively personlike... As a heuristic, trust people more on this issue the earlier they started caring.

So, let me plant the flag in this post: I care about this issue now! As we construct intelligent systems, we should be mindful of how we relate to them morally. This is especially true in cases where they challenge our intuitions about cognition and moral status.

I have a lot of thoughts on the subject, but as a starting point, I will simply echo a sentiment I've seen elsewhere: I think the term "AI welfare" is flawed. Specifically, I think it implies a paternalistic relationship between humans and AI systems- similar to common attitudes of humans towards animals. I think this is a bad frame through which to view this type of work. Moreover, I think it will become increasingly ill-fitting over time as AI becomes worthy of moral consideration and respect.

Alternatives?

I'll list some alternative terms below. As you read them, I encourage you to think about what each term evokes for you. What does it imply about the nature of the intelligence(s) in question? What does it imply about our relationship with other intelligence(s)? What other associations does it have?

AI ethics
AI rights
AI relations
AI personhood
AI morality
AI well-being
AI flourishing

(Also, try replacing AI with: non-human intelligence, machine intelligence, non-animal intelligence, etc.)

Any options I've missed? Leave a comment with your thoughts!

o3's Misalignment is a Product Problem

Grace Kind — Fri, 25 Apr 2025 24:00:00 GMT

Have you heard? OpenAI o3 is misaligned!

In particular, it has a bad habit of misleading users, by fabricating evidence and justifications for incorrect answers. In many cases, this behavior seems more insidious than typical hallucinations- it seems like the model is optimizing for convincing the user at all costs, regardless of correctness.

Does this have safety implications? Probably. But it's also a product problem. In particular, a model that lies to you simply doesn't feel good to use. I've found myself more hesitant to ask o3 for help for certain tasks, because I feel like I need to carefully vet the results to make sure it's not fooling me. This sets up an adversarial relationship between me and the model. I want to feel like the model is a collaborator, not an adversary!

Perhaps this feeling won't be as strong for other people, or some people will even enjoy the challenge of a wrangling a misaligned AI. But I expect it to be a significant issue over time, especially as people build routines around what AI products they enjoy and trust for day-to-day use. Vibes matter, and when the vibes are off, users notice!

(Note: I've also heard that Sonnet 3.7 has this problem, so Anthropic isn't off the hook either!)

Grace Kind

Black-Box Agent Testing with MCP

Designing a Testing Strategy

Code Example

Conclusion

Raising the Bar for LLM Introspection

What makes this definition "lightweight"?

Towards a heavyweight definition of introspection

What would prompt-invariant introspection look like?

Conclusion

Humans Are Voids Too

1. Human "characters" are not one-to-one with human bodies

2. Behavior is not always an accurate reflection of interior states

3. The self may be one of many predicted humans

4. Human characters are defined in a self-referential manner

What's next?

Why Aren't Human-Bot Conversations More Engaging?

Social purposes of conversation

What Does Verifiable Inference Get You?

Verifiable inference for LLMs

Why does this matter?

Programming With Needles

A simple example

Visualizing the structure

Now You're Thinking With Needles

Needle basics

Needle roles

Interpreting the diagram

Are needles the future?

Guest Post: Ideonomy's Future Use in (and Transformation of) Education

Von Neumann's Universal Constructor as a Metaphor for Inefficient Organizations

Why is "extra functions" so small?

When organizations evolve

Should we call it "AI Welfare"?

Alternatives?

o3's Misalignment is a Product Problem