Experiments with Autonomous GPT Agents

And the future of collaborating with AI

By Adam Cohen Hillel, originally published on Substack, Apr 27, 2023


In the last few weeks, I’ve been fascinated by the BabyAGI concept Yohei introduced on Twitter. It is a simple yet so powerful implementation of a “Task-driven Autonomous Agent” using nothing but OpenAI API and ~100 lines of Python code. Essentially, it is a code that runs infinitely and, using OpenAI API, creates and executes its own tasks toward an objective defined by the user. It can also leverage tools like Google Search, Zapier, etc., to accomplish the tasks. As just one example, the following agent was asked to write a business plan. It then made a to-do list, a marketing plan, etc.

Today I will walk you through my experiments with different Agents I created (coding, entertainment) and share my view on the future of this technology.


Playing around with the original BabyAGI version, I quickly realised its great potential. The reason is that recently I noticed that the way I use ChatGPT (the web interface) is not the same way I used to use the “old” internet (querying questions/problems and getting answers), but rather - I use ChatGPT as cognitive offload. I provide it with tasks (mostly coding) that I know I can solve myself, but I rather spend my time doing other things. When I saw BabyAGI - I saw how it could enhance my ChatGPT flow by 10x and more!

So I took a step ahead and started implementing different agents:

Thanks for reading Adam’s Notes! Subscribe for free to receive new posts and support my work.

My Experiments


Experiment #1: TDD (Test Driven Development) Coding Agent (Twitter thread).

TLDR: Coding Agent that follows the Test Driven Development (TDD) methodology! You write the tests - and the agent runs in a loop until it creates the feature properly!

To integrate autonomous agents into the workforce, I realised, we’d need to be able to measure their progress, recalculate their direction, and flag when they finished/need help. Essentially, having a framework for it to work within, like setting expectations with a new junior dev. This was when TDD came to mind! In the case of coding agents, it can act as the agent’s framework.

💡 To those of my readers unfamiliar with the TDD term - it is a software development practice that focuses on creating test cases before developing the actual code.

So what do we see here?

  1. We’ve got an incomplete application code (in this case, a simple FastAPI app) and two tests. One of the tests fails (as the application is incomplete).

  2. We then run the agent - it can’t read the tests but just run them and get the output.

  3. On the first try, it created the missing "echo" endpoint with a POST method - which failed, as the test defined it as a GET.

  4. It then took the failures and converted the code to use GET instead - which failed again but with a different error (we tried to make it fail on purpose, so in the test, we defined the expected key name of the return value as "message1" and not "message")

  5. It, therefore, took the failures (again) and adjusted the code - this time making the return value "message1" - which worked, and the tests passed.

The idea here was to show how TDD is super useful to steer it in a specific, measurable direction - while keeping it somewhat autonomous.
As a software engineer, to leverage agents in my workflow, I have to have a way to steer the agent to my desired end goal, and whatever it does to get there - is up to it.

Have we finally found a good reason to follow the TDD practice? ;)


Experiment #2: Dynamic Chatbot that creates its own inner world (Twitter thread).

In a completely different direction, I wanted to explore the potential entertainment side of autonomous agents. And yes, I recently re-watched Westworld, which affected my thinking in this experiment :)

While watching the show, now with the knowledge of LLMs, I realised that what attracts people to come to the park (take all the crazy inhuman things aside) are the dynamic loops the “hosts”/”agents” live in. Unlike today’s interactive AI products, Westworld ones have their own lives going on, even when visitors aren't around. It's not just about them waiting for people to interact with them.

Can I use LLMs to do something similar?

So I created “Dreamer” - an autonomous chatbot. You provide it with a given personality, and it then runs in a simulated environment and can do whatever: exploring, chasing interests, widening its knowledge, playing games. You can chat with it whenever, but it doesn't sit and wait for your input! It does its own thing :) dynamic chatbot with evolving inner world!

So what do we see here?

  1. The agent is given the personality. “Like to play board games, explore new ideas and read science fiction.”

  2. It then starts by exploring the internet for good science fiction books to read, which comes up with a few results (i.e. Dune by Frank Herbert).

  3. It chooses to save this to its memory for later.

  4. It does the same for board games^

  5. The user then asks it, “Hey what you doing? :)”

  6. It tells the user about what it was up to

  7. It continues to create a new type of boarding game

  8. etc

In other runs, it also simulated a Catan game, reading a book and writing its opinion, etc. This Agent is a bit simpler than the coding one, as the accuracy/precision of its output is not as critical (see the “Takeaways” section below for more details), but this is also why it is a super interesting use case - it can get wilder!

Check this out (link)→


Experiment #3: LLMitlessAPI - one single endpoint is all you need! (Twitter Thread)

This experiment, for some reason, was the most difficult to convey to the readers - so I will try my best here.

Most (arguably all) backend API services are a combination of these 3 operations:

  1. Store

  2. Fetch

  3. Execute

You can combine them in infinite combinations and create any API out there. So what if I place an agent that knows how to do these 3 things behind one API endpoint? We can then ask it to act as if it was a different endpoint every time!

Define the service you want the agent to act as + the data to act upon. That is it. limitless API! (Or should I say, LLMitlessAPI? ;)

So what do we see here?

  1. Nothing in the backend/API defines a “chat” service;

  2. The client sends a message to the API asking it to act as a chat API, + the message send

  3. The agent then decides to store it for later, for whoever will ask for it

  4. Another client asks the API to act as a Chat API too, but asking for new messages

  5. The agent fetches the latest message from its memory, sent by the other client, and sends it back

  6. BAM - a working chat (with many bugs and very slow) - without creating the chat functionality!

I don’t see much “production” value facing customers for this use case. Still, it can help software engineering teams to iterate faster by creating a working PoC of products in a fraction of the time, getting feedback quicker and building what works, and throwing out what isn’t. Founders can also use it to create a demo and raise money, etc.


Takeaways - Agents in Production?

An autonomous agent is a very interesting concept worth more experiments. Still, I think they are not yet production-ready. For most of the use cases, we’d need to develop a strong framework for the agents to work within before we can gain value from them (a simple proof-of-concept of a framework was the TDD agent, which has the potential to be further improved for production coding agents).

The diagram below describes this well, but to deploy an agent to production use cases, we either need a good framework to measure the agent progress and output, or the use case doesn’t requires too much precision.

Some use cases are more “frameworkable” by nature, like coding, where you can “verify” correctness more systematically by writing tests or running the code and seeing what it does. Whereas use cases like therapy are more open-ended, and the results are not [easily] measurable (even in human-to-human interaction, it is difficult to measure the progress/impact of a session, not to mention long-term effects).

I believe gaming (and entertainment in general) will be the first production-use-cases in the upcoming months (due to low “precision-critical”, even if difficult to verify), followed by coding agents (due to high “ease-of-verification”).

Exciting times!

Thanks for reading Adam’s Notes! Subscribe for free to receive new posts and support my work.




Thank you for reading. If you liked my content, don’t hesitate to reach out. I’d love to talk with more people and discuss everything: tech, philosophy, AI, ideas, Lex Fridman, startups, software, science, whatever!

Twitter: https://twitter.com/adamcohenhillel
LinkedIn: https://www.linkedin.com/in/adamcohenhillel
Email: adamcohenhillel@gmail.com

Adam.