How Mendable leverages Langsmith to debug Tools & Actions
Langsmith is an easy way to debug and evaluate LLM apps
How Mendable leverages Langsmith to debug Tools & Actions
It is no secret that 2024 will be the year we start seeing more LLMs baked into our workflows. This means that the way we interact with LLM models will be less just Question and Answer and more action-based.
At Mendable.ai, we are seeing this transformation first hand. Late last year, we equipped ~1000 customer success + sales people at a $20+ billion tech company with GTM assistants that help with tech guidance, process help, and industry expertise. In five months, the platform achieved $1.3 million in savings, and itâs projected to save $3 million this year due to decreased research time and dependency on technical resources. Now we are working with that same company to enable these assistants to take action, enabling even more efficiency improvements.
An example use case is a salesperson who wants to get the latest focus areas for a prospect and their company. When asking an assistant enabled with our Tools & Actions âWhat are the latest key initiatives for X?â, the AI can:
- Call the CRM API and get the exact team the salesperson is trying to sell to
- Use the Google News or DUNS API to get the latest news on the specific team and related initiatives
- Call the CoreSignal API to get the latest hiring trends for the company based on job postings and more
- Interpret the news and hiring trends, highlighting ways the salesperson can use these new found initiatives to sell in meeting
As you can see the introduction of Tools & Actions in Mendable expands capabilities quite a bit, enabling chatbots to access and utilize a wider range of data sources and perform various automated tasks. On the backend, to ensure the precision and efficiency of such features, Mendable leverages Langsmithâs debugging tools, a critical component in the development and optimization of our AI-driven functionalities.
Opening the âblack boxâ of agent execution
One of the biggest problems when building applications that depend on agentic behavior is reliability and lack of observability. Understanding the key interactions, decisions of an Agent loop can be quite tricky. Especially when it has been giving access to multiple resources and is embedded in a production pipeline.
While building Tools & Actions, the core aspect we had in mind was giving the ability for the user to create their own Tool by defining an API spec for it.
We designed this so the user could input a tag such as <ai-generated-value>
when creating the API request and the AI can fill that value at request time with an âAI generatedâ value based on the userâs question and schema. This is one example, but there were a lot more just in time AI inputs/outputs that went into it. This posed some challenges in the building process that we werenât expecting. Soon our development process was full of âconsole.logsâ everywhere and high latency runs. Trying to debug why a tool wasnât being called or why the API request had failed became a nightmare. We had no proper visibility on what the agentic behavior looked like nor if custom tools were working as expected.
Here is where Langsmith from Langchain came to help. If you are not familiar, Langsmith allows you to easily debug, evaluate and manage LLM apps. It, of course, integrates swiftly with Langchain. As we were already using parts of the OpenAI tool agents that Langchain provides, the integration was smooth.
The Debugging Process
Langsmith allows us to have a peek inside of the agentsâ brain. This is very useful for debugging how an agentâs thinking and decision process can impact the output.
When you enable tracing in a LangChain, the app captures and displays a detailed visualization of the runsâ call hierarchy. This feature allows you to explore the inputs, outputs, parameters, response times, feedback, token consumption, and other critical metrics of your run.
When we connected Langsmith to our Tools & Actions module, we quickly spotted problems that we didnât have the visibility for.
Take a look for instance on one of our first traces using Tools. As you can see here, the last call to ChatOpenAI
took a long time: 7.23 seconds.
When you click on the 7.23s Run we saw that the prompt was massive, it had concatenated all of our RAG pipeline prompts/sources with our Tools & Actions, leading to a delay in the streaming process. This allowed to further optimize what chunks of the prompt needed to be used by the Tools & Action pipeline, reducing total latency overall.
Inspecting Tools
Another valuable aspect of having ease access to traces is the ability to inspect a tool input. As I mentioned in the beginning, we allow users to create custom tools in Mendable. With that we need to make sure that the building process of a tool in the UI is easy and quick but also performs well. This means that when we create a tool in the backend, it needs to have the correct schema defined partially by what the user inputted in our UI (API request details) but also by what the AI will automatically feed in at request time.
In the example below, it shows a Recent News Tool that was run. The question inside the query
parameter was generated by the AI.
Making sure that query was accurate with what the user inputted but also optimized for the tool being used was very challenging. Thankfully Langsmith made visualizing that very easy. What we did is we ran the same tool with different queries ~20 times and quickly scrolled through Langsmith making sure the output and schema were accurate. The times that werenât accurate, we were able to understand why by opening the trace further or by annotating in Langsmith so we could review it later.
One of our realizations was that the Tools description was critical for the correct schema and input to be generated. With this new insight we obtained from running the experiments, we went ahead and improved the AI generated part of that in our product and also made users aware that they need to provide good detailed descriptions when creating a Tool.
Building our Dataset
With all the optimization experiments taking over, the need to quickly save inputs/outputs for further evaluation became evident. With Langsmith we selected the runs that we wanted to add to our dataset and clicked the âAdd to Datasetâ button.
This was a very quick and easy win for us as we now had all the data in one place from our runs and we could even evaluate that using Langsmith itself.
Conclusion
Langsmithâs debugging tools have been a game-changer for us. Theyâve given us a clear window into how our âTools and Actionsâ AI agent thinks and acts, which has been helpful for tackling tricky issues like slow response times and making our debugging process way smoother. Mendable Tools & Actions has launched but we are still early in the process. We have been working with amazing enterprises such as Snapchat to help improve and tailor custom solutions to them. If you are interested in testing Mendable, send me a message in X at @nickscamara_ or email us at eric@mendable.ai with your use case.
Also, if you are looking to speed up your LLM development process, I would definitely recommend trying Langsmith out - especially if you already use Langchain in your pipeline.
I hope my insights were helpful and thanks to Langchain for being an awesome partner.