On Large Reasoning Models (LRMs). Q&A with Vishvesh Bhat
Moderated by Ramesh Chitor.
Q1. What are the main limitations of Large Language Models (LLMs) when it comes to solving long-horizon or multi-step reasoning tasks?
Can you share specific examples where LLMs tend to fail as task complexity increases?
If you ask an AI agent to book a ticket to Paris for the weekend, it may do it properly. But if you ask it to book a ticket to Paris with:
- a layover in Denver for a day to meet a friend, and with a hotel booking for a one-day stay
- a meal plan with a vegan dietary preference
- an option to reschedule if the flight duration is over 18 hours
- preferably Lufthansa/Emirates as you happen to like their service
- a total budget of $760
then it is highly likely to fail as it is a complicated (but plausible) real-world use case. We notice that as the complexity of a user request increases, the accuracy tends to take a sharp dip beyond a point. As easier problems get solved by AI agents the stakes to address the complicated use cases get higher.
Q2. How do Large Reasoning Models (LRMs) address the shortcomings exhibited by traditional LLMs in enterprise and agentic workflows? What architectural or methodological differences set LRMs apart from standard LLMs in this context?
Today’s Large Reasoning Models implement a method called Test-time scaling, where they finetune a model or perform Reinforcement training on it to elucidate its thoughts during test-time before giving out the final answer. The problem with this is that beyond a certain level of complexity, their accuracy takes a sharp dip. This behavior continues to persist with continual pretraining and Reinforcement Learning. This prevents LRMs from doing complicated real world use cases.
Q3. CoreThink advocates for a neuro-symbolic approach within LRMs. Could you explain what neuro-symbolic LRMs are, and why this hybrid paradigm is particularly suited to complex reasoning problems? What makes this fusion uniquely powerful?
Certainly. When we talk about neuro-symbolic LRM (Large Reasoning Models) within the context of CoreThink, we’re essentially referring to a powerful integration of two historically distinct AI paradigms: neural networks and symbolic AI.
Think of it this way: traditional symbolic AI systems, before the rise of neural networks, were built on explicit rules and formal logic. They were incredibly transparent and reliable in structured environments, almost like a meticulously designed machine where you could trace every single gear and lever. This made them excellent for tasks like theorem proving or tax calculations where correctness and explainability are paramount. However, they faced a massive challenge with scalability and adaptability; every single rule had to be manually encoded, which became incredibly time-consuming and expensive for real-world complexity, and they struggled with anything outside their predefined rules, like understanding nuance or common sense.
Then came neural networks and large language models (LLMs). These models revolutionized AI by learning patterns directly from massive datasets. They’re fantastic at generalization, understanding context, and handling the messy, ambiguous nature of human language. They’re like a brilliant artist who can intuitively grasp and recreate patterns without explicitly knowing the rules of composition. But, and this is a big “but,” they suffer from the “black-box” problem. It’s hard to understand why they make a particular decision, leading to issues with explainability, and they often struggle with complex, multi-step reasoning, sometimes hallucinating outputs or failing to maintain long-term state.
CoreThink’s neuro-symbolic approach, therefore, is about taking the best of both worlds. We’re leveraging the Turing-complete nature of natural language, meaning it can, in principle, express any computable problem, and combining it with the pattern recognition capabilities of neural networks. The unique power of this fusion lies in its ability to marry the structured, explainable reasoning of symbolic AI with the adaptive, data-driven learning of neural networks.
What makes this particularly powerful for complex reasoning problems is that we can achieve the kind of robust, multi-step inference that traditional LLMs struggle with, while still ensuring transparency and logical soundness. We’re not converting natural language into a brittle, formal system; instead, CoreThink processes natural language in a Turing-complete way, dynamically constructing reasoning traces directly from common-sense priors, domain-specific context, and external knowledge. This iterative refinement within the natural language domain prevents the loss of crucial information like negation scope or pragmatic context that occurs when converting to intermediaries like vector embeddings or symbolic logic.
In essence, we’re building AI systems that can reason with the logical rigor of symbolic AI, offering auditable and debuggable decision-making, but with the flexibility, scalability, and common-sense understanding of neural networks. This leads to significantly improved accuracy on complex tasks, reduced reasoning errors, and even substantial cost savings because we’re not relying solely on expensive, GPU-intensive inference. It’s truly a foundational shift for enterprise AI that demands both performance and transparency.
Q4. Your team reports beating all major benchmarks versus current best-in-class LLMs and LRMs. What are these benchmarks, and what kinds of improvements are you seeing in metrics such as accuracy, latency, and explainability?
We’ve been really focused on pushing the boundaries of AI, and I’m excited to share that CoreThink AI is indeed outperforming major benchmarks against some of the best LLMs and LRMs out there.
We’re seeing significant improvements across three key areas:
Tool-calling: This is crucial for how well our AI can interact with and use external tools and functions. We’re leading with a score of 52 overall, notably higher than models like Claude 4 Sonnet, Grok 4-Thinking, and Gemini 2.5 Pro. Specifically, on the ‘Berkeley Function Calling v3’ benchmark, we scored a 56, which shows our superior ability to accurately and efficiently use tools.
Code-generation: This is about our AI’s ability to generate high-quality and efficient code. We’re top of the pack here with an overall score of 56.1. While others like o4-mini and Gemini 2.5 Pro might have higher scores on ‘LiveCodeBench v6’ (where we got a 64), our consistent strong performance across benchmarks like ‘SWE-Bench Lite’ (where we scored 62.3) highlights our robust code generation capabilities.
Reasoning and Planning: This category assesses our AI’s ability to understand complex instructions and plan solutions. We have a strong overall score of 56.7, outperforming all competitors. A standout for us is ‘Instruction Following-Evals,’ where we achieved an 89, significantly surpassing other models. This really speaks to our AI’s accuracy in interpreting and executing complex instructions. While ‘ARC-AGI-2’ is still an area of focus and ongoing evaluation, our performance there is still ahead of most.
Overall, CoreThink AI demonstrates a really well-rounded and robust capability. We’re consistently seeing ourselves as the top performer across these diverse and critical AI functionalities, which firmly establishes us as a leading model.”
Q5. Why is explainability and traceability so critical in long-horizon reasoning tasks, and how does a neuro-symbolic LRM provide better transparency compared to purely neural approaches?
That’s a fantastic question, and it gets right to the heart of why we developed CoreThink. In long-horizon reasoning tasks, explainability and traceability aren’t just “nice-to-haves”—they’re absolutely critical for several reasons.
First, consider the complexity. When an AI system is tackling a multi-step workflow, like finding university records, zipping them, emailing them, and then monitoring for follow-ups, there are numerous points where things can go wrong. If an purely neural LLM makes an error, it’s often a “black box” failure. You see the wrong output, but you have no idea why it went wrong. Was it a misinterpretation of the initial query? A failure in sequencing the tools? An inability to maintain state? Without explainability, debugging these issues becomes a nightmare, leading to significant costs and inconsistent performance.
Second, in enterprise applications, particularly in regulated industries like finance, healthcare, or legal, auditability is non-negotiable. If an AI system denies a loan or makes a medical diagnosis, you must be able to explain the reasoning behind that decision. Purely neural networks, which rely on hidden statistical patterns, simply can’t provide that explicit, step-by-step justification. This makes regulatory compliance incredibly challenging.
Third, long-horizon tasks inherently involve dependencies and state tracking. If an error occurs early in the process, it can cascade and lead to completely nonsensical or harmful outcomes later on. Traceability allows you to pinpoint exactly where the breakdown happened, diagnose the root cause, and ensure the system learns from it. Without it, you’re essentially flying blind.
Now, how does a neuro-symbolic LRM like CoreThink provide better transparency compared to purely neural approaches? It’s all about integrating the best of both worlds. Traditional symbolic AI was fantastic at explainability and traceability because it operated on explicit, rule-based logic. Every decision could be traced back to a defined premise. However, it lacked adaptability and scalability. Neural networks, on the other hand, are highly adaptive and scalable, but they sacrifice that transparency.
CoreThink bridges this gap. Our core innovation is processing natural language in a Turing-complete manner, but without reducing it to opaque formal logic or vector embeddings that discard critical information. Instead, we dynamically construct reasoning traces that remain entirely in human-interpretable natural language.
Q6. What are some real-world applications or use-cases where neuro-symbolic LRMs significantly outperform LLMs and standard LRMs? Can you elaborate on specific industry domains or workflows?
That’s an excellent question, and it really gets to the heart of why neuro-symbolic AI, like our CoreThink engine, is so crucial right now. While large language models (LLMs) have made incredible strides, they hit a wall when it comes to complex, multi-step reasoning, explainability, and consistent performance in enterprise settings. This is where neuro-symbolic approaches truly shine, offering a significant advantage.
Let me break down some key real-world applications and industry domains where we see neuro-symbolic LLMs outperforming traditional LLMs and even standard agentic frameworks:
First, let’s talk about code generation platforms. Think about tools like GitHub Copilot or Replit alternatives. While they’re great for generating single functions or small code snippets, they often struggle with larger, more intricate tasks that require understanding logical coherence across multiple files and managing long-range dependencies. A neuro-symbolic system, by integrating structured reasoning, can significantly improve multi-file reasoning and code correctness. It drastically reduces those frustrating “hallucinations” in function completion or dependency resolution that developers often encounter, leading to more reliable and accurate AI-generated code.
Next, consider agentic workflows and autonomous AI agents. Frameworks like LangChain or CrewAI are designed to automate complex tasks, but their reliance on purely neural networks means they can be inconsistent, especially in multi-hop task execution. This is a critical area for neuro-symbolic advantage. Our CoreThink engine, for instance, enhances tool sequencing and decision-making in these complex workflows. More importantly, it ensures logical traceability in agent responses, which means less erratic behavior and more predictable, auditable actions. This is vital in domains where consistent and accountable automation is paramount.
Then, there’s LLM-based task planning and process automation. Solutions like Zapier AI or Microsoft Copilot aim to automate business processes, but they often lack the transparent, rule-based reasoning needed for mission-critical tasks. When something goes wrong, it’s incredibly hard to debug because the underlying reasoning is opaque. Neuro-symbolic AI addresses this directly by generating explainable reasoning traces, making the entire process auditable. This not only improves task execution reliability by reducing AI planning errors but also builds trust in automated systems, which is crucial for enterprise adoption.
More broadly, we’re talking about situations where explainability and reliability are non-negotiable. In regulated industries like finance, healthcare, and legal AI, “black box” decisions from purely neural networks are a major roadblock. A neuro-symbolic approach provides that transparent, auditable decision-making process. For example, a medical AI system predicting a diagnosis can now explain why it arrived at that conclusion, linking symptoms to conditions through explicit logical steps, rather than just outputting a probability. Similarly, a financial AI model denying a loan can provide explicit, traceable reasons, ensuring regulatory compliance and fairness.
Finally, and perhaps most importantly, neuro-symbolic systems excel in long-horizon reasoning tasks that require more than five sequential logical steps. Traditional LLMs experience an exponential accuracy decay beyond this threshold. Our benchmarks show that CoreThink can achieve over a 6x improvement beyond 8 logical steps on complex reasoning tasks. This means for workflows involving iterative decision-making, complex problem-solving, or maintaining long-term state across interactions – like a sophisticated enterprise workflow automation bot that needs to compare flight options, factor in loyalty points, reschedule based on meetings, and dynamically handle refunds – neuro-symbolic is the game-changer. It’s about moving from atomic, single-turn tasks to truly complex, adaptive intelligence.
Q7. What is your vision for the future of enterprise AI reasoning?
Do you foresee neuro-symbolic LRMs becoming an industry standard, and what would widespread adoption mean for enterprise automation and decision-making?
Today’s large language models, while powerful for atomic tasks, struggle significantly with complex, multi-step reasoning, explaining their decisions, and scaling efficiently for mission-critical enterprise applications.
This leads to issues like high failure rates in tasks requiring more than five logical steps, expensive computational costs, and a lack of auditability. My vision is to overcome these hurdles by integrating the structured, explainable reasoning of symbolic AI with the adaptive, data-driven pattern recognition of neural networks.
I absolutely foresee neuro-symbolic reasoning models (NRMs), like CoreThink, becoming an industry standard. The current trajectory of enterprise AI adoption is accelerating, but companies are hitting roadblocks due to LLM unpredictability and the unsustainable costs of GPU-heavy inference. NRMs offer a compelling solution. We’ve seen 30-60% improvement in LLM output accuracy and a 2-5x reduction in reasoning errors for multi-step AI workflows. This level of reliability is crucial for tasks like code generation, agentic workflows, and process automation where logical coherence and correct tool sequencing are paramount.
By reducing dependence on GPU-heavy inference, NRMs can achieve up to 40% cost savings over traditional chain-of-thought reasoning, making enterprise AI deployments economically viable at scale. For industries under regulation (finance, healthcare, legal), the ability to generate fully explainable reasoning traces is a game-changer. This transparency fosters trust and enables compliance, which is often a barrier for black-box AI systems. Furthermore, NRMs can handle complex, multi-step tasks that current LLMs fail on, with performance improvements of over 6x beyond 8 logical steps on reasoning benchmarks, unlocking new possibilities for automating intricate business processes.
Widespread adoption of neuro-symbolic LRMs would fundamentally transform enterprise automation and decision-making in several ways. Businesses could trust AI systems with more complex, end-to-end tasks, knowing that the reasoning is sound, explainable, and less prone to hallucinations or unpredictable failures. This would enable true autonomous AI agents that can navigate intricate workflows. The current reliance on rigid, manually coded workflows for agents would diminish. Neuro-symbolic systems could dynamically adapt to new data and evolving user requests without constant manual updates, leading to faster iteration cycles and more flexible automation. With explainable reasoning traces, human experts could more easily understand, validate, and debug AI decisions, fostering a more effective partnership between human intelligence and AI capabilities. This is particularly important for critical decision-making processes. Finally, the ability to handle long-horizon reasoning and provide transparency will open doors for AI in highly regulated or sensitive domains where current LLMs fall short, such as personalized medical diagnoses, complex financial modeling, or comprehensive legal analysis.
In essence, neuro-symbolic LRMs will pave the way for a new generation of enterprise AI that is not only powerful and scalable but also reliable, transparent, and truly intelligent in its ability to reason. This is not just an incremental improvement; it’s a foundational shift that will unlock unprecedented levels of automation and insight across industries.
………………………………………….

Vishvesh Bhat is the CEO/Founder of CoreThink AI.
CoreThink AI is a startup that has successfully raised pre-seed funding from a Tier-1 VC firm in the Bay Area. CoreThink’s focus is in advancing AI reasoning capabilities, and becoming a clear leader in dealing with complex AI problems. Later this year (2025), CoreThink will be releasing a reasoning engine that is expected to beat all the state-of-the-art models by over 30% on the most popular set of reasoning benchmarks. Vishvesh Bhat lives in San Francisco Bay Area, California.

Ramesh Chitor
Ramesh Chitor is a seasoned business leader with over 20 years of experience in the high-tech industry working for Mondee. Ramesh brings a wealth of expertise in strategic alliances, business development, and go-to-market strategies. His background includes senior roles at prominent companies such as IBM, Cisco, Western Digital, and Rubrik, where he served as Senior Director of Strategic Alliances. Ramesh is actively contributing as a Business Fellow for Perplexity.
Ramesh is a value and data-driven leader known for his ability to drive successful business outcomes by fostering strong relationships with clients, partners, and the broader ecosystem. His expertise in navigating complex partnerships and his forward-thinking approach to innovation will be invaluable assets to Perplexity’s growth and strategic direction.
Connect on LinkedIn
Sponsored by Chitor.