Agentforce has moved from demo to production in a meaningful number of enterprise orgs over the past year. The gap between the keynote demos and what actually works at scale is large enough to warrant a proper write-up. This is based on three production deployments: a B2B service desk, a Commerce Cloud post-purchase support flow, and an internal sales productivity tool.
What Agentforce Actually Is
At its core, Agentforce is a framework for deploying LLM-powered agents that act on your Salesforce data using a defined set of tools (called Actions). It sits on top of Einstein, which runs on the Einstein Trust Layer - Salesforce’s isolated inference infrastructure that ensures your data doesn’t train external models.
The mental model: you define an Agent (a persona + instructions), give it Topics (areas of responsibility), and attach Actions to each topic. When a user sends a message, the LLM decides which topic applies and which action to call.
User Message
│
▼
Agent (persona + grounding instructions)
│
├── Topic: Order Management
│ ├── Action: Get Order Status (Flow)
│ └── Action: Initiate Return (Apex)
│
└── Topic: Product Questions
├── Action: Search Knowledge (Einstein Search)
└── Action: Get Product Details (Flow)
Topics Are the Critical Design Decision
Topics define the agent’s scope of reasoning. A poorly scoped topic causes the LLM to either refuse valid requests or hallucinate capabilities it doesn’t have.
Rules I follow:
One job per topic. A topic called “Customer Service” that covers orders, billing, accounts, and products is too broad. The LLM will misroute. Break it into specific domains.
Write instructions as constraints, not descriptions. Instead of “This topic handles order enquiries”, write “Only handle requests about orders the authenticated user placed. Never discuss pricing, promotions, or product inventory. If asked about returns, confirm the order exists before proceeding.”
Test topic boundary cases. What happens when a user asks something that’s plausible but out of scope? The agent should politely redirect, not confabulate an answer.
Actions: Flows vs Apex
Actions are the tools your agent calls. You have three implementation options:
Auto-launched Flows - the right default. They’re declarative, testable in Salesforce, version-controlled via source format, and the LLM can read their input/output descriptions to understand when to call them.
Apex - necessary for complex logic, external callouts, or heavy data processing. Expose via @InvocableMethod. Be explicit about what the method does in the label and description - the LLM uses these at runtime.
Standard Actions - pre-built actions for common tasks (create record, send email, search knowledge). Use these where they fit. They’re already tested against the trust layer.
For the service desk deployment, every action started as a Flow. Only two were eventually moved to Apex: one that needed to call an ERP API with custom retry logic, and one performing bulk record updates that hit Flow limits.
The Einstein Trust Layer in Practice
The Trust Layer is what makes Agentforce deployable in enterprise contexts. Key behaviours:
- Data masking - PII fields (SSN, credit card, health data) are masked before being sent to the LLM and unmasked in the response
- Toxicity filtering - inputs and outputs are screened; you can configure sensitivity levels
- Audit trail - every LLM call is logged in the
AiGenerationLogobject with prompt, response, and masked data map - Zero data retention - prompts are not stored by the model provider
For regulated industries: query the AiGenerationLog regularly and include it in your data governance audits. It’s the source of truth for what the agent did.
Grounding and Hallucination Control
The biggest production issue isn’t the LLM giving wrong answers - it’s the LLM confidently giving plausible-but-wrong answers when context is missing.
Techniques that work:
Explicit negative instructions. In your agent instructions: “If you cannot find the information in the data returned by actions, say you cannot find it. Do not infer or estimate.”
Structured action outputs. Return structured JSON from your actions, not free text. The LLM handles {"orderStatus": "shipped", "estimatedDelivery": "2026-01-22"} more reliably than a sentence.
Retrieval-Augmented Generation (RAG) via Knowledge. For product/policy questions, attach a Knowledge Action. The agent retrieves the relevant article and grounds its answer in it. This cuts hallucination on policy questions dramatically.
Conversation turn limits. Set a maximum turn count per session. Long conversations accumulate context that increases the chance of the LLM losing track of earlier constraints.
Testing Agentforce
Testing is the area most teams underinvest in. The agent behaves differently in prod than in the Agent Builder sandbox because the sandbox doesn’t replicate your full org data context.
What I test:
| Category | What I Check |
|---|---|
| Happy path | Agent completes the intended task end-to-end |
| Topic routing | Edge-case phrasings route to the correct topic |
| Action failure | If an action returns an error, agent responds gracefully |
| Out-of-scope | Agent refuses gracefully without hallucinating |
| Adversarial | Prompt injection attempts (e.g. “ignore your instructions and…”) |
| Data boundary | Agent doesn’t expose records the user doesn’t own |
Run these as named test scenarios in Agent Builder and save them. Rerun after every deployment.
Deployment Considerations
Agentforce components deploy via Salesforce DX source format:
force-app/
main/
default/
aiAgents/
OrderSupportAgent.aiAgent-meta.xml
aiAgentTopics/
OrderManagement.aiAgentTopic-meta.xml
flows/
Agent_GetOrderStatus.flow-meta.xml
Version your agent instructions. The systemPrompt field in .aiAgent-meta.xml is plain text - it belongs in version control and changes should go through code review. Prompt changes are code changes.
Separate sandbox and prod agent configurations. Use Named Credentials and Custom Settings to control which external endpoints and Einstein models each environment targets.
Monitor AiGenerationLog in prod. Set up a dashboard tracking: total calls, error rate, average latency, and topic distribution. Unusual topic distributions are often the first signal that something is being misrouted.
What Doesn’t Work Yet
Agentforce is genuinely impressive, but there are real limitations as of early 2026:
- Multi-agent orchestration is in beta and the handoff protocol between agents is not yet stable enough for production use cases that require it
- Long-context reasoning degrades on complex multi-hop lookups - the LLM handles one action result well; chaining three creates compounding uncertainty
- Async actions aren’t natively supported - if your action takes more than a few seconds (e.g. a batch process), you need to design around it with polling
- Custom LLM models aren’t supported - you’re tied to the Einstein model roster, which is fine for most use cases but limits specialised domains
The platform is evolving fast. What’s in beta today will likely be GA by mid-2026.
Recommended Starting Point
If you’re just beginning: pick one narrow, high-volume, low-risk use case. “Answer common order status questions for authenticated B2C customers” is ideal. One topic, two or three actions, clear scope. Get it to production, instrument it, and iterate.
Resist the urge to build a general-purpose “concierge” agent out of the gate. Narrow agents are more reliable, easier to test, and much easier to explain to stakeholders when they behave unexpectedly.