← Back to HomeHow We Track
AI Visibility
A complete technical walkthrough of the scan pipeline, AEO scoring formula, citation classification system, diversity scoring, and how every data point is stored, cached, and kept fresh in our database.
Step-by-step pipeline
The Scan Engine
Every scan runs a batch of tracked prompts against GPT-4o and processes the raw text response through a deterministic extraction pipeline. Here is exactly what happens for each prompt, in order.
The prompt text is appended with an instruction to name specific companies and include URLs where available. We use temperature 0.3 to minimise hallucination variance. Maximum 900 tokens — enough for a detailed recommendation list.
"{prompt}? Please mention specific companies, brands, or websites by name where relevant. Include URLs where available."Three parallel checks determine whether the entity is mentioned: exact bare-domain match in the response text, brand-name substring match (minimum 3 characters to avoid false positives on acronyms), and a word-boundary regex for the domain base label.
mentioned =
lower.includes(bareDomain) ||
lower.includes(brandLower) ||
/\b{domainBase}\b/i.test(lower)The response is split by newline. Any line matching a numbered list pattern ("1.", "2)", etc.) increments a counter. The first numbered line that contains the brand name becomes the rank position. If mentioned but not in a numbered list, rank defaults to 1.
for line in response.split("\n"):
if /^\s*\d+[.)]/.test(line): count++
if line contains brand: rank = count; break
if mentioned && rank == null: rank = 1A 450-character context window around the first brand mention is scanned against two word lists. Presence of any negative signal word takes priority over positive. If neither list matches, sentiment is neutral. Sentiment is only computed when the brand is mentioned.
window = text[idx-150 : idx+300] sentiment = NEGATIVE.some(w => window.includes(w)) ? "negative" POSITIVE.some(w => window.includes(w)) ? "positive" : "neutral"
Two regex passes extract URLs. First pass captures markdown-formatted links [text](url) and preserves the anchor text as the citation name. Second pass captures bare https:// URLs not already captured. Output is capped at 6 citations per response to focus on the most prominent references.
// Pass 1 — markdown links
/\[([^\]]+)\]\((https?:\/\/[^\s)]+)\)/g
// Pass 2 — bare URLs (deduped)
/https?:\/\/[a-zA-Z0-9-.]+\.[a-z]{2,}[^\s)]*/gEach known competitor domain is tested against the response using the same three-check pattern used for the entity itself. Competitors are stored as a JSON array in the prompt result and their count is written to the rank-history snapshot for trend tracking.
competitors = monitor.competitors.filter(comp =>
lower.includes(bareDomain(comp)) ||
/\b{base}\b/i.test(lower)
)All results for a prompt are written in a single Promise.all: a PromptResult upsert (deduplicated by promptId + engine + cachedDate), a PromptRankHistory insert (immutable historical record), and the Citation Pipeline (see below). The three writes run in parallel — if one fails, the batch logs the error and continues to the next prompt.
await Promise.all([
prisma.promptResult.upsert({ ... }),
prisma.promptRankHistory.create({ ... }),
processCitationPipeline({ ... }),
])Scoring formula
AEO Visibility Score
The AEO Visibility Score is a 0–100 number summarising how well an entity is represented across all its tracked prompts in a single scan. It accounts for both presence and quality of that presence.
Core Formula
Where weightedSum sums over all prompts:
Worked Example — 5 prompts
| Prompt | Status | Sentiment | Weight |
|---|---|---|---|
| best hearing aids in India | visible | positive | 1.0 |
| hearing aid price comparison | visible | neutral | 0.8 |
| audiologist recommended brands | visible | negative | 0.5 |
| invisible hearing aids review | not_found | — | 0.0 |
| hearing aid service centre near me | not_found | — | 0.0 |
| weightedSum = 1.0 + 0.8 + 0.5 = 2.3 · totalPrompts = 5 | = 46% | ||
Freshness system
Daily Caching & Scan Freshness
LLM API calls are expensive and slow. We deduplicate at the calendar-day level so re-running a scan never charges twice for the same prompt, while ensuring every day produces a fresh data point for trend tracking.
Normal mode
- Checks PromptResult for rows where cachedDate = today
- Skips prompts that already have a result today
- Runs only the remaining prompts
- Returns cached: N, ran: M in the response
Force refresh
- Deletes all PromptResult rows for today's date
- Re-runs every active prompt regardless
- Overwrites rank history with fresh data
- ScanCitations use skipDuplicates to stay idempotent
How history accumulates
Each scan appends an immutable PromptRankHistory row per prompt. These rows are never deleted or modified — they form the permanent audit trail that powers the trend charts, sparklines, and task impact calculations. The PromptResult table is the mutable daily cache; PromptRankHistory is the immutable ledger.
Citation pipeline
How Citations Are Extracted & Stored
After every prompt run we pass the extracted URLs through a four-stage citation pipeline that classifies each URL, aggregates per-domain statistics, and keeps a live diversity score for the entity.
Extract domain
new URL() with try/catch regex fallback strips scheme, www., and path. Returns null for truly malformed input — never throws.
Classify type
Priority-ordered lookup against six static Set registries: social → review → directory → news → ecommerce → aggregator → blog patterns → unknown.
Write ScanCitation
One row per URL per prompt per day. The composite unique key (workspaceId, promptId, url, cachedDate) makes force-refresh idempotent via skipDuplicates.
Upsert aggregates
EntityCitationDomain and CitationAnalysis are recomputed from ScanCitations as the source of truth — never incrementally drifted.
Domain extraction — safe against malformed URLs
function extractDomain(url: string): string | null {
try {
const parsed = new URL(url.startsWith("http") ? url : `https://${url}`);
return parsed.hostname.replace(/^www\./, "").toLowerCase() || null;
} catch {
// Fallback for severely malformed URLs (missing scheme, typos, etc.)
const m = url.match(/(?:https?:\/\/)?(?:www\.)?([a-z0-9][a-z0-9-]*(\.[a-z0-9-]+)+)/i);
return m ? m[1].toLowerCase() : null;
}
}9-type classification system
Citation Type Taxonomy
Every citation domain is mapped to one of nine types using a priority-ordered lookup. Types are permanent — once a domain is classified, that type is stored with the EntityCitationDomain record and used in all diversity calculations.
News & Media
Coverage in established news outlets and industry publications. The most authoritative citation type — signals that credible journalists reference your brand independently.
e.g. forbes.com, techcrunch.com, ndtv.com, thehindu.com
Review Platform
Citations on dedicated review and rating platforms. Strong social proof signal — indicates customers publicly evaluate your product and AI models recognise that authority.
e.g. trustpilot.com, g2.com, yelp.com, practo.com
Directory
Business and professional directories. Foundational entity recognition — structured directory listings help LLMs consistently map your brand name to a real business.
e.g. justdial.com, indiamart.com, clutch.co, yellowpages.com
Third-Party Blog
Independent writers, Medium publications, and hosted blog platforms. Good for long-tail visibility — bloggers often cover niche topics that mainstream media ignores.
e.g. medium.com, substack.com, blogspot.com, hashnode.dev
Aggregator
Comparison and aggregation platforms. Valuable for transactional queries — users on aggregators are high-intent buyers, and appearing there reinforces purchase-stage visibility.
e.g. policybazaar.com, booking.com, zomato.com, 1mg.com
Social
Social media profiles and posts. Baseline presence signal — LLMs do surface social content, especially LinkedIn for B2B brands. Contributes to recognition but is easily replicated.
e.g. linkedin.com, youtube.com, instagram.com, twitter.com
E-commerce
Product listings on marketplace and e-commerce platforms. Relevant for product businesses — indicates your items are discoverable and purchasable through major retail channels.
e.g. amazon.in, flipkart.com, ebay.com, etsy.com
Owned
Your own domain or its subdomains. Low diversity weight — expected and necessary but not differentiating. A high proportion of owned citations relative to total often signals weak third-party authority.
e.g. yourdomain.com, blog.yourdomain.com, shop.yourdomain.com
Unknown
Domains not matching any known category. Treated as neutral — may include niche industry sites, country-specific directories, or newly established platforms not yet in our registry.
e.g. niche-industry-portal.com, regional-news.co.in, …
Authority metric
Citation Diversity Score
A scalar number representing how broadly and authoritatively an entity is cited across the web. Higher = more diverse, high-quality references = harder for competitors to displace you from AI recommendations.
Formula
Each unique domain in EntityCitationDomain contributes exactly once, regardless of how many times it was cited. This rewards breadth — being cited by 10 different domains scores higher than being cited 10 times by the same domain.
Worked Example
| Domain | Type | Weight |
|---|---|---|
| healthline.com | news_media | 3.0 |
| trustpilot.com | review_platform | 2.5 |
| justdial.com | directory | 2.0 |
| medium.com | third_party_blog | 1.5 |
| instagram.com | social | 1.0 |
| yourdomain.com | owned | 0.5 |
| Total: 3.0 + 2.5 + 2.0 + 1.5 + 1.0 + 0.5 | = 10.5 | |
The score is rounded to one decimal place and stored in CitationAnalysis after every scan. There is no defined maximum — scores above 15 indicate strong multi-channel authority, while scores below 3 suggest the entity is only cited by one or two low-weight source types.
Automatic recommendations
Citation Health Flags
After every scan, three threshold-based checks run against the entity's aggregated citation data. Any threshold that is breached generates a recommendation string stored in CitationAnalysis.flags[].
if ownedCount === 0When AI models answer questions about your category but never link to your own domain, it means LLMs have not yet established a strong association between your brand and your domain. This requires direct content strategy work — schema markup, structured data, and authoritative on-site content.
if ownedCount / totalCitations < 0.30Less than 30% of all citation events reference your own domain. Third-party sources are doing the work of representing your brand. While some third-party citations are healthy, over-reliance signals that your own site lacks the depth and authority that LLMs look for when choosing what to cite.
if maxDomainCount / totalCitations > 0.60More than 60% of all citations come from a single domain. This creates fragility — if that source changes how it covers your brand, or if LLMs shift their weighting away from that source, your visibility score drops sharply. Diversification across multiple citation types insulates against this.
Data architecture
Database Schema & Freshness Model
Six tables work together to provide both real-time prompt results and a deep historical ledger. Mutable cache tables and immutable append-only logs serve different purposes — here is how they differ.
| Table | Type | Freshness strategy | Unique key / deduplication | Purpose |
|---|---|---|---|---|
| PromptResult | Mutable cache | Daily — one row per prompt per engine per day | promptId + engine + cachedDate | Latest result for UI display; overwritten on force refresh |
| PromptRankHistory | Immutable log | Appended on every scan — never updated or deleted | cuid (auto) + index on promptId + scannedAt | Historical trend data, chart points, task impact tracking |
| ScanCitation | Immutable log | One row per URL per prompt per day; skipDuplicates on re-run | workspaceId + promptId + url + cachedDate | Per-citation record used as source of truth for aggregates |
| EntityCitationDomain | Recomputed view | Upserted after every scan — counts recomputed from ScanCitation | workspaceId + domain | Per-domain aggregation: timesCited, firstSeen, promptsList |
| CitationAnalysis | Recomputed view | Upserted after every scan — single row per workspace | workspaceId (unique) | Entity-level diversity score, ownedCount, flags[] |
| PinnedPrompt | User state | Mutated by user actions only; not affected by scans | workspaceId + promptId | Watchlist membership for prioritised monitoring |
Key design decisions
Idempotency by design
Every write operation is safe to repeat. PromptResult uses upsert with a composite unique key. ScanCitation uses createMany with skipDuplicates. EntityCitationDomain and CitationAnalysis are recomputed from scratch rather than incremented, so running the pipeline 10 times produces the same result as running it once.
Source of truth separation
Mutable tables (PromptResult) handle today's UI data. Immutable tables (PromptRankHistory, ScanCitation) handle historical truth. Computed views (EntityCitationDomain, CitationAnalysis) are always derived from the immutable tables — they can be wiped and rebuilt at any time without data loss.
Cascade-safe deletions
All child records use onDelete: Cascade so deleting a workspace, topic, or prompt cleanly removes all associated data. ScanCitations cascade from Prompt, not from PromptResult, so force-refreshing the mutable cache does not erase citation history.