Full Technical Breakdown

How We Track
AI Visibility

A complete technical walkthrough of the scan pipeline, AEO scoring formula, citation classification system, diversity scoring, and how every data point is stored, cached, and kept fresh in our database.

Scan Engine AEO Score Caching Citations Taxonomy Diversity Score Health Flags DB Schema

Step-by-step pipeline

The Scan Engine

Every scan runs a batch of tracked prompts against GPT-4o and processes the raw text response through a deterministic extraction pipeline. Here is exactly what happens for each prompt, in order.

01Prompt sent to GPT-4o

The prompt text is appended with an instruction to name specific companies and include URLs where available. We use temperature 0.3 to minimise hallucination variance. Maximum 900 tokens — enough for a detailed recommendation list.

"{prompt}? Please mention specific companies, brands, or websites by name where relevant. Include URLs where available."

02Mention detection

Three parallel checks determine whether the entity is mentioned: exact bare-domain match in the response text, brand-name substring match (minimum 3 characters to avoid false positives on acronyms), and a word-boundary regex for the domain base label.

mentioned =
  lower.includes(bareDomain) ||
  lower.includes(brandLower) ||
  /\b{domainBase}\b/i.test(lower)

03Rank extraction

The response is split by newline. Any line matching a numbered list pattern ("1.", "2)", etc.) increments a counter. The first numbered line that contains the brand name becomes the rank position. If mentioned but not in a numbered list, rank defaults to 1.

for line in response.split("\n"):
  if /^\s*\d+[.)]/.test(line): count++
  if line contains brand: rank = count; break
if mentioned && rank == null: rank = 1

04Sentiment analysis

A 450-character context window around the first brand mention is scanned against two word lists. Presence of any negative signal word takes priority over positive. If neither list matches, sentiment is neutral. Sentiment is only computed when the brand is mentioned.

window = text[idx-150 : idx+300]
sentiment =
  NEGATIVE.some(w => window.includes(w)) ? "negative"
  POSITIVE.some(w => window.includes(w)) ? "positive"
  : "neutral"

05Citation extraction

Two regex passes extract URLs. First pass captures markdown-formatted links [text](url) and preserves the anchor text as the citation name. Second pass captures bare https:// URLs not already captured. Output is capped at 6 citations per response to focus on the most prominent references.

// Pass 1 — markdown links
/\[([^\]]+)\]\((https?:\/\/[^\s)]+)\)/g
// Pass 2 — bare URLs (deduped)
/https?:\/\/[a-zA-Z0-9-.]+\.[a-z]{2,}[^\s)]*/g

06Competitor detection

Each known competitor domain is tested against the response using the same three-check pattern used for the entity itself. Competitors are stored as a JSON array in the prompt result and their count is written to the rank-history snapshot for trend tracking.

competitors = monitor.competitors.filter(comp =>
  lower.includes(bareDomain(comp)) ||
  /\b{base}\b/i.test(lower)
)

07Atomic write to DB

All results for a prompt are written in a single Promise.all: a PromptResult upsert (deduplicated by promptId + engine + cachedDate), a PromptRankHistory insert (immutable historical record), and the Citation Pipeline (see below). The three writes run in parallel — if one fails, the batch logs the error and continues to the next prompt.

await Promise.all([
  prisma.promptResult.upsert({ ... }),
  prisma.promptRankHistory.create({ ... }),
  processCitationPipeline({ ... }),
])

Scoring formula

AEO Visibility Score

The AEO Visibility Score is a 0–100 number summarising how well an entity is represented across all its tracked prompts in a single scan. It accounts for both presence and quality of that presence.

Core Formula

score = round(weightedSum / totalPrompts × 100)

Where weightedSum sums over all prompts:

visible + positive → weight 1.0

visible + neutral → weight 0.8

visible + negative → weight 0.5

not_found → weight 0.0

Worked Example — 5 prompts

Prompt	Status	Sentiment	Weight
best hearing aids in India	visible	positive	1.0
hearing aid price comparison	visible	neutral	0.8
audiologist recommended brands	visible	negative	0.5
invisible hearing aids review	not_found	—	0.0
hearing aid service centre near me	not_found	—	0.0
weightedSum = 1.0 + 0.8 + 0.5 = 2.3 · totalPrompts = 5			= 46%

Freshness system

Daily Caching & Scan Freshness

LLM API calls are expensive and slow. We deduplicate at the calendar-day level so re-running a scan never charges twice for the same prompt, while ensuring every day produces a fresh data point for trend tracking.

Normal mode

Checks PromptResult for rows where cachedDate = today
Skips prompts that already have a result today
Runs only the remaining prompts
Returns cached: N, ran: M in the response

Force refresh

Deletes all PromptResult rows for today's date
Re-runs every active prompt regardless
Overwrites rank history with fresh data
ScanCitations use skipDuplicates to stay idempotent

How history accumulates

Scan on May 1

RankHistory row

Scan on May 5

RankHistory row

Scan on May 12

RankHistory row

Scan on May 20

RankHistory row

Each scan appends an immutable PromptRankHistory row per prompt. These rows are never deleted or modified — they form the permanent audit trail that powers the trend charts, sparklines, and task impact calculations. The PromptResult table is the mutable daily cache; PromptRankHistory is the immutable ledger.

Citation pipeline

How Citations Are Extracted & Stored

After every prompt run we pass the extracted URLs through a four-stage citation pipeline that classifies each URL, aggregates per-domain statistics, and keeps a live diversity score for the entity.

Extract domain

new URL() with try/catch regex fallback strips scheme, www., and path. Returns null for truly malformed input — never throws.

Classify type

Priority-ordered lookup against six static Set registries: social → review → directory → news → ecommerce → aggregator → blog patterns → unknown.

Write ScanCitation

One row per URL per prompt per day. The composite unique key (workspaceId, promptId, url, cachedDate) makes force-refresh idempotent via skipDuplicates.

Upsert aggregates

EntityCitationDomain and CitationAnalysis are recomputed from ScanCitations as the source of truth — never incrementally drifted.

Domain extraction — safe against malformed URLs

function extractDomain(url: string): string | null {
  try {
    const parsed = new URL(url.startsWith("http") ? url : `https://${url}`);
    return parsed.hostname.replace(/^www\./, "").toLowerCase() || null;
  } catch {
    // Fallback for severely malformed URLs (missing scheme, typos, etc.)
    const m = url.match(/(?:https?:\/\/)?(?:www\.)?([a-z0-9][a-z0-9-]*(\.[a-z0-9-]+)+)/i);
    return m ? m[1].toLowerCase() : null;
  }
}

9-type classification system

Citation Type Taxonomy

Every citation domain is mapped to one of nine types using a priority-ordered lookup. Types are permanent — once a domain is classified, that type is stored with the EntityCitationDomain record and used in all diversity calculations.

news_media×3

News & Media

Coverage in established news outlets and industry publications. The most authoritative citation type — signals that credible journalists reference your brand independently.

e.g. forbes.com, techcrunch.com, ndtv.com, thehindu.com

review_platform×2.5

Review Platform

Citations on dedicated review and rating platforms. Strong social proof signal — indicates customers publicly evaluate your product and AI models recognise that authority.

e.g. trustpilot.com, g2.com, yelp.com, practo.com

directory×2

Citation Diversity Score

A scalar number representing how broadly and authoritatively an entity is cited across the web. Higher = more diverse, high-quality references = harder for competitors to displace you from AI recommendations.

Formula

diversityScore = Σ(typeWeight per unique citation domain)

Each unique domain in EntityCitationDomain contributes exactly once, regardless of how many times it was cited. This rewards breadth — being cited by 10 different domains scores higher than being cited 10 times by the same domain.

Worked Example

Domain	Type	Weight
healthline.com	news_media	3.0
trustpilot.com	review_platform	2.5
justdial.com	directory	2.0
medium.com	third_party_blog	1.5
instagram.com	social	1.0
yourdomain.com	owned	0.5
Total: 3.0 + 2.5 + 2.0 + 1.5 + 1.0 + 0.5		= 10.5

The score is rounded to one decimal place and stored in CitationAnalysis after every scan. There is no defined maximum — scores above 15 indicate strong multi-channel authority, while scores below 3 suggest the entity is only cited by one or two low-weight source types.

Automatic recommendations

Citation Health Flags

After every scan, three threshold-based checks run against the entity's aggregated citation data. Any threshold that is breached generates a recommendation string stored in CitationAnalysis.flags[].

CriticalZero owned citations

if ownedCount === 0

“LLMs never cite your site directly — critical issue”

When AI models answer questions about your category but never link to your own domain, it means LLMs have not yet established a strong association between your brand and your domain. This requires direct content strategy work — schema markup, structured data, and authoritative on-site content.

WarningLow owned citation share

if ownedCount / totalCitations < 0.30

“Strengthen your own site content”

Less than 30% of all citation events reference your own domain. Third-party sources are doing the work of representing your brand. While some third-party citations are healthy, over-reliance signals that your own site lacks the depth and authority that LLMs look for when choosing what to cite.

WarningSingle-source over-reliance

if maxDomainCount / totalCitations > 0.60

“Over-reliant on one source — diversify”

More than 60% of all citations come from a single domain. This creates fragility — if that source changes how it covers your brand, or if LLMs shift their weighting away from that source, your visibility score drops sharply. Diversification across multiple citation types insulates against this.

Data architecture

Database Schema & Freshness Model

Six tables work together to provide both real-time prompt results and a deep historical ledger. Mutable cache tables and immutable append-only logs serve different purposes — here is how they differ.

Table	Type	Freshness strategy	Unique key / deduplication	Purpose
PromptResult	Mutable cache	Daily — one row per prompt per engine per day	promptId + engine + cachedDate	Latest result for UI display; overwritten on force refresh
PromptRankHistory	Immutable log	Appended on every scan — never updated or deleted	cuid (auto) + index on promptId + scannedAt	Historical trend data, chart points, task impact tracking
ScanCitation	Immutable log	One row per URL per prompt per day; skipDuplicates on re-run	workspaceId + promptId + url + cachedDate	Per-citation record used as source of truth for aggregates
EntityCitationDomain	Recomputed view	Upserted after every scan — counts recomputed from ScanCitation	workspaceId + domain	Per-domain aggregation: timesCited, firstSeen, promptsList
CitationAnalysis	Recomputed view	Upserted after every scan — single row per workspace	workspaceId (unique)	Entity-level diversity score, ownedCount, flags[]
PinnedPrompt	User state	Mutated by user actions only; not affected by scans	workspaceId + promptId	Watchlist membership for prioritised monitoring

Key design decisions

Idempotency by design

Every write operation is safe to repeat. PromptResult uses upsert with a composite unique key. ScanCitation uses createMany with skipDuplicates. EntityCitationDomain and CitationAnalysis are recomputed from scratch rather than incremented, so running the pipeline 10 times produces the same result as running it once.

Source of truth separation

Mutable tables (PromptResult) handle today's UI data. Immutable tables (PromptRankHistory, ScanCitation) handle historical truth. Computed views (EntityCitationDomain, CitationAnalysis) are always derived from the immutable tables — they can be wiped and rebuilt at any time without data loss.

Cascade-safe deletions

All child records use onDelete: Cascade so deleting a workspace, topic, or prompt cleanly removes all associated data. ScanCitations cascade from Prompt, not from PromptResult, so force-refreshing the mutable cache does not erase citation history.

Get Started

See It Working on Your Domain

Free scan — no credit card required

START FREE ANALYSIS

How We TrackAI Visibility

The Scan Engine

AEO Visibility Score

Daily Caching & Scan Freshness

How Citations Are Extracted & Stored

Citation Type Taxonomy

Citation Diversity Score

Citation Health Flags

Database Schema & Freshness Model

See It Working on Your Domain

How We Track
AI Visibility