Webspecia← Back to Home
Full Technical Breakdown

How We Track
AI Visibility

A complete technical walkthrough of the scan pipeline, AEO scoring formula, citation classification system, diversity scoring, and how every data point is stored, cached, and kept fresh in our database.

Step-by-step pipeline

The Scan Engine

Every scan runs a batch of tracked prompts against GPT-4o and processes the raw text response through a deterministic extraction pipeline. Here is exactly what happens for each prompt, in order.

01Prompt sent to GPT-4o

The prompt text is appended with an instruction to name specific companies and include URLs where available. We use temperature 0.3 to minimise hallucination variance. Maximum 900 tokens — enough for a detailed recommendation list.

"{prompt}? Please mention specific companies, brands, or websites by name where relevant. Include URLs where available."
02Mention detection

Three parallel checks determine whether the entity is mentioned: exact bare-domain match in the response text, brand-name substring match (minimum 3 characters to avoid false positives on acronyms), and a word-boundary regex for the domain base label.

mentioned =
  lower.includes(bareDomain) ||
  lower.includes(brandLower) ||
  /\b{domainBase}\b/i.test(lower)
03Rank extraction

The response is split by newline. Any line matching a numbered list pattern ("1.", "2)", etc.) increments a counter. The first numbered line that contains the brand name becomes the rank position. If mentioned but not in a numbered list, rank defaults to 1.

for line in response.split("\n"):
  if /^\s*\d+[.)]/.test(line): count++
  if line contains brand: rank = count; break
if mentioned && rank == null: rank = 1
04Sentiment analysis

A 450-character context window around the first brand mention is scanned against two word lists. Presence of any negative signal word takes priority over positive. If neither list matches, sentiment is neutral. Sentiment is only computed when the brand is mentioned.

window = text[idx-150 : idx+300]
sentiment =
  NEGATIVE.some(w => window.includes(w)) ? "negative"
  POSITIVE.some(w => window.includes(w)) ? "positive"
  : "neutral"
05Citation extraction

Two regex passes extract URLs. First pass captures markdown-formatted links [text](url) and preserves the anchor text as the citation name. Second pass captures bare https:// URLs not already captured. Output is capped at 6 citations per response to focus on the most prominent references.

// Pass 1 — markdown links
/\[([^\]]+)\]\((https?:\/\/[^\s)]+)\)/g
// Pass 2 — bare URLs (deduped)
/https?:\/\/[a-zA-Z0-9-.]+\.[a-z]{2,}[^\s)]*/g
06Competitor detection

Each known competitor domain is tested against the response using the same three-check pattern used for the entity itself. Competitors are stored as a JSON array in the prompt result and their count is written to the rank-history snapshot for trend tracking.

competitors = monitor.competitors.filter(comp =>
  lower.includes(bareDomain(comp)) ||
  /\b{base}\b/i.test(lower)
)
07Atomic write to DB

All results for a prompt are written in a single Promise.all: a PromptResult upsert (deduplicated by promptId + engine + cachedDate), a PromptRankHistory insert (immutable historical record), and the Citation Pipeline (see below). The three writes run in parallel — if one fails, the batch logs the error and continues to the next prompt.

await Promise.all([
  prisma.promptResult.upsert({ ... }),
  prisma.promptRankHistory.create({ ... }),
  processCitationPipeline({ ... }),
])

Scoring formula

AEO Visibility Score

The AEO Visibility Score is a 0–100 number summarising how well an entity is represented across all its tracked prompts in a single scan. It accounts for both presence and quality of that presence.

Core Formula

score = round(weightedSum / totalPrompts × 100)

Where weightedSum sums over all prompts:

visible + positive → weight 1.0
visible + neutral   → weight 0.8
visible + negative → weight 0.5
not_found          → weight 0.0

Worked Example — 5 prompts

PromptStatusSentimentWeight
best hearing aids in Indiavisiblepositive1.0
hearing aid price comparisonvisibleneutral0.8
audiologist recommended brandsvisiblenegative0.5
invisible hearing aids reviewnot_found0.0
hearing aid service centre near menot_found0.0
weightedSum = 1.0 + 0.8 + 0.5 = 2.3  ·  totalPrompts = 5= 46%

Freshness system

Daily Caching & Scan Freshness

LLM API calls are expensive and slow. We deduplicate at the calendar-day level so re-running a scan never charges twice for the same prompt, while ensuring every day produces a fresh data point for trend tracking.

Normal mode

  • Checks PromptResult for rows where cachedDate = today
  • Skips prompts that already have a result today
  • Runs only the remaining prompts
  • Returns cached: N, ran: M in the response

Force refresh

  • Deletes all PromptResult rows for today's date
  • Re-runs every active prompt regardless
  • Overwrites rank history with fresh data
  • ScanCitations use skipDuplicates to stay idempotent

How history accumulates

Scan on May 1
RankHistory row
Scan on May 5
RankHistory row
Scan on May 12
RankHistory row
Scan on May 20
RankHistory row

Each scan appends an immutable PromptRankHistory row per prompt. These rows are never deleted or modified — they form the permanent audit trail that powers the trend charts, sparklines, and task impact calculations. The PromptResult table is the mutable daily cache; PromptRankHistory is the immutable ledger.

Citation pipeline

How Citations Are Extracted & Stored

After every prompt run we pass the extracted URLs through a four-stage citation pipeline that classifies each URL, aggregates per-domain statistics, and keeps a live diversity score for the entity.

1

Extract domain

new URL() with try/catch regex fallback strips scheme, www., and path. Returns null for truly malformed input — never throws.

2

Classify type

Priority-ordered lookup against six static Set registries: social → review → directory → news → ecommerce → aggregator → blog patterns → unknown.

3

Write ScanCitation

One row per URL per prompt per day. The composite unique key (workspaceId, promptId, url, cachedDate) makes force-refresh idempotent via skipDuplicates.

4

Upsert aggregates

EntityCitationDomain and CitationAnalysis are recomputed from ScanCitations as the source of truth — never incrementally drifted.

Domain extraction — safe against malformed URLs

function extractDomain(url: string): string | null {
  try {
    const parsed = new URL(url.startsWith("http") ? url : `https://${url}`);
    return parsed.hostname.replace(/^www\./, "").toLowerCase() || null;
  } catch {
    // Fallback for severely malformed URLs (missing scheme, typos, etc.)
    const m = url.match(/(?:https?:\/\/)?(?:www\.)?([a-z0-9][a-z0-9-]*(\.[a-z0-9-]+)+)/i);
    return m ? m[1].toLowerCase() : null;
  }
}

9-type classification system

Citation Type Taxonomy

Every citation domain is mapped to one of nine types using a priority-ordered lookup. Types are permanent — once a domain is classified, that type is stored with the EntityCitationDomain record and used in all diversity calculations.

news_media×3

News & Media

Coverage in established news outlets and industry publications. The most authoritative citation type — signals that credible journalists reference your brand independently.

e.g. forbes.com, techcrunch.com, ndtv.com, thehindu.com

review_platform×2.5

Review Platform

Citations on dedicated review and rating platforms. Strong social proof signal — indicates customers publicly evaluate your product and AI models recognise that authority.

e.g. trustpilot.com, g2.com, yelp.com, practo.com

directory×2

Directory

Business and professional directories. Foundational entity recognition — structured directory listings help LLMs consistently map your brand name to a real business.

e.g. justdial.com, indiamart.com, clutch.co, yellowpages.com

third_party_blog×1.5

Third-Party Blog

Independent writers, Medium publications, and hosted blog platforms. Good for long-tail visibility — bloggers often cover niche topics that mainstream media ignores.

e.g. medium.com, substack.com, blogspot.com, hashnode.dev

aggregator×1.5

Aggregator

Comparison and aggregation platforms. Valuable for transactional queries — users on aggregators are high-intent buyers, and appearing there reinforces purchase-stage visibility.

e.g. policybazaar.com, booking.com, zomato.com, 1mg.com

social×1

Social

Social media profiles and posts. Baseline presence signal — LLMs do surface social content, especially LinkedIn for B2B brands. Contributes to recognition but is easily replicated.

e.g. linkedin.com, youtube.com, instagram.com, twitter.com

ecommerce×1

E-commerce

Product listings on marketplace and e-commerce platforms. Relevant for product businesses — indicates your items are discoverable and purchasable through major retail channels.

e.g. amazon.in, flipkart.com, ebay.com, etsy.com

owned×0.5

Owned

Your own domain or its subdomains. Low diversity weight — expected and necessary but not differentiating. A high proportion of owned citations relative to total often signals weak third-party authority.

e.g. yourdomain.com, blog.yourdomain.com, shop.yourdomain.com

unknown×0.3

Unknown

Domains not matching any known category. Treated as neutral — may include niche industry sites, country-specific directories, or newly established platforms not yet in our registry.

e.g. niche-industry-portal.com, regional-news.co.in, …

Authority metric

Citation Diversity Score

A scalar number representing how broadly and authoritatively an entity is cited across the web. Higher = more diverse, high-quality references = harder for competitors to displace you from AI recommendations.

Formula

diversityScore = Σ(typeWeight per unique citation domain)

Each unique domain in EntityCitationDomain contributes exactly once, regardless of how many times it was cited. This rewards breadth — being cited by 10 different domains scores higher than being cited 10 times by the same domain.

Worked Example

DomainTypeWeight
healthline.comnews_media3.0
trustpilot.comreview_platform2.5
justdial.comdirectory2.0
medium.comthird_party_blog1.5
instagram.comsocial1.0
yourdomain.comowned0.5
Total: 3.0 + 2.5 + 2.0 + 1.5 + 1.0 + 0.5= 10.5

The score is rounded to one decimal place and stored in CitationAnalysis after every scan. There is no defined maximum — scores above 15 indicate strong multi-channel authority, while scores below 3 suggest the entity is only cited by one or two low-weight source types.

Automatic recommendations

Citation Health Flags

After every scan, three threshold-based checks run against the entity's aggregated citation data. Any threshold that is breached generates a recommendation string stored in CitationAnalysis.flags[].

CriticalZero owned citations
if ownedCount === 0
LLMs never cite your site directly — critical issue

When AI models answer questions about your category but never link to your own domain, it means LLMs have not yet established a strong association between your brand and your domain. This requires direct content strategy work — schema markup, structured data, and authoritative on-site content.

WarningLow owned citation share
if ownedCount / totalCitations < 0.30
Strengthen your own site content

Less than 30% of all citation events reference your own domain. Third-party sources are doing the work of representing your brand. While some third-party citations are healthy, over-reliance signals that your own site lacks the depth and authority that LLMs look for when choosing what to cite.

WarningSingle-source over-reliance
if maxDomainCount / totalCitations > 0.60
Over-reliant on one source — diversify

More than 60% of all citations come from a single domain. This creates fragility — if that source changes how it covers your brand, or if LLMs shift their weighting away from that source, your visibility score drops sharply. Diversification across multiple citation types insulates against this.

Data architecture

Database Schema & Freshness Model

Six tables work together to provide both real-time prompt results and a deep historical ledger. Mutable cache tables and immutable append-only logs serve different purposes — here is how they differ.

TableTypeFreshness strategyUnique key / deduplicationPurpose
PromptResultMutable cacheDaily — one row per prompt per engine per daypromptId + engine + cachedDateLatest result for UI display; overwritten on force refresh
PromptRankHistoryImmutable logAppended on every scan — never updated or deletedcuid (auto) + index on promptId + scannedAtHistorical trend data, chart points, task impact tracking
ScanCitationImmutable logOne row per URL per prompt per day; skipDuplicates on re-runworkspaceId + promptId + url + cachedDatePer-citation record used as source of truth for aggregates
EntityCitationDomainRecomputed viewUpserted after every scan — counts recomputed from ScanCitationworkspaceId + domainPer-domain aggregation: timesCited, firstSeen, promptsList
CitationAnalysisRecomputed viewUpserted after every scan — single row per workspaceworkspaceId (unique)Entity-level diversity score, ownedCount, flags[]
PinnedPromptUser stateMutated by user actions only; not affected by scansworkspaceId + promptIdWatchlist membership for prioritised monitoring

Key design decisions

Idempotency by design

Every write operation is safe to repeat. PromptResult uses upsert with a composite unique key. ScanCitation uses createMany with skipDuplicates. EntityCitationDomain and CitationAnalysis are recomputed from scratch rather than incremented, so running the pipeline 10 times produces the same result as running it once.

Source of truth separation

Mutable tables (PromptResult) handle today's UI data. Immutable tables (PromptRankHistory, ScanCitation) handle historical truth. Computed views (EntityCitationDomain, CitationAnalysis) are always derived from the immutable tables — they can be wiped and rebuilt at any time without data loss.

Cascade-safe deletions

All child records use onDelete: Cascade so deleting a workspace, topic, or prompt cleanly removes all associated data. ScanCitations cascade from Prompt, not from PromptResult, so force-refreshing the mutable cache does not erase citation history.

Get Started

See It Working on Your Domain

Free scan — no credit card required

START FREE ANALYSIS