Google AI Overviews choose sources by combining semantic relevance, authority, freshness, and how often users click each citation.
What signals does Google officially say it uses to build AI Overviews?
Google uses semantic relevance, authority, freshness, and selection rate to choose which pages to cite in AI Overviews.
Google describes AI Overviews as a core Search feature that can appear in standard search results, image searches, and Circle to Search. The Help Centre states that users can select the Web filter after a search to display only text-based links without features like AI Overviews. The system is designed to highlight web pages and drive attention to content on the web, according to Google's November 2023 announcement that expanded generative AI in Search to more than 120 countries and territories.
Google says Search Generative Experience (SGE) is designed to highlight web pages and drive attention to content on the web while showing a wider range of sources. The company rolled out an update to make it easier for people to see the web pages that back up information in AI-powered overviews, including an 'About this result' tool. These transparency features let users check the provenance of cited information directly.
The stated goal is to bring together the most helpful and relevant information available for a search. Google emphasises that SGE creates discovery opportunities by showing more links and links to a wider range of sources on the results page. Search ads continue to appear in dedicated slots, and the company maintains that ads remain distinguishable from organic results.
How do authority, freshness and selection-rate signals affect which pages are cited?
SEO analysis of the system notes that query-independent measures include selection rate across queries, trustworthiness based on author, domain, or inbound links, overall popularity, and freshness (how recently content was created or updated). These signals mirror traditional ranking factors but apply at the citation-selection stage rather than the initial retrieval stage.
Selection rate refers to how often a document is chosen as a source across multiple queries. A page that has been cited repeatedly for related queries builds a track record that the system can weight positively. Trustworthiness derives from author credentials, domain reputation, and the quality of inbound links. Popularity reflects overall traffic and engagement. Freshness matters particularly for topics where information changes rapidly.
Reverse-engineering analysis suggests the system considers these signals in combination. A page with high semantic relevance but low domain authority may still be cited if its content is exceptionally clear and comprehensive. Conversely, a page from a high-authority domain may be passed over if it lacks the specific detail the query requires. Google states it aims to show a wider range of voices and has made improvements intended to highlight high-quality sources, suggesting the balance between authority and relevance is tuned to favour diversity.
Patent summaries indicate candidate documents are often selected from top search results or top-N results. This means pages that rank well organically have a structural advantage in the citation pool, though semantic matching can surface pages outside the top results if they meet the relevance threshold.
How does semantic matching or embeddings decide citation candidates?
The AI Overviews source selection pipeline uses semantic ranking with embedding proximity. Pages are represented as vectors in a high-dimensional space, and the system calculates cosine similarity between the query representation and candidate documents. Reverse-engineering analysis suggests pages may need a cosine similarity above 0.88 to the query representation to be selected, though Google has not confirmed this threshold publicly.
This threshold is stricter than traditional keyword matching. A page can rank well for a query in organic results but fail to be cited in an AI Overview if its semantic representation does not align closely with the query's conceptual structure. The system narrows candidates by semantic similarity first, then applies authority and freshness signals to the remaining pool.
Semantic alignment depends on topical coverage and conceptual vocabulary. A page that covers the full scope of a topic and uses terminology consistent with authoritative sources in the field is more likely to pass the similarity threshold. Narrow or tangential content, even if well-written, may fall below the cutoff.
The candidate pool often comes from top search results, but the system can retrieve documents outside the top results if their embedding proximity is high. This creates a pathway for new or lower-authority pages to be cited if they match the query representation exceptionally well.
Can low-authority or new sites be cited in AI Overviews?
Yes. Practical SEO guidance states that new websites can be cited if they produce exceptionally high-quality, accurate, and well-structured content that directly answers a specific user query. Quality and relevance are the primary selection criteria, not domain age. Google's public position that SGE highlights web pages and creates discovery opportunities supports this.
The conditions under which new or low-authority sites are cited are not fully specified. The system appears to apply a higher bar for semantic relevance and content clarity when domain authority is low. A new site must demonstrate that its content is the clearest, most reliable answer to the query, structured in a way the AI can parse and extract with confidence.
Examples from SEO case studies show new sites being cited when they provide comprehensive explanations with context, nuance, and supporting evidence. Surface-level content is rarely cited. The content must cover the topic thoroughly and use the same conceptual vocabulary as authoritative sources in the field.
Building genuine expertise signals requires sustained investment in author credibility, editorial processes, and topical authority. A new site that publishes one exceptional article may be cited for that specific query, but consistent citation across multiple queries depends on building a track record of quality.
What transparency and user controls show where AI Overviews get information?
Google rolled out the 'About this result' tool to reveal sources behind AI-powered overviews. This feature lets users see the web pages that back up the information in an AI Overview, helping them evaluate what they are finding. The tool appears alongside the overview and links directly to cited sources.
The Web filter option allows users to display only text-based links without features like AI Overviews. After performing a search, users can select the Web filter to see traditional organic results. This control is available because AI Overviews are a core Search feature that cannot be turned off at the account level.
AI Overviews appear in standard search results, image searches, and Circle to Search in countries, territories, and languages where the feature is available. The cited sources are visible as links within the overview itself, and users can click through to the original pages to verify information or explore the topic further.
These transparency features address concerns about provenance and accuracy. Google notes that AI Overviews can and will make mistakes and may provide inaccurate or offensive information, cautioning users to think critically about AI Overview responses. The visibility of sources helps users apply that critical judgement.
What practical steps should publishers take to improve odds of being cited?
Content structure and extractability matter. Pages should provide clear answers to specific questions, cover topics comprehensively, and use the conceptual vocabulary of the field. The system favours content that is easy to parse and extract. This means using headings, lists, tables, and other structural elements that signal the organisation of information.
Matching the semantic representation of authoritative sources requires understanding what the broader literature says about a topic. If your page uses different terminology or frames the topic differently, it may fall below the cosine similarity threshold even if the substance is correct. Align your language with the terms and concepts used by recognised experts.
Longer-term signals include author credentials, domain trust, third-party citations, and freshness. Building these signals takes time. Author credentials can be signalled through author bios, credentials listed on the page, and links to the author's other work. Domain trust builds through consistent publication of high-quality content, inbound links from reputable sources, and user engagement.
Third-party citations are particularly valuable. If other authoritative sites cite your content, the system treats that as evidence of trustworthiness. This is harder to control directly but follows from producing original, valuable research or analysis that others find worth referencing.
Freshness signals can be maintained by updating content regularly. For topics where information changes, a page last updated years ago is less likely to be cited than one updated within the past 12 months. Update dates should reflect genuine revision, not cosmetic changes.
The following table compares scenarios where pages are more or less likely to be cited:
| Scenario | Likelihood of citation | Key factor |
|---|---|---|
| High-authority domain, narrow topical coverage | Moderate | Authority helps, but semantic relevance may be insufficient |
| New domain, exceptionally comprehensive and clear content | Moderate to high | Quality and relevance can overcome lack of authority |
| High-authority domain, comprehensive coverage, recent update | High | All signals align |
| Any domain, outdated content on a fast-changing topic | Low | Freshness signal fails |
If your page is indexed and topically relevant but never cited while competitors with similar authority are, the issue is semantic alignment. Expand topical coverage and align terminology with the broader authoritative literature. If semantically strong, well-structured pages are bypassed while less comprehensive content from higher-authority domains is cited, the bottleneck is E-E-A-T. Compare your page's author credentials, domain reputation, and third-party citations against cited competitors.
The reverse-engineering analysis suggests a three-stage failure model. Stage 1 is indexing and retrieval: if your page is not indexed or does not appear in the top-N results for related queries, it will not enter the candidate pool. Stage 2 is semantic ranking: if your page is indexed but semantically misaligned, it fails the cosine similarity threshold. Stage 3 is E-E-A-T filtering: if your page passes semantic ranking but lacks authority signals, it may be bypassed in favour of higher-trust sources.
Fixing Stage 1 failures requires standard SEO: ensure the page is crawlable, has relevant keywords, and earns organic ranking signals. Fixing Stage 2 failures requires expanding topical coverage and aligning vocabulary. Fixing Stage 3 failures requires building genuine expertise signals, which is the hardest to address quickly.
The system does not publish numeric weights or thresholds. The balance between rank-based and embedding-based selection is not publicly specified. Limited public data exists on the frequency or conditions under which new or low-authority sites are cited. Publishers should treat the guidance above as directional rather than deterministic.
