Web Feed Survey

Common Crawl Archive: CC-MAIN-2026-12

πŸ“… Crawl ended 2026-03-17
⚠️ Filtered to top 500,000 entries (Tranco subdomain-inclusive, by registrable site)

1. The Crawl

Percentages describe this Common Crawl result set, not the entire Web. Metrics reflect what Common Crawl fetched, what sites allowed it to fetch, and the configured Tranco list/sample limits for this run. Site counts and TOP_N scoping use the Tranco subdomain-inclusive list, normalized to registrable sites with the Public Suffix List, including private suffixes for hosted sub-sites.
Responses Processed
425,471,592
All responses after Tranco filtering; 418,132,532 had HTML, feed, XML, or sniffable content types and were analyzed further.
HTML Pages Processed
416,048,595
HTML/XHTML responses; used as the denominator for autodiscovery page rates.
Unique Sites
196,598
Distinct registrable sites among processed responses (HyperLogLog estimate).
Feed URLs Checked
311,382
0 generic XML/text/plain/octet-stream responses sniffed as feeds; the rest had exact RSS/Atom media types.
Analyzed Response Content-Type Distribution
HTTP Content-Type header values among responses that passed the analysis prefilter. β€œOther XML” is XML content without an exact RSS/Atom media type, such as application/xml or text/xml. β€œOther Non-XML” is sniffable non-XML content such as text/plain or application/octet-stream. Either is only counted as a feed URL if sniffing finds RSS/Atom. Parenthetical percentages use all analyzed responses as the denominator.
HTML
416,048,595
99.50% of total
Non-HTML types (bars scaled relative to each other):
Atom
113,064  (0.0270%)
RSS
198,759  (0.0475%)
Other XML
966,177  (0.2311%)
Other Non-XML
805,937  (0.1927%)

2. Feed Autodiscovery

Autodiscovery Coverage
Pages with feed links
81,945,315
19.70% of analyzed HTML responses
Sites with feed links
70,633
35.93% of analyzed registrable sites
Site count is an estimated lower bound from sampled source sites per feed URL.
Discovery link relations
Pages with feed rel="alternate" 81,942,184
Pages with feed rel="feed" 12,793
Pages using both relations 9,662
Pages with a multi-rel link 9,544
Multi-rel links 16,970
These counts include only RSS/Atom autodiscovery links: HTML <link> elements whose type is RSS, Atom, or RDF feed XML. Other uses of rel="alternate" are not counted. Pages using both relations includes pages with separate alternate and feed links, plus pages with a single link whose rel contains both. Multi-rel links counts those single link elements.
Autodiscovery Links per Page

Unique feed URLs per discovered HTML page (analyzed HTML responses with zero feed links omitted: 334,103,280).

Autodiscovery Links per Site

Unique feed URLs per registrable site. Based on sampled source sites retained for each discovered feed URL. Sites with zero links omitted: 125,965.

Pages with Duplicate Feeds
7
0.0% of 25,079 multi-feed pages

Pages that link to redundant format variants of the same feed (same internal title and link after light normalization).

Coincident formatsPages
atom10 and rss2.04
rss2.0 and rss2.02
atom10 and atom101
Quality: Autodiscovered vs. Not

Distribution of quality scores among successfully parsed feeds (% of each group per decile). With autodiscovery: 53,096 feeds, mean 0.251. Without: 250,694 feeds, mean 0.179.

Autodiscovery Usage by HTML Platform
Known page-side platform hints among analyzed HTML responses, plus an unknown row for pages with no recognized fingerprint. Parenthetical percentages use HTML pages in that row as the denominator and show how often those pages expose RSS/Atom links. Detection is conservative and based on generator metadata plus common asset markers.
FingerprintHTML pagesWith autodiscovery
unknown 353,984,603 53,146,358 (15.0%)
wordpress 45,451,537 26,802,985 (59.0%)
drupal 13,190,932 758,482 (5.8%)
shopify 1,097,730 99,413 (9.1%)
joomla 930,272 203,431 (21.9%)
substack 490,578 389,821 (79.5%)
wix 430,754 82,764 (19.2%)
blogger 335,819 305,604 (91.0%)
ghost 229,444 218,342 (95.2%)
squarespace 20,792 1,017 (4.9%)
medium 2,995 41 (1.4%)

3. Feeds

Successfully Parsed Feeds
303,790
Percentages use feed URLs checked as the denominator for parse results, and successfully parsed feeds as the denominator for autodiscovery rows.
Parse success rate 97.6%
Broken/unparseable 7,592 (2.4%)
With autodiscovery link 53,096 (17.5%)
Without autodiscovery link 250,694 (82.5%)
Feeds with Entry Content
237,123
Coverage rate 78.1%
Full content 90,416 (29.8%)
Summary only 146,707 (48.3%)
Neither 66,667 (21.9%)
Denominator: successfully parsed feeds.
Entries Found
3,189,317
Percentages use successfully parsed feeds as the denominator, except dated entries use parsed feeds with entries.
Mean entries per parsed feed 10.5
Parsed feeds with entries 239,438 (78.8%)
Feeds with dated entries 232,646 (97.2%)
Feeds with feed-level dates 222,276 (73.2%)
Feeds with repeated entry titles 27,674
Feeds with default entry titles 79
Feeds with repeated entry links 24,211
Default entry titles are generic placeholder-like values such as β€œDefault Title”.
7,592
feed parse errors
2.4% of 311,382 feed URLs checked
Error-row percentages use total parse errors as the denominator.
XML declaration allowed only at the start of the document 3,971 52.3% Malformed declaration expecting version 64 0.8%
EntityRef: expecting ';' 1,001 13.2% Namespace prefix rdf for resource on license is not defined 52 0.7%
CData section not finished 601 7.9% Invalid bytes in character encoding 50 0.7%
Unknown root tag: div 277 3.6% Namespace prefix sn on type is not defined 46 0.6%
xmlParseEntityRef: no name 219 2.9% Unknown root tag: entry 45 0.6%
Unknown root tag: quakeml 177 2.3% Unknown root tag: style 45 0.6%
Opening and ending tag mismatch: title line 4 and feed 137 1.8% PCDATA invalid Char value 12 44 0.6%
no element found (line 0) 105 1.4% Empty response 32 0.4%
Extra content at the end of the document 80 1.1% Start tag expected 30 0.4%
Not XML 73 1.0% Unknown root tag: head 29 0.4%
Feed Availability and Freshness
Step-down view from every feed URL checked to parsed feeds with entries that appear fresh. Fresh means the newest entry date, or feed-level updated date when no entry date exists, is within 365 days of the crawl response time. Funnel percentages use feed URLs checked as the denominator.
Feed URLs checked
311,382
Parsed RSS/Atom
303,790
Freshness signal within cutoff
100,643
Active with entries
67,997
Operational Quality
0.191
mean score, 0-1, across successfully parsed feeds
Parenthetical percentages use successfully parsed feeds as the denominator.
Quality > 0.5 57,995 (19.1%)
Freshness signal within cutoff 100,643 (33.1%)
Mean among those feeds 0.578
Active with entries 67,997 (22.4%)
Undated or stale 203,147
Feed Quality Distribution
Operational, non-editorial score 0–1 among successfully parsed feeds. In split tables, β€œquality > 0.5” means a feed has a usable freshness signal and enough basic entry/feed metadata to look usable; lower-scoring feeds remain in the all-feeds columns so abandoned or sparse feeds still count. Feeds with no usable date, or no freshness signal within 365 days, score 0 here; the card above also shows the mean after counting those feeds out. Excluded from the freshness-filtered mean: 26,891 undated and 176,256 stale feeds.
Quality Score Components
Mean component scores across successfully parsed feeds. Bar labels show each component's weight in the composite score. Repeated/default-looking entry titles and repeated entry links reduce the entry metadata component and can cap the final score when severe.
Quality by Format
% of feeds in each quality tier per format, ordered by mean score (↓). Dot = mean score. Normalised for population size so rare formats compare fairly against common ones.
Feed Formats
Successfully parsed feeds only. RSS-family feeds: 199,853; Atom feeds: 103,937. Parenthetical percentages use feeds in that format as the denominator and show the share with operational quality > 0.5.
FormatCountQuality > 0.5Mean quality
rss1.0 1 1/1 (100.0%) 0.782
rss 1 1/1 (100.0%) 0.650
rss0.92 3,200 1,209/3,200 (37.8%) 0.286
atom10 103,937 14,520/103,937 (14.0%) 0.195
rss2.0 181,975 40,508/181,975 (22.3%) 0.191
rss0.91 106 23/106 (21.7%) 0.167
rss10 14,552 1,732/14,552 (11.9%) 0.143
rss2.00 18 1/18 (5.6%) 0.052
Charset per Format
Source: HTTP Content-Type header only.
FormatCharsetCount
rss2.0 utf-8 164,428
unknown 16,544
iso-8859-1 549
windows-1251 446
windows1251 2
iso-8859-2 1
windows-1250 1
utf8 1
windows-1252 1
uft-8 1
latin2 1
atom10 utf-8 78,892
unknown 25,033
iso-8859-1 12
rss0.92 utf-8 2,483
iso-8859-1 714
windows-1251 2
unknown 1
rss0.91 utf-8 88
unknown 15
iso-8859-1 2
gb2312 1
rss10 utf-8 11,281
unknown 3,252
iso-8859-1 19
rss2.00 unknown 18
rss unknown 1
rss1.0 utf-8 1
Entry Content Profile
Content type classification based on Atom type attribute and RSS element semantics. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.
ProfileAll parsed feedsAmong 57,995 high-quality feeds
html 132,524 (43.6%) 31,651 (54.6%)
unknown 66,667 (21.9%) 272 (0.5%)
plain 50,672 (16.7%) 13,525 (23.3%)
mixed 49,074 (16.2%) 12,153 (21.0%)
xhtml 4,853 (1.6%) 394 (0.7%)
Entry Counts per Feed
Feed Languages
Top feed language declarations across successfully parsed feeds. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.
LanguageAll parsed feedsAmong 57,995 high-quality feeds
unknown 145,914 (48.0%) 15,240 (26.3%)
en 58,749 (19.3%) 11,608 (20.0%)
en-us 42,865 (14.1%) 14,029 (24.2%)
es 7,941 (2.6%) 2,330 (4.0%)
fr 6,633 (2.2%) 1,853 (3.2%)
de 4,898 (1.6%) 1,464 (2.5%)
ru 3,550 (1.2%) 1,182 (2.0%)
en-gb 3,441 (1.1%) 890 (1.5%)
pt-br 3,275 (1.1%) 971 (1.7%)
ja 1,867 (0.6%) 664 (1.1%)
it-it 1,338 (0.4%) 367 (0.6%)
de-de 1,335 (0.4%) 722 (1.2%)
es-es 1,274 (0.4%) 312 (0.5%)
uk 1,191 (0.4%) 388 (0.7%)
fr-fr 1,027 (0.3%) 542 (0.9%)
fi 1,001 (0.3%) 150 (0.3%)
ja-jp 915 (0.3%) 374 (0.6%)
nl 886 (0.3%) 349 (0.6%)
zh-tw 842 (0.3%) 190 (0.3%)
km 794 (0.3%) 62 (0.1%)
Language Tagging
Feed-internal language signals are read from xml:lang attributes, RSS <language>, Dublin Core dc:language, and Atom link hreflang. Counts are successfully parsed feeds; categories can overlap except the no-language row. HTTP/feed mismatches are only counted when an HTTP language conflicts with a feed-level language.
No language information 145,812 (48.0%)
HTTP Content-Language 58,387 (19.2%)
Feed-level language 151,815 (50.0%)
Entry-level language 3,571 (1.2%)
Both HTTP and feed-level language 52,559 (17.3%)
Mismatching HTTP and feed-level language 2,384 (0.8%)
Multiple entry languages 2,527 (0.8%)
Uses hreflang 116 (0.0%)
Multiple entry languages includes direct entry tags plus feed/HTTP languages that untagged entries inherit.
Feed Recency (CDF)
% of feeds with a last-update date whose update falls within the given age.
Newest Entry Recency (CDF)
% of feeds with at least one entry whose most recent entry falls within the given age. 64,352 zero-entry feeds excluded from denominator.
Feed History Depth (CDF of Oldest Entry Age)
% of feeds with at least one entry whose oldest entry is at most the given age β€” i.e. how far back the feed's window reaches. A client polling every N days will miss entries in feeds whose oldest entry is < N days old when the poll fires. 64,352 zero-entry feeds excluded from denominator.
Inferred Update Cadence (CDF)
% of feeds with enough dated entries whose average interval between entries is at most the given time. Cadence is inferred from the span between oldest and newest entry dates divided by entry count. 159,690 feeds lack enough dated entries to infer cadence.
Entry Content Length
Distribution of entry body or summary text lengths, measured after parsing successfully parsed feeds. This helps distinguish feeds that carry full content from feeds that mostly expose short summaries, links, or empty entries.
Feed Link Relations
Feed-level Atom link relations, including Atom links embedded in RSS channels. Self/canonical means rel=self; hub covers WebSub/PubSubHubbub discovery; paging and archive cover feed navigation and archived-feed links. Parenthetical percentages use all successfully parsed feeds as the denominator.
SignalAll parsed feedsAmong 57,995 high-quality feeds
self/canonical URL 190,131 (62.6%) 36,487 (62.9%)
paging links 31,150 (10.3%) 3,601 (6.2%)
WebSub/PubSubHubbub hub 27,329 (9.0%) 3,776 (6.5%)
archive links 5 (0.0%) 0 (0.0%)
Top Feed Extensions
Non-core namespace elements encountered across successfully parsed feeds. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.
ElementAll parsed feedsAmong 57,995 high-quality feeds
dc:creator 87,143 (28.7%) 23,287 (40.2%)
sy:updatePeriod 37,823 (12.5%) 7,557 (13.0%)
sy:updateFrequency 37,749 (12.4%) 7,486 (12.9%)
content:encoded 29,140 (9.6%) 8,829 (15.2%)
dc:date 19,770 (6.5%) 4,289 (7.4%)
opensearch:totalResults 17,585 (5.8%) 613 (1.1%)
opensearch:startIndex 17,585 (5.8%) 613 (1.1%)
opensearch:itemsPerPage 17,585 (5.8%) 613 (1.1%)
gd:image 17,583 (5.8%) 613 (1.1%)
itunes:author 12,386 (4.1%) 4,323 (7.5%)
itunes:explicit 12,360 (4.1%) 4,319 (7.4%)
itunes:image 12,334 (4.1%) 4,315 (7.4%)
itunes:owner 12,292 (4.0%) 4,286 (7.4%)
itunes:name 12,248 (4.0%) 4,260 (7.3%)
itunes:summary 11,881 (3.9%) 4,025 (6.9%)
Platform Fingerprints
Known feed generators or platform headers observed on parsed feeds. This is intentionally conservative; absent fingerprints mean β€œnot identified,” not β€œcustom-built.” Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5; the final column shows the share of feeds in that fingerprint row that clear the threshold.
FingerprintAll parsed feedsAmong 57,995 high-quality feedsQuality within fingerprint
unknown 230,473 (75.9%) 45,239 (78.0%) 19.6%
wordpress 29,009 (9.5%) 5,588 (9.6%) 19.3%
drupal 22,613 (7.4%) 5,625 (9.7%) 24.9%
blogger 17,585 (5.8%) 613 (1.1%) 3.5%
joomla 3,901 (1.3%) 868 (1.5%) 22.3%
ghost 207 (0.1%) 60 (0.1%) 29.0%
feedburner 1 (0.0%) 1 (0.0%) 100.0%
squarespace 1 (0.0%) 1 (0.0%) 100.0%
Autodiscovered Feed Quality by Source Platform
Quality of successfully parsed feeds found through HTML autodiscovery, grouped by recognized platform hints on the source page. The unknown row covers autodiscovered feeds whose source page had no recognized platform fingerprint. Parenthetical percentages use parsed feeds in that source-platform row as the denominator.
Source platformAutodiscovered parsed feedsQuality > 0.5Mean quality
unknown 31,873 8,619/31,873 (27.0%) 0.257
drupal 10,634 2,676/10,634 (25.2%) 0.180
wordpress 8,205 3,049/8,205 (37.2%) 0.331
blogger 2,330 281/2,330 (12.1%) 0.224
joomla 563 286/563 (50.8%) 0.397
substack 40 34/40 (85.0%) 0.740
ghost 30 21/30 (70.0%) 0.579
shopify 12 7/12 (58.3%) 0.432
wix 9 8/9 (88.9%) 0.719
squarespace 5 4/5 (80.0%) 0.676