Web Feed Survey

Common Crawl Archive: CC-MAIN-2026-12

πŸ“… Crawl ended 2026-03-17
⚠️ Filtered to top 500,000 entries (Tranco subdomain-inclusive, by registrable site)

1. The Crawl

Percentages describe this Common Crawl result set, not the entire Web. Metrics reflect what Common Crawl fetched, what sites allowed it to fetch, and the configured Tranco list/sample limits for this run. Site counts and TOP_N scoping use the Tranco subdomain-inclusive list, normalized to registrable sites with the Public Suffix List, including private suffixes for hosted sub-sites.
Responses Processed
425,459,512
All responses after Tranco filtering; 418,120,515 had HTML, feed, XML, or sniffable Content-Type values and were analyzed further.
HTML Pages Processed
416,036,676
HTML/XHTML responses; used as the denominator for autodiscovery page rates.
Unique Sites
196,598
Distinct registrable sites among processed responses (HyperLogLog estimate).
Feed URLs Checked
543,577
232,910 generic XML/text/plain/octet-stream responses sniffed as feeds; the rest had exact RSS/Atom media types.
Analyzed Response Content-Type Distribution
HTTP Content-Type header values among responses that passed the analysis prefilter. β€œOther XML” is XML content without an exact RSS/Atom media type, such as application/xml or text/xml. β€œOther Non-XML” is sniffable non-XML content such as text/plain or application/octet-stream. Either is only counted as a feed URL if sniffing finds RSS/Atom. Parenthetical Content-Type percentages use all analyzed responses as the denominator; sniffed outcome percentages use sniffed feeds as the denominator.
HTML
416,036,676
99.50% of total
Non-HTML types (bars scaled relative to each other):
Atom
113,060  (0.0270%)
RSS
198,747  (0.0475%)
Other XML
966,142  (0.2311%)
Other Non-XML
805,890  (0.1927%)
Sniffed feed outcomes (bars scaled relative to sniffed feeds):
RSS (sniffed)
212,171  (92.1%)
Atom (sniffed)
18,253  (7.9%)
Other parsed (sniffed)
0  (0.0%)

2. Feed Autodiscovery

Autodiscovery Coverage
Pages with feed links
81,943,367
19.70% of analyzed HTML responses
Sites with feed links
70,636
35.93% of analyzed registrable sites
Sites with feed links is the autodiscovery subset: unique registrable sites exposing at least one RSS/Atom link in HTML.
Discovery link relations
Pages with feed rel="alternate" 81,940,235
Pages with feed rel="feed" 12,796
Pages using both relations 9,664
Pages with a multi-rel link 9,546
Multi-rel links 16,977
These counts include only RSS/Atom autodiscovery links: HTML <link> elements whose type is RSS, Atom, or RDF feed XML. Other uses of rel="alternate" are not counted. Pages using both relations includes pages with separate alternate and feed links, plus pages with a single link whose rel contains both. Multi-rel links counts those single link elements.
Autodiscovery Links per Page

Unique feed URLs per discovered HTML page (analyzed HTML responses with zero feed links omitted: 334,093,309).

Autodiscovery Links per Site

Unique feed URLs per registrable site. Sites with zero links omitted: 125,962.

Pages with Duplicate Feeds
127
0.5% of 24,317 multi-feed pages

Pages that link to redundant format variants of the same feed (same internal title and link after light normalization).

Coincident formatsPages
rss2.0 and rss2.0181
atom10 and rss2.0100
atom10 and rss1013
rss10 and rss2.08
atom10 and atom105
Quality: Autodiscovered vs. Not

Distribution of quality scores among successfully parsed feeds (% of each group per decile). With autodiscovery: 94,824 feeds, mean 0.246. Without: 439,371 feeds, mean 0.220.

Autodiscovery Usage by HTML Platform
Known page-side platform hints among analyzed HTML responses, plus an unknown row for pages with no recognized fingerprint. Parenthetical percentages use HTML pages in that row as the denominator and show how often those pages expose RSS/Atom links. Detection is conservative and based on generator metadata plus common asset markers.
FingerprintHTML pagesWith autodiscovery
unknown 353,974,158 53,145,055 (15.0%)
wordpress 45,450,449 26,802,369 (59.0%)
drupal 13,190,628 758,470 (5.8%)
shopify 1,097,732 99,423 (9.1%)
joomla 930,240 203,428 (21.9%)
substack 490,567 389,821 (79.5%)
wix 430,723 82,752 (19.2%)
blogger 335,807 305,603 (91.0%)
ghost 229,434 218,332 (95.2%)
squarespace 20,798 1,017 (4.9%)
medium 2,994 41 (1.4%)

3. Feeds

Successfully Parsed Feeds
534,195
Percentages use feed URLs checked as the denominator for parse results, and successfully parsed feeds as the denominator for autodiscovery rows.
Parse success rate 98.3%
Broken/unparseable 9,382 (1.7%)
With autodiscovery link 94,824 (17.8%)
Without autodiscovery link 439,371 (82.2%)
Sites with Feeds Found
13,796
Unique registrable sites that host at least one successfully parsed feed URL.
Share of analyzed sites 7.02%
Analyzed sites 196,598
Feeds with Entry Content
413,346
Coverage rate 77.4%
Full content 127,484 (23.9%)
Summary only 285,862 (53.5%)
Neither 120,849 (22.6%)
Denominator: successfully parsed feeds.
Entries Found
7,996,206
Percentages use successfully parsed feeds as the denominator, except dated entries use parsed feeds with entries.
Mean entries per parsed feed 15.0
Parsed feeds with entries 420,978 (78.8%)
Feeds with dated entries 395,919 (94.0%)
Feeds with feed-level dates 372,724 (69.8%)
Feeds with repeated entry titles 54,426
Feeds with default entry titles 95
Feeds with repeated entry links 63,440
Default entry titles are generic placeholder-like values such as β€œDefault Title”.
9,382
feed parse errors
1.7% of 543,577 feed URLs checked
Error-row percentages use total parse errors as the denominator.
XML declaration allowed only at the start of the document 4,258 45.4% Blank needed here 100 1.1%
EntityRef: expecting ';' 1,137 12.1% Not XML 97 1.0%
CData section not finished 794 8.5% Namespace prefix owl on sameAs is not defined 77 0.8%
no element found (line 0) 364 3.9% Start tag expected 67 0.7%
xmlParseEntityRef: no name 327 3.5% Malformed declaration expecting version 64 0.7%
Unknown root tag: div 286 3.0% Unknown root tag: br 63 0.7%
Invalid bytes in character encoding 188 2.0% Namespace prefix rdf for resource on license is not defined 52 0.6%
Unknown root tag: quakeml 177 1.9% Namespace prefix sn on type is not defined 46 0.5%
Opening and ending tag mismatch: title line 4 and feed 139 1.5% Unknown root tag: entry 45 0.5%
Extra content at the end of the document 104 1.1% Unknown root tag: style 45 0.5%
Feed Availability and Freshness
Step-down view from every feed URL checked to parsed feeds with entries that appear fresh. Fresh means the newest entry date, or feed-level updated date when no entry date exists, is within 365 days of the crawl response time. Funnel percentages use feed URLs checked as the denominator.
Feed URLs checked
543,577
Parsed RSS/Atom
534,195
Freshness signal within cutoff
203,483
Active with entries
141,380
Operational Quality
0.225
mean score, 0-1, across successfully parsed feeds
Parenthetical percentages use successfully parsed feeds as the denominator.
Quality > 0.5 120,832 (22.6%)
Freshness signal within cutoff 203,483 (38.1%)
Mean among those feeds 0.590
Active with entries 141,380 (26.5%)
Undated or stale 330,712
Feed Quality Distribution
Operational, non-editorial score 0–1 among successfully parsed feeds. In split tables, β€œquality > 0.5” means a feed has a usable freshness signal and enough basic entry/feed metadata to look usable; lower-scoring feeds remain in the all-feeds columns so abandoned or sparse feeds still count. Feeds with no usable date, or no freshness signal within 365 days, score 0 here; the card above also shows the mean after counting those feeds out. Excluded from the freshness-filtered mean: 57,319 undated and 273,393 stale feeds.
Quality Score Components
Mean component scores across successfully parsed feeds. Bar labels show each component's weight in the composite score. Repeated/default-looking entry titles and repeated entry links reduce the entry metadata component and can cap the final score when severe.
Quality by Format
% of feeds in each quality tier per format, ordered by mean score (↓). Dot = mean score. Normalised for population size so rare formats compare fairly against common ones.
Feed Formats
Successfully parsed feeds only. RSS-family feeds: 412,014; Atom feeds: 122,181. Parenthetical percentages use feeds in that format as the denominator and show the share with operational quality > 0.5.
FormatCountQuality > 0.5Mean quality
rss1.0 2 2/2 (100.0%) 0.732
rss0.93 1 1/1 (100.0%) 0.682
rss 5 5/5 (100.0%) 0.622
rss0.92 3,224 1,220/3,224 (37.8%) 0.287
rss2.0 378,287 92,687/378,287 (24.5%) 0.230
rss10 28,256 6,608/28,256 (23.4%) 0.217
atom10 122,181 19,940/122,181 (16.3%) 0.209
rss0.91 2,217 368/2,217 (16.6%) 0.150
rss2.00 18 1/18 (5.6%) 0.052
rss.92 4 0/4 (0.0%) 0.000
Charset per Format
Source: HTTP Content-Type header only.
FormatCharsetCount
rss2.0 utf-8 316,850
unknown 54,486
iso-8859-1 4,203
windows-1251 1,267
iso-8859-15 914
windows-1252 231
utf8 152
windows-1256 69
$charset 36
iso-8859-9 18
iso-8859-2 16
gb2312 16
euc-kr 12
windows-1250 9
big5 3
windows1251 2
gbk 1
uft-8 1
latin2 1
atom10 utf-8 89,909
unknown 31,945
iso-8859-1 315
iso-8859-15 12
rss10 utf-8 20,106
unknown 7,960
iso-8859-1 161
iso-8859-15 13
euc-jp 9
utf8 3
gb2312 1
windows-1251 1
windows-1250 1
gbk 1
rss0.91 iso-8859-2 787
unknown 764
utf-8 629
koi8-r 31
gb2312 3
iso-8859-1 2
windows-1250 1
rss0.92 utf-8 2,501
iso-8859-1 714
unknown 6
windows-1251 2
iso-8859-15 1
rss2.00 unknown 18
rss.92 utf-8 4
rss unknown 3
utf-8 2
rss1.0 utf-8 1
unknown 1
rss0.93 utf-8 1
Entry Content Profile
Content type classification based on Atom type attribute and RSS element semantics. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.
ProfileAll parsed feedsAmong 120,832 high-quality feeds
html 210,747 (39.5%) 58,025 (48.0%)
unknown 120,849 (22.6%) 1,401 (1.2%)
plain 119,857 (22.4%) 38,246 (31.7%)
mixed 77,791 (14.6%) 22,759 (18.8%)
xhtml 4,951 (0.9%) 401 (0.3%)
Entry Counts per Feed
Feed Languages
Top feed language declarations across successfully parsed feeds. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.
LanguageAll parsed feedsAmong 120,832 high-quality feeds
unknown 250,463 (46.9%) 32,683 (27.0%)
en 83,341 (15.6%) 22,163 (18.3%)
en-us 77,634 (14.5%) 25,184 (20.8%)
ja 13,340 (2.5%) 6,038 (5.0%)
fr 11,981 (2.2%) 3,488 (2.9%)
es 9,764 (1.8%) 3,208 (2.7%)
de 8,528 (1.6%) 3,051 (2.5%)
ru 6,420 (1.2%) 1,841 (1.5%)
fr-fr 5,979 (1.1%) 1,946 (1.6%)
en-gb 5,843 (1.1%) 2,214 (1.8%)
pt-br 4,175 (0.8%) 1,318 (1.1%)
de-de 3,821 (0.7%) 1,927 (1.6%)
pt 2,801 (0.5%) 795 (0.7%)
cs 2,740 (0.5%) 831 (0.7%)
utf-8 2,543 (0.5%) 5 (0.0%)
it-it 2,469 (0.5%) 686 (0.6%)
ja-jp 2,335 (0.4%) 1,250 (1.0%)
zh-tw 2,328 (0.4%) 502 (0.4%)
es-es 2,066 (0.4%) 672 (0.6%)
zh-cn 1,634 (0.3%) 323 (0.3%)
Language Tagging
Feed-internal language signals are read from xml:lang attributes, RSS <language>, Dublin Core dc:language, and Atom link hreflang. Counts are successfully parsed feeds; categories can overlap except the no-language row. HTTP/feed mismatches are only counted when an HTTP language conflicts with a feed-level language.
No language information 249,759 (46.8%)
HTTP Content-Language 72,201 (13.5%)
Feed-level language 269,430 (50.4%)
Entry-level language 5,698 (1.1%)
Both HTTP and feed-level language 59,294 (11.1%)
Mismatching HTTP and feed-level language 6,806 (1.3%)
Multiple entry languages 5,645 (1.1%)
Uses hreflang 1,343 (0.3%)
Multiple entry languages includes direct entry tags plus feed/HTTP languages that untagged entries inherit.
Feed Recency (CDF)
% of feeds with a last-update date whose update falls within the given age.
Newest Entry Recency (CDF)
% of feeds with at least one entry whose most recent entry falls within the given age. 113,217 zero-entry feeds excluded from denominator.
Feed History Depth (CDF of Oldest Entry Age)
% of feeds with at least one entry whose oldest entry is at most the given age β€” i.e. how far back the feed's window reaches. A client polling every N days will miss entries in feeds whose oldest entry is < N days old when the poll fires. 113,217 zero-entry feeds excluded from denominator.
Inferred Update Cadence (CDF)
% of feeds with enough dated entries whose average interval between entries is at most the given time. Cadence is inferred from the span between oldest and newest entry dates divided by entry count. 269,909 feeds lack enough dated entries to infer cadence.
Entry Content Length
Distribution of entry body or summary text lengths, measured after parsing successfully parsed feeds. This helps distinguish feeds that carry full content from feeds that mostly expose short summaries, links, or empty entries.
Feed Link Relations
Feed-level Atom link relations, including Atom links embedded in RSS channels. Self/canonical means rel=self; hub covers WebSub/PubSubHubbub discovery; paging and archive cover feed navigation and archived-feed links. Parenthetical percentages use all successfully parsed feeds as the denominator.
SignalAll parsed feedsAmong 120,832 high-quality feeds
self/canonical URL 293,773 (55.0%) 67,874 (56.2%)
paging links 34,939 (6.5%) 4,240 (3.5%)
WebSub/PubSubHubbub hub 33,941 (6.4%) 7,818 (6.5%)
archive links 8 (0.0%) 1 (0.0%)
Top Feed Extensions
Non-core namespace elements encountered across successfully parsed feeds. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.
ElementAll parsed feedsAmong 120,832 high-quality feeds
dc:creator 142,073 (26.6%) 38,730 (32.1%)
content:encoded 59,448 (11.1%) 23,124 (19.1%)
sy:updatePeriod 50,606 (9.5%) 12,441 (10.3%)
sy:updateFrequency 50,529 (9.5%) 12,366 (10.2%)
dc:date 45,539 (8.5%) 13,092 (10.8%)
itunes:explicit 30,681 (5.7%) 14,266 (11.8%)
itunes:author 30,286 (5.7%) 14,124 (11.7%)
itunes:image 29,996 (5.6%) 14,013 (11.6%)
itunes:owner 29,466 (5.5%) 13,840 (11.5%)
itunes:name 28,616 (5.4%) 13,621 (11.3%)
itunes:category 28,268 (5.3%) 13,406 (11.1%)
itunes:duration 26,876 (5.0%) 13,165 (10.9%)
itunes:summary 26,867 (5.0%) 12,577 (10.4%)
itunes:email 22,625 (4.2%) 9,672 (8.0%)
itunes:type 22,511 (4.2%) 10,902 (9.0%)
Platform Fingerprints
Known feed generators or platform headers observed on parsed feeds. This is intentionally conservative; absent fingerprints mean β€œnot identified,” not β€œcustom-built.” Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5; the final column shows the share of feeds in that fingerprint row that clear the threshold.
FingerprintAll parsed feedsAmong 120,832 high-quality feedsQuality within fingerprint
unknown 450,091 (84.3%) 104,636 (86.6%) 23.2%
wordpress 36,821 (6.9%) 8,142 (6.7%) 22.1%
drupal 22,830 (4.3%) 5,722 (4.7%) 25.1%
blogger 19,954 (3.7%) 1,177 (1.0%) 5.9%
joomla 3,988 (0.7%) 898 (0.7%) 22.5%
ghost 237 (0.0%) 76 (0.1%) 32.1%
substack 191 (0.0%) 152 (0.1%) 79.6%
squarespace 46 (0.0%) 17 (0.0%) 37.0%
medium 36 (0.0%) 11 (0.0%) 30.6%
feedburner 1 (0.0%) 1 (0.0%) 100.0%
Autodiscovered Feed Quality by Source Platform
Quality of successfully parsed feeds found through HTML autodiscovery, grouped by recognized platform hints on the source page. The unknown row covers autodiscovered feeds whose source page had no recognized platform fingerprint. Parenthetical percentages use parsed feeds in that source-platform row as the denominator.
Source platformAutodiscovered parsed feedsQuality > 0.5Mean quality
unknown 72,571 18,912/72,571 (26.1%) 0.247
drupal 10,721 2,734/10,721 (25.5%) 0.183
wordpress 9,258 3,376/9,258 (36.5%) 0.322
blogger 2,330 281/2,330 (12.1%) 0.224
joomla 636 317/636 (49.8%) 0.384
substack 180 162/180 (90.0%) 0.768
wix 49 21/49 (42.9%) 0.350
ghost 39 26/39 (66.7%) 0.553
shopify 24 15/24 (62.5%) 0.474
squarespace 10 9/10 (90.0%) 0.732
medium 1 1/1 (100.0%) 0.717