Web Feed Survey

Common Crawl Archive: CC-MAIN-2026-12

📅 Crawl ended 2026-03-17

⚠️ Filtered to top 500,000 entries (Tranco subdomain-inclusive, by registrable site)

1. The Crawl

Percentages describe this Common Crawl result set, not the entire Web. Metrics reflect what Common Crawl fetched, what sites allowed it to fetch, and the configured Tranco list/sample limits for this run. Site counts and TOP_N scoping use the Tranco subdomain-inclusive list, normalized to registrable sites with the Public Suffix List, including private suffixes for hosted sub-sites.

Responses Processed

425,459,512

All responses after Tranco filtering; 418,120,515 had HTML, feed, XML, or sniffable Content-Type values and were analyzed further.

HTML Pages Processed

416,036,676

HTML/XHTML responses; used as the denominator for autodiscovery page rates.

Unique Sites

196,598

Distinct registrable sites among processed responses (HyperLogLog estimate).

Feed URLs Checked

543,577

232,910 generic XML/text/plain/octet-stream responses sniffed as feeds; the rest had exact RSS/Atom media types.

Analyzed Response Content-Type Distribution

HTTP Content-Type header values among responses that passed the analysis prefilter. “Other XML” is XML content without an exact RSS/Atom media type, such as application/xml or text/xml. “Other Non-XML” is sniffable non-XML content such as text/plain or application/octet-stream. Either is only counted as a feed URL if sniffing finds RSS/Atom. Parenthetical Content-Type percentages use all analyzed responses as the denominator; sniffed outcome percentages use sniffed feeds as the denominator.

HTML

416,036,676

99.50% of total

Non-HTML types (bars scaled relative to each other):

Atom

113,060 (0.0270%)

RSS

198,747 (0.0475%)

Other XML

966,142 (0.2311%)

Other Non-XML

805,890 (0.1927%)

Sniffed feed outcomes (bars scaled relative to sniffed feeds):

RSS (sniffed)

212,171 (92.1%)

Atom (sniffed)

18,253 (7.9%)

Other parsed (sniffed)

0 (0.0%)

2. Feed Autodiscovery

Autodiscovery Coverage

Pages with feed links

81,943,367

19.70% of analyzed HTML responses

Sites with feed links

70,636

35.93% of analyzed registrable sites

Sites with feed links is the autodiscovery subset: unique registrable sites exposing at least one RSS/Atom link in HTML.

Discovery link relations

Pages with feed rel="alternate" 81,940,235

Pages with feed rel="feed" 12,796

Pages using both relations 9,664

Pages with a multi-rel link 9,546

Multi-rel links 16,977

These counts include only RSS/Atom autodiscovery links: HTML <link> elements whose type is RSS, Atom, or RDF feed XML. Other uses of rel="alternate" are not counted. Pages using both relations includes pages with separate alternate and feed links, plus pages with a single link whose rel contains both. Multi-rel links counts those single link elements.

Autodiscovery Links per Page

Unique feed URLs per discovered HTML page (analyzed HTML responses with zero feed links omitted: 334,093,309).

Autodiscovery Links per Site

Unique feed URLs per registrable site. Sites with zero links omitted: 125,962.

Pages with Duplicate Feeds

127

0.5% of 24,317 multi-feed pages

Pages that link to redundant format variants of the same feed (same internal title and link after light normalization).

Coincident formats	Pages
rss2.0 and rss2.0	181
atom10 and rss2.0	100
atom10 and rss10	13
rss10 and rss2.0	8
atom10 and atom10	5

Quality: Autodiscovered vs. Not

Distribution of quality scores among successfully parsed feeds (% of each group per decile). With autodiscovery: 94,824 feeds, mean 0.246. Without: 439,371 feeds, mean 0.220.

Autodiscovery Usage by HTML Platform

Known page-side platform hints among analyzed HTML responses, plus an unknown row for pages with no recognized fingerprint. Parenthetical percentages use HTML pages in that row as the denominator and show how often those pages expose RSS/Atom links. Detection is conservative and based on generator metadata plus common asset markers.

Fingerprint	HTML pages	With autodiscovery
unknown	353,974,158	53,145,055 (15.0%)
wordpress	45,450,449	26,802,369 (59.0%)
drupal	13,190,628	758,470 (5.8%)
shopify	1,097,732	99,423 (9.1%)
joomla	930,240	203,428 (21.9%)
substack	490,567	389,821 (79.5%)
wix	430,723	82,752 (19.2%)
blogger	335,807	305,603 (91.0%)
ghost	229,434	218,332 (95.2%)
squarespace	20,798	1,017 (4.9%)
medium	2,994	41 (1.4%)

3. Feeds

Successfully Parsed Feeds

534,195

Percentages use feed URLs checked as the denominator for parse results, and successfully parsed feeds as the denominator for autodiscovery rows.

Parse success rate 98.3%

Broken/unparseable 9,382 (1.7%)

With autodiscovery link 94,824 (17.8%)

Without autodiscovery link 439,371 (82.2%)

Sites with Feeds Found

13,796

Unique registrable sites that host at least one successfully parsed feed URL.

Share of analyzed sites 7.02%

Analyzed sites 196,598

Feeds with Entry Content

413,346

Coverage rate 77.4%

Full content 127,484 (23.9%)

Summary only 285,862 (53.5%)

Neither 120,849 (22.6%)

Denominator: successfully parsed feeds.

Entries Found

7,996,206

Percentages use successfully parsed feeds as the denominator, except dated entries use parsed feeds with entries.

Mean entries per parsed feed 15.0

Parsed feeds with entries 420,978 (78.8%)

Feeds with dated entries 395,919 (94.0%)

Feeds with feed-level dates 372,724 (69.8%)

Feeds with repeated entry titles 54,426

Feeds with default entry titles 95

Feeds with repeated entry links 63,440

Default entry titles are generic placeholder-like values such as “Default Title”.

9,382

feed parse errors

1.7% of 543,577 feed URLs checked

Error-row percentages use total parse errors as the denominator.

XML declaration allowed only at the start of the document	4,258	45.4%	Blank needed here	100	1.1%
EntityRef: expecting ';'	1,137	12.1%	Not XML	97	1.0%
CData section not finished	794	8.5%	Namespace prefix owl on sameAs is not defined	77	0.8%
no element found (line 0)	364	3.9%	Start tag expected	67	0.7%
xmlParseEntityRef: no name	327	3.5%	Malformed declaration expecting version	64	0.7%
Unknown root tag: div	286	3.0%	Unknown root tag: br	63	0.7%
Invalid bytes in character encoding	188	2.0%	Namespace prefix rdf for resource on license is not defined	52	0.6%
Unknown root tag: quakeml	177	1.9%	Namespace prefix sn on type is not defined	46	0.5%
Opening and ending tag mismatch: title line 4 and feed	139	1.5%	Unknown root tag: entry	45	0.5%
Extra content at the end of the document	104	1.1%	Unknown root tag: style	45	0.5%

Feed Availability and Freshness

Step-down view from every feed URL checked to parsed feeds with entries that appear fresh. Fresh means the newest entry date, or feed-level updated date when no entry date exists, is within 365 days of the crawl response time. Funnel percentages use feed URLs checked as the denominator.

Feed URLs checked

543,577

Parsed RSS/Atom

534,195

Freshness signal within cutoff

203,483

Active with entries

141,380

Operational Quality

0.225

mean score, 0-1, across successfully parsed feeds

Parenthetical percentages use successfully parsed feeds as the denominator.

Quality > 0.5 120,832 (22.6%)

Freshness signal within cutoff 203,483 (38.1%)

Mean among those feeds 0.590

Active with entries 141,380 (26.5%)

Undated or stale 330,712

Feed Quality Distribution

Operational, non-editorial score 0–1 among successfully parsed feeds. In split tables, “quality > 0.5” means a feed has a usable freshness signal and enough basic entry/feed metadata to look usable; lower-scoring feeds remain in the all-feeds columns so abandoned or sparse feeds still count. Feeds with no usable date, or no freshness signal within 365 days, score 0 here; the card above also shows the mean after counting those feeds out. Excluded from the freshness-filtered mean: 57,319 undated and 273,393 stale feeds.

Quality Score Components

Mean component scores across successfully parsed feeds. Bar labels show each component's weight in the composite score. Repeated/default-looking entry titles and repeated entry links reduce the entry metadata component and can cap the final score when severe.

Quality by Format

% of feeds in each quality tier per format, ordered by mean score (↓). Dot = mean score. Normalised for population size so rare formats compare fairly against common ones.

Feed Formats

Successfully parsed feeds only. RSS-family feeds: 412,014; Atom feeds: 122,181. Parenthetical percentages use feeds in that format as the denominator and show the share with operational quality > 0.5.

Format	Count	Quality > 0.5	Mean quality
rss1.0	2	2/2 (100.0%)	0.732
rss0.93	1	1/1 (100.0%)	0.682
rss	5	5/5 (100.0%)	0.622
rss0.92	3,224	1,220/3,224 (37.8%)	0.287
rss2.0	378,287	92,687/378,287 (24.5%)	0.230
rss10	28,256	6,608/28,256 (23.4%)	0.217
atom10	122,181	19,940/122,181 (16.3%)	0.209
rss0.91	2,217	368/2,217 (16.6%)	0.150
rss2.00	18	1/18 (5.6%)	0.052
rss.92	4	0/4 (0.0%)	0.000

Charset per Format

Source: HTTP Content-Type header only.

Format	Charset	Count
rss2.0	`utf-8`	316,850
	`unknown`	54,486
	`iso-8859-1`	4,203
	`windows-1251`	1,267
	`iso-8859-15`	914
	`windows-1252`	231
	`utf8`	152
	`windows-1256`	69
	`$charset`	36
	`iso-8859-9`	18
	`iso-8859-2`	16
	`gb2312`	16
	`euc-kr`	12
	`windows-1250`	9
	`big5`	3
	`windows1251`	2
	`gbk`	1
	`uft-8`	1
	`latin2`	1
atom10	`utf-8`	89,909
	`unknown`	31,945
	`iso-8859-1`	315
	`iso-8859-15`	12
rss10	`utf-8`	20,106
	`unknown`	7,960
	`iso-8859-1`	161
	`iso-8859-15`	13
	`euc-jp`	9
	`utf8`	3
	`gb2312`	1
	`windows-1251`	1
	`windows-1250`	1
	`gbk`	1
rss0.91	`iso-8859-2`	787
	`unknown`	764
	`utf-8`	629
	`koi8-r`	31
	`gb2312`	3
	`iso-8859-1`	2
	`windows-1250`	1
rss0.92	`utf-8`	2,501
	`iso-8859-1`	714
	`unknown`	6
	`windows-1251`	2
	`iso-8859-15`	1
rss2.00	`unknown`	18
rss.92	`utf-8`	4
rss	`unknown`	3
rss	`utf-8`	2
rss1.0	`utf-8`	1
rss1.0	`unknown`	1
rss0.93	`utf-8`	1

Entry Content Profile

Content type classification based on Atom type attribute and RSS element semantics. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.

Profile	All parsed feeds	Among 120,832 high-quality feeds
html	210,747 (39.5%)	58,025 (48.0%)
unknown	120,849 (22.6%)	1,401 (1.2%)
plain	119,857 (22.4%)	38,246 (31.7%)
mixed	77,791 (14.6%)	22,759 (18.8%)
xhtml	4,951 (0.9%)	401 (0.3%)

Entry Counts per Feed

Feed Languages

Top feed language declarations across successfully parsed feeds. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.

Language	All parsed feeds	Among 120,832 high-quality feeds
unknown	250,463 (46.9%)	32,683 (27.0%)
en	83,341 (15.6%)	22,163 (18.3%)
en-us	77,634 (14.5%)	25,184 (20.8%)
ja	13,340 (2.5%)	6,038 (5.0%)
fr	11,981 (2.2%)	3,488 (2.9%)
es	9,764 (1.8%)	3,208 (2.7%)
de	8,528 (1.6%)	3,051 (2.5%)
ru	6,420 (1.2%)	1,841 (1.5%)
fr-fr	5,979 (1.1%)	1,946 (1.6%)
en-gb	5,843 (1.1%)	2,214 (1.8%)
pt-br	4,175 (0.8%)	1,318 (1.1%)
de-de	3,821 (0.7%)	1,927 (1.6%)
pt	2,801 (0.5%)	795 (0.7%)
cs	2,740 (0.5%)	831 (0.7%)
utf-8	2,543 (0.5%)	5 (0.0%)
it-it	2,469 (0.5%)	686 (0.6%)
ja-jp	2,335 (0.4%)	1,250 (1.0%)
zh-tw	2,328 (0.4%)	502 (0.4%)
es-es	2,066 (0.4%)	672 (0.6%)
zh-cn	1,634 (0.3%)	323 (0.3%)

Language Tagging

Feed-internal language signals are read from xml:lang attributes, RSS <language>, Dublin Core dc:language, and Atom link hreflang. Counts are successfully parsed feeds; categories can overlap except the no-language row. HTTP/feed mismatches are only counted when an HTTP language conflicts with a feed-level language.

No language information 249,759 (46.8%)

HTTP Content-Language 72,201 (13.5%)

Feed-level language 269,430 (50.4%)

Entry-level language 5,698 (1.1%)

Both HTTP and feed-level language 59,294 (11.1%)

Mismatching HTTP and feed-level language 6,806 (1.3%)

Multiple entry languages 5,645 (1.1%)

Uses hreflang 1,343 (0.3%)

Multiple entry languages includes direct entry tags plus feed/HTTP languages that untagged entries inherit.

Feed Recency (CDF)

% of feeds with a last-update date whose update falls within the given age.

Newest Entry Recency (CDF)

% of feeds with at least one entry whose most recent entry falls within the given age. 113,217 zero-entry feeds excluded from denominator.

Feed History Depth (CDF of Oldest Entry Age)

% of feeds with at least one entry whose oldest entry is at most the given age — i.e. how far back the feed's window reaches. A client polling every N days will miss entries in feeds whose oldest entry is < N days old when the poll fires. 113,217 zero-entry feeds excluded from denominator.

Inferred Update Cadence (CDF)

% of feeds with enough dated entries whose average interval between entries is at most the given time. Cadence is inferred from the span between oldest and newest entry dates divided by entry count. 269,909 feeds lack enough dated entries to infer cadence.

Entry Content Length

Distribution of entry body or summary text lengths, measured after parsing successfully parsed feeds. This helps distinguish feeds that carry full content from feeds that mostly expose short summaries, links, or empty entries.

Feed Link Relations

Feed-level Atom link relations, including Atom links embedded in RSS channels. Self/canonical means rel=self; hub covers WebSub/PubSubHubbub discovery; paging and archive cover feed navigation and archived-feed links. Parenthetical percentages use all successfully parsed feeds as the denominator.

Signal	All parsed feeds	Among 120,832 high-quality feeds
self/canonical URL	293,773 (55.0%)	67,874 (56.2%)
paging links	34,939 (6.5%)	4,240 (3.5%)
WebSub/PubSubHubbub hub	33,941 (6.4%)	7,818 (6.5%)
archive links	8 (0.0%)	1 (0.0%)

Top Feed Extensions

Non-core namespace elements encountered across successfully parsed feeds. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.

Element	All parsed feeds	Among 120,832 high-quality feeds
`dc:creator`	142,073 (26.6%)	38,730 (32.1%)
`content:encoded`	59,448 (11.1%)	23,124 (19.1%)
`sy:updatePeriod`	50,606 (9.5%)	12,441 (10.3%)
`sy:updateFrequency`	50,529 (9.5%)	12,366 (10.2%)
`dc:date`	45,539 (8.5%)	13,092 (10.8%)
`itunes:explicit`	30,681 (5.7%)	14,266 (11.8%)
`itunes:author`	30,286 (5.7%)	14,124 (11.7%)
`itunes:image`	29,996 (5.6%)	14,013 (11.6%)
`itunes:owner`	29,466 (5.5%)	13,840 (11.5%)
`itunes:name`	28,616 (5.4%)	13,621 (11.3%)
`itunes:category`	28,268 (5.3%)	13,406 (11.1%)
`itunes:duration`	26,876 (5.0%)	13,165 (10.9%)
`itunes:summary`	26,867 (5.0%)	12,577 (10.4%)
`itunes:email`	22,625 (4.2%)	9,672 (8.0%)
`itunes:type`	22,511 (4.2%)	10,902 (9.0%)

Platform Fingerprints

Known feed generators or platform headers observed on parsed feeds. This is intentionally conservative; absent fingerprints mean “not identified,” not “custom-built.” Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5; the final column shows the share of feeds in that fingerprint row that clear the threshold.

Fingerprint	All parsed feeds	Among 120,832 high-quality feeds	Quality within fingerprint
unknown	450,091 (84.3%)	104,636 (86.6%)	23.2%
wordpress	36,821 (6.9%)	8,142 (6.7%)	22.1%
drupal	22,830 (4.3%)	5,722 (4.7%)	25.1%
blogger	19,954 (3.7%)	1,177 (1.0%)	5.9%
joomla	3,988 (0.7%)	898 (0.7%)	22.5%
ghost	237 (0.0%)	76 (0.1%)	32.1%
substack	191 (0.0%)	152 (0.1%)	79.6%
squarespace	46 (0.0%)	17 (0.0%)	37.0%
medium	36 (0.0%)	11 (0.0%)	30.6%
feedburner	1 (0.0%)	1 (0.0%)	100.0%

Autodiscovered Feed Quality by Source Platform

Quality of successfully parsed feeds found through HTML autodiscovery, grouped by recognized platform hints on the source page. The unknown row covers autodiscovered feeds whose source page had no recognized platform fingerprint. Parenthetical percentages use parsed feeds in that source-platform row as the denominator.

Source platform	Autodiscovered parsed feeds	Quality > 0.5	Mean quality
unknown	72,571	18,912/72,571 (26.1%)	0.247
drupal	10,721	2,734/10,721 (25.5%)	0.183
wordpress	9,258	3,376/9,258 (36.5%)	0.322
blogger	2,330	281/2,330 (12.1%)	0.224
joomla	636	317/636 (49.8%)	0.384
substack	180	162/180 (90.0%)	0.768
wix	49	21/49 (42.9%)	0.350
ghost	39	26/39 (66.7%)	0.553
shopify	24	15/24 (62.5%)	0.474
squarespace	10	9/10 (90.0%)	0.732
medium	1	1/1 (100.0%)	0.717