Web Feed Survey

Common Crawl Archive: CC-MAIN-2026-12

📅 Crawl ended 2026-03-17

⚠️ Filtered to top 500,000 entries (Tranco subdomain-inclusive, by registrable site)

1. The Crawl

Percentages describe this Common Crawl result set, not the entire Web. Metrics reflect what Common Crawl fetched, what sites allowed it to fetch, and the configured Tranco list/sample limits for this run. Site counts and TOP_N scoping use the Tranco subdomain-inclusive list, normalized to registrable sites with the Public Suffix List, including private suffixes for hosted sub-sites.

Responses Processed

425,471,592

All responses after Tranco filtering; 418,132,532 had HTML, feed, XML, or sniffable content types and were analyzed further.

HTML Pages Processed

416,048,595

HTML/XHTML responses; used as the denominator for autodiscovery page rates.

Unique Sites

196,598

Distinct registrable sites among processed responses (HyperLogLog estimate).

Feed URLs Checked

311,382

0 generic XML/text/plain/octet-stream responses sniffed as feeds; the rest had exact RSS/Atom media types.

Analyzed Response Content-Type Distribution

HTTP Content-Type header values among responses that passed the analysis prefilter. “Other XML” is XML content without an exact RSS/Atom media type, such as application/xml or text/xml. “Other Non-XML” is sniffable non-XML content such as text/plain or application/octet-stream. Either is only counted as a feed URL if sniffing finds RSS/Atom. Parenthetical percentages use all analyzed responses as the denominator.

HTML

416,048,595

99.50% of total

Non-HTML types (bars scaled relative to each other):

Atom

113,064 (0.0270%)

RSS

198,759 (0.0475%)

Other XML

966,177 (0.2311%)

Other Non-XML

805,937 (0.1927%)

2. Feed Autodiscovery

Autodiscovery Coverage

Pages with feed links

81,945,315

19.70% of analyzed HTML responses

Sites with feed links

70,633

35.93% of analyzed registrable sites

Site count is an estimated lower bound from sampled source sites per feed URL.

Discovery link relations

Pages with feed rel="alternate" 81,942,184

Pages with feed rel="feed" 12,793

Pages using both relations 9,662

Pages with a multi-rel link 9,544

Multi-rel links 16,970

These counts include only RSS/Atom autodiscovery links: HTML <link> elements whose type is RSS, Atom, or RDF feed XML. Other uses of rel="alternate" are not counted. Pages using both relations includes pages with separate alternate and feed links, plus pages with a single link whose rel contains both. Multi-rel links counts those single link elements.

Autodiscovery Links per Page

Unique feed URLs per discovered HTML page (analyzed HTML responses with zero feed links omitted: 334,103,280).

Autodiscovery Links per Site

Unique feed URLs per registrable site. Based on sampled source sites retained for each discovered feed URL. Sites with zero links omitted: 125,965.

Pages with Duplicate Feeds

0.0% of 25,079 multi-feed pages

Pages that link to redundant format variants of the same feed (same internal title and link after light normalization).

Coincident formats	Pages
atom10 and rss2.0	4
rss2.0 and rss2.0	2
atom10 and atom10	1

Quality: Autodiscovered vs. Not

Distribution of quality scores among successfully parsed feeds (% of each group per decile). With autodiscovery: 53,096 feeds, mean 0.251. Without: 250,694 feeds, mean 0.179.

Autodiscovery Usage by HTML Platform

Known page-side platform hints among analyzed HTML responses, plus an unknown row for pages with no recognized fingerprint. Parenthetical percentages use HTML pages in that row as the denominator and show how often those pages expose RSS/Atom links. Detection is conservative and based on generator metadata plus common asset markers.

Fingerprint	HTML pages	With autodiscovery
unknown	353,984,603	53,146,358 (15.0%)
wordpress	45,451,537	26,802,985 (59.0%)
drupal	13,190,932	758,482 (5.8%)
shopify	1,097,730	99,413 (9.1%)
joomla	930,272	203,431 (21.9%)
substack	490,578	389,821 (79.5%)
wix	430,754	82,764 (19.2%)
blogger	335,819	305,604 (91.0%)
ghost	229,444	218,342 (95.2%)
squarespace	20,792	1,017 (4.9%)
medium	2,995	41 (1.4%)

3. Feeds

Successfully Parsed Feeds

303,790

Percentages use feed URLs checked as the denominator for parse results, and successfully parsed feeds as the denominator for autodiscovery rows.

Parse success rate 97.6%

Broken/unparseable 7,592 (2.4%)

With autodiscovery link 53,096 (17.5%)

Without autodiscovery link 250,694 (82.5%)

Feeds with Entry Content

237,123

Coverage rate 78.1%

Full content 90,416 (29.8%)

Summary only 146,707 (48.3%)

Neither 66,667 (21.9%)

Denominator: successfully parsed feeds.

Entries Found

3,189,317

Percentages use successfully parsed feeds as the denominator, except dated entries use parsed feeds with entries.

Mean entries per parsed feed 10.5

Parsed feeds with entries 239,438 (78.8%)

Feeds with dated entries 232,646 (97.2%)

Feeds with feed-level dates 222,276 (73.2%)

Feeds with repeated entry titles 27,674

Feeds with default entry titles 79

Feeds with repeated entry links 24,211

Default entry titles are generic placeholder-like values such as “Default Title”.

7,592

feed parse errors

2.4% of 311,382 feed URLs checked

Error-row percentages use total parse errors as the denominator.

XML declaration allowed only at the start of the document	3,971	52.3%	Malformed declaration expecting version	64	0.8%
EntityRef: expecting ';'	1,001	13.2%	Namespace prefix rdf for resource on license is not defined	52	0.7%
CData section not finished	601	7.9%	Invalid bytes in character encoding	50	0.7%
Unknown root tag: div	277	3.6%	Namespace prefix sn on type is not defined	46	0.6%
xmlParseEntityRef: no name	219	2.9%	Unknown root tag: entry	45	0.6%
Unknown root tag: quakeml	177	2.3%	Unknown root tag: style	45	0.6%
Opening and ending tag mismatch: title line 4 and feed	137	1.8%	PCDATA invalid Char value 12	44	0.6%
no element found (line 0)	105	1.4%	Empty response	32	0.4%
Extra content at the end of the document	80	1.1%	Start tag expected	30	0.4%
Not XML	73	1.0%	Unknown root tag: head	29	0.4%

Feed Availability and Freshness

Step-down view from every feed URL checked to parsed feeds with entries that appear fresh. Fresh means the newest entry date, or feed-level updated date when no entry date exists, is within 365 days of the crawl response time. Funnel percentages use feed URLs checked as the denominator.

Feed URLs checked

311,382

Parsed RSS/Atom

303,790

Freshness signal within cutoff

100,643

Active with entries

67,997

Operational Quality

0.191

mean score, 0-1, across successfully parsed feeds

Parenthetical percentages use successfully parsed feeds as the denominator.

Quality > 0.5 57,995 (19.1%)

Freshness signal within cutoff 100,643 (33.1%)

Mean among those feeds 0.578

Active with entries 67,997 (22.4%)

Undated or stale 203,147

Feed Quality Distribution

Operational, non-editorial score 0–1 among successfully parsed feeds. In split tables, “quality > 0.5” means a feed has a usable freshness signal and enough basic entry/feed metadata to look usable; lower-scoring feeds remain in the all-feeds columns so abandoned or sparse feeds still count. Feeds with no usable date, or no freshness signal within 365 days, score 0 here; the card above also shows the mean after counting those feeds out. Excluded from the freshness-filtered mean: 26,891 undated and 176,256 stale feeds.

Quality Score Components

Mean component scores across successfully parsed feeds. Bar labels show each component's weight in the composite score. Repeated/default-looking entry titles and repeated entry links reduce the entry metadata component and can cap the final score when severe.

Quality by Format

% of feeds in each quality tier per format, ordered by mean score (↓). Dot = mean score. Normalised for population size so rare formats compare fairly against common ones.

Feed Formats

Successfully parsed feeds only. RSS-family feeds: 199,853; Atom feeds: 103,937. Parenthetical percentages use feeds in that format as the denominator and show the share with operational quality > 0.5.

Format	Count	Quality > 0.5	Mean quality
rss1.0	1	1/1 (100.0%)	0.782
rss	1	1/1 (100.0%)	0.650
rss0.92	3,200	1,209/3,200 (37.8%)	0.286
atom10	103,937	14,520/103,937 (14.0%)	0.195
rss2.0	181,975	40,508/181,975 (22.3%)	0.191
rss0.91	106	23/106 (21.7%)	0.167
rss10	14,552	1,732/14,552 (11.9%)	0.143
rss2.00	18	1/18 (5.6%)	0.052

Charset per Format

Source: HTTP Content-Type header only.

Format	Charset	Count
rss2.0	`utf-8`	164,428
	`unknown`	16,544
	`iso-8859-1`	549
	`windows-1251`	446
	`windows1251`	2
	`iso-8859-2`	1
	`windows-1250`	1
	`utf8`	1
	`windows-1252`	1
	`uft-8`	1
	`latin2`	1
atom10	`utf-8`	78,892
	`unknown`	25,033
	`iso-8859-1`	12
rss0.92	`utf-8`	2,483
	`iso-8859-1`	714
	`windows-1251`	2
	`unknown`	1
rss0.91	`utf-8`	88
	`unknown`	15
	`iso-8859-1`	2
	`gb2312`	1
rss10	`utf-8`	11,281
	`unknown`	3,252
	`iso-8859-1`	19
rss2.00	`unknown`	18
rss	`unknown`	1
rss1.0	`utf-8`	1

Entry Content Profile

Content type classification based on Atom type attribute and RSS element semantics. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.

Profile	All parsed feeds	Among 57,995 high-quality feeds
html	132,524 (43.6%)	31,651 (54.6%)
unknown	66,667 (21.9%)	272 (0.5%)
plain	50,672 (16.7%)	13,525 (23.3%)
mixed	49,074 (16.2%)	12,153 (21.0%)
xhtml	4,853 (1.6%)	394 (0.7%)

Entry Counts per Feed

Feed Languages

Top feed language declarations across successfully parsed feeds. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.

Language	All parsed feeds	Among 57,995 high-quality feeds
unknown	145,914 (48.0%)	15,240 (26.3%)
en	58,749 (19.3%)	11,608 (20.0%)
en-us	42,865 (14.1%)	14,029 (24.2%)
es	7,941 (2.6%)	2,330 (4.0%)
fr	6,633 (2.2%)	1,853 (3.2%)
de	4,898 (1.6%)	1,464 (2.5%)
ru	3,550 (1.2%)	1,182 (2.0%)
en-gb	3,441 (1.1%)	890 (1.5%)
pt-br	3,275 (1.1%)	971 (1.7%)
ja	1,867 (0.6%)	664 (1.1%)
it-it	1,338 (0.4%)	367 (0.6%)
de-de	1,335 (0.4%)	722 (1.2%)
es-es	1,274 (0.4%)	312 (0.5%)
uk	1,191 (0.4%)	388 (0.7%)
fr-fr	1,027 (0.3%)	542 (0.9%)
fi	1,001 (0.3%)	150 (0.3%)
ja-jp	915 (0.3%)	374 (0.6%)
nl	886 (0.3%)	349 (0.6%)
zh-tw	842 (0.3%)	190 (0.3%)
km	794 (0.3%)	62 (0.1%)

Language Tagging

Feed-internal language signals are read from xml:lang attributes, RSS <language>, Dublin Core dc:language, and Atom link hreflang. Counts are successfully parsed feeds; categories can overlap except the no-language row. HTTP/feed mismatches are only counted when an HTTP language conflicts with a feed-level language.

No language information 145,812 (48.0%)

HTTP Content-Language 58,387 (19.2%)

Feed-level language 151,815 (50.0%)

Entry-level language 3,571 (1.2%)

Both HTTP and feed-level language 52,559 (17.3%)

Mismatching HTTP and feed-level language 2,384 (0.8%)

Multiple entry languages 2,527 (0.8%)

Uses hreflang 116 (0.0%)

Multiple entry languages includes direct entry tags plus feed/HTTP languages that untagged entries inherit.

Feed Recency (CDF)

% of feeds with a last-update date whose update falls within the given age.

Newest Entry Recency (CDF)

% of feeds with at least one entry whose most recent entry falls within the given age. 64,352 zero-entry feeds excluded from denominator.

Feed History Depth (CDF of Oldest Entry Age)

% of feeds with at least one entry whose oldest entry is at most the given age — i.e. how far back the feed's window reaches. A client polling every N days will miss entries in feeds whose oldest entry is < N days old when the poll fires. 64,352 zero-entry feeds excluded from denominator.

Inferred Update Cadence (CDF)

% of feeds with enough dated entries whose average interval between entries is at most the given time. Cadence is inferred from the span between oldest and newest entry dates divided by entry count. 159,690 feeds lack enough dated entries to infer cadence.

Entry Content Length

Distribution of entry body or summary text lengths, measured after parsing successfully parsed feeds. This helps distinguish feeds that carry full content from feeds that mostly expose short summaries, links, or empty entries.

Feed Link Relations

Feed-level Atom link relations, including Atom links embedded in RSS channels. Self/canonical means rel=self; hub covers WebSub/PubSubHubbub discovery; paging and archive cover feed navigation and archived-feed links. Parenthetical percentages use all successfully parsed feeds as the denominator.

Signal	All parsed feeds	Among 57,995 high-quality feeds
self/canonical URL	190,131 (62.6%)	36,487 (62.9%)
paging links	31,150 (10.3%)	3,601 (6.2%)
WebSub/PubSubHubbub hub	27,329 (9.0%)	3,776 (6.5%)
archive links	5 (0.0%)	0 (0.0%)

Top Feed Extensions

Non-core namespace elements encountered across successfully parsed feeds. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.

Element	All parsed feeds	Among 57,995 high-quality feeds
`dc:creator`	87,143 (28.7%)	23,287 (40.2%)
`sy:updatePeriod`	37,823 (12.5%)	7,557 (13.0%)
`sy:updateFrequency`	37,749 (12.4%)	7,486 (12.9%)
`content:encoded`	29,140 (9.6%)	8,829 (15.2%)
`dc:date`	19,770 (6.5%)	4,289 (7.4%)
`opensearch:totalResults`	17,585 (5.8%)	613 (1.1%)
`opensearch:startIndex`	17,585 (5.8%)	613 (1.1%)
`opensearch:itemsPerPage`	17,585 (5.8%)	613 (1.1%)
`gd:image`	17,583 (5.8%)	613 (1.1%)
`itunes:author`	12,386 (4.1%)	4,323 (7.5%)
`itunes:explicit`	12,360 (4.1%)	4,319 (7.4%)
`itunes:image`	12,334 (4.1%)	4,315 (7.4%)
`itunes:owner`	12,292 (4.0%)	4,286 (7.4%)
`itunes:name`	12,248 (4.0%)	4,260 (7.3%)
`itunes:summary`	11,881 (3.9%)	4,025 (6.9%)

Platform Fingerprints

Known feed generators or platform headers observed on parsed feeds. This is intentionally conservative; absent fingerprints mean “not identified,” not “custom-built.” Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5; the final column shows the share of feeds in that fingerprint row that clear the threshold.

Fingerprint	All parsed feeds	Among 57,995 high-quality feeds	Quality within fingerprint
unknown	230,473 (75.9%)	45,239 (78.0%)	19.6%
wordpress	29,009 (9.5%)	5,588 (9.6%)	19.3%
drupal	22,613 (7.4%)	5,625 (9.7%)	24.9%
blogger	17,585 (5.8%)	613 (1.1%)	3.5%
joomla	3,901 (1.3%)	868 (1.5%)	22.3%
ghost	207 (0.1%)	60 (0.1%)	29.0%
feedburner	1 (0.0%)	1 (0.0%)	100.0%
squarespace	1 (0.0%)	1 (0.0%)	100.0%

Autodiscovered Feed Quality by Source Platform

Quality of successfully parsed feeds found through HTML autodiscovery, grouped by recognized platform hints on the source page. The unknown row covers autodiscovered feeds whose source page had no recognized platform fingerprint. Parenthetical percentages use parsed feeds in that source-platform row as the denominator.

Source platform	Autodiscovered parsed feeds	Quality > 0.5	Mean quality
unknown	31,873	8,619/31,873 (27.0%)	0.257
drupal	10,634	2,676/10,634 (25.2%)	0.180
wordpress	8,205	3,049/8,205 (37.2%)	0.331
blogger	2,330	281/2,330 (12.1%)	0.224
joomla	563	286/563 (50.8%)	0.397
substack	40	34/40 (85.0%)	0.740
ghost	30	21/30 (70.0%)	0.579
shopify	12	7/12 (58.3%)	0.432
wix	9	8/9 (88.9%)	0.719
squarespace	5	4/5 (80.0%)	0.676