Common Crawl Archive: CC-MAIN-2026-12
Content-Type header values among responses that passed the analysis prefilter. βOther XMLβ is XML content without an exact RSS/Atom media type, such as application/xml or text/xml. βOther Non-XMLβ is sniffable non-XML content such as text/plain or application/octet-stream. Either is only counted as a feed URL if sniffing finds RSS/Atom. Parenthetical Content-Type percentages use all analyzed responses as the denominator; sniffed outcome percentages use sniffed feeds as the denominator.<link> elements whose type is RSS, Atom, or RDF feed XML. Other uses of rel="alternate" are not counted. Pages using both relations includes pages with separate alternate and feed links, plus pages with a single link whose rel contains both. Multi-rel links counts those single link elements.
Unique feed URLs per discovered HTML page (analyzed HTML responses with zero feed links omitted: 334,093,309).
Unique feed URLs per registrable site. Sites with zero links omitted: 125,962.
Pages that link to redundant format variants of the same feed (same internal title and link after light normalization).
| Coincident formats | Pages |
|---|---|
| rss2.0 and rss2.0 | 181 |
| atom10 and rss2.0 | 100 |
| atom10 and rss10 | 13 |
| rss10 and rss2.0 | 8 |
| atom10 and atom10 | 5 |
Distribution of quality scores among successfully parsed feeds (% of each group per decile). With autodiscovery: 94,824 feeds, mean 0.246. Without: 439,371 feeds, mean 0.220.
| Fingerprint | HTML pages | With autodiscovery |
|---|---|---|
| unknown | 353,974,158 | 53,145,055 (15.0%) |
| wordpress | 45,450,449 | 26,802,369 (59.0%) |
| drupal | 13,190,628 | 758,470 (5.8%) |
| shopify | 1,097,732 | 99,423 (9.1%) |
| joomla | 930,240 | 203,428 (21.9%) |
| substack | 490,567 | 389,821 (79.5%) |
| wix | 430,723 | 82,752 (19.2%) |
| blogger | 335,807 | 305,603 (91.0%) |
| ghost | 229,434 | 218,332 (95.2%) |
| squarespace | 20,798 | 1,017 (4.9%) |
| medium | 2,994 | 41 (1.4%) |
| XML declaration allowed only at the start of the document | 4,258 | 45.4% | Blank needed here | 100 | 1.1% |
| EntityRef: expecting ';' | 1,137 | 12.1% | Not XML | 97 | 1.0% |
| CData section not finished | 794 | 8.5% | Namespace prefix owl on sameAs is not defined | 77 | 0.8% |
| no element found (line 0) | 364 | 3.9% | Start tag expected | 67 | 0.7% |
| xmlParseEntityRef: no name | 327 | 3.5% | Malformed declaration expecting version | 64 | 0.7% |
| Unknown root tag: div | 286 | 3.0% | Unknown root tag: br | 63 | 0.7% |
| Invalid bytes in character encoding | 188 | 2.0% | Namespace prefix rdf for resource on license is not defined | 52 | 0.6% |
| Unknown root tag: quakeml | 177 | 1.9% | Namespace prefix sn on type is not defined | 46 | 0.5% |
| Opening and ending tag mismatch: title line 4 and feed | 139 | 1.5% | Unknown root tag: entry | 45 | 0.5% |
| Extra content at the end of the document | 104 | 1.1% | Unknown root tag: style | 45 | 0.5% |
| Format | Count | Quality > 0.5 | Mean quality |
|---|---|---|---|
| rss1.0 | 2 | 2/2 (100.0%) | 0.732 |
| rss0.93 | 1 | 1/1 (100.0%) | 0.682 |
| rss | 5 | 5/5 (100.0%) | 0.622 |
| rss0.92 | 3,224 | 1,220/3,224 (37.8%) | 0.287 |
| rss2.0 | 378,287 | 92,687/378,287 (24.5%) | 0.230 |
| rss10 | 28,256 | 6,608/28,256 (23.4%) | 0.217 |
| atom10 | 122,181 | 19,940/122,181 (16.3%) | 0.209 |
| rss0.91 | 2,217 | 368/2,217 (16.6%) | 0.150 |
| rss2.00 | 18 | 1/18 (5.6%) | 0.052 |
| rss.92 | 4 | 0/4 (0.0%) | 0.000 |
Content-Type header only.| Format | Charset | Count |
|---|---|---|
| rss2.0 | utf-8 |
316,850 |
unknown |
54,486 | |
iso-8859-1 |
4,203 | |
windows-1251 |
1,267 | |
iso-8859-15 |
914 | |
windows-1252 |
231 | |
utf8 |
152 | |
windows-1256 |
69 | |
$charset |
36 | |
iso-8859-9 |
18 | |
iso-8859-2 |
16 | |
gb2312 |
16 | |
euc-kr |
12 | |
windows-1250 |
9 | |
big5 |
3 | |
windows1251 |
2 | |
gbk |
1 | |
uft-8 |
1 | |
latin2 |
1 | |
| atom10 | utf-8 |
89,909 |
unknown |
31,945 | |
iso-8859-1 |
315 | |
iso-8859-15 |
12 | |
| rss10 | utf-8 |
20,106 |
unknown |
7,960 | |
iso-8859-1 |
161 | |
iso-8859-15 |
13 | |
euc-jp |
9 | |
utf8 |
3 | |
gb2312 |
1 | |
windows-1251 |
1 | |
windows-1250 |
1 | |
gbk |
1 | |
| rss0.91 | iso-8859-2 |
787 |
unknown |
764 | |
utf-8 |
629 | |
koi8-r |
31 | |
gb2312 |
3 | |
iso-8859-1 |
2 | |
windows-1250 |
1 | |
| rss0.92 | utf-8 |
2,501 |
iso-8859-1 |
714 | |
unknown |
6 | |
windows-1251 |
2 | |
iso-8859-15 |
1 | |
| rss2.00 | unknown |
18 |
| rss.92 | utf-8 |
4 |
| rss | unknown |
3 |
utf-8 |
2 | |
| rss1.0 | utf-8 |
1 |
unknown |
1 | |
| rss0.93 | utf-8 |
1 |
type attribute and RSS element semantics. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.| Profile | All parsed feeds | Among 120,832 high-quality feeds |
|---|---|---|
| html | 210,747 (39.5%) | 58,025 (48.0%) |
| unknown | 120,849 (22.6%) | 1,401 (1.2%) |
| plain | 119,857 (22.4%) | 38,246 (31.7%) |
| mixed | 77,791 (14.6%) | 22,759 (18.8%) |
| xhtml | 4,951 (0.9%) | 401 (0.3%) |
| Language | All parsed feeds | Among 120,832 high-quality feeds |
|---|---|---|
| unknown | 250,463 (46.9%) | 32,683 (27.0%) |
| en | 83,341 (15.6%) | 22,163 (18.3%) |
| en-us | 77,634 (14.5%) | 25,184 (20.8%) |
| ja | 13,340 (2.5%) | 6,038 (5.0%) |
| fr | 11,981 (2.2%) | 3,488 (2.9%) |
| es | 9,764 (1.8%) | 3,208 (2.7%) |
| de | 8,528 (1.6%) | 3,051 (2.5%) |
| ru | 6,420 (1.2%) | 1,841 (1.5%) |
| fr-fr | 5,979 (1.1%) | 1,946 (1.6%) |
| en-gb | 5,843 (1.1%) | 2,214 (1.8%) |
| pt-br | 4,175 (0.8%) | 1,318 (1.1%) |
| de-de | 3,821 (0.7%) | 1,927 (1.6%) |
| pt | 2,801 (0.5%) | 795 (0.7%) |
| cs | 2,740 (0.5%) | 831 (0.7%) |
| utf-8 | 2,543 (0.5%) | 5 (0.0%) |
| it-it | 2,469 (0.5%) | 686 (0.6%) |
| ja-jp | 2,335 (0.4%) | 1,250 (1.0%) |
| zh-tw | 2,328 (0.4%) | 502 (0.4%) |
| es-es | 2,066 (0.4%) | 672 (0.6%) |
| zh-cn | 1,634 (0.3%) | 323 (0.3%) |
xml:lang attributes, RSS <language>, Dublin Core dc:language, and Atom link hreflang. Counts are successfully parsed feeds; categories can overlap except the no-language row. HTTP/feed mismatches are only counted when an HTTP language conflicts with a feed-level language.Content-Language 72,201 (13.5%)hreflang 1,343 (0.3%)rel=self; hub covers WebSub/PubSubHubbub discovery; paging and archive cover feed navigation and archived-feed links. Parenthetical percentages use all successfully parsed feeds as the denominator.| Signal | All parsed feeds | Among 120,832 high-quality feeds |
|---|---|---|
| self/canonical URL | 293,773 (55.0%) | 67,874 (56.2%) |
| paging links | 34,939 (6.5%) | 4,240 (3.5%) |
| WebSub/PubSubHubbub hub | 33,941 (6.4%) | 7,818 (6.5%) |
| archive links | 8 (0.0%) | 1 (0.0%) |
| Element | All parsed feeds | Among 120,832 high-quality feeds |
|---|---|---|
dc:creator
|
142,073 (26.6%) | 38,730 (32.1%) |
content:encoded
|
59,448 (11.1%) | 23,124 (19.1%) |
sy:updatePeriod
|
50,606 (9.5%) | 12,441 (10.3%) |
sy:updateFrequency
|
50,529 (9.5%) | 12,366 (10.2%) |
dc:date
|
45,539 (8.5%) | 13,092 (10.8%) |
itunes:explicit
|
30,681 (5.7%) | 14,266 (11.8%) |
itunes:author
|
30,286 (5.7%) | 14,124 (11.7%) |
itunes:image
|
29,996 (5.6%) | 14,013 (11.6%) |
itunes:owner
|
29,466 (5.5%) | 13,840 (11.5%) |
itunes:name
|
28,616 (5.4%) | 13,621 (11.3%) |
itunes:category
|
28,268 (5.3%) | 13,406 (11.1%) |
itunes:duration
|
26,876 (5.0%) | 13,165 (10.9%) |
itunes:summary
|
26,867 (5.0%) | 12,577 (10.4%) |
itunes:email
|
22,625 (4.2%) | 9,672 (8.0%) |
itunes:type
|
22,511 (4.2%) | 10,902 (9.0%) |
| Fingerprint | All parsed feeds | Among 120,832 high-quality feeds | Quality within fingerprint |
|---|---|---|---|
| unknown | 450,091 (84.3%) | 104,636 (86.6%) | 23.2% |
| wordpress | 36,821 (6.9%) | 8,142 (6.7%) | 22.1% |
| drupal | 22,830 (4.3%) | 5,722 (4.7%) | 25.1% |
| blogger | 19,954 (3.7%) | 1,177 (1.0%) | 5.9% |
| joomla | 3,988 (0.7%) | 898 (0.7%) | 22.5% |
| ghost | 237 (0.0%) | 76 (0.1%) | 32.1% |
| substack | 191 (0.0%) | 152 (0.1%) | 79.6% |
| squarespace | 46 (0.0%) | 17 (0.0%) | 37.0% |
| medium | 36 (0.0%) | 11 (0.0%) | 30.6% |
| feedburner | 1 (0.0%) | 1 (0.0%) | 100.0% |
| Source platform | Autodiscovered parsed feeds | Quality > 0.5 | Mean quality |
|---|---|---|---|
| unknown | 72,571 | 18,912/72,571 (26.1%) | 0.247 |
| drupal | 10,721 | 2,734/10,721 (25.5%) | 0.183 |
| wordpress | 9,258 | 3,376/9,258 (36.5%) | 0.322 |
| blogger | 2,330 | 281/2,330 (12.1%) | 0.224 |
| joomla | 636 | 317/636 (49.8%) | 0.384 |
| substack | 180 | 162/180 (90.0%) | 0.768 |
| wix | 49 | 21/49 (42.9%) | 0.350 |
| ghost | 39 | 26/39 (66.7%) | 0.553 |
| shopify | 24 | 15/24 (62.5%) | 0.474 |
| squarespace | 10 | 9/10 (90.0%) | 0.732 |
| medium | 1 | 1/1 (100.0%) | 0.717 |