Common Crawl Archive: CC-MAIN-2026-12
Content-Type header values among responses that passed the analysis prefilter. βOther XMLβ is XML content without an exact RSS/Atom media type, such as application/xml or text/xml. βOther Non-XMLβ is sniffable non-XML content such as text/plain or application/octet-stream. Either is only counted as a feed URL if sniffing finds RSS/Atom. Parenthetical percentages use all analyzed responses as the denominator.<link> elements whose type is RSS, Atom, or RDF feed XML. Other uses of rel="alternate" are not counted. Pages using both relations includes pages with separate alternate and feed links, plus pages with a single link whose rel contains both. Multi-rel links counts those single link elements.
Unique feed URLs per discovered HTML page (analyzed HTML responses with zero feed links omitted: 334,103,280).
Unique feed URLs per registrable site. Based on sampled source sites retained for each discovered feed URL. Sites with zero links omitted: 125,965.
Pages that link to redundant format variants of the same feed (same internal title and link after light normalization).
| Coincident formats | Pages |
|---|---|
| atom10 and rss2.0 | 4 |
| rss2.0 and rss2.0 | 2 |
| atom10 and atom10 | 1 |
Distribution of quality scores among successfully parsed feeds (% of each group per decile). With autodiscovery: 53,096 feeds, mean 0.251. Without: 250,694 feeds, mean 0.179.
| Fingerprint | HTML pages | With autodiscovery |
|---|---|---|
| unknown | 353,984,603 | 53,146,358 (15.0%) |
| wordpress | 45,451,537 | 26,802,985 (59.0%) |
| drupal | 13,190,932 | 758,482 (5.8%) |
| shopify | 1,097,730 | 99,413 (9.1%) |
| joomla | 930,272 | 203,431 (21.9%) |
| substack | 490,578 | 389,821 (79.5%) |
| wix | 430,754 | 82,764 (19.2%) |
| blogger | 335,819 | 305,604 (91.0%) |
| ghost | 229,444 | 218,342 (95.2%) |
| squarespace | 20,792 | 1,017 (4.9%) |
| medium | 2,995 | 41 (1.4%) |
| XML declaration allowed only at the start of the document | 3,971 | 52.3% | Malformed declaration expecting version | 64 | 0.8% |
| EntityRef: expecting ';' | 1,001 | 13.2% | Namespace prefix rdf for resource on license is not defined | 52 | 0.7% |
| CData section not finished | 601 | 7.9% | Invalid bytes in character encoding | 50 | 0.7% |
| Unknown root tag: div | 277 | 3.6% | Namespace prefix sn on type is not defined | 46 | 0.6% |
| xmlParseEntityRef: no name | 219 | 2.9% | Unknown root tag: entry | 45 | 0.6% |
| Unknown root tag: quakeml | 177 | 2.3% | Unknown root tag: style | 45 | 0.6% |
| Opening and ending tag mismatch: title line 4 and feed | 137 | 1.8% | PCDATA invalid Char value 12 | 44 | 0.6% |
| no element found (line 0) | 105 | 1.4% | Empty response | 32 | 0.4% |
| Extra content at the end of the document | 80 | 1.1% | Start tag expected | 30 | 0.4% |
| Not XML | 73 | 1.0% | Unknown root tag: head | 29 | 0.4% |
| Format | Count | Quality > 0.5 | Mean quality |
|---|---|---|---|
| rss1.0 | 1 | 1/1 (100.0%) | 0.782 |
| rss | 1 | 1/1 (100.0%) | 0.650 |
| rss0.92 | 3,200 | 1,209/3,200 (37.8%) | 0.286 |
| atom10 | 103,937 | 14,520/103,937 (14.0%) | 0.195 |
| rss2.0 | 181,975 | 40,508/181,975 (22.3%) | 0.191 |
| rss0.91 | 106 | 23/106 (21.7%) | 0.167 |
| rss10 | 14,552 | 1,732/14,552 (11.9%) | 0.143 |
| rss2.00 | 18 | 1/18 (5.6%) | 0.052 |
Content-Type header only.| Format | Charset | Count |
|---|---|---|
| rss2.0 | utf-8 |
164,428 |
unknown |
16,544 | |
iso-8859-1 |
549 | |
windows-1251 |
446 | |
windows1251 |
2 | |
iso-8859-2 |
1 | |
windows-1250 |
1 | |
utf8 |
1 | |
windows-1252 |
1 | |
uft-8 |
1 | |
latin2 |
1 | |
| atom10 | utf-8 |
78,892 |
unknown |
25,033 | |
iso-8859-1 |
12 | |
| rss0.92 | utf-8 |
2,483 |
iso-8859-1 |
714 | |
windows-1251 |
2 | |
unknown |
1 | |
| rss0.91 | utf-8 |
88 |
unknown |
15 | |
iso-8859-1 |
2 | |
gb2312 |
1 | |
| rss10 | utf-8 |
11,281 |
unknown |
3,252 | |
iso-8859-1 |
19 | |
| rss2.00 | unknown |
18 |
| rss | unknown |
1 |
| rss1.0 | utf-8 |
1 |
type attribute and RSS element semantics. Parenthetical percentages in the first count column use all successfully parsed feeds as the denominator. High-quality feeds have operational quality > 0.5.| Profile | All parsed feeds | Among 57,995 high-quality feeds |
|---|---|---|
| html | 132,524 (43.6%) | 31,651 (54.6%) |
| unknown | 66,667 (21.9%) | 272 (0.5%) |
| plain | 50,672 (16.7%) | 13,525 (23.3%) |
| mixed | 49,074 (16.2%) | 12,153 (21.0%) |
| xhtml | 4,853 (1.6%) | 394 (0.7%) |
| Language | All parsed feeds | Among 57,995 high-quality feeds |
|---|---|---|
| unknown | 145,914 (48.0%) | 15,240 (26.3%) |
| en | 58,749 (19.3%) | 11,608 (20.0%) |
| en-us | 42,865 (14.1%) | 14,029 (24.2%) |
| es | 7,941 (2.6%) | 2,330 (4.0%) |
| fr | 6,633 (2.2%) | 1,853 (3.2%) |
| de | 4,898 (1.6%) | 1,464 (2.5%) |
| ru | 3,550 (1.2%) | 1,182 (2.0%) |
| en-gb | 3,441 (1.1%) | 890 (1.5%) |
| pt-br | 3,275 (1.1%) | 971 (1.7%) |
| ja | 1,867 (0.6%) | 664 (1.1%) |
| it-it | 1,338 (0.4%) | 367 (0.6%) |
| de-de | 1,335 (0.4%) | 722 (1.2%) |
| es-es | 1,274 (0.4%) | 312 (0.5%) |
| uk | 1,191 (0.4%) | 388 (0.7%) |
| fr-fr | 1,027 (0.3%) | 542 (0.9%) |
| fi | 1,001 (0.3%) | 150 (0.3%) |
| ja-jp | 915 (0.3%) | 374 (0.6%) |
| nl | 886 (0.3%) | 349 (0.6%) |
| zh-tw | 842 (0.3%) | 190 (0.3%) |
| km | 794 (0.3%) | 62 (0.1%) |
xml:lang attributes, RSS <language>, Dublin Core dc:language, and Atom link hreflang. Counts are successfully parsed feeds; categories can overlap except the no-language row. HTTP/feed mismatches are only counted when an HTTP language conflicts with a feed-level language.Content-Language 58,387 (19.2%)hreflang 116 (0.0%)rel=self; hub covers WebSub/PubSubHubbub discovery; paging and archive cover feed navigation and archived-feed links. Parenthetical percentages use all successfully parsed feeds as the denominator.| Signal | All parsed feeds | Among 57,995 high-quality feeds |
|---|---|---|
| self/canonical URL | 190,131 (62.6%) | 36,487 (62.9%) |
| paging links | 31,150 (10.3%) | 3,601 (6.2%) |
| WebSub/PubSubHubbub hub | 27,329 (9.0%) | 3,776 (6.5%) |
| archive links | 5 (0.0%) | 0 (0.0%) |
| Element | All parsed feeds | Among 57,995 high-quality feeds |
|---|---|---|
dc:creator
|
87,143 (28.7%) | 23,287 (40.2%) |
sy:updatePeriod
|
37,823 (12.5%) | 7,557 (13.0%) |
sy:updateFrequency
|
37,749 (12.4%) | 7,486 (12.9%) |
content:encoded
|
29,140 (9.6%) | 8,829 (15.2%) |
dc:date
|
19,770 (6.5%) | 4,289 (7.4%) |
opensearch:totalResults
|
17,585 (5.8%) | 613 (1.1%) |
opensearch:startIndex
|
17,585 (5.8%) | 613 (1.1%) |
opensearch:itemsPerPage
|
17,585 (5.8%) | 613 (1.1%) |
gd:image
|
17,583 (5.8%) | 613 (1.1%) |
itunes:author
|
12,386 (4.1%) | 4,323 (7.5%) |
itunes:explicit
|
12,360 (4.1%) | 4,319 (7.4%) |
itunes:image
|
12,334 (4.1%) | 4,315 (7.4%) |
itunes:owner
|
12,292 (4.0%) | 4,286 (7.4%) |
itunes:name
|
12,248 (4.0%) | 4,260 (7.3%) |
itunes:summary
|
11,881 (3.9%) | 4,025 (6.9%) |
| Fingerprint | All parsed feeds | Among 57,995 high-quality feeds | Quality within fingerprint |
|---|---|---|---|
| unknown | 230,473 (75.9%) | 45,239 (78.0%) | 19.6% |
| wordpress | 29,009 (9.5%) | 5,588 (9.6%) | 19.3% |
| drupal | 22,613 (7.4%) | 5,625 (9.7%) | 24.9% |
| blogger | 17,585 (5.8%) | 613 (1.1%) | 3.5% |
| joomla | 3,901 (1.3%) | 868 (1.5%) | 22.3% |
| ghost | 207 (0.1%) | 60 (0.1%) | 29.0% |
| feedburner | 1 (0.0%) | 1 (0.0%) | 100.0% |
| squarespace | 1 (0.0%) | 1 (0.0%) | 100.0% |
| Source platform | Autodiscovered parsed feeds | Quality > 0.5 | Mean quality |
|---|---|---|---|
| unknown | 31,873 | 8,619/31,873 (27.0%) | 0.257 |
| drupal | 10,634 | 2,676/10,634 (25.2%) | 0.180 |
| wordpress | 8,205 | 3,049/8,205 (37.2%) | 0.331 |
| blogger | 2,330 | 281/2,330 (12.1%) | 0.224 |
| joomla | 563 | 286/563 (50.8%) | 0.397 |
| substack | 40 | 34/40 (85.0%) | 0.740 |
| ghost | 30 | 21/30 (70.0%) | 0.579 |
| shopify | 12 | 7/12 (58.3%) | 0.432 |
| wix | 9 | 8/9 (88.9%) | 0.719 |
| squarespace | 5 | 4/5 (80.0%) | 0.676 |