Review of gbif-col265.tsv — 7,746,724 rows, 30 columns,
~3.9 GB. Basic match metrics, matchType / matchIssues
distributions, and patterns behind the no-matches.
gbif:scientificName matching
^SH\d+\.\d+FU, e.g. SH220494.07FU, rank
unranked). The version suffix differs between GBIF and COL, so all
609,269 are no-matches by construction (100%). They are removed from every metric
below. Working total: 7,137,455 rows.
Three families of “names” are really external identifiers. They behave very differently against COL:
| Artefact | Detection | Rows | No-match | Verdict |
|---|---|---|---|---|
| UNITE SH OTUs | ^SH\d+\.\d+FU | 609,269 | 100% | Excluded — version differs (GBIF 09 vs COL 10), can never match. |
| BOLD BINs | ^BOLD: | 617,160 | 0.4% | Kept — present in COL, match almost perfectly. Not a no-match source. |
| GTDB placeholders | kingdom Bacteria/Archaea AND (name has a digit OR ends _<CAPS>) | 40,473 | 100% | Deterministic non-matches; not wanted in COL. |
_A/_B-style
suffix — is a GTDB placeholder. This catches uninomial codes (ARS69,
CAIJMQ01, 2-02-FULL-45-22_A), pure _CAPS splits
(Bacillus_D, Myxococcota_A, Desulfobacterota_C) and
binomial placeholders (… sp002774085, 28,514 of them).
By rank: species 28,565, genus 8,877, family 1,919, order 818, class 202, phylum 92.
By kingdom: Bacteria 37,311, Archaea 3,162.| Outcome | Rows | Share |
|---|---|---|
| Matched to COL | 5,810,113 | 81.4% |
| No match | 1,327,342 | 18.6% |
With the SH OTUs included the match rate is 75.0%; removing them lifts it to 81.4%; additionally removing the 40,473 GTDB placeholders (all deterministic misses) lifts it to 81.9% (5,810,113 / 7,096,982).
| matchType | Rows | Share |
|---|---|---|
exact | 4,967,502 | 69.6% |
variant | 815,950 | 11.4% |
ambiguous | 26,661 | 0.4% |
none | 1,292,771 | 18.1% |
| (blank) | 34,571 | 0.5% |
exact + variant + ambiguous = matched;
none + blank = unmatched (blank = matching apparently not attempted).
Negligible — only ~13,032 rows file-wide (<0.2%) carry any flag, almost
entirely name-parsing noise (INDETERMINED,
PARTIALLY_PARSABLE_NAME, UNPARSABLE_NAME). Not a
meaningful driver of anything.
| col:status | Rows | Share of matched |
|---|---|---|
| accepted | 3,064,120 | 52.7% |
| synonym | 2,498,133 | 43.0% |
| ambiguous synonym | 124,263 | 2.1% |
| provisionally accepted | 120,967 | 2.1% |
| misapplied | 2,630 | 0.05% |
~47% of matches point to a non-accepted COL name; for those, follow
col:acceptedID / col:acceptedScientificName rather than
col:ID.
| Kingdom | Total | No-match | Rate |
|---|---|---|---|
| Animalia | 4,432,182 | 650,880 | 14.7% |
| Plantae | 1,989,382 | 476,264 | 23.9% |
| Fungi | 397,510 | 45,710 | 11.5% |
| Chromista | 205,793 | 88,604 | 43.1% |
| Viruses | 19,641 | 8,156 | 41.5% |
| Protozoa | 13,729 | 4,545 | 33.1% |
| Bacteria | 72,136 | 47,559 | 65.9% |
| Archaea | 4,468 | 3,568 | 79.9% |
| incertae sedis | 2,605 | 2,052 | 78.8% |
Removing the SH OTUs collapses the apparent Fungi problem (60% → 11.5%); the high Bacteria/Archaea rates are ~79% GTDB placeholders (see §0).
| Rank | Total | No-match | Rate |
|---|---|---|---|
| species | 4,994,322 | 929,605 | 18.6% |
| genus | 547,518 | 146,645 | 26.8% |
| variety | 421,934 | 87,057 | 20.6% |
| subspecies | 380,481 | 82,181 | 21.6% |
| unranked | 666,605 | 50,345 | 7.6% |
| form | 87,936 | 22,881 | 26.0% |
| family | 34,356 | 7,037 | 20.5% |
| order | 3,092 | 1,076 | 34.8% |
| class | 845 | 316 | 37.4% |
| phylum | 347 | 195 | 56.2% |
unranked drops from 51.7% to 7.6% once SH OTUs are removed.
| gbif:status | Total | No-match | Rate |
|---|---|---|---|
| accepted | 3,795,210 | 588,224 | 15.5% |
| synonym | 3,030,793 | 547,560 | 18.1% |
| provisionally accepted | 292,906 | 188,396 | 64.3% |
| ambiguous synonym | 18,546 | 3,162 | 17.0% |
| taxGroup | Total | No-match | Rate |
|---|---|---|---|
| diptera | 434,653 | 28,792 | 6.6% |
| lepidoptera | 553,950 | 49,398 | 8.9% |
| ascomycetes | 250,866 | 27,533 | 11.0% |
| hymenoptera | 385,221 | 44,063 | 11.4% |
| coleoptera | 767,552 | 104,588 | 13.6% |
| arachnids | 212,874 | 30,356 | 14.3% |
| gastropods | 326,327 | 52,095 | 16.0% |
| chordates | 424,423 | 99,245 | 23.4% |
| angiosperms | 1,717,697 | 410,970 | 23.9% |
| pteridophytes | 86,511 | 26,048 | 30.1% |
| bacteria | 72,137 | 47,560 | 65.9% |
| algae | 126,407 | 89,997 | 71.2% |
col:acceptedID, not col:ID, for those.provisionally accepted backbone names — 64% miss (auto-generated names COL doesn't carry);incertae sedis (79%).