GBIF Backbone → COL 26.5 XR mapping — analysis

Review of gbif-col265.tsv — 7,746,724 rows, 30 columns, ~3.9 GB. Basic match metrics, matchType / matchIssues distributions, and patterns behind the no-matches.

UNITE SH OTUs excluded. 609,269 rows are UNITE Species Hypothesis OTU identifiers used as names (gbif:scientificName matching ^SH\d+\.\d+FU, e.g. SH220494.07FU, rank unranked). The version suffix differs between GBIF and COL, so all 609,269 are no-matches by construction (100%). They are removed from every metric below. Working total: 7,137,455 rows.

0. Identifier-as-name artefacts

Three families of “names” are really external identifiers. They behave very differently against COL:

ArtefactDetectionRowsNo-matchVerdict
UNITE SH OTUs^SH\d+\.\d+FU609,269100%Excluded — version differs (GBIF 09 vs COL 10), can never match.
BOLD BINs^BOLD:617,1600.4%Kept — present in COL, match almost perfectly. Not a no-match source.
GTDB placeholderskingdom Bacteria/Archaea AND (name has a digit OR ends _<CAPS>)40,473100%Deterministic non-matches; not wanted in COL.

1. Match outcome (SH OTUs excluded)

OutcomeRowsShare
Matched to COL5,810,11381.4%
No match1,327,34218.6%

With the SH OTUs included the match rate is 75.0%; removing them lifts it to 81.4%; additionally removing the 40,473 GTDB placeholders (all deterministic misses) lifts it to 81.9% (5,810,113 / 7,096,982).

matchType

matchTypeRowsShare
exact4,967,50269.6%
variant815,95011.4%
ambiguous26,6610.4%
none1,292,77118.1%
(blank)34,5710.5%

exact + variant + ambiguous = matched; none + blank = unmatched (blank = matching apparently not attempted).

matchIssues

Negligible — only ~13,032 rows file-wide (<0.2%) carry any flag, almost entirely name-parsing noise (INDETERMINED, PARTIALLY_PARSABLE_NAME, UNPARSABLE_NAME). Not a meaningful driver of anything.

2. What matched rows resolve to (col:status)

col:statusRowsShare of matched
accepted3,064,12052.7%
synonym2,498,13343.0%
ambiguous synonym124,2632.1%
provisionally accepted120,9672.1%
misapplied2,6300.05%

~47% of matches point to a non-accepted COL name; for those, follow col:acceptedID / col:acceptedScientificName rather than col:ID.

3. No-match patterns (SH OTUs excluded)

By kingdom

KingdomTotalNo-matchRate
Animalia4,432,182650,88014.7%
Plantae1,989,382476,26423.9%
Fungi397,51045,71011.5%
Chromista205,79388,60443.1%
Viruses19,6418,15641.5%
Protozoa13,7294,54533.1%
Bacteria72,13647,55965.9%
Archaea4,4683,56879.9%
incertae sedis2,6052,05278.8%

Removing the SH OTUs collapses the apparent Fungi problem (60% → 11.5%); the high Bacteria/Archaea rates are ~79% GTDB placeholders (see §0).

By rank

RankTotalNo-matchRate
species4,994,322929,60518.6%
genus547,518146,64526.8%
variety421,93487,05720.6%
subspecies380,48182,18121.6%
unranked666,60550,3457.6%
form87,93622,88126.0%
family34,3567,03720.5%
order3,0921,07634.8%
class84531637.4%
phylum34719556.2%

unranked drops from 51.7% to 7.6% once SH OTUs are removed.

By gbif:status (strongest remaining signal)

gbif:statusTotalNo-matchRate
accepted3,795,210588,22415.5%
synonym3,030,793547,56018.1%
provisionally accepted292,906188,39664.3%
ambiguous synonym18,5463,16217.0%

By taxGroup (selected)

taxGroupTotalNo-matchRate
diptera434,65328,7926.6%
lepidoptera553,95049,3988.9%
ascomycetes250,86627,53311.0%
hymenoptera385,22144,06311.4%
coleoptera767,552104,58813.6%
arachnids212,87430,35614.3%
gastropods326,32752,09516.0%
chordates424,42399,24523.4%
angiosperms1,717,697410,97023.9%
pteridophytes86,51126,04830.1%
bacteria72,13747,56065.9%
algae126,40789,99771.2%

4. Takeaways