occurrenceID	taxonID	catalogNumber	collectionCode	institutionCode	typeStatus	verbatimLabel	sex	individualCount	eventDate	recordedBy	recordNumber	decimalLatitude	decimalLongitude	minimumElevationInMeters	maximumElevationInMeters	minimumDepthInMeters	maximumDepthInMeters	country	stateProvince	municipality	locality	references	associatedOccurrences	associatedReferences	associatedSequences	basisOfRecord
442187AFFFCCFFD22869F8D5FA520388.mc.7CE03CE4FFCCFFD22889F8EDFB9B0347	442187AFFFCCFFD22869F8D5FA520388.taxon			PERL		In order to identify class II defensin sequences, we designed a semiautomatic pipeline (Fig. 1). For that, initially all proteins from the A. thaliana Uniprot database were downloaded. The dataset consisted of 86,486 sequences (March 2017). From this dataset, 387 sequences were retrieved by using regular expression (RegEx) search (step 2, Fig. 1). From these, 285 had up to 130 amino acid residues (step 3, Fig. 1). This criterion allows eventual larger C-terminal prodomains to be identified. Then, we used a PERL		1	2017-03													https://treatment.plazi.org/id/442187AFFFCCFFD22869F8D5FA520388#7CE03CE4FFCCFFD22889F8EDFB9B0347				MaterialCitation
442187AFFFCCFFD22869F8D5FA520388.mc.7CE03CE4FFCCFFD22C04FC25FA5E0388	442187AFFFCCFFD22869F8D5FA520388.taxon	A7, REG2, REG4		A, REG		script to select the sequences with the following flags: hypothetical, unknown, unnamed and / or uncharacterized (step 4, Fig. 1), resulting in 15 sequences. From 15 sequences, seven were incomplete and therefore were discarded, (step 5, Fig. 1). From the remaining sequences, two sequences without signal peptide or with transmembrane domains were discarded (step 6, Fig. 1). Finally, from six remaining sequences, two sequences with a potential C-Terminal prodomain were selected, with accession codes A 7 REG 2 and A 7 REG 4		1														https://treatment.plazi.org/id/442187AFFFCCFFD22869F8D5FA520388#7CE03CE4FFCCFFD22C04FC25FA5E0388				MaterialCitation
