Difference between revisions of "ENA Submission Pipeline"

From BGBM Collection Workflows
Jump to: navigation, search
(Add Annotations to Sequences)
(Most common software errors)
 
(43 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
'''Attention: BGBM submissions to ENA including registration of new names should only be done by authorised staff! Please contact the DNA Bank if you want to submit sequences SIX weeks ahead of paper submission!'''
 
'''Attention: BGBM submissions to ENA including registration of new names should only be done by authorised staff! Please contact the DNA Bank if you want to submit sequences SIX weeks ahead of paper submission!'''
 
The following guide shall help you to prepare all required data and to understand the complexity of sequence submissions. The submission itself will be done by authorised staff only!
 
The following guide shall help you to prepare all required data and to understand the complexity of sequence submissions. The submission itself will be done by authorised staff only!
 
'''!To Do: Text includes a Mix of German and English !'''
 
  
  
Line 8: Line 6:
 
==INPUT required for any submission==
 
==INPUT required for any submission==
  
# DNA alignment in NEXUS format with annotations following the INSDC vocabulary
+
# DNA alignment in NEXUS format with annotations following the [http://www.insdc.org/files/feature_table.html INSDC vocabulary]
 +
## most common errors: coordinates syntax, usage of Umlaute or charcters such as °
 
# Sample metadata in CSV format following the BGBM standard (TODO: description of this standard)
 
# Sample metadata in CSV format following the BGBM standard (TODO: description of this standard)
 
# List of standardized scientific names
 
# List of standardized scientific names
Line 14: Line 13:
  
 
==Check all taxon names against NCBI Taxonomy==
 
==Check all taxon names against NCBI Taxonomy==
Option A: Use the NCBI tool: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi
+
Use the NCBI tool: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi
 
 
Option B: Use OpenRefine: TODO: documentation
 
  
 
* Names muss follow IPNI standard (except for unidenified organisms and new described taxa), most common mistakes are blanks in authorship (e.g. H. Karst); IPNI always without blanks (H.Karst); BGBM follows this standard in all its collection databases
 
* Names muss follow IPNI standard (except for unidenified organisms and new described taxa), most common mistakes are blanks in authorship (e.g. H. Karst); IPNI always without blanks (H.Karst); BGBM follows this standard in all its collection databases
 
* Names not found at ENA/NCBI Taxonomy must be registered at ENA (by BGBM DNA Bank!). Note, that you will only find names associated to <u>published</u> sequences!
 
* Names not found at ENA/NCBI Taxonomy must be registered at ENA (by BGBM DNA Bank!). Note, that you will only find names associated to <u>published</u> sequences!
 
* ENA has rules in place for registration of new names one must follow. BGBM documents those name submissions centrally, to enable updating sequence records after publication of new names.
 
* ENA has rules in place for registration of new names one must follow. BGBM documents those name submissions centrally, to enable updating sequence records after publication of new names.
** Unidentified or unpublished names must be unique, e.g. "Minuartia sp. G.Parolly et al. 15015" instead of "Minuartia spec. 1"
+
** Unidentified or unpublished names must be unique, e.g. "Minuartia sp. DB 12843" instead of "Minuartia spec. 1"
  
'''Attention: the registration of new names might require up to two weeks, make sure ALL names are checked carefully and submitted right in time'''
+
'''Attention: the registration of new names might require up to six weeks, make sure ALL names are checked carefully and submitted right in time'''
  
 
==Add Annotations to Sequences==
 
==Add Annotations to Sequences==
 
* Required Software: Phyde, Geneious, every common text editor (e.g. Geany, Notepad++, Textpad)
 
* Required Software: Phyde, Geneious, every common text editor (e.g. Geany, Notepad++, Textpad)
 
* Video Tutorial: https://www.youtube.com/watch?v=CY1e2RkULas
 
* Video Tutorial: https://www.youtube.com/watch?v=CY1e2RkULas
* '''Attention: ENA requires use of certain vocabulary for annotations see Annotations'''
+
* '''Attention: ENA requires use of certain vocabulary for annotations see [[Internal:ENA Annotations | ENA Annotations]]'''
  
 
Export your file into Nexus format!
 
Export your file into Nexus format!
In the nexus file annotations show up at the end in following format:
+
In the nexus file annotations show up at the end of the file in following format. Annotations must not contain spaces!
 
<pre>
 
<pre>
 
BEGIN SETS;
 
BEGIN SETS;
 
charset 18S_rRNA = 1-378;
 
charset 18S_rRNA = 1-378;
charset 18S gene = 1-378;
+
charset 18S_gene = 1-378;
  
 
END;
 
END;
Line 48: Line 45:
 
## todo: Beispiel-Anfragen hinzufügen
 
## todo: Beispiel-Anfragen hinzufügen
  
Email von Carol Hotton (NCBI): Here’s the GenBank format for non-canonical names:
+
===Mandatory conventions for non-canonical names===
 
* Sensu stricto names are treated as canonical, e.g.: Arabis hirsuta s. str. = Arabis hirsuta
 
* Sensu stricto names are treated as canonical, e.g.: Arabis hirsuta s. str. = Arabis hirsuta
 
* Sensu lato names are treated as cf., e.g.: Festuca ovina s. l. = Festuca cf. ovina  
 
* Sensu lato names are treated as cf., e.g.: Festuca ovina s. l. = Festuca cf. ovina  
* Cf. (and aff.) names we treat as non-unique, to which a unique string is attached, e.g.: Festuca cf. ovina GD-2019
+
* Cf. (and aff.) names we treat as non-unique, to which a unique string is attached, e.g.: Festuca cf. ovina DB 12342
 
* Hybrid formulas repeat the genus, i.e.: Carex cespitosa x Carex nigra
 
* Hybrid formulas repeat the genus, i.e.: Carex cespitosa x Carex nigra
* We use an ‘x’ rather than a multiplication sign in hybrid names because not all applications can handle non-ASCII characters (as you have done, so that’s not a problem).
+
* use an ‘x’ rather than a multiplication sign in hybrid names because not all applications can handle non-ASCII characters
  
 
==Create project at ENA to get project number==
 
==Create project at ENA to get project number==
Line 60: Line 57:
 
# Fill out relevant information (Short name, Title of study, Abstract, Release date)
 
# Fill out relevant information (Short name, Title of study, Abstract, Release date)
  
==Überprüfung der Korrespondenz zwischen DNA-Alignment und Metadaten==
+
==Converting DNA alignment (nexus file) to EMBL flate file, integration of metadata==
Benötigte Software: beliebiger Software für Tabellenkalkulation (z.B. LibreOffice), beliebiger Texteditor (z.B. Geany)
+
Required Software: https://github.com/michaelgruenstaeudl/annonex2embl
  
# Die Sequenznamen des DNA-Alignments müssen den Einträgen einer der Spalten der Metadaten-Datei exakt entsprechen.
+
===Metadata File===
# Die Metadaten-Datei darf hierbei mehr Einträge haben als das DNA-Alignment, aber das DNA-ALignment darf nicht mehr Einträge haben als die Metadaten-Datei.
+
* MUST NOT contain any empty columns or rows
 +
* '''Comma''' as only accepted '''delimiter''' (Windows -> change region to  -> CSV export done with comma instead of semicolon)
 +
* '''DATE''' must be converted into '''YYYY-MM-DD''' ''Attention: when you do the export in Windows dates will be converted into US format!''
 +
* '''Coordinates''' must follow this syntax: "47.94 N 28.12 W" (no commas, keep spaces!)
 +
* MUST NOT contain any Umlaut or "strange" character such as °
 +
* Column headers must match with [http://www.insdc.org/files/feature_table.html INSDC feature table]
 +
* for '''matK''' use '''organelle = plastid'''; for others column "organelle" is not needed
  
==Schritt 6 DNA Bank: Konvertierung des DNA-Alignments von NEXUS-Format zu Flatfile-Format, Integration der Metadaten==
 
Benötigte Software: https://github.com/michaelgruenstaeudl/annonex2embl
 
  
===Metadaten-File===
+
===Most common software errors===
* darf keine Leerzeilen enthalten
+
{|class="wikitable"
* muss mit Komma separiert sein (Windows -> Region zu Englisch (USA) ändern -> CSV export dann mit Komma statt Semikolon
+
!Error
* darf keine Umlaute enthalten
+
!Solution
* Spaltennamen müssen den INSDC-Featuren entsprechen
+
|-
* für matK wird organelle = plastid benötigt
+
|annonex2embl ERROR: expected string or bytes-like object
 +
|csv parsing error in metadata file, typically " in strings, Umlaute, ó, é, á, í etc., replace all of them by "regular" characters
 +
|-
 +
|annonex2embl ERROR with qualifiers of `SAL169`: list index out of range
 +
|sample metadata missing, add data to metadata.csv
 +
|-
 +
| annonex2embl ERROR: Not enough taxa in matrix.
 +
| DIMENSIONS NTAX= doesn't match number of sequences in nexus file, update number in nexus file
 +
|-
 +
| annonex2embl ERROR: csv-file does not contain a column labelled isolate.
 +
| You have probably forgotten, that ";" is NOT supported as delimiter; see above how to export a csv with "," as delimiter
 +
|-
 +
| ValueError: End location (1410) must be greater than or equal to start location (1417)
 +
| CharSet error in Nexus fils, e.g. 1874-2222; 2222-2335; instead of 1874-2221; 2222-2335;
 +
|}
  
 
<pre>
 
<pre>
Line 80: Line 95:
 
INPUT=examples/DNA_Alignment.nex
 
INPUT=examples/DNA_Alignment.nex
 
METAD=examples/Metadata.csv
 
METAD=examples/Metadata.csv
DESCR="description_of_alignment"
+
DESCR="description_of_alignment" // note, that spaces are unfortunately not yet supported!
 
EMAIL=your_email_here@bgbm.org
 
EMAIL=your_email_here@bgbm.org
AUTHR="Your_name_here"
+
AUTHR="Your_name_here" // note, that spaces are unfortunately not yet supported!
  
 
annonex2embl -n $INPUT -c $METAD -o ${INPUT%.nex*}.embl -d $DESCR -e $EMAIL -a $AUTHR
 
annonex2embl -n $INPUT -c $METAD -o ${INPUT%.nex*}.embl -d $DESCR -e $EMAIL -a $AUTHR
Line 89: Line 104:
 
SET INPUT=examples/DNA_Alignment.nex
 
SET INPUT=examples/DNA_Alignment.nex
 
SET METAD=examples/Metadata.csv
 
SET METAD=examples/Metadata.csv
SET DESCR="description_of_alignment"
+
SET DESCR="description_of_alignment" // note, that spaces are unfortunately not yet supported!
 
SET EMAIL=your_email_here@bgbm.org
 
SET EMAIL=your_email_here@bgbm.org
SET AUTHR="Your_name_here"
+
SET AUTHR="Your_name_here" // note, that spaces are unfortunately not yet supported!
  
 
annonex2embl -n %INPUT% -c %METAD% -o output.embl -d %DESCR% -e %EMAIL% -a %AUTHR%
 
annonex2embl -n %INPUT% -c %METAD% -o output.embl -d %DESCR% -e %EMAIL% -a %AUTHR%
 
</pre>
 
</pre>
  
Beispiele für DESCR:
+
Examples for DESCR: '''note, that spaces are unfortunately not yet supported! You'll need to replace underscores in the next step'''
 
* ITS: "18S rRNA gene (partial), ITS1, 5.8S rRNA gene, ITS2 and 28S rRNA gene (partial)"
 
* ITS: "18S rRNA gene (partial), ITS1, 5.8S rRNA gene, ITS2 and 28S rRNA gene (partial)"
 
* trnLF: "tRNA-Leu (trnL) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene"
 
* trnLF: "tRNA-Leu (trnL) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene"
Line 102: Line 117:
 
* rpl16: "rpl16 intron, partial sequence"
 
* rpl16: "rpl16 intron, partial sequence"
 
* 18S: "partial 18S rRNA gene"
 
* 18S: "partial 18S rRNA gene"
 +
* rbcL: "partial rbcL gene"
 +
* petD: "petB gene, partial sequence; petD gene and intron, partial sequence"
 +
 +
===Check EMBL File before you continue===
 +
* first row '''mol_type''' formatted correctly?
 +
** WRONG: ID  D257_023a; SV 1; linear; '''DNA'''; ; PLN; 378 BP.
 +
** CORRECTED: ID  D257_023a; SV 1; linear; '''genomic DNA'''; ; PLN; 378 BP.
 +
* all dates in ISO format?
 +
* Identifier doen't have to be changed
 +
* "isolate DB" to be replaced by "isolate DB "
 +
* "isolate=DB" to be replaced by "isolate=DB "
 +
* 58S etc. to replaced by "5 8S" etc.
 +
* replace underscores in descriptions by spaces
  
===Kontrolle des EMBL-Files===
+
==Validating EMBL flatfile==
* erste Zeile mol_type richtig übernommen?
+
* Required Software: https://github.com/enasequence/webin-cli/releases  -> download latest jar file
* Datumsangaben in ISO?
+
* '''Mandatory: Manifest file''', file ending ".manifest", example file can be found at https://kb.bgbm.org/share/s/zWNjSrM1RyW1DNyNb5c15Q
* Identifier können so bleiben
+
* '''Mandatora: create output folder''', otherwise the software will tell you it can't write to output folder
* "isolate DB" ersetzen mit "isolate DB "
+
* The EMBL Validator will provide you with a report
* "isolate=DB" ersetzen mit "isolate=DB "
+
* Most common errors: wrong syntax for coordinates, unregistered names, typos in isolate/first part of nexus sequence rows (e.g. annonex2embl ERROR with qualifiers of `AmCoc01`: list index out of range means AmCoc01 found in Nexus file could not be found in metadata file)
* description_of_alignment ersetzen
 
* 58S etc. ersetzen
 
  
==Validierung des EMBL-Flatfile==
 
Benötigte Software: https://mvnrepository.com/artifact/uk.ac.ebi.ena.sequence/embl-api-validator  -> eine Version auswählen und dann das jar runterladen
 
  
 
<pre>
 
<pre>
  ##LINUX##
+
  ##LINUX and WINDOWS##
INPUT=examples/DNA_Alignment.embl
+
  java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ -validate
  java -jar embl-api-validator-1.1.265.jar $INPUT ###Check current version number!!!
+
</pre>
  
  ##WINDOWS##
+
<pre>
  SET INPUT=examples/DNA_Alignment.embl
+
  ##Example Manifest file##
  java -jar embl-api-validator-1.1.265.jar %INPUT% ###Check current version number!!!
+
  STUDY PRJEB42197 // see above, you'll needd to register a study as a very first step
 +
NAME partial 18S rRNA gene // same as used in DESCR parameter before, but spaces are fine this time
 +
  FLATFILE Cocconeis_crawfordii_18SV4.embl.gz // the zipped embl file name
 
</pre>
 
</pre>
  
==Entscheidung zwischen Art der Submission==
+
==Testing submission==
Option A: Interaktiver/GUI-basierter Submission; beschrieben in Schritten 9a und 9b
+
If validation was successul you can test the submission ('''NEVER SKIP THIS STEP''')
 
 
Option B: Programmatic/Command-Line-basierter Submission; beschrieben in Schritt 10
 
 
 
==9a Konvertierung des EMBL-Flatfiles zu einer ENA Checklist==
 
Benötigte Software: EMBL2checklists (Gruenstaeudl & Hartmaring 2019)
 
 
 
See details here: https://www.protocols.io/view/usage-of-embl2checklists-v6me9c6
 
  
 
<pre>
 
<pre>
  ##LINUX##
+
  ##LINUX and WINDOWS##
  INPUT=examples/DNA_Alignment.embl
+
  java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ -test - submit
CHOSEN_CHECKLIST="trnK_matK"
 
 
 
EMBL2checklists_CLI -i $INPUT -o ${INPUT%.embl*}.tsv -c $CHOSEN_CHECKLIST -e no
 
 
 
##WINDOWS##
 
SET INPUT=examples/DNA_Alignment.embl
 
SET CHOSEN_CHECKLIST="trnK_matK"
 
 
 
EMBL2checklists_CLI -i %INPUT% -o output.tsv -c %CHOSEN_CHECKLIST% -e no
 
 
</pre>
 
</pre>
  
==9b Upload der ENA Checklist unter Angabe der Projektnummer==
+
==Final submission==
# Bei ENA's Webin einloggen
+
If test submission was successul you can do the final the submission
# Klicke auf "Submit other assembled and annotated sequences [formerly EMBL-Bank]" --> Next
 
# Richtige Studie auswählen (nämlich jene, die unter Schritt 1 angelegt wurde) --> Next
 
# Klicke auf "Submit Completed Spreadsheet" und jene Datei auswählen, die unter Schritt 8a erstellt wurde --> Next
 
  
==10a DNA Bank: Vorbereiten der xml-Dateien==
 
# packen der embl-Datei und Checksumme bestimmen
 
 
<pre>
 
<pre>
gzip Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-submission.embl
+
##LINUX and WINDOWS##
md5sum Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-submission.embl.gz
+
java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ - submit
 
</pre>
 
</pre>
# Es werden zwei xml-Dateien benötigt *_submission.xml und *_analysis.xml
 
# Ausfüllen der *_analysis.xml - Datei; die STUDY_REF accession ist die Nummer, der Study, die wir in Schritt 1 angelegt haben!
 
# die *_submission.xml - Datei beinhaltet nur ein leeres Schema
 
  
==10b DNA Bank: Hochladen der Dateien über ftp und command line==
+
Command line will provide you with a submission number, e.g.:
# Der ftp-Server von ENA kann entweder direkt über die Command Line angesprochen werden, oder man nutzt ein beliebiges ftp-Programm. In jedem Fall werden die ENA Webin-Login-Daten benötigt
+
 
<pre>
+
INFO : The submission has been completed successfully. The following analysis accession was assigned to the submission: ERZ1740481
ftp webin.ebi.ac.uk  ##Login
 
Type bin to use binary mode.
 
Type ls command to check the content of your drop box.
 
Type prompt to switch off confirmation for each file uploaded.
 
Use mput command to upload files. ##lade folgendes hoch: 1) das *.gz file, das *_submission.xml und das *_analysis.xml
 
Use by commande to exit the ftp client.
 
</pre>
 
# Testsubmission auf dem ENA-Testserver (IMMER vor der echten Submission machen!)
 
<pre>curl -u Webin-NUMMER:TOPSECRETPASSWORD -F "SUBMISSION=@Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-submission.xml" -F "ANALYSIS=@Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-analysis.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" > Akhani-et-al_Willdenowia_Tamarix_trnG-S_Test.xml</pre>
 
# Ausgabe des Tests prüfen -> *_Test.xml
 
# Wenn Test erfolgreich -> Submission auf dem ENA-Server
 
<pre>curl -u Webin-NUMMER:TOPSECRETPASSWORD -F "SUBMISSION=@Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-submission.xml" -F "ANALYSIS=@Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-analysis.xml" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" > Akhani-et-al_Willdenowia_Tamarix_trnG-S_Test.xml</pre>
 
  
'''Done!'''
+
'''You will NOT receive a confirmation email, but they will be listed in the Webin user interface.'''

Latest revision as of 13:48, 23 July 2021

Attention: BGBM submissions to ENA including registration of new names should only be done by authorised staff! Please contact the DNA Bank if you want to submit sequences SIX weeks ahead of paper submission! The following guide shall help you to prepare all required data and to understand the complexity of sequence submissions. The submission itself will be done by authorised staff only!


1 Researchers

1.1 INPUT required for any submission

  1. DNA alignment in NEXUS format with annotations following the INSDC vocabulary
    1. most common errors: coordinates syntax, usage of Umlaute or charcters such as °
  2. Sample metadata in CSV format following the BGBM standard (TODO: description of this standard)
  3. List of standardized scientific names
  4. Project metadata (short name, title of study, abstract, release date, autors/contributors)

1.2 Check all taxon names against NCBI Taxonomy

Use the NCBI tool: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

  • Names muss follow IPNI standard (except for unidenified organisms and new described taxa), most common mistakes are blanks in authorship (e.g. H. Karst); IPNI always without blanks (H.Karst); BGBM follows this standard in all its collection databases
  • Names not found at ENA/NCBI Taxonomy must be registered at ENA (by BGBM DNA Bank!). Note, that you will only find names associated to published sequences!
  • ENA has rules in place for registration of new names one must follow. BGBM documents those name submissions centrally, to enable updating sequence records after publication of new names.
    • Unidentified or unpublished names must be unique, e.g. "Minuartia sp. DB 12843" instead of "Minuartia spec. 1"

Attention: the registration of new names might require up to six weeks, make sure ALL names are checked carefully and submitted right in time

1.3 Add Annotations to Sequences

Export your file into Nexus format! In the nexus file annotations show up at the end of the file in following format. Annotations must not contain spaces!

BEGIN SETS;
charset 18S_rRNA = 1-378;
charset 18S_gene = 1-378;

END;

2 DNA Bank

2.1 Registration of all new names at NCBI/ENA

  1. Login at ENA's Webin
  2. Click on "Taxonomy Check/Request" --> Next
  3. Enter organism name / upload list of names
    1. Save list of names as txt file (one per row) or enter individually
    2. todo: Beispiel-Anfragen hinzufügen

2.1.1 Mandatory conventions for non-canonical names

  • Sensu stricto names are treated as canonical, e.g.: Arabis hirsuta s. str. = Arabis hirsuta
  • Sensu lato names are treated as cf., e.g.: Festuca ovina s. l. = Festuca cf. ovina
  • Cf. (and aff.) names we treat as non-unique, to which a unique string is attached, e.g.: Festuca cf. ovina DB 12342
  • Hybrid formulas repeat the genus, i.e.: Carex cespitosa x Carex nigra
  • use an ‘x’ rather than a multiplication sign in hybrid names because not all applications can handle non-ASCII characters

2.2 Create project at ENA to get project number

  1. Login at ENA's Webin
  2. Click on "Register study (project)" --> Next
  3. Fill out relevant information (Short name, Title of study, Abstract, Release date)

2.3 Converting DNA alignment (nexus file) to EMBL flate file, integration of metadata

Required Software: https://github.com/michaelgruenstaeudl/annonex2embl

2.3.1 Metadata File

  • MUST NOT contain any empty columns or rows
  • Comma as only accepted delimiter (Windows -> change region to -> CSV export done with comma instead of semicolon)
  • DATE must be converted into YYYY-MM-DD Attention: when you do the export in Windows dates will be converted into US format!
  • Coordinates must follow this syntax: "47.94 N 28.12 W" (no commas, keep spaces!)
  • MUST NOT contain any Umlaut or "strange" character such as °
  • Column headers must match with INSDC feature table
  • for matK use organelle = plastid; for others column "organelle" is not needed


2.3.2 Most common software errors

Error Solution
annonex2embl ERROR: expected string or bytes-like object csv parsing error in metadata file, typically " in strings, Umlaute, ó, é, á, í etc., replace all of them by "regular" characters
annonex2embl ERROR with qualifiers of `SAL169`: list index out of range sample metadata missing, add data to metadata.csv
annonex2embl ERROR: Not enough taxa in matrix. DIMENSIONS NTAX= doesn't match number of sequences in nexus file, update number in nexus file
annonex2embl ERROR: csv-file does not contain a column labelled isolate. You have probably forgotten, that ";" is NOT supported as delimiter; see above how to export a csv with "," as delimiter
ValueError: End location (1410) must be greater than or equal to start location (1417) CharSet error in Nexus fils, e.g. 1874-2222; 2222-2335; instead of 1874-2221; 2222-2335;
##LINUX##
INPUT=examples/DNA_Alignment.nex
METAD=examples/Metadata.csv
DESCR="description_of_alignment" // note, that spaces are unfortunately not yet supported!
EMAIL=your_email_here@bgbm.org
AUTHR="Your_name_here" // note, that spaces are unfortunately not yet supported!

annonex2embl -n $INPUT -c $METAD -o ${INPUT%.nex*}.embl -d $DESCR -e $EMAIL -a $AUTHR

##WINDOWS##
SET INPUT=examples/DNA_Alignment.nex
SET METAD=examples/Metadata.csv
SET DESCR="description_of_alignment" // note, that spaces are unfortunately not yet supported!
SET EMAIL=your_email_here@bgbm.org
SET AUTHR="Your_name_here" // note, that spaces are unfortunately not yet supported!

annonex2embl -n %INPUT% -c %METAD% -o output.embl -d %DESCR% -e %EMAIL% -a %AUTHR%

Examples for DESCR: note, that spaces are unfortunately not yet supported! You'll need to replace underscores in the next step

  • ITS: "18S rRNA gene (partial), ITS1, 5.8S rRNA gene, ITS2 and 28S rRNA gene (partial)"
  • trnLF: "tRNA-Leu (trnL) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene"
  • trnKmatK: "tRNA-Lys (trnK) gene and intron, partial sequence; maturase K (matK) gene, complete cds; psbA gene, partial sequence"
  • rpl16: "rpl16 intron, partial sequence"
  • 18S: "partial 18S rRNA gene"
  • rbcL: "partial rbcL gene"
  • petD: "petB gene, partial sequence; petD gene and intron, partial sequence"

2.3.3 Check EMBL File before you continue

  • first row mol_type formatted correctly?
    • WRONG: ID D257_023a; SV 1; linear; DNA; ; PLN; 378 BP.
    • CORRECTED: ID D257_023a; SV 1; linear; genomic DNA; ; PLN; 378 BP.
  • all dates in ISO format?
  • Identifier doen't have to be changed
  • "isolate DB" to be replaced by "isolate DB "
  • "isolate=DB" to be replaced by "isolate=DB "
  • 58S etc. to replaced by "5 8S" etc.
  • replace underscores in descriptions by spaces

2.4 Validating EMBL flatfile

  • Required Software: https://github.com/enasequence/webin-cli/releases -> download latest jar file
  • Mandatory: Manifest file, file ending ".manifest", example file can be found at https://kb.bgbm.org/share/s/zWNjSrM1RyW1DNyNb5c15Q
  • Mandatora: create output folder, otherwise the software will tell you it can't write to output folder
  • The EMBL Validator will provide you with a report
  • Most common errors: wrong syntax for coordinates, unregistered names, typos in isolate/first part of nexus sequence rows (e.g. annonex2embl ERROR with qualifiers of `AmCoc01`: list index out of range means AmCoc01 found in Nexus file could not be found in metadata file)


 ##LINUX and WINDOWS##
 java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ -validate
 ##Example Manifest file##
 STUDY	 PRJEB42197 // see above, you'll needd to register a study as a very first step
 NAME	 partial 18S rRNA gene // same as used in DESCR parameter before, but spaces are fine this time
 FLATFILE Cocconeis_crawfordii_18SV4.embl.gz // the zipped embl file name

2.5 Testing submission

If validation was successul you can test the submission (NEVER SKIP THIS STEP)

 ##LINUX and WINDOWS##
 java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ -test - submit

2.6 Final submission

If test submission was successul you can do the final the submission

 ##LINUX and WINDOWS##
 java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ - submit

Command line will provide you with a submission number, e.g.:

INFO : The submission has been completed successfully. The following analysis accession was assigned to the submission: ERZ1740481

You will NOT receive a confirmation email, but they will be listed in the Webin user interface.