Revision as of 11:44, 12 February 2021

Attention: BGBM submissions to ENA including registration of new names should only be done by authorised staff! Please contact the DNA Bank if you want to submit sequences SIX weeks ahead of paper submission! The following guide shall help you to prepare all required data and to understand the complexity of sequence submissions. The submission itself will be done by authorised staff only!

!To Do: Text includes a Mix of German and English !

Researchers

INPUT required for any submission

DNA alignment in NEXUS format with annotations following the INSDC vocabulary
Sample metadata in CSV format following the BGBM standard (TODO: description of this standard)
List of standardized scientific names
Project metadata (short name, title of study, abstract, release date, autors/contributors)

Check all taxon names against NCBI Taxonomy

Option A: Use the NCBI tool: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

Option B: Use OpenRefine: TODO: documentation

Names muss follow IPNI standard (except for unidenified organisms and new described taxa), most common mistakes are blanks in authorship (e.g. H. Karst); IPNI always without blanks (H.Karst); BGBM follows this standard in all its collection databases
Names not found at ENA/NCBI Taxonomy must be registered at ENA (by BGBM DNA Bank!). Note, that you will only find names associated to published sequences!
ENA has rules in place for registration of new names one must follow. BGBM documents those name submissions centrally, to enable updating sequence records after publication of new names.
- Unidentified or unpublished names must be unique, e.g. "Minuartia sp. G.Parolly et al. 15015" instead of "Minuartia spec. 1"

Attention: the registration of new names might require up to two weeks, make sure ALL names are checked carefully and submitted right in time

Add Annotations to Sequences

Required Software: Phyde, Geneious, every common text editor (e.g. Geany, Notepad++, Textpad)
Video Tutorial: https://www.youtube.com/watch?v=CY1e2RkULas
Attention: ENA requires use of certain vocabulary for annotations see Annotations

Export your file into Nexus format! In the nexus file annotations show up at the end of the file in following format. Annotations must not contain spaces!

BEGIN SETS;
charset 18S_rRNA = 1-378;
charset 18S_gene = 1-378;

END;

DNA Bank

Registration of all new names at NCBI/ENA

Login at ENA's Webin
Click on "Taxonomy Check/Request" --> Next
Enter organism name / upload list of names
1. Save list of names as txt file (one per row) or enter individually
2. todo: Beispiel-Anfragen hinzufügen

Email von Carol Hotton (NCBI): Here’s the GenBank format for non-canonical names:

Sensu stricto names are treated as canonical, e.g.: Arabis hirsuta s. str. = Arabis hirsuta
Sensu lato names are treated as cf., e.g.: Festuca ovina s. l. = Festuca cf. ovina
Cf. (and aff.) names we treat as non-unique, to which a unique string is attached, e.g.: Festuca cf. ovina GD-2019
Hybrid formulas repeat the genus, i.e.: Carex cespitosa x Carex nigra
We use an ‘x’ rather than a multiplication sign in hybrid names because not all applications can handle non-ASCII characters (as you have done, so that’s not a problem).

Create project at ENA to get project number

Login at ENA's Webin
Click on "Register study (project)" --> Next
Fill out relevant information (Short name, Title of study, Abstract, Release date)

Überprüfung der Korrespondenz zwischen DNA-Alignment und Metadaten

Benötigte Software: beliebiger Software für Tabellenkalkulation (z.B. LibreOffice), beliebiger Texteditor (z.B. Geany)

Die Sequenznamen des DNA-Alignments müssen den Einträgen einer der Spalten der Metadaten-Datei exakt entsprechen.
Die Metadaten-Datei darf hierbei mehr Einträge haben als das DNA-Alignment, aber das DNA-ALignment darf nicht mehr Einträge haben als die Metadaten-Datei.

Schritt 6 DNA Bank: Konvertierung des DNA-Alignments von NEXUS-Format zu Flatfile-Format, Integration der Metadaten

Benötigte Software: https://github.com/michaelgruenstaeudl/annonex2embl

Metadata File

MUST NOT contain any empty columns or rows
Comma as only accepted delimiter (Windows -> change region to -> CSV export done with comma instead of semicolon)
DATE must be converted into YYYY-MM-DD Attention: when you do the export in Windows dates will be converted into US format!
MUST NOT contain any Umlaut or "strange" character such as °
Column headers must match with INSDC feature table
for matK use organelle = plastid; for others column "organelle" is not needed

##LINUX##
INPUT=examples/DNA_Alignment.nex
METAD=examples/Metadata.csv
DESCR="description_of_alignment" // note, that spaces are unfortunately not yet supported!
EMAIL=your_email_here@bgbm.org
AUTHR="Your_name_here" // note, that spaces are unfortunately not yet supported!

annonex2embl -n $INPUT -c $METAD -o ${INPUT%.nex*}.embl -d $DESCR -e $EMAIL -a $AUTHR

##WINDOWS##
SET INPUT=examples/DNA_Alignment.nex
SET METAD=examples/Metadata.csv
SET DESCR="description_of_alignment" // note, that spaces are unfortunately not yet supported!
SET EMAIL=your_email_here@bgbm.org
SET AUTHR="Your_name_here" // note, that spaces are unfortunately not yet supported!

annonex2embl -n %INPUT% -c %METAD% -o output.embl -d %DESCR% -e %EMAIL% -a %AUTHR%

Examples for DESCR: note, that spaces are unfortunately not yet supported! You'll need to replace underscores in the next step

ITS: "18S rRNA gene (partial), ITS1, 5.8S rRNA gene, ITS2 and 28S rRNA gene (partial)"
trnLF: "tRNA-Leu (trnL) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene"
trnKmatK: "tRNA-Lys (trnK) gene and intron, partial sequence; maturase K (matK) gene, complete cds; psbA gene, partial sequence"
rpl16: "rpl16 intron, partial sequence"
18S: "partial 18S rRNA gene"

Check EMBL File before you continue

first row mol_type formatted correctly?
- WRONG: ID D257_023a; SV 1; linear; DNA; ; PLN; 378 BP.
- CORRECTED: ID D257_023a; SV 1; linear; genomic DNA; ; PLN; 378 BP.
all dates in ISO format?
Identifier doen't have to be changed
"isolate DB" to be replaced by "isolate DB "
"isolate=DB" to be replaced by "isolate=DB "
58S etc. to replaced by "5 8S" etc.
replace underscores in descriptions by spaces

Validating EMBL flatfile

Required Software: https://github.com/enasequence/webin-cli/releases -> download latest jar file
Mandatory: Manifest file, file ending ".manifest"

 ##LINUX and WINDOWS##
 java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ -validate

 ##Example Manifest file##
 STUDY	 PRJEB42197 // see above, you'll needd to register a study as a very first step
 NAME	 partial 18S rRNA gene // same as used in DESCR parameter before
 FLATFILE Cocconeis_crawfordii_18SV4.embl.gz // the zipped embl file name

Entscheidung zwischen Art der Submission

Option A: Interaktiver/GUI-basierter Submission; beschrieben in Schritten 9a und 9b

Option B: Programmatic/Command-Line-basierter Submission; beschrieben in Schritt 10

9a Konvertierung des EMBL-Flatfiles zu einer ENA Checklist

Benötigte Software: EMBL2checklists (Gruenstaeudl & Hartmaring 2019)

See details here: https://www.protocols.io/view/usage-of-embl2checklists-v6me9c6

 ##LINUX##
 INPUT=examples/DNA_Alignment.embl
 CHOSEN_CHECKLIST="trnK_matK"

 EMBL2checklists_CLI -i $INPUT -o ${INPUT%.embl*}.tsv -c $CHOSEN_CHECKLIST -e no

 ##WINDOWS##
 SET INPUT=examples/DNA_Alignment.embl 
 SET CHOSEN_CHECKLIST="trnK_matK"

 EMBL2checklists_CLI -i %INPUT% -o output.tsv -c %CHOSEN_CHECKLIST% -e no

9b Upload der ENA Checklist unter Angabe der Projektnummer

Bei ENA's Webin einloggen
Klicke auf "Submit other assembled and annotated sequences [formerly EMBL-Bank]" --> Next
Richtige Studie auswählen (nämlich jene, die unter Schritt 1 angelegt wurde) --> Next
Klicke auf "Submit Completed Spreadsheet" und jene Datei auswählen, die unter Schritt 8a erstellt wurde --> Next

10a DNA Bank: Vorbereiten der xml-Dateien

packen der embl-Datei und Checksumme bestimmen

gzip Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-submission.embl
md5sum Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-submission.embl.gz

Es werden zwei xml-Dateien benötigt *_submission.xml und *_analysis.xml
Ausfüllen der *_analysis.xml - Datei; die STUDY_REF accession ist die Nummer, der Study, die wir in Schritt 1 angelegt haben!
die *_submission.xml - Datei beinhaltet nur ein leeres Schema

10b DNA Bank: Hochladen der Dateien über ftp und command line

Der ftp-Server von ENA kann entweder direkt über die Command Line angesprochen werden, oder man nutzt ein beliebiges ftp-Programm. In jedem Fall werden die ENA Webin-Login-Daten benötigt

ftp webin.ebi.ac.uk  ##Login
Type bin to use binary mode.
Type ls command to check the content of your drop box.
Type prompt to switch off confirmation for each file uploaded.
Use mput command to upload files. ##lade folgendes hoch: 1) das *.gz file, das *_submission.xml und das *_analysis.xml
Use by commande to exit the ftp client.

Testsubmission auf dem ENA-Testserver (IMMER vor der echten Submission machen!)

curl -u Webin-NUMMER:TOPSECRETPASSWORD -F "SUBMISSION=@Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-submission.xml" -F "ANALYSIS=@Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-analysis.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" > Akhani-et-al_Willdenowia_Tamarix_trnG-S_Test.xml

Ausgabe des Tests prüfen -> *_Test.xml
Wenn Test erfolgreich -> Submission auf dem ENA-Server

curl -u Webin-NUMMER:TOPSECRETPASSWORD -F "SUBMISSION=@Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-submission.xml" -F "ANALYSIS=@Akhani-et-al_Willdenowia_Tamarix_trnG-S_ENA-analysis.xml" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" > Akhani-et-al_Willdenowia_Tamarix_trnG-S_Test.xml

Done!

ENA Submission Pipeline: Difference between revisions

Revision as of 11:44, 12 February 2021

Contents

Researchers

INPUT required for any submission

Check all taxon names against NCBI Taxonomy

Add Annotations to Sequences

DNA Bank

Registration of all new names at NCBI/ENA

Create project at ENA to get project number

Überprüfung der Korrespondenz zwischen DNA-Alignment und Metadaten

Schritt 6 DNA Bank: Konvertierung des DNA-Alignments von NEXUS-Format zu Flatfile-Format, Integration der Metadaten

Metadata File

Check EMBL File before you continue

Validating EMBL flatfile

Entscheidung zwischen Art der Submission

9a Konvertierung des EMBL-Flatfiles zu einer ENA Checklist

9b Upload der ENA Checklist unter Angabe der Projektnummer

10a DNA Bank: Vorbereiten der xml-Dateien

10b DNA Bank: Hochladen der Dateien über ftp und command line

Navigation menu

ENA Submission Pipeline: Difference between revisions

Revision as of 11:44, 12 February 2021

Researchers

INPUT required for any submission

Check all taxon names against NCBI Taxonomy

Add Annotations to Sequences

DNA Bank

Registration of all new names at NCBI/ENA

Create project at ENA to get project number

Überprüfung der Korrespondenz zwischen DNA-Alignment und Metadaten

Schritt 6 DNA Bank: Konvertierung des DNA-Alignments von NEXUS-Format zu Flatfile-Format, Integration der Metadaten

Metadata File

Check EMBL File before you continue

Validating EMBL flatfile

Entscheidung zwischen Art der Submission

9a Konvertierung des EMBL-Flatfiles zu einer ENA Checklist

9b Upload der ENA Checklist unter Angabe der Projektnummer

10a DNA Bank: Vorbereiten der xml-Dateien

10b DNA Bank: Hochladen der Dateien über ftp und command line

Navigation menu

Search