Difference between revisions of "ENA Submission Pipeline"
GabiDroege (talk | contribs) (→Final submission) |
GabiDroege (talk | contribs) (→Converting DNA alignment (nexus file) to EMBL flate file, integration of metadata) |
||
Line 105: | Line 105: | ||
* rpl16: "rpl16 intron, partial sequence" | * rpl16: "rpl16 intron, partial sequence" | ||
* 18S: "partial 18S rRNA gene" | * 18S: "partial 18S rRNA gene" | ||
+ | * rbcL: "partial rbcL gene" | ||
===Check EMBL File before you continue=== | ===Check EMBL File before you continue=== |
Revision as of 11:44, 12 February 2021
Attention: BGBM submissions to ENA including registration of new names should only be done by authorised staff! Please contact the DNA Bank if you want to submit sequences SIX weeks ahead of paper submission! The following guide shall help you to prepare all required data and to understand the complexity of sequence submissions. The submission itself will be done by authorised staff only!
!To Do: Text includes a Mix of German and English !
Contents
- 1 Researchers
- 2 DNA Bank
- 2.1 Registration of all new names at NCBI/ENA
- 2.2 Create project at ENA to get project number
- 2.3 Überprüfung der Korrespondenz zwischen DNA-Alignment und Metadaten
- 2.4 Converting DNA alignment (nexus file) to EMBL flate file, integration of metadata
- 2.5 Validating EMBL flatfile
- 2.6 Testing submission
- 2.7 Final submission
1 Researchers
1.1 INPUT required for any submission
- DNA alignment in NEXUS format with annotations following the INSDC vocabulary
- most common errors: coordinates syntax, usage of Umlaute or charcters such as °
- Sample metadata in CSV format following the BGBM standard (TODO: description of this standard)
- List of standardized scientific names
- Project metadata (short name, title of study, abstract, release date, autors/contributors)
1.2 Check all taxon names against NCBI Taxonomy
Option A: Use the NCBI tool: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi
Option B: Use OpenRefine: TODO: documentation
- Names muss follow IPNI standard (except for unidenified organisms and new described taxa), most common mistakes are blanks in authorship (e.g. H. Karst); IPNI always without blanks (H.Karst); BGBM follows this standard in all its collection databases
- Names not found at ENA/NCBI Taxonomy must be registered at ENA (by BGBM DNA Bank!). Note, that you will only find names associated to published sequences!
- ENA has rules in place for registration of new names one must follow. BGBM documents those name submissions centrally, to enable updating sequence records after publication of new names.
- Unidentified or unpublished names must be unique, e.g. "Minuartia sp. G.Parolly et al. 15015" instead of "Minuartia spec. 1"
Attention: the registration of new names might require up to two weeks, make sure ALL names are checked carefully and submitted right in time
1.3 Add Annotations to Sequences
- Required Software: Phyde, Geneious, every common text editor (e.g. Geany, Notepad++, Textpad)
- Video Tutorial: https://www.youtube.com/watch?v=CY1e2RkULas
- Attention: ENA requires use of certain vocabulary for annotations see Annotations
Export your file into Nexus format! In the nexus file annotations show up at the end of the file in following format. Annotations must not contain spaces!
BEGIN SETS; charset 18S_rRNA = 1-378; charset 18S_gene = 1-378; END;
2 DNA Bank
2.1 Registration of all new names at NCBI/ENA
- Login at ENA's Webin
- Click on "Taxonomy Check/Request" --> Next
- Enter organism name / upload list of names
- Save list of names as txt file (one per row) or enter individually
- todo: Beispiel-Anfragen hinzufügen
Email von Carol Hotton (NCBI): Here’s the GenBank format for non-canonical names:
- Sensu stricto names are treated as canonical, e.g.: Arabis hirsuta s. str. = Arabis hirsuta
- Sensu lato names are treated as cf., e.g.: Festuca ovina s. l. = Festuca cf. ovina
- Cf. (and aff.) names we treat as non-unique, to which a unique string is attached, e.g.: Festuca cf. ovina GD-2019
- Hybrid formulas repeat the genus, i.e.: Carex cespitosa x Carex nigra
- We use an ‘x’ rather than a multiplication sign in hybrid names because not all applications can handle non-ASCII characters (as you have done, so that’s not a problem).
2.2 Create project at ENA to get project number
- Login at ENA's Webin
- Click on "Register study (project)" --> Next
- Fill out relevant information (Short name, Title of study, Abstract, Release date)
2.3 Überprüfung der Korrespondenz zwischen DNA-Alignment und Metadaten
Benötigte Software: beliebiger Software für Tabellenkalkulation (z.B. LibreOffice), beliebiger Texteditor (z.B. Geany)
- Die Sequenznamen des DNA-Alignments müssen den Einträgen einer der Spalten der Metadaten-Datei exakt entsprechen.
- Die Metadaten-Datei darf hierbei mehr Einträge haben als das DNA-Alignment, aber das DNA-ALignment darf nicht mehr Einträge haben als die Metadaten-Datei.
2.4 Converting DNA alignment (nexus file) to EMBL flate file, integration of metadata
Required Software: https://github.com/michaelgruenstaeudl/annonex2embl
2.4.1 Metadata File
- MUST NOT contain any empty columns or rows
- Comma as only accepted delimiter (Windows -> change region to -> CSV export done with comma instead of semicolon)
- DATE must be converted into YYYY-MM-DD Attention: when you do the export in Windows dates will be converted into US format!
- Coordinates must follow this syntax: "47.94 N 28.12 W" (no commas, keep spaces!)
- MUST NOT contain any Umlaut or "strange" character such as °
- Column headers must match with INSDC feature table
- for matK use organelle = plastid; for others column "organelle" is not needed
##LINUX## INPUT=examples/DNA_Alignment.nex METAD=examples/Metadata.csv DESCR="description_of_alignment" // note, that spaces are unfortunately not yet supported! EMAIL=your_email_here@bgbm.org AUTHR="Your_name_here" // note, that spaces are unfortunately not yet supported! annonex2embl -n $INPUT -c $METAD -o ${INPUT%.nex*}.embl -d $DESCR -e $EMAIL -a $AUTHR ##WINDOWS## SET INPUT=examples/DNA_Alignment.nex SET METAD=examples/Metadata.csv SET DESCR="description_of_alignment" // note, that spaces are unfortunately not yet supported! SET EMAIL=your_email_here@bgbm.org SET AUTHR="Your_name_here" // note, that spaces are unfortunately not yet supported! annonex2embl -n %INPUT% -c %METAD% -o output.embl -d %DESCR% -e %EMAIL% -a %AUTHR%
Examples for DESCR: note, that spaces are unfortunately not yet supported! You'll need to replace underscores in the next step
- ITS: "18S rRNA gene (partial), ITS1, 5.8S rRNA gene, ITS2 and 28S rRNA gene (partial)"
- trnLF: "tRNA-Leu (trnL) gene, partial sequence; trnL-trnF intergenic spacer, complete sequence; and tRNA-Phe (trnF) gene"
- trnKmatK: "tRNA-Lys (trnK) gene and intron, partial sequence; maturase K (matK) gene, complete cds; psbA gene, partial sequence"
- rpl16: "rpl16 intron, partial sequence"
- 18S: "partial 18S rRNA gene"
- rbcL: "partial rbcL gene"
2.4.2 Check EMBL File before you continue
- first row mol_type formatted correctly?
- WRONG: ID D257_023a; SV 1; linear; DNA; ; PLN; 378 BP.
- CORRECTED: ID D257_023a; SV 1; linear; genomic DNA; ; PLN; 378 BP.
- all dates in ISO format?
- Identifier doen't have to be changed
- "isolate DB" to be replaced by "isolate DB "
- "isolate=DB" to be replaced by "isolate=DB "
- 58S etc. to replaced by "5 8S" etc.
- replace underscores in descriptions by spaces
2.5 Validating EMBL flatfile
- Required Software: https://github.com/enasequence/webin-cli/releases -> download latest jar file
- Mandatory: Manifest file, file ending ".manifest", example file can be found at https://library.ggbn.org/share/s/zWNjSrM1RyW1DNyNb5c15Q
- The EMBL Validator will provide you with a report
- Most common errors: wrong syntax for coordinates, unregistered names
##LINUX and WINDOWS## java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ -validate
##Example Manifest file## STUDY PRJEB42197 // see above, you'll needd to register a study as a very first step NAME partial 18S rRNA gene // same as used in DESCR parameter before, but spaces are fine this time FLATFILE Cocconeis_crawfordii_18SV4.embl.gz // the zipped embl file name
2.6 Testing submission
If validation was successul you can test the submission (NEVER SKIP THIS STEP)
##LINUX and WINDOWS## java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ -test - submit
2.7 Final submission
If test submission was successul you can do the final the submission
##LINUX and WINDOWS## java -jar ~/Downloads/webin-cli-3.5.0.jar -context sequence -userName TOPSECRET -password EVENMORESECRET -manifest ~/Desktop/ENA/YOUR_FOLDER/MANIFEST_FILE.manifest -outputDir ~/Desktop/ENA/YOUR_FOLDER/output -inputDir ~/Desktop/ENA/YOUR_FOLDER/ - submit
Command line will provide you with a submission number, e.g.:
INFO : The submission has been completed successfully. The following analysis accession was assigned to the submission: ERZ1740481
You will NOT receive a confirmation email.