DNA Sequences and Derivatives

From BGBM Collection Workflows
Revision as of 12:01, 6 May 2026 by GabiDroege (talk | contribs) (Created page with "DNA sequence data and their derivatives are primarily organized by sequencing methods (Sanger vs. NGS) and by lab code and DNA region, with folders for raw data, processed data and analysis results = DNA sequences = '''Nature:''' pherograms / reads from Sanger sequencing or NGS platforms '''File formats:''' - Sanger sequences: Pherograms: *.scf for own sequences, *.abi format (can be converted in scf) - NGS: *.fastq.gz '''Storage / folder organization:''' - Sanger se...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

DNA sequence data and their derivatives are primarily organized by sequencing methods (Sanger vs. NGS) and by lab code and DNA region, with folders for raw data, processed data and analysis results

DNA sequences

Nature: pherograms / reads from Sanger sequencing or NGS platforms

File formats: - Sanger sequences: Pherograms: *.scf for own sequences, *.abi format (can be converted in scf) - NGS: *.fastq.gz

Storage / folder organization:

- Sanger sequences are organized by taxonomic group and the DNA region.

Version control: multiple reads derived from the same sample are distinguished by consecutive numbering of the PCR products they result from.

Metadata: Linked to source data under section 1. by labcode (own data) or INSDC accession number (published data)

- Sample metadata: as described under section 1.

- Sequencing details: for NGS data, the sequencing platform or method and relevant parameters

Retention:

- Sanger sequences: Pherograms are stored indefinitely. Bad pherograms resulting from failed sequencing are deleted once sequencing is successful, or latest after completion of a study.

- ab1, txt and pdf files from Macrogen are not retained.

- NGS raw data: A backup copy of the raw data as received from the sequencing company is saved in BGBM NGS raw data storage. In addition, the raw data files (usually renamed for further processing) are stored in a separate subdirectory of the project directory.

Publication: pherograms are normally not published

Sanger Sequences

Sanger sequences are organized by DNA region, by study group or project, and by labcode.

  • Main folder: The folder containing pherograms and contigs is named “sequences” or includes the word “sequences”.

Example: CAR_sequences

  • Folder naming: the folder name must contain the unambiguous name of the DNA region sequenced, e.g. trnL-F. Naming folders with person’s names is not permitted, even if multiple people are working on the same project.

Example: CAR_trnK-matK

  • New sequences: Store in a subfolder named “new_sequences” until they are processed.
  • No separation by study: Sequences under one labcode are not separated by studies or projects. For example, all Caryophyllaceae sequences (labcode CAR) are stored in one folder CAR_sequences, not separated by the genera studied.
  • Batch organization: Depending on the number of sequences, they can either be stored in one folder or in subfolders of batches of 100 sequences. The second option is highly recommended.
  • Labcode / DB number: For sequences that have no labcode but a DB number, the structure is the same, using DB numbers instead of labcodes.
  • Deprecated or old sequences: Move old or deprecated sequences to an archive folder within the same structure, clearly labelled with the date of archival.

Example: Archive_CA_trnL-F_Jan2024

Example structure:

Cactaceae

└── CA_sequences

└── CA_trnL-F

└── CA_trnK-matK

└── CA_new_sequences

Naming conventions

Pherograms

<Lab code>_<PCR product number>_<primer>.scf

Example: CAR001_ NK304_ITS4.scf

This naming is automatically generated by Macrogen if the order sheet is filled out in this way.

DB numbers are used if the sample has no labcode.

Contigs

The name always contains the lab code and the DNA region name. The species / taxon name is optional.

<Lab code>_<region name>.pde

Example: CRD095_ITS.pde

<Lab code>_<taxon>_<region name>.pde

Example: CRD095_Jurinea-arachnoidea_ITS.pde

File structure within the folders

When working with PhyDE, the contig files and the pherograms have to be either

a. in the same folder, or

b. the pherograms must be in a subfolder of the folder containing the contig files.

This ensures a stable link between pherograms and contig. This link will remain when migrating data (all folders); but will disappear if the folders holding pherograms are renamed.

Option 1. Contigs and pherograms in the same folder (convenient for small datasets)

📂 CAR_ITS

CAR001_ITS.pde [contig]

CAR001_NK304_ITS4.scf [pherogram]

CAR001_NK304_ITS5.scf [pherogram]

Option 2. Pherograms in a subfolder (better for larger datasets)

📂 CAR_ITS

CAR001_ITS.pde

📂 pherograms

CAR001_ NK304_ITS4.scf

CAR001_ NK304_ITS5.scf

Batch organization. Batches of 100 for large datasets and trnK-matK data

📂 CA_rpl16

CA101_NK205_rpl16.pde

📂 CA001-099

📂 CA100-199

CA101_NK205_CArps3F.pde

CA101_NK205_rpl16R.pde

Additional guidelines

Genbank sequences: Store alongside contigs. Alternatively, include directly in an alignment with the sequence accession number. Documenting the accession numbers and publication is crucial for accurate source citation.

Sequences from external collaborators: Store separately from own sequences, and the folder name can contain the collaborator’s name. Save any e-mails or other information regarding these sequences in the same folder and clearly document all associated information.

Sequences with different lab codes for one study: In case of multiple lab codes for one taxonomic group (like ERS and CRD for Cardueae), store them together in a folder under the labcode that refers to the taxonomic group (CRD).

NGS Sequences

Derivatives of DNA sequences

Nature:

(a) contigs or assembled reads; assembled cp genomes; cleaned and deduped NGS reads (including those from partners or downloaded published reads; reads mapped to Hyb-Seq targets.

(b) alignments: master alignments, single-locus alignments, concatenated alignments.

File formats: various formats, including Fasta, Phylip, Nexus, PhyDe.

Storage / folder organization: Sequences are organized by taxonomic group or project, and the DNA region. Sequences obtained from other sources are stored in the same location as own contigs.

Naming convention: see separate document.

Version control: For alignments, include version numbers or dates in the file names, if relevant, also include the number of terminals or any other dataset parameter.

Metadata:

Sample metadata: As described above

Data source: For sequences obtained from other sources (e.g., collaborators), include the origin, accession numbers, citation and any other relevant information. For GenBank sequences, include accession number and citations.

Link to source data: Linked to source data under section 1. by labcode (own data) or INSDC accession number (published data).

Link to sample metadata: through labcode or DB number

Accession numbers for published sequences are included in the DNA master list.

Retention:

- Contigs: stored indefinitely

- Alignments: published alignments are always retained and linked to the publication. Master alignments are stored as versions; outdated versions can be deleted.

Publication:

- Sequences used in a manuscript are submitted to GenBank or ENA, along with the sample metadata, using a specified structure including standardized locality metadata, identifiers, and marker designations.

- Final alignments are published along with the manuscript, either as supplement, or uploaded to a repository.

Sequence submission files: Project info text file, metadata file(s) (original and formatted for submission), alignments files, taxonomic info files (e.g., for unpublished names), flat files for submission (ENA), other ENA-specific files, and the file(s) containing accession numbers are archived along with the manuscript files.