Output Files

The following section describes the output file formats generated by the TRISTAN toolkit. Each format provides a different perspective on the results, ranging from raw model predictions to comprehensive tabular summaries and standardized genome annotations. Understanding these formats is essential to integrate downstream workflows from TRISTAN results.

📦 Raw Predictions (.npy)

Raw predictions for every nucleotide on the transcripts are saved as pickled numpy objects. For each transcript, the output is an array containing the transcript label and a corresponding array of model outputs (e.g., scores or probabilities) for each nucleotide position. For TIS Transformer, predictions are also saved within the hdf5 database.

Each entry in the numpy array is a list with three elements:

  • sample id and transcript identifier separated by <sample_id>|<transcript_id>. The <sample_id> is seq when predictions come from TIS Transformer.

  • the predicted scores for each position of that transcript

  • the true labels for each position of that transcript, as derived from the GTF file used to create the database.

>>> results = np.load('<out_prefix>_<sample>.npy', allow_pickle=True)
>>> results[0]
array(['seq|ENST00000410304',
       array([2.3891837e-09, 7.0824785e-07, 8.3791534e-09, 4.3269135e-09,
              4.9220684e-08, 1.5315813e-10, 7.0196869e-08, 2.4103475e-10,
              4.5873511e-10, 1.4299616e-10, 6.1071654e-09, 1.9664975e-08,
              2.9255699e-07, 4.7719610e-08, 7.7600065e-10, 9.2305236e-10,
              3.3297397e-07, 3.5771163e-07, 4.1942007e-05, 4.5123262e-08,
              1.0270607e-11, 1.1841109e-09, 7.9038587e-10, 6.5511790e-10,
              6.0892291e-13, 1.6157842e-11, 6.9130129e-10, 4.5778301e-11,
              2.1682500e-03, 2.3315516e-09, 2.2578116e-11], dtype=float32),
       array([False False False False False False False False False False False False
              False False False False False False False False False False False False
              False False False False False False False], dtype=bool)],
      dtype=object)

📅 CSV Table (.csv)

The CSV table output provides an extended view on predicted translated ORFs by providing various associated features. Each row corresponds to a single ORF, capturing model scores, genomic coordinates, transcript details, and additional annotations.

Column Name

Description

seqname

Chromosome or scaffold name from GTF.

ORF_id

Unique ID from transcript ID and ORF TIS position.

ORF_len

Length of the ORF in nucleotides.

ribotie_score

RiboTIE: RiboTIE model score (0-1, not a probability).

ribotie_rank

RiboTIE: Prediction rank in table (1 is highest).

tis_transformer_score

TIS Transformer: Prediction rank in table (1 is highest).

transcript_id

Transcript ID from GTF.

transcript_len

Length of transcript in nucleotides.

start_codon

Start codon sequence (3-char string).

stop_codon

Stop codon sequence (3-char string or None).

strand

Transcript strand (+ or -).

ORF_type

Classification of ORFs (annotated CDS, u(o)ORF, d(o)ORF, intORF, varRNA-ORF, lncRNA-ORF).

TIS_pos

TIS position on transcript (1-based).

TTS_pos

TTS position on transcript (-1 if no stop, 1-based).

shared_CDS_clones

Transcript IDs of CDSs with identical genomic regions as ORF.

shared_CDS_TIS

Boolean: ORF shares TIS with annotated CDS.

shared_CDS_TTS

Boolean: ORF shares TTS with annotated CDS.

shared_in_frame_CDS_frac

Fraction of ORF within annotated CDS genomic regions.

correction

Correction to find nearest in-frame ATG.

dist_from_canonical_TIS

Nucleotide distance from canonical TIS.

frame_wrt_canonical_TIS

Frame relative to canonical TIS (2 is 1-nt upstream shift).

TTS_on_transcript

Boolean: valid stop codon found on transcript.

reads_in_transcript

RiboTIE: Total ribosome reads mapping to transcript genomic region.

reads_in_ORF

RiboTIE: Total ribosome reads mapping to ORF genomic region.

reads_in_frame_frac

RiboTIE: Fraction of in-frame reads relative to reads_in_ORF.

reads_5UTR

RiboTIE: Reads in 5’ UTR (relative to called ORF).

reads_3UTR

RiboTIE: Reads in 3’ UTR (relative to called ORF).

reads_coverage_frac

RiboTIE: Fraction of ORF with at least one aligned read.

TIS_coord

TIS genomic coordinate (first nucleotide of start codon).

TIS_exon

Transcript exon number for TIS (1-based).

LTS_coord

Last Translation Site (LTS) genomic coordinate (last nucleotide of ORF).

LTS_exon

Transcript exon number for LTS (1-based).

TTS_coord

TTS genomic coordinate (first nucleotide of stop codon; -1 if not present).

TTS_exon

Transcript exon number for TTS (1-based).

canonical_TIS_coord

Canonical TIS genomic coordinate (first nucleotide of start codon; -1 if not present).

canonical_TIS_exon

Transcript exon number for canonical TIS (1-based).

canonical_TIS_pos

TIS position of canonical CDS on transcript (1-based).

canonical_TTS_coord

Canonical TTS genomic coordinate (first nucleotide of stop codon; -1 if not present).

canonical_TTS_pos

TTS position of canonical CDS on transcript (1-based).

canonical_TTS_exon

Transcript exon number for canonical TTS (1-based).

protein_seq

Predicted protein sequence.

🗄️ GTF format (.gtf)


The GTF (Gene Transfer Format) file produced provide a standardized representation of identified translation events.

The result GTF includes a limited set of features, specifically tailored to represent the translated regions identified by TRISTAN tools:

  • transcript: Represents the entire transcript associated with the identified ORF.

  • exon: Denotes the exonic regions of the transcript.

  • start_codon: Marks the precise genomic location of the translation initiation site.

  • CDS: Defines the Coding Sequence, the region that is translated into protein.

  • stop_codon: Indicates the translation termination site.

Attention

Pay attention to the modified transcript_id: In the output GTF, the transcript_id is actually the ORF_id (i.e., <transcript_id>_<TIS_pos>), where <TIS_pos> is the transcript coordinate of the translation initiation site. This modification is essential because the standard GTF format does not support the annotation of multiple distinct CDSs on a single transcript without implying a single, contiguous coding region. Users loading data from this GTF for downstream processing should be aware of this convention and may need to split off the _<TIS_pos> part if they wish to revert to the original transcript identifier.

Attention

The phase value in the GTF files is not occupied and is currently set to . for all CDS features.

Beyond standard GTF attributes like gene_id, transcript_id, gene_name, and exon_number, the GTF adds specific attributes to provide detailed information about the identified ORFs, particularly for start_codon, CDS, and stop_codon features:

  • ORF_id: This attribute explicitly states the unique identifier for the predicted Open Reading Frame. As mentioned above, it follows the format <original_transcript_id>_<TIS_pos>.

  • ORF_type: Categorizes the type of ORF identified (e.g., “uORF” for upstream ORF, “CDS” for canonical CDS, etc., though the example snippet only shows “uoORF”).

  • ribotie_score: A numerical score indicating RiboTIE’s confidence in the prediction of this ORF. Higher scores generally indicate stronger evidence.

  • tis_transformer_score: A score derived from a Translation Initiation Site (TIS) Transformer model, providing an additional measure of confidence specifically for the initiation site prediction.

GTF files can facilitate further analysis in several ways:

  • Extend reference assembly GTF: Researchers can append the <sample>.novel.gtf file generated by RiboTIE to their existing reference assembly GTF.

  • Quantify read counts on novel sequences: Tools like gffread and salmon can utilize the extended GTF to accurately quantify read counts specifically mapping to these novel or alternative coding sequences, providing insights into their expression levels.

  • Visualization in genome browsers: The GTF output can be directly loaded into genome browsers (e.g., IGV, UCSC Genome Browser).

RiboTIE GTF snippet:

7▸      RiboTIE▸transcript▸     73736094▸       73738802▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11";
7▸      RiboTIE▸exon▸   73738646▸       73738802▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "1";
7▸      RiboTIE▸exon▸   73738328▸       73738463▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "2";
7▸      RiboTIE▸exon▸   73737562▸       73737735▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "3";
7▸      RiboTIE▸exon▸   73737221▸       73737391▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "4";
7▸      RiboTIE▸exon▸   73736929▸       73737110▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "5";
7▸      RiboTIE▸exon▸   73736094▸       73736691▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "6";
7▸      RiboTIE▸start_codon▸    73738800▸       73738802▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "1";
7▸      RiboTIE▸CDS▸    73738646▸       73738802▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "1";
7▸      RiboTIE▸CDS▸    73738328▸       73738463▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "2";
7▸      RiboTIE▸CDS▸    73737732▸       73737735▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "3";
7▸      RiboTIE▸stop_codon▸     73737729▸       73737731▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "3";