Output Filesď
The following section describes the output file formats generated by the TRISTAN toolkit. Each format provides a different perspective on the results, ranging from raw model predictions to comprehensive tabular summaries and standardized genome annotations. Understanding these formats is essential to integrate downstream workflows from TRISTAN results.
đŚ Raw Predictions (.npy)ď
Raw predictions for every nucleotide on the transcripts are saved as pickled numpy objects. For each transcript, the output is an array containing the transcript label and a corresponding array of model outputs (e.g., scores or probabilities) for each nucleotide position. For TIS Transformer, predictions are also saved within the hdf5 database.
Each entry in the numpy array is a list with three elements:
sample id and transcript identifier separated by
<sample_id>|<transcript_id>. The<sample_id>isseqwhen predictions come from TIS Transformer.the predicted scores for each position of that transcript
the true labels for each position of that transcript, as derived from the GTF file used to create the database.
>>> results = np.load('<out_prefix>_<sample>.npy', allow_pickle=True)
>>> results[0]
array(['seq|ENST00000410304',
array([2.3891837e-09, 7.0824785e-07, 8.3791534e-09, 4.3269135e-09,
4.9220684e-08, 1.5315813e-10, 7.0196869e-08, 2.4103475e-10,
4.5873511e-10, 1.4299616e-10, 6.1071654e-09, 1.9664975e-08,
2.9255699e-07, 4.7719610e-08, 7.7600065e-10, 9.2305236e-10,
3.3297397e-07, 3.5771163e-07, 4.1942007e-05, 4.5123262e-08,
1.0270607e-11, 1.1841109e-09, 7.9038587e-10, 6.5511790e-10,
6.0892291e-13, 1.6157842e-11, 6.9130129e-10, 4.5778301e-11,
2.1682500e-03, 2.3315516e-09, 2.2578116e-11], dtype=float32),
array([False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False], dtype=bool)],
dtype=object)
đ CSV Table (.csv)ď
The CSV table output provides an extended view on predicted translated ORFs by providing various associated features. Each row corresponds to a single ORF, capturing model scores, genomic coordinates, transcript details, and additional annotations.
Column Name |
Description |
|---|---|
|
Chromosome or scaffold name from GTF. |
|
Unique ID from transcript ID and ORF TIS position. |
|
Length of the ORF in nucleotides. |
|
RiboTIE: RiboTIE model score (0-1, not a probability). |
|
RiboTIE: Prediction rank in table ( |
|
TIS Transformer: Prediction rank in table ( |
|
Transcript ID from GTF. |
|
Length of transcript in nucleotides. |
|
Start codon sequence (3-char string). |
|
Stop codon sequence (3-char string or |
|
Transcript strand ( |
|
Classification of ORFs ( |
|
TIS position on transcript (1-based). |
|
TTS position on transcript ( |
|
Transcript IDs of CDSs with identical genomic regions as ORF. |
|
Boolean: ORF shares TIS with annotated CDS. |
|
Boolean: ORF shares TTS with annotated CDS. |
|
Fraction of ORF within annotated CDS genomic regions. |
|
Correction to find nearest in-frame ATG. |
|
Nucleotide distance from canonical TIS. |
|
Frame relative to canonical TIS ( |
|
Boolean: valid stop codon found on transcript. |
|
RiboTIE: Total ribosome reads mapping to transcript genomic region. |
|
RiboTIE: Total ribosome reads mapping to ORF genomic region. |
|
RiboTIE: Fraction of in-frame reads relative to |
|
RiboTIE: Reads in 5â UTR (relative to called ORF). |
|
RiboTIE: Reads in 3â UTR (relative to called ORF). |
|
RiboTIE: Fraction of ORF with at least one aligned read. |
|
TIS genomic coordinate (first nucleotide of start codon). |
|
Transcript exon number for TIS (1-based). |
|
Last Translation Site (LTS) genomic coordinate (last nucleotide of ORF). |
|
Transcript exon number for LTS (1-based). |
|
TTS genomic coordinate (first nucleotide of stop codon; |
|
Transcript exon number for TTS (1-based). |
|
Canonical TIS genomic coordinate (first nucleotide of start codon; |
|
Transcript exon number for canonical TIS (1-based). |
|
TIS position of canonical CDS on transcript (1-based). |
|
Canonical TTS genomic coordinate (first nucleotide of stop codon; |
|
TTS position of canonical CDS on transcript (1-based). |
|
Transcript exon number for canonical TTS (1-based). |
|
Predicted protein sequence. |
đď¸ GTF format (.gtf)ď
The GTF (Gene Transfer Format) file produced provide a standardized representation of identified translation events.
The result GTF includes a limited set of features, specifically tailored to represent the translated regions identified by TRISTAN tools:
transcript: Represents the entire transcript associated with the identified ORF.
exon: Denotes the exonic regions of the transcript.
start_codon: Marks the precise genomic location of the translation initiation site.
CDS: Defines the Coding Sequence, the region that is translated into protein.
stop_codon: Indicates the translation termination site.
Attention
Pay attention to the modified transcript_id: In the output GTF, the transcript_id is actually the ORF_id (i.e., <transcript_id>_<TIS_pos>), where <TIS_pos> is the transcript coordinate of the translation initiation site. This modification is essential because the standard GTF format does not support the annotation of multiple distinct CDSs on a single transcript without implying a single, contiguous coding region. Users loading data from this GTF for downstream processing should be aware of this convention and may need to split off the _<TIS_pos> part if they wish to revert to the original transcript identifier.
Attention
The phase value in the GTF files is not occupied and is currently set to . for all CDS features.
Beyond standard GTF attributes like gene_id, transcript_id, gene_name, and exon_number, the GTF adds specific attributes to provide detailed information about the identified ORFs, particularly for start_codon, CDS, and stop_codon features:
ORF_id: This attribute explicitly states the unique identifier for the predicted Open Reading Frame. As mentioned above, it follows the format<original_transcript_id>_<TIS_pos>.ORF_type: Categorizes the type of ORF identified (e.g., âuORFâ for upstream ORF, âCDSâ for canonical CDS, etc., though the example snippet only shows âuoORFâ).ribotie_score: A numerical score indicating RiboTIEâs confidence in the prediction of this ORF. Higher scores generally indicate stronger evidence.tis_transformer_score: A score derived from a Translation Initiation Site (TIS) Transformer model, providing an additional measure of confidence specifically for the initiation site prediction.
GTF files can facilitate further analysis in several ways:
Extend reference assembly GTF: Researchers can append the
<sample>.novel.gtffile generated by RiboTIE to their existing reference assembly GTF.Quantify read counts on novel sequences: Tools like gffread and salmon can utilize the extended GTF to accurately quantify read counts specifically mapping to these novel or alternative coding sequences, providing insights into their expression levels.
Visualization in genome browsers: The GTF output can be directly loaded into genome browsers (e.g., IGV, UCSC Genome Browser).
RiboTIE GTF snippet:
7⸠RiboTIEâ¸transcript⸠73736094⸠73738802⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11";
7⸠RiboTIEâ¸exon⸠73738646⸠73738802⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "1";
7⸠RiboTIEâ¸exon⸠73738328⸠73738463⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "2";
7⸠RiboTIEâ¸exon⸠73737562⸠73737735⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "3";
7⸠RiboTIEâ¸exon⸠73737221⸠73737391⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "4";
7⸠RiboTIEâ¸exon⸠73736929⸠73737110⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "5";
7⸠RiboTIEâ¸exon⸠73736094⸠73736691⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "6";
7⸠RiboTIEâ¸start_codon⸠73738800⸠73738802⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "1";
7⸠RiboTIEâ¸CDS⸠73738646⸠73738802⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "1";
7⸠RiboTIEâ¸CDS⸠73738328⸠73738463⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "2";
7⸠RiboTIEâ¸CDS⸠73737732⸠73737735⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "3";
7⸠RiboTIEâ¸stop_codon⸠73737729⸠73737731⸠.⸠-⸠.⸠gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "3";