# Output Files

The following section describes the output file formats generated by the TRISTAN toolkit. Each format provides a different perspective on the results, ranging from raw model predictions to comprehensive tabular summaries and standardized genome annotations. Understanding these formats is essential to integrate downstream workflows from TRISTAN results.

(npy-layout)=
## 📦 Raw Predictions (.npy)

Raw predictions for every nucleotide on the transcripts are saved as pickled numpy objects. For each transcript, the output is an array containing the transcript label and a corresponding array of model outputs (e.g., scores or probabilities) for each nucleotide position. For TIS Transformer, predictions are also saved within the `hdf5` database.

Each entry in the numpy array is a list with three elements:
- sample id and transcript identifier separated by `<sample_id>|<transcript_id>`. The `<sample_id>` is `seq` when predictions come from TIS Transformer.
- the predicted scores for each position of that transcript
- the true labels for each position of that transcript, as derived from the GTF file used to create the database.

```python
>>> results = np.load('<out_prefix>_<sample>.npy', allow_pickle=True)
>>> results[0]
array(['seq|ENST00000410304',
       array([2.3891837e-09, 7.0824785e-07, 8.3791534e-09, 4.3269135e-09,
              4.9220684e-08, 1.5315813e-10, 7.0196869e-08, 2.4103475e-10,
              4.5873511e-10, 1.4299616e-10, 6.1071654e-09, 1.9664975e-08,
              2.9255699e-07, 4.7719610e-08, 7.7600065e-10, 9.2305236e-10,
              3.3297397e-07, 3.5771163e-07, 4.1942007e-05, 4.5123262e-08,
              1.0270607e-11, 1.1841109e-09, 7.9038587e-10, 6.5511790e-10,
              6.0892291e-13, 1.6157842e-11, 6.9130129e-10, 4.5778301e-11,
              2.1682500e-03, 2.3315516e-09, 2.2578116e-11], dtype=float32),
       array([False False False False False False False False False False False False
              False False False False False False False False False False False False
              False False False False False False False], dtype=bool)],
      dtype=object)
```

(csv-layout)=
## 📅 CSV Table (.csv)
The CSV table output provides an extended view on predicted translated ORFs by providing various associated features. Each row corresponds to a single ORF, capturing model scores, genomic coordinates, transcript details, and additional annotations.
| Column Name                 | Description                                                                                         |
| :-------------------------- | :-------------------------------------------------------------------------------------------------- |
| `seqname`                   | Chromosome or scaffold name from GTF.                                                               |
| `ORF_id`                    | Unique ID from transcript ID and ORF TIS position.                                                  |
| `ORF_len`                   | Length of the ORF in nucleotides.                                                                   |
| `ribotie_score`             | **RiboTIE:** RiboTIE model score (0-1, not a probability).                                                       |
| `ribotie_rank`              | **RiboTIE:** Prediction rank in table (`1` is highest).                                                          |
| `tis_transformer_score`     | **TIS Transformer:** Prediction rank in table (`1` is highest).                                                          |
| `transcript_id`             | Transcript ID from GTF.                                                                             |
| `transcript_len`            | Length of transcript in nucleotides.                                                                |
| `start_codon`               | Start codon sequence (3-char string).                                                               |
| `stop_codon`                | Stop codon sequence (3-char string or `None`).                                                      |
| `strand`                    | Transcript strand (`+` or `-`).                                                                     |
| `ORF_type`                  | Classification of ORFs (`annotated CDS`, `u(o)ORF`, `d(o)ORF`, `intORF`, `varRNA-ORF`, `lncRNA-ORF`). |
| `TIS_pos`                   | TIS position on transcript (1-based).                                                               |
| `TTS_pos`                   | TTS position on transcript (`-1` if no stop, 1-based).                                              |
| `shared_CDS_clones`         | Transcript IDs of CDSs with identical genomic regions as ORF.                                       |
| `shared_CDS_TIS`            | Boolean: ORF shares TIS with annotated CDS.                                                         |
| `shared_CDS_TTS`            | Boolean: ORF shares TTS with annotated CDS.                                                         |
| `shared_in_frame_CDS_frac`  | Fraction of ORF within annotated CDS genomic regions.                                               |
| `correction`                | Correction to find nearest in-frame ATG.                                                            |
| `dist_from_canonical_TIS`   | Nucleotide distance from canonical TIS.                                                             |
| `frame_wrt_canonical_TIS`   | Frame relative to canonical TIS (`2` is 1-nt upstream shift).                                       |
| `TTS_on_transcript`         | Boolean: valid stop codon found on transcript.                                                      |
| `reads_in_transcript`       | **RiboTIE:** Total ribosome reads mapping to transcript genomic region.                        |
| `reads_in_ORF`              | **RiboTIE:** Total ribosome reads mapping to ORF genomic region.                               |
| `reads_in_frame_frac`       | **RiboTIE:** Fraction of in-frame reads relative to `reads_in_ORF`.                            |
| `reads_5UTR`                | **RiboTIE:** Reads in 5' UTR (relative to called ORF).                                         |
| `reads_3UTR`                | **RiboTIE:** Reads in 3' UTR (relative to called ORF).                                         |
| `reads_coverage_frac`       | **RiboTIE:** Fraction of ORF with at least one aligned read.                                   |
| `TIS_coord`                 | TIS genomic coordinate (first nucleotide of start codon).                                           |
| `TIS_exon`                  | Transcript exon number for TIS (1-based).                                                           |
| `LTS_coord`                 | Last Translation Site (LTS) genomic coordinate (last nucleotide of ORF).                            |
| `LTS_exon`                  | Transcript exon number for LTS (1-based).                                                           |
| `TTS_coord`                 | TTS genomic coordinate (first nucleotide of stop codon; `-1` if not present).                     |
| `TTS_exon`                  | Transcript exon number for TTS (1-based).                                                           |
| `canonical_TIS_coord`       | Canonical TIS genomic coordinate (first nucleotide of start codon; `-1` if not present).            |
| `canonical_TIS_exon`        | Transcript exon number for canonical TIS (1-based).                                                 |
| `canonical_TIS_pos`         | TIS position of canonical CDS on transcript (1-based).                                              |
| `canonical_TTS_coord`       | Canonical TTS genomic coordinate (first nucleotide of stop codon; `-1` if not present).             |
| `canonical_TTS_pos`         | TTS position of canonical CDS on transcript (1-based).                                              |
| `canonical_TTS_exon`        | Transcript exon number for canonical TTS (1-based).                                                 |
| `protein_seq`               | Predicted protein sequence.                                                                         |

(gtf-layout)=
## 🗄️ GTF format (.gtf)
---
The GTF (Gene Transfer Format) file produced provide a standardized representation of identified translation events.

The result GTF includes a limited set of features, specifically tailored to represent the translated regions identified by TRISTAN tools:

* **transcript**: Represents the entire transcript associated with the identified ORF.
* **exon**: Denotes the exonic regions of the transcript.
* **start_codon**: Marks the precise genomic location of the translation initiation site.
* **CDS**: Defines the Coding Sequence, the region that is translated into protein.
* **stop_codon**: Indicates the translation termination site.

:::{attention}
:class: myclass1 myclass2
:name: transcript-id
Pay attention to the **modified `transcript_id`**: In the output GTF, the `transcript_id` is actually the `ORF_id` (i.e., `<transcript_id>_<TIS_pos>`), where `<TIS_pos>` is the transcript coordinate of the translation initiation site. This modification is essential because the standard GTF format does not support the annotation of multiple distinct CDSs on a single transcript without implying a single, contiguous coding region. Users loading data from this GTF for downstream processing should be aware of this convention and may need to split off the `_<TIS_pos>` part if they wish to revert to the original transcript identifier.
:::

:::{attention}
:class: myclass1 myclass2
:name: phase
The phase value in the GTF files is not occupied and is currently set to `.` for all CDS features.
:::


Beyond standard GTF attributes like `gene_id`, `transcript_id`, `gene_name`, and `exon_number`, the GTF adds specific attributes to provide detailed information about the identified ORFs, particularly for `start_codon`, `CDS`, and `stop_codon` features:
- `ORF_id`: This attribute explicitly states the unique identifier for the predicted Open Reading Frame. As mentioned above, it follows the format `<original_transcript_id>_<TIS_pos>`.
- `ORF_type`: Categorizes the type of ORF identified (e.g., "uORF" for upstream ORF, "CDS" for canonical CDS, etc., though the example snippet only shows "uoORF").
- `ribotie_score`: A numerical score indicating RiboTIE's confidence in the prediction of this ORF. Higher scores generally indicate stronger evidence.
- `tis_transformer_score`: A score derived from a Translation Initiation Site (TIS) Transformer model, providing an additional measure of confidence specifically for the initiation site prediction.

GTF files can facilitate further analysis in several ways:

* **Extend reference assembly GTF**: Researchers can append the `<sample>.novel.gtf` file generated by RiboTIE to their existing reference assembly GTF. 
* **Quantify read counts on novel sequences**: Tools like [gffread](https://github.com/gpertea/gffread) and [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) can utilize the extended GTF to accurately quantify read counts specifically mapping to these novel or alternative coding sequences, providing insights into their expression levels.
* **Visualization in genome browsers**: The GTF output can be directly loaded into genome browsers (e.g., IGV, UCSC Genome Browser).

RiboTIE GTF snippet:
```tsv
7▸      RiboTIE▸transcript▸     73736094▸       73738802▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11";
7▸      RiboTIE▸exon▸   73738646▸       73738802▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "1";
7▸      RiboTIE▸exon▸   73738328▸       73738463▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "2";
7▸      RiboTIE▸exon▸   73737562▸       73737735▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "3";
7▸      RiboTIE▸exon▸   73737221▸       73737391▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "4";
7▸      RiboTIE▸exon▸   73736929▸       73737110▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "5";
7▸      RiboTIE▸exon▸   73736094▸       73736691▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; exon_number "6";
7▸      RiboTIE▸start_codon▸    73738800▸       73738802▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "1";
7▸      RiboTIE▸CDS▸    73738646▸       73738802▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "1";
7▸      RiboTIE▸CDS▸    73738328▸       73738463▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "2";
7▸      RiboTIE▸CDS▸    73737732▸       73737735▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "3";
7▸      RiboTIE▸stop_codon▸     73737729▸       73737731▸       .▸      -▸      .▸      gene_id "ENSG00000106077"; transcript_id "ENST00000222800_1"; gene_name "ABHD11"; ORF_id "ENST00000222800_1"; ORF_type "uoORF"; ribotie_score "0.20319226384162903"; tis_transformer_score "0.003580695716664195"; exon_number "3";
```