Bioinformatic Tools

Data Pre-processing

Sequencing reads are provided as compressed (zipped) fastq files. They must first be processed to remove sequencing adapters and unique barcodes, then demultiplexed using common tools. File prefixes and paths will vary according to the experimental conditions and file origin, and should be substituted for instances of "$Outprefix" or "$path" in subsequent steps.

Trimmomatic

When performing sequencing using an Illumina instrument, sequences that correspond to the library adapter can be [more often than not] present in the fastq files at the 5'- and 3'- ends of the reads, if the read length is greater than the insert size. Since library inserts are shorter than the total read length of the sequencer, adapters must be trimmed away prior to alignment. Trimmomatic is used to remove the 3'-end adapters (Bolger et al. 2014).

java -jar trimmomatic-0.36.jar SE -threads 16 -phred33 $fastq $Outprefix'_trim3.fastq' ILLUMINACLIP:$fasta:3:20:10 SLIDINGWINDOW:4:20 MINLEN:18

Note that the described parameters are specific to this pipeline. For instance, the reads from PARIS and SHARC-based experiments are single-ended, therefore only one set of input and output files is specified. The processing steps (trimming, cropping, adapter clipping, etc.) are specified as additional arguments after the input/output files. Multiple steps can be specified as required, by using additional arguments at the end as described in the section processing steps (separated by a single space). For input and output files, adding .gz/.bz2 as an extension tells Trimmomatic that the file is provided in gzip/bzip2 format or that Trimmomatic should gzip/bzip2 the file, respectively. Note that most processing steps take one or more settings, delimited by ':' (colon).

java -jar <trimmomatic.jar> SE [-threads <threads>] [-phred33 | -phred64] <input.fastq.gz> <output.fastq.gz> ILLUMINACLIP:<adapters.fa>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold> SLIDINGWINDOW:<windowSize>:<requiredQuality> MINLEN:<length>

SE: Single ended mode.
-threads: Indicates the number of computing threads to use, which improves performance on multi-core computers. If not specified, it will be chosen automatically.
-phred33 or -phred64: Specifies the base quality encoding.
Input/output: For input and output files, specifying .gz tells Trimmomatic that the file provided is in compressed format or that Trimmomatic should gzip the file, respectively.
ILLUMINACLIP: Finds and removes specifically Illumina adapters.
fastaWithAdaptersEtc: Specifies the path to a fasta file containing all the adapters, PCR sequences, etc. The naming of the various sequences within this file determines how they are used.
seedMismatches: Specifies the maximum mismatch count which will still allow a full match to be performed.; To speed up the search, short sections of each adapter (up to 16bp) are tested at all possible positions to find "seeds" that trigger a full alignment.
palindromeClipThreshold: Specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
simpleClipThreshold: Specifies how accurate the match between any adapter etc. sequence must be against a read.; Suggested values are 7-15 depending on the length of the Score; increase 0.6 per match, decrease by Q/10 per mismatch where Q is the quality score.
Sliding window: Performs sliding window trimming when average quality within the window falls below specified threshold.; By considering multiple bases, a single poor quality base will not cause the removal of high quality data later in the read.
Sliding window size: Length of window in number of base pairs.
Sliding window minimum quality: The average quality required during trimming.
Minimum read length: Removes reads that fall below the specified minimum length. Reads removed by this step are included in the 'dropped reads' count. Length written in base pairs.

Note that this final pre-processing step takes place after readCollapse and splitFastq (see below). Trimmomatic is used to remove the 5'-end adapter sequences. HEADCROP:17 option is used to trim off the designed random and multiplexing barcodes.

java -jar trimmomatic-0.36.jar SE -threads 16 -phred33 $Outprefix'_trim3_nodup_'$barcode'.fastq' $Outprefix'_trim_no-dup_'$barcode'.fastq' HEADCROP:17 MINLEN:20

java -jar <trimmomatic.jar> SE -threads 16 -phred33 <input.fastq> <output.fastq> HEADCROP:17 MINLEN:20

icSHAPE Tools

Carryover of code from the iCSHAPE pipeline. More details to be added.

perl readCollapse $Outprefix'_trim3.fastq' $Outprefix'_trim3_nodup.fastq'

perl <readCollapse.pl> <input.fastq> <output.fastq>

perl splitFastq.pl -U $Outprefix'_trim3_nodup.fastq' -b 6:6 -l CGTGAT:R701::ACATCG:R702::GCCTAA:R703 -s -d ./

perl splitFastq.pl -U <input.fastq> -b 6:6 -l L -s -d <Output Path>

-b: Specifies the position and length of hexamer for reads to be used for demultiplexing.
-l “L”: Specifies the list of specific length (-b) of barcodes "L" to be used for separation and subsequent names for the output file(s).; Multiple barcodes can be provided separated with two colons "::".; For example, providing "CGTGAT:R701" will take reads with barcode "CGTGAT" and separate them into file "R701.fastq".; Use format BARCODE1:LIB_NAME1::BARCODE2:LIB_NAME2::...
-s: Generates a simple statistics output file if separated by library barcode.
-d: Specifies the directory at which to output file(s).

Quality Control (FastQC)

FastQC

After pre-processing, data must be checked for quality.

Alignment to reference genome

STAR

The mapping of pre-processed data is standardized from PARIS- and SHARC-based sequencing. STAR and various scripts from the CRSSANT pipeline are used. More specific details can be found in the supplemental material provided as part of the CRSSANT pipeline paper.


                        /STAR-2.7.1a/bin/Linux_x86_64/STAR

                        --runThreadN 8

                        --runMode alignReads

                        --genomeDir $StaridxPath

                        --readFilesIn $Fastq

                        --outFileNamePrefix $Outprefix'_1_'

                        --genomeLoad NoSharedMemory

                        --outReadsUnmapped Fastx

                        --outFilterMultimapNmax 10

                        --outFilterScoreMinOverLread 0

                        --outSAMattributes All

                        --outSAMtype BAM Unsorted SortedByCoordinate

                        --alignIntronMin 1

                        --scoreGap 0

                        --scoreGapNoncan 0

                        --scoreGapGCAG 0

                        --scoreGapATAC 0

                        --scoreGenomicLengthLog2scale -1

                        --chimOutType WithinBAM HardClip

                        --chimSegmentMin 5

                        --chimJunctionOverhangMin 5

                        --chimScoreJunctionNonGTAG 0

                        --chimScoreDropMax 80

                        --chimNonchimScoreDropMin 20

Primary Read Extraction and Classification

Extraction by samtools


                        samtools view -h x_Aligned.sortedByCoord.out.bam | awk '$1~/^@/ || NF<21' > x_nonchimeric_temp.sam

                        samtools view -h x_Aligned.sortedByCoord.out.bam | awk '$1!~/^@/ && NF==21'> x_chimeric_temp.sam'

                        samtools view -bS -F 0x900 -o x_nonchimeric_pri.bam x_nonchimeric_temp.sam

                        samtools view -h x_nonchimeric_pri.bam > x_nonchimeric_pri.sam

                        cat x_nonchimeric_pri.sam x_chimeric_temp.sam > x_1_pri.sam

                        rm -f x_nonchimeric_temp.sam x_chimeric_temp.sam x_nonchimeric_pri.bam x_nonchimeric_pri.sam

Classifying Primary Reads

python3 gaptypes.py x_1_pri.sam x_1_pri -1 15 1

Rearrange Softclipped Continuous Reads

python softreverse_v2.py x_1_pricont.sam softrev.fastq

Repeat STAR mapping with the softrev.fastq as input file, then extract and classify the alignments.

Combine output from the two rounds of STAR mapping


                        python3 merger_sams.py x_1_prigap1.sam x_2_prigap1.sam x gap1 x_prigap1.tmp

                        python3 merger_sams.py x_1_prigapm.sam x_2_prigapm.sam x gapm x_prigapm.tmp

                        python3 merger_sams.py x_1_prihomo.sam x_2_prihomo.sam x homo x_prihomo.tmp

                        python3 merger_sams.py x_1_pritrans.sam x_2_pritrans.sam x trans x_pritrans.tmp


                        samtools view -H x_1_Aligned.sortedByCoord.out.bam > header


                        cat header x_prigap1.tmp > x_prigap1.sam

                        cat header x_prigapm.tmp > x_prigapm.sam

                        cat header x_prihomo.tmp > x_prihomo.sam
 
                        cat header x_pritrans.tmp > x_pritrans.sam

                        rm -f header *.tmp

Filter splices and short gapped reads


                        python gapfilter.py Gtf x_prigap1.sam x_prigap1_filtered.sam 13 yes

                        python gapfilter.py Gtf x_prigapm.sam x_prigapm_filtered.sam 13 yes

Generate Alignment Statistics

Gap Length


                        cat x_prigap1_filtered.sam x_prigapm_filtered.sam > x_prigaps_filtered.sam

                        python gaplendist.py x_prigaps_filtered.sam sam x.list all

                        python gaplendist.py x.list list x_gaplen.pdf all

                        rm -f x_prigaps_filtered.sam x.list

Arm Length


                        cat x_prigap1_filtered.sam x_prigapm_filtered.sam x_pritrans.sam > x_prifiltered.sam

                        python seglendist.py x_prifiltered.sam sam x_prifiltered.list

                        python seglendist.py x_prifiltered.list list x_seglen.pdf

                        rm -f x_prifiltered.sam x_prifiltered.list

Read Counting


                        samtools view -bS -o x_prigap1.bam x_prigap1.sam

                        python CountCdsUtr.py x_prigap1.bam CdsUtr.bed none x_pri


                        bedtools bamtobed -i x_prigap1.bam > x.bed

                        sort -k1,1 -k2,2n x.bed > x_sorted.bed

                        rm x.bed

                        bedtools coverage -g chrNameLength_sorted.txt -a reference.bed -b x_sorted.bed -counts -sorted -nonamecheck > x_coverage.txt

CRSSANT

Input file preparation


                        python merger.py x_prigap1_filtered.sam x_pritrans.sam x_pri_crssant.sam

OPTIONAL: Tag with unique identifier

Prepare the files for CRSSANT DG/NG assembly, tagging with a label.


                        awk '$0!~/^@/ {printf $1"-TAG1\t"; for(i=2;i<=NF;++i) printf $i "\t"; printf"\n"}' x_pri_crssant.sam > x_pri_crssant_TAG1_tmp.sam

                        awk '$0!~/^@/ {printf $1"-TAG2\t"; for(i=2;i<=NF;++i) printf $i "\t"; printf"\n"}' x_pri_crssant.sam > x_pri_crssant_TAG2_tmp.sam

                        samtools view -H x_pri_crssant.sam > header

                        cat header x_pri_crssant_TAG1_tmp.sam x_pri_crssant_TAG2_tmp.sam > x_pri_crssant.sam

                        rm -f header x_pri_crssant_TAG1_tmp.sam x_pri_crssant_TAG2_tmp.sam

(AFTER CRSSANT DG ASSEMBLY) Separate the alignments by the tagged identifier.


                        awk '$0~/^@/ || $1~/-TAG/' x_pri_crssant.cliques.t_o0.2.sam > x_ crssant_TAG1.sam

Build "genesfile" in BED format


                        python genes2bed.py x_ReadCount.txt CdsUtr.bed min_reads outname

Generate bedgraphs


                        bedtools genomecov -bg -split -strand + -ibam x_pri_crssant.bam -g staridxPath/chrNameLength.txt > x_plus.bedgraph

                        bedtools genomecov -bg -split -strand - -ibam x_pri_crssant.bam -g staridxPath/chrNameLength.txt > x_minus.bedgraph

DG and NG assembly


                        python crssant.py -cluster cliques -t_o 0.2 -out ./ x_pri_crssant.sam $Bed x_plus.bedgraph,x_minus.bedgraph

TG assembly


                        python gapmcluster.py x.bedpe x_gapm.sam

Useful Resources

●●●●●●●

IGV Browser

●●●●●●●