Bioinformatic Tools


Data Pre-processing

Sequencing reads are provided as compressed (zipped) fastq files. They must first be processed to remove sequencing adapters and unique barcodes, then demultiplexed using common tools. File prefixes and paths will vary according to the experimental conditions and file origin, and should be substituted for instances of "$Outprefix" or "$path" in subsequent steps.

Trimmomatic

When performing sequencing using an Illumina instrument, sequences that correspond to the library adapter can be [more often than not] present in the fastq files at the 5'- and 3'- ends of the reads, if the read length is greater than the insert size. Since library inserts are shorter than the total read length of the sequencer, adapters must be trimmed away prior to alignment. Trimmomatic is used to remove the 3'-end adapters (Bolger et al. 2014).

java -jar trimmomatic-0.36.jar SE -threads 16 -phred33 $fastq $Outprefix'_trim3.fastq' ILLUMINACLIP:$fasta:3:20:10 SLIDINGWINDOW:4:20 MINLEN:18

Note that the described parameters are specific to this pipeline. For instance, the reads from PARIS and SHARC-based experiments are single-ended, therefore only one set of input and output files is specified. The processing steps (trimming, cropping, adapter clipping, etc.) are specified as additional arguments after the input/output files. Multiple steps can be specified as required, by using additional arguments at the end as described in the section processing steps (separated by a single space). For input and output files, adding .gz/.bz2 as an extension tells Trimmomatic that the file is provided in gzip/bzip2 format or that Trimmomatic should gzip/bzip2 the file, respectively. Note that most processing steps take one or more settings, delimited by ':' (colon).

java -jar <trimmomatic.jar> SE [-threads <threads>] [-phred33 | -phred64] <input.fastq.gz> <output.fastq.gz> ILLUMINACLIP:<adapters.fa>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold> SLIDINGWINDOW:<windowSize>:<requiredQuality> MINLEN:<length>
SE
Single ended mode.
-threads
Indicates the number of computing threads to use, which improves performance on multi-core computers. If not specified, it will be chosen automatically.
-phred33 or -phred64
Specifies the base quality encoding.
Input/output
For input and output files, specifying .gz tells Trimmomatic that the file provided is in compressed format or that Trimmomatic should gzip the file, respectively.
ILLUMINACLIP
Finds and removes specifically Illumina adapters.
fastaWithAdaptersEtc
Specifies the path to a fasta file containing all the adapters, PCR sequences, etc. The naming of the various sequences within this file determines how they are used.
seedMismatches
Specifies the maximum mismatch count which will still allow a full match to be performed.
To speed up the search, short sections of each adapter (up to 16bp) are tested at all possible positions to find "seeds" that trigger a full alignment.
palindromeClipThreshold
Specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
simpleClipThreshold
Specifies how accurate the match between any adapter etc. sequence must be against a read.
Suggested values are 7-15 depending on the length of the Score; increase 0.6 per match, decrease by Q/10 per mismatch where Q is the quality score.
Sliding window
Performs sliding window trimming when average quality within the window falls below specified threshold.
By considering multiple bases, a single poor quality base will not cause the removal of high quality data later in the read.
Sliding window size
Length of window in number of base pairs.
Sliding window minimum quality
The average quality required during trimming.
Minimum read length
Removes reads that fall below the specified minimum length. Reads removed by this step are included in the 'dropped reads' count. Length written in base pairs.

Note that this final pre-processing step takes place after readCollapse and splitFastq (see below). Trimmomatic is used to remove the 5'-end adapter sequences. HEADCROP:17 option is used to trim off the designed random and multiplexing barcodes.

java -jar trimmomatic-0.36.jar SE -threads 16 -phred33 $Outprefix'_trim3_nodup_'$barcode'.fastq' $Outprefix'_trim_no-dup_'$barcode'.fastq' HEADCROP:17 MINLEN:20
java -jar <trimmomatic.jar> SE -threads 16 -phred33 <input.fastq> <output.fastq> HEADCROP:17 MINLEN:20

icSHAPE Tools

Carryover of code from the iCSHAPE pipeline. More details to be added.

perl readCollapse $Outprefix'_trim3.fastq' $Outprefix'_trim3_nodup.fastq'
perl <readCollapse.pl> <input.fastq> <output.fastq>
perl splitFastq.pl -U $Outprefix'_trim3_nodup.fastq' -b 6:6 -l CGTGAT:R701::ACATCG:R702::GCCTAA:R703 -s -d ./
perl splitFastq.pl -U <input.fastq> -b 6:6 -l L -s -d <Output Path>
-b
Specifies the position and length of hexamer for reads to be used for demultiplexing.
-l  ā€œLā€
Specifies the list of specific length (-b) of barcodes "L" to be used for separation and subsequent names for the output file(s).
Multiple barcodes can be provided separated with two colons "::".
For example, providing "CGTGAT:R701" will take reads with barcode "CGTGAT" and separate them into file "R701.fastq".
Use format BARCODE1:LIB_NAME1::BARCODE2:LIB_NAME2::...
-s
Generates a simple statistics output file if separated by library barcode.
-d
Specifies the directory at which to output file(s).

Quality Control (FastQC)

FastQC

After pre-processing, data must be checked for quality.

Alignment to reference genome

STAR

The mapping of pre-processed data is standardized from PARIS- and SHARC-based sequencing. STAR and various scripts from the CRSSANT pipeline are used. More specific details can be found in the supplemental material provided as part of the CRSSANT pipeline paper.

/STAR-2.7.1a/bin/Linux_x86_64/STAR
--runThreadN 8
--runMode alignReads
--genomeDir $StaridxPath
--readFilesIn $Fastq
--outFileNamePrefix $Outprefix'_1_'
--genomeLoad NoSharedMemory
--outReadsUnmapped Fastx
--outFilterMultimapNmax 10
--outFilterScoreMinOverLread 0
--outSAMattributes All
--outSAMtype BAM Unsorted SortedByCoordinate
--alignIntronMin 1
--scoreGap 0
--scoreGapNoncan 0
--scoreGapGCAG 0
--scoreGapATAC 0
--scoreGenomicLengthLog2scale -1
--chimOutType WithinBAM HardClip
--chimSegmentMin 5
--chimJunctionOverhangMin 5
--chimScoreJunctionNonGTAG 0
--chimScoreDropMax 80
--chimNonchimScoreDropMin 20

Primary Read Extraction and Classification

Extraction by samtools

samtools view -h x_Aligned.sortedByCoord.out.bam | awk '$1~/^@/ || NF<21' > x_nonchimeric_temp.sam
samtools view -h x_Aligned.sortedByCoord.out.bam | awk '$1!~/^@/ && NF==21'> x_chimeric_temp.sam'
samtools view -bS -F 0x900 -o x_nonchimeric_pri.bam x_nonchimeric_temp.sam
samtools view -h x_nonchimeric_pri.bam > x_nonchimeric_pri.sam
cat x_nonchimeric_pri.sam x_chimeric_temp.sam > x_1_pri.sam
rm -f x_nonchimeric_temp.sam x_chimeric_temp.sam x_nonchimeric_pri.bam x_nonchimeric_pri.sam

Classifying Primary Reads

python3 gaptypes.py x_1_pri.sam x_1_pri -1 15 1

Rearrange Softclipped Continuous Reads

python softreverse_v2.py x_1_pricont.sam softrev.fastq

Repeat STAR mapping with the softrev.fastq as input file, then extract and classify the alignments.

Combine output from the two rounds of STAR mapping

python3 merger_sams.py x_1_prigap1.sam x_2_prigap1.sam x gap1 x_prigap1.tmp
python3 merger_sams.py x_1_prigapm.sam x_2_prigapm.sam x gapm x_prigapm.tmp
python3 merger_sams.py x_1_prihomo.sam x_2_prihomo.sam x homo x_prihomo.tmp
python3 merger_sams.py x_1_pritrans.sam x_2_pritrans.sam x trans x_pritrans.tmp

samtools view -H x_1_Aligned.sortedByCoord.out.bam > header

cat header x_prigap1.tmp > x_prigap1.sam
cat header x_prigapm.tmp > x_prigapm.sam
cat header x_prihomo.tmp > x_prihomo.sam
cat header x_pritrans.tmp > x_pritrans.sam
rm -f header *.tmp

Filter splices and short gapped reads

python gapfilter.py Gtf x_prigap1.sam x_prigap1_filtered.sam 13 yes
python gapfilter.py Gtf x_prigapm.sam x_prigapm_filtered.sam 13 yes

Generate Alignment Statistics

Gap Length

cat x_prigap1_filtered.sam x_prigapm_filtered.sam > x_prigaps_filtered.sam
python gaplendist.py x_prigaps_filtered.sam sam x.list all
python gaplendist.py x.list list x_gaplen.pdf all
rm -f x_prigaps_filtered.sam x.list

Arm Length

cat x_prigap1_filtered.sam x_prigapm_filtered.sam x_pritrans.sam > x_prifiltered.sam
python seglendist.py x_prifiltered.sam sam x_prifiltered.list
python seglendist.py x_prifiltered.list list x_seglen.pdf
rm -f x_prifiltered.sam x_prifiltered.list

Read Counting

samtools view -bS -o x_prigap1.bam x_prigap1.sam
python CountCdsUtr.py x_prigap1.bam CdsUtr.bed none x_pri
bedtools bamtobed -i x_prigap1.bam > x.bed
sort -k1,1 -k2,2n x.bed > x_sorted.bed
rm x.bed
bedtools coverage -g chrNameLength_sorted.txt -a reference.bed -b x_sorted.bed -counts -sorted -nonamecheck > x_coverage.txt

CRSSANT

Input file preparation

python merger.py x_prigap1_filtered.sam x_pritrans.sam x_pri_crssant.sam

OPTIONAL: Tag with unique identifier

Prepare the files for CRSSANT DG/NG assembly, tagging with a label.

awk '$0!~/^@/ {printf $1"-TAG1\t"; for(i=2;i<=NF;++i) printf $i "\t"; printf"\n"}' x_pri_crssant.sam > x_pri_crssant_TAG1_tmp.sam
awk '$0!~/^@/ {printf $1"-TAG2\t"; for(i=2;i<=NF;++i) printf $i "\t"; printf"\n"}' x_pri_crssant.sam > x_pri_crssant_TAG2_tmp.sam
samtools view -H x_pri_crssant.sam > header
cat header x_pri_crssant_TAG1_tmp.sam x_pri_crssant_TAG2_tmp.sam > x_pri_crssant.sam
rm -f header x_pri_crssant_TAG1_tmp.sam x_pri_crssant_TAG2_tmp.sam

(AFTER CRSSANT DG ASSEMBLY) Separate the alignments by the tagged identifier.

awk '$0~/^@/ || $1~/-TAG/' x_pri_crssant.cliques.t_o0.2.sam > x_ crssant_TAG1.sam

Build "genesfile" in BED format

python genes2bed.py x_ReadCount.txt CdsUtr.bed min_reads outname

Generate bedgraphs

bedtools genomecov -bg -split -strand + -ibam x_pri_crssant.bam -g staridxPath/chrNameLength.txt > x_plus.bedgraph
bedtools genomecov -bg -split -strand - -ibam x_pri_crssant.bam -g staridxPath/chrNameLength.txt > x_minus.bedgraph

DG and NG assembly

python crssant.py -cluster cliques -t_o 0.2 -out ./ x_pri_crssant.sam $Bed x_plus.bedgraph,x_minus.bedgraph

TG assembly

python gapmcluster.py x.bedpe x_gapm.sam

Useful Resources

●●●●●●●

IGV Browser


●●●●●●●