Bioinformatic Tools
Data Pre-processing
Sequencing reads are provided as compressed (zipped) fastq files. They must first be processed to remove sequencing adapters and unique barcodes, then demultiplexed using common tools. File prefixes and paths will vary according to the experimental conditions and file origin, and should be substituted for instances of "$Outprefix" or "$path" in subsequent steps.
Trimmomatic
When performing sequencing using an Illumina instrument, sequences that correspond to the library adapter can be [more often than not] present in the fastq files at the 5'- and 3'- ends of the reads, if the read length is greater than the insert size. Since library inserts are shorter than the total read length of the sequencer, adapters must be trimmed away prior to alignment. Trimmomatic is used to remove the 3'-end adapters (Bolger et al. 2014).
java -jar trimmomatic-0.36.jar SE -threads 16 -phred33 $fastq $Outprefix'_trim3.fastq' ILLUMINACLIP:$fasta:3:20:10 SLIDINGWINDOW:4:20 MINLEN:18
Note that the described parameters are specific to this pipeline. For instance, the reads from PARIS and SHARC-based experiments are single-ended, therefore only one set of input and output files is specified. The processing steps (trimming, cropping, adapter clipping, etc.) are specified as additional arguments after the input/output files. Multiple steps can be specified as required, by using additional arguments at the end as described in the section processing steps (separated by a single space). For input and output files, adding .gz/.bz2 as an extension tells Trimmomatic that the file is provided in gzip/bzip2 format or that Trimmomatic should gzip/bzip2 the file, respectively. Note that most processing steps take one or more settings, delimited by ':' (colon).
java -jar <trimmomatic.jar> SE [-threads <threads>] [-phred33 | -phred64] <input.fastq.gz> <output.fastq.gz> ILLUMINACLIP:<adapters.fa>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold> SLIDINGWINDOW:<windowSize>:<requiredQuality> MINLEN:<length>
- SE
- Single ended mode.
- -threads
- Indicates the number of computing threads to use, which improves performance on multi-core computers. If not specified, it will be chosen automatically.
- -phred33 or -phred64
- Specifies the base quality encoding.
- Input/output
- For input and output files, specifying .gz tells Trimmomatic that the file provided is in compressed format or that Trimmomatic should gzip the file, respectively.
- ILLUMINACLIP
- Finds and removes specifically Illumina adapters.
- fastaWithAdaptersEtc
- Specifies the path to a fasta file containing all the adapters, PCR sequences, etc. The naming of the various sequences within this file determines how they are used.
- seedMismatches
- Specifies the maximum mismatch count which will still allow a full match to be performed.
- To speed up the search, short sections of each adapter (up to 16bp) are tested at all possible positions to find "seeds" that trigger a full alignment.
- palindromeClipThreshold
- Specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
- simpleClipThreshold
- Specifies how accurate the match between any adapter etc. sequence must be against a read.
- Suggested values are 7-15 depending on the length of the Score; increase 0.6 per match, decrease by Q/10 per mismatch where Q is the quality score.
- Sliding window
- Performs sliding window trimming when average quality within the window falls below specified threshold.
- By considering multiple bases, a single poor quality base will not cause the removal of high quality data later in the read.
- Sliding window size
- Length of window in number of base pairs.
- Sliding window minimum quality
- The average quality required during trimming.
- Minimum read length
- Removes reads that fall below the specified minimum length. Reads removed by this step are included in the 'dropped reads' count. Length written in base pairs.
Note that this final pre-processing step takes place after readCollapse and splitFastq (see below). Trimmomatic is used to remove the 5'-end adapter sequences. HEADCROP:17 option is used to trim off the designed random and multiplexing barcodes.
java -jar trimmomatic-0.36.jar SE -threads 16 -phred33 $Outprefix'_trim3_nodup_'$barcode'.fastq' $Outprefix'_trim_no-dup_'$barcode'.fastq' HEADCROP:17 MINLEN:20
java -jar <trimmomatic.jar> SE -threads 16 -phred33 <input.fastq> <output.fastq> HEADCROP:17 MINLEN:20
icSHAPE Tools
Carryover of code from the iCSHAPE pipeline. More details to be added.
perl readCollapse $Outprefix'_trim3.fastq' $Outprefix'_trim3_nodup.fastq'
perl <readCollapse.pl> <input.fastq> <output.fastq>
perl splitFastq.pl -U $Outprefix'_trim3_nodup.fastq' -b 6:6 -l CGTGAT:R701::ACATCG:R702::GCCTAA:R703 -s -d ./
perl splitFastq.pl -U <input.fastq> -b 6:6 -l L -s -d <Output Path>
- -b
- Specifies the position and length of hexamer for reads to be used for demultiplexing.
- -l āLā
- Specifies the list of specific length (-b) of barcodes "L" to be used for separation and subsequent names for the output file(s).
- Multiple barcodes can be provided separated with two colons "::".
- For example, providing "CGTGAT:R701" will take reads with barcode "CGTGAT" and separate them into file "R701.fastq".
- Use format BARCODE1:LIB_NAME1::BARCODE2:LIB_NAME2::...
- -s
- Generates a simple statistics output file if separated by library barcode.
- -d
- Specifies the directory at which to output file(s).
Quality Control (FastQC)
FastQC
After pre-processing, data must be checked for quality.
Alignment to reference genome
STAR
The mapping of pre-processed data is standardized from PARIS- and SHARC-based sequencing. STAR and various scripts from the CRSSANT pipeline are used. More specific details can be found in the supplemental material provided as part of the CRSSANT pipeline paper.
/STAR-2.7.1a/bin/Linux_x86_64/STAR
--runThreadN 8
--runMode alignReads
--genomeDir $StaridxPath
--readFilesIn $Fastq
--outFileNamePrefix $Outprefix'_1_'
--genomeLoad NoSharedMemory
--outReadsUnmapped Fastx
--outFilterMultimapNmax 10
--outFilterScoreMinOverLread 0
--outSAMattributes All
--outSAMtype BAM Unsorted SortedByCoordinate
--alignIntronMin 1
--scoreGap 0
--scoreGapNoncan 0
--scoreGapGCAG 0
--scoreGapATAC 0
--scoreGenomicLengthLog2scale -1
--chimOutType WithinBAM HardClip
--chimSegmentMin 5
--chimJunctionOverhangMin 5
--chimScoreJunctionNonGTAG 0
--chimScoreDropMax 80
--chimNonchimScoreDropMin 20
Primary Read Extraction and Classification
Extraction by samtools
samtools view -h x_Aligned.sortedByCoord.out.bam | awk '$1~/^@/ || NF<21' > x_nonchimeric_temp.sam
samtools view -h x_Aligned.sortedByCoord.out.bam | awk '$1!~/^@/ && NF==21'> x_chimeric_temp.sam'
samtools view -bS -F 0x900 -o x_nonchimeric_pri.bam x_nonchimeric_temp.sam
samtools view -h x_nonchimeric_pri.bam > x_nonchimeric_pri.sam
cat x_nonchimeric_pri.sam x_chimeric_temp.sam > x_1_pri.sam
rm -f x_nonchimeric_temp.sam x_chimeric_temp.sam x_nonchimeric_pri.bam x_nonchimeric_pri.sam
Classifying Primary Reads
python3 gaptypes.py x_1_pri.sam x_1_pri -1 15 1
Rearrange Softclipped Continuous Reads
python softreverse_v2.py x_1_pricont.sam softrev.fastq
Repeat STAR mapping with the softrev.fastq as input file, then extract and classify the alignments.
Combine output from the two rounds of STAR mapping
python3 merger_sams.py x_1_prigap1.sam x_2_prigap1.sam x gap1 x_prigap1.tmp
python3 merger_sams.py x_1_prigapm.sam x_2_prigapm.sam x gapm x_prigapm.tmp
python3 merger_sams.py x_1_prihomo.sam x_2_prihomo.sam x homo x_prihomo.tmp
python3 merger_sams.py x_1_pritrans.sam x_2_pritrans.sam x trans x_pritrans.tmp
samtools view -H x_1_Aligned.sortedByCoord.out.bam > header
cat header x_prigap1.tmp > x_prigap1.sam
cat header x_prigapm.tmp > x_prigapm.sam
cat header x_prihomo.tmp > x_prihomo.sam
cat header x_pritrans.tmp > x_pritrans.sam
rm -f header *.tmp
Filter splices and short gapped reads
python gapfilter.py Gtf x_prigap1.sam x_prigap1_filtered.sam 13 yes
python gapfilter.py Gtf x_prigapm.sam x_prigapm_filtered.sam 13 yes
Generate Alignment Statistics
Gap Length
cat x_prigap1_filtered.sam x_prigapm_filtered.sam > x_prigaps_filtered.sam
python gaplendist.py x_prigaps_filtered.sam sam x.list all
python gaplendist.py x.list list x_gaplen.pdf all
rm -f x_prigaps_filtered.sam x.list
Arm Length
cat x_prigap1_filtered.sam x_prigapm_filtered.sam x_pritrans.sam > x_prifiltered.sam
python seglendist.py x_prifiltered.sam sam x_prifiltered.list
python seglendist.py x_prifiltered.list list x_seglen.pdf
rm -f x_prifiltered.sam x_prifiltered.list
Read Counting
samtools view -bS -o x_prigap1.bam x_prigap1.sam
python CountCdsUtr.py x_prigap1.bam CdsUtr.bed none x_pri
bedtools bamtobed -i x_prigap1.bam > x.bed
sort -k1,1 -k2,2n x.bed > x_sorted.bed
rm x.bed
bedtools coverage -g chrNameLength_sorted.txt -a reference.bed -b x_sorted.bed -counts -sorted -nonamecheck > x_coverage.txt
CRSSANT
Input file preparation
python merger.py x_prigap1_filtered.sam x_pritrans.sam x_pri_crssant.sam
OPTIONAL: Tag with unique identifier
Prepare the files for CRSSANT DG/NG assembly, tagging with a label.
awk '$0!~/^@/ {printf $1"-TAG1\t"; for(i=2;i<=NF;++i) printf $i "\t"; printf"\n"}' x_pri_crssant.sam > x_pri_crssant_TAG1_tmp.sam
awk '$0!~/^@/ {printf $1"-TAG2\t"; for(i=2;i<=NF;++i) printf $i "\t"; printf"\n"}' x_pri_crssant.sam > x_pri_crssant_TAG2_tmp.sam
samtools view -H x_pri_crssant.sam > header
cat header x_pri_crssant_TAG1_tmp.sam x_pri_crssant_TAG2_tmp.sam > x_pri_crssant.sam
rm -f header x_pri_crssant_TAG1_tmp.sam x_pri_crssant_TAG2_tmp.sam
(AFTER CRSSANT DG ASSEMBLY) Separate the alignments by the tagged identifier.
awk '$0~/^@/ || $1~/-TAG/' x_pri_crssant.cliques.t_o0.2.sam > x_ crssant_TAG1.sam
Build "genesfile" in BED format
python genes2bed.py x_ReadCount.txt CdsUtr.bed min_reads outname
Generate bedgraphs
bedtools genomecov -bg -split -strand + -ibam x_pri_crssant.bam -g staridxPath/chrNameLength.txt > x_plus.bedgraph
bedtools genomecov -bg -split -strand - -ibam x_pri_crssant.bam -g staridxPath/chrNameLength.txt > x_minus.bedgraph
DG and NG assembly
python crssant.py -cluster cliques -t_o 0.2 -out ./ x_pri_crssant.sam $Bed x_plus.bedgraph,x_minus.bedgraph
TG assembly
python gapmcluster.py x.bedpe x_gapm.sam