DNA 2. maftools of genomic variation analysis artifact in SCI articles

Click attention, Huanfeng gene

Huanfeng gene

Bioinformatics analysis, SCI article writing and bioinformatics basic knowledge learning: R language learning, perl basic programming, linux system commands, Python meet you better

78 original content

official account

preface

With the progress of cancer genomics, mutation annotation format (MAF) is widely accepted and used to store detected somatic variants. The Cancer Genome Atlas Project has sequenced more than 30 different cancers, with a sample size of more than 200 for each cancer. The result data composed of somatic variants is stored in the form of mutation annotation format. As long as the data is in MAF format, this package attempts to summarize, analyze, annotate and visualize MAF files in an effective way from TCGA sources or any internal research.

Instance analysis

1. Software installation

When installing the software maftools, you need to install BioManager first, and then install maftools, as follows:

if (!require("BiocManager")) {
    install.packages("BiocManager")
}
if (!require(maftools)) {
    BiocManager::install("maftools")
}
library(maftools)

2. Data reading

The maftools tool needs to read in two files, as follows:

1.MAF file - can be gz compressed. Necessary;

2. Optional recommended clinical data related to each sample / tumor sample barcode in MAF;

3. An optional copy number data: it can be GISTIC output or user-defined table.

1.maf file format

MAF files contain many fields, from chromosome names to cosmic annotations. However, most analyses in maftools use the following: 1 Hugo_ Symbol;

2.Chromosome;

3.Start_Position;

4.End_Position;

5.Reference_Allele;

6.Tumor_Seq_Allele2;

7.Variant_Classification;

8.Variant_Type;

9.Tumor_Sample_Barcode.

Read the maf file and clinical information file at the same time, and see the results as follows:

# path to TCGA LAML MAF file
laml.maf = system.file("extdata", "tcga_laml.maf.gz", package = "maftools")
# clinical information containing survival information and histology. This is
# optional
laml.clin = system.file("extdata", "tcga_laml_annot.tsv", package = "maftools")
laml = read.maf(maf = laml.maf, clinicalData = laml.clin)
## -Reading
## -Validating
## -Silent variants: 475 
## -Summarizing
## -Processing clinical data
## -Finished in 3.790s elapsed (0.550s cpu)
print(laml@data[1:5, ])
##    Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
## 1:      ABCA10          10349 genome.wustl.edu         37         17
## 2:       ABCA4             24 genome.wustl.edu         37          1
## 3:      ABCB11           8647 genome.wustl.edu         37          2
## 4:       ABCC3           8714 genome.wustl.edu         37         17
## 5:       ABCF1             23 genome.wustl.edu         37          6
##    Start_Position End_Position Strand Variant_Classification Variant_Type
## 1:       67170917     67170917      +            Splice_Site          SNP
## 2:       94490594     94490594      +      Missense_Mutation          SNP
## 3:      169780250    169780250      +      Missense_Mutation          SNP
## 4:       48760974     48760974      +      Missense_Mutation          SNP
## 5:       30554429     30554429      +      Missense_Mutation          SNP
##    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
## 1:                T                 T                 C         TCGA-AB-2988
## 2:                C                 C                 T         TCGA-AB-2869
## 3:                G                 G                 A         TCGA-AB-3009
## 4:                C                 C                 T         TCGA-AB-2887
## 5:                G                 G                 A         TCGA-AB-2920
##    Protein_Change i_TumorVAF_WU i_transcript_name
## 1:        p.K960R      45.66000       NM_080282.3
## 2:       p.R1517H      38.12000       NM_000350.2
## 3:       p.A1283V      46.97218       NM_003742.2
## 4:       p.P1271S      56.41000       NM_003786.1
## 5:        p.G658S      40.95000    NM_001025091.1

2. Clinical information format

The clinical data format includes: bar code of each sample / tumor sample, and corresponding clinical data, as follows:

print(laml@clinical.data[1:5, ])
##    Tumor_Sample_Barcode FAB_classification days_to_last_followup
## 1:         TCGA-AB-2802                 M4                   365
## 2:         TCGA-AB-2803                 M3                   792
## 3:         TCGA-AB-2804                 M3                  2557
## 4:         TCGA-AB-2805                 M0                   577
## 5:         TCGA-AB-2806                 M1                   945
##    Overall_Survival_Status
## 1:                       1
## 2:                       1
## 3:                       0
## 4:                       1
## 5:                       1

3. Copy number variation format

Copy number data includes sample name, gene name and copy number status (Amp or Del).

all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")

laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes,
    gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
laml.gistic@data[1:5, ]
##    Hugo_Symbol Tumor_Sample_Barcode Variant_Classification Variant_Type
## 1:       KMT2A         TCGA-AB-2805                    Amp          CNV
## 2:   LINC00689         TCGA-AB-2805                    Del          CNV
## 3:      MIR595         TCGA-AB-2805                    Del          CNV
## 4:   RN7SL142P         TCGA-AB-2805                    Del          CNV
## 5:         SHH         TCGA-AB-2805                    Del          CNV
##        Cytoband
## 1: AP_2:11q23.3
## 2:  DP_4:7q32.3
## 3:  DP_4:7q32.3
## 4:  DP_4:7q32.3
## 5:  DP_4:7q32.3

3. Data statistics

Summarize the overall data, count the sample size, mutation gene quantity and mutation type, and finally sort them into the table as follows:

# Shows sample summry.
getSampleSummary(laml)
##      Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
##   1:         TCGA-AB-3009               0               5            0
##   2:         TCGA-AB-2807               1               0            1
##   3:         TCGA-AB-2959               0               0            0
##   4:         TCGA-AB-3002               0               0            0
##   5:         TCGA-AB-2849               0               1            0
##  ---                                                                  
## 189:         TCGA-AB-2942               0               0            0
## 190:         TCGA-AB-2946               0               0            0
## 191:         TCGA-AB-2954               0               0            0
## 192:         TCGA-AB-2982               0               0            0
## 193:         TCGA-AB-2903               0               0            0
##      In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
##   1:            1                25                 2           1    34
##   2:            0                16                 3           4    25
##   3:            0                22                 0           1    23
##   4:            0                15                 1           5    21
##   5:            0                16                 1           2    20
##  ---                                                                   
## 189:            1                 0                 0           0     1
## 190:            0                 1                 0           0     1
## 191:            0                 1                 0           0     1
## 192:            0                 1                 0           0     1
## 193:            0                 0                 0           0     0
# Shows gene summary.
getGeneSummary(laml)
##       Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
##    1:        FLT3               0               0            1           33
##    2:      DNMT3A               4               0            0            0
##    3:        NPM1               0              33            0            0
##    4:        IDH2               0               0            0            0
##    5:        IDH1               0               0            0            0
##   ---                                                                      
## 1237:      ZNF689               0               0            0            0
## 1238:      ZNF75D               0               0            0            0
## 1239:      ZNF827               1               0            0            0
## 1240:       ZNF99               0               0            0            0
## 1241:        ZPBP               0               0            0            0
##       Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
##    1:                15                 0           3    52             52
##    2:                39                 5           6    54             48
##    3:                 1                 0           0    34             33
##    4:                20                 0           0    20             20
##    5:                18                 0           0    18             18
##   ---                                                                     
## 1237:                 1                 0           0     1              1
## 1238:                 1                 0           0     1              1
## 1239:                 0                 0           0     1              1
## 1240:                 1                 0           0     1              1
## 1241:                 1                 0           0     1              1
##       AlteredSamples
##    1:             52
##    2:             48
##    3:             33
##    4:             20
##    5:             18
##   ---               
## 1237:              1
## 1238:              1
## 1239:              1
## 1240:              1
## 1241:              1
# shows clinical data associated with samples
getClinicalData(laml)
##      Tumor_Sample_Barcode FAB_classification days_to_last_followup
##   1:         TCGA-AB-2802                 M4                   365
##   2:         TCGA-AB-2803                 M3                   792
##   3:         TCGA-AB-2804                 M3                  2557
##   4:         TCGA-AB-2805                 M0                   577
##   5:         TCGA-AB-2806                 M1                   945
##  ---                                                              
## 189:         TCGA-AB-3007                 M3                  1581
## 190:         TCGA-AB-3008                 M1                   822
## 191:         TCGA-AB-3009                 M4                   577
## 192:         TCGA-AB-3011                 M1                  1885
## 193:         TCGA-AB-3012                 M3                  1887
##      Overall_Survival_Status
##   1:                       1
##   2:                       1
##   3:                       0
##   4:                       1
##   5:                       1
##  ---                        
## 189:                       0
## 190:                       1
## 191:                       1
## 192:                       0
## 193:                       0
# Shows all fields in MAF
getFields(laml)
##  [1] "Hugo_Symbol"            "Entrez_Gene_Id"         "Center"                
##  [4] "NCBI_Build"             "Chromosome"             "Start_Position"        
##  [7] "End_Position"           "Strand"                 "Variant_Classification"
## [10] "Variant_Type"           "Reference_Allele"       "Tumor_Seq_Allele1"     
## [13] "Tumor_Seq_Allele2"      "Tumor_Sample_Barcode"   "Protein_Change"        
## [16] "i_TumorVAF_WU"          "i_transcript_name"
# Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = "laml")

Data summary statistical drawing, the number of variables in each sample is displayed as stacked barplot, and the variable type is displayed as variant_ Box diagram summarized by classification. As follows:

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = "median", dashboard = TRUE,
    titvRaw = FALSE)

4. Mutation data visualization

Waterfall Plot

# oncoplot for top ten mutated genes.
oncoplot(maf = laml, top = 10)

CIS trans graph

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
# plot titv summary
plotTiTv(res = laml.titv)

Lollipop chart (amino acid)

# lollipop plot for DNMT3A, which is one of the most frequent mutated gene in
# Leukemia.
lollipopPlot(maf = laml, gene = "DNMT3A", AACol = "Protein_Change", showMutationRate = TRUE,
    labelPos = 882)
##      HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_022552  NP_072046       912
## 2: DNMT3A NM_153759  NP_715640       723
## 3: DNMT3A NM_175629  NP_783328       912

Lollipop chart (protein)

plotProtein(gene = "TP53", refSeqID = "NM_000546")

Rainfall map

brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)
rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.4)
##    Chromosome Start_Position End_Position nMuts Avg_intermutation_dist Size
## 1:          8       98129348     98133560     7               702.0000 4212
## 2:          8       98398549     98403536     9               623.3750 4987
## 3:          8       98453076     98456466     9               423.7500 3390
## 4:          8      124090377    124096810    22               306.3333 6433
## 5:         12       97436055     97439705     7               608.3333 3650
## 6:         17       29332072     29336153     8               583.0000 4081
##    Tumor_Sample_Barcode C>G C>T
## 1:         TCGA-A8-A08B   4   3
## 2:         TCGA-A8-A08B   1   8
## 3:         TCGA-A8-A08B   1   8
## 4:         TCGA-A8-A08B   1  21
## 5:         TCGA-A8-A08B   4   3
## 6:         TCGA-A8-A08B   4   4

Comparison between catastrophe load and TCGA

laml.mutload = tcgaCompare(maf = laml, cohortName = "Example-LAML", logscale = TRUE,
    capture_size = 50)

VAF distribution map

plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')

CNV visualization

Genome distribution

gisticChromPlot(gistic = laml.gistic, markBands = "all")

Bubble Diagram

gisticBubblePlot(gistic = laml.gistic)

Waterfall Plot

gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = "FAB_classification",
    sortByAnnotation = TRUE, top = 10)

CBS visualization

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg)
## NULL

This issue mainly talks about simple visual data. In the next issue, we will talk about using maftools to analyze some substantive operations. Please look forward to it. Those who want to learn Shengxin can go to station B and upload tutorials every day!

The live broadcast course of station B and the training course of biological information analysis of tumor cloning and evolution are not recorded and broadcast. Teachers who need the analysis content in this regard can come and communicate!

Our official account of Huan Feng gene also introduced a series of articles on clonal evolution.

Topic 1. sciClone of clonal evolution
Topic 2. ClonEvol of clonal evolution
Topic 3. fishplot of clonal evolution
Topic 4. Pyclone of clonal evolution
Topic 5. CITUP of clonal evolution
Topic 6. Clonal evolution Canopy
Topic 7. Cardelino of clonal evolution
Topic 8. RobustClone of clonal evolution
Topic 9. TimeScape of clonal evolution
Clone 1. The past and present life of tumor clonal evolution
Clone 2. Different evolutionary models of tumor clonal evolution

Teachers who do tumor metastasis, cancer drug resistance, subcloning, targeted drugs and other topics can contact me directly to ensure the best service, the top scientific research methods and help score high marks!

References:

  1. Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. 2018. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Resarch. PMID: 30341162

Tags: Machine Learning AI Data Analysis Data Mining

Posted by halex on Wed, 20 Apr 2022 01:26:43 +0300