Bioinfo02 - making simulated single cell data with splatter

preparation

Recently, I saw a package in the article on single cell algorithm, which can easily generate various simulation data according to needs. Isn't it very convenient to make ground truth?

Just try.

Just install it directly and it's easy to:

BiocManager::install("splatter")
p_load(splatter, scuttle, scater)
copy

1 - function introduction

y1s1, the logo of this package is pretty good:

  • Cell group effects: Where multiple, heterogeneous cell-groups are simulated for each individual. These groups could represent different cell-types or the same cell-type before/after a treatment. Group effects include group-specific differential expression (DE) and/or group-specific expression Quantitative Trait Loci (eQTL) effects.
  • Conditional effects between individuals: Where individuals are simulated as belonging to different conditional cohorts (e.g. different treatment groups or groups with different disease statuses). Conditional effects include DE and/or eQTL effects.
  • Batch effect from multiplexed experimental designs: Like in splat, batch effects are simulated by assigning small batch-specific DE effects to all genes. splatPop allows for the simulation of different patterns of batch effects, such as those resulting from multiplexed sequencing designs.

How does splat ter estimate single cell data?

The core of splat model is to use gamma Poisson and cell counts matrix to generate gene expression data.

The core of the Splat model is a gamma-Poisson distribution used to generate a gene by cell matrix of counts. Mean expression levels for each gene are simulated from a gamma distribution[4] and the Biological Coefficient of Variation is used to enforce a mean-variance trend before counts are simulated from a Poisson distribution[5]. Splat also allows you to simulate expression outlier genes (genes with mean expression outside the gamma distribution) and dropout (random knock out of counts based on mean expression). Each cell is given an expected library size (simulated from a log-normal distribution) that makes it easier to match to a given dataset.

2-splatparames object

All splat simulation related parameters are stored in the splatparames object.

params <- newSimpleParams()
params <- newSimpleParams(nGenes = 200, nCells = 10)

> params
A Params object of class SimpleParams 
Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
Secondary parameters are usually set during simulation

Global: 
(GENES)  (CELLS)   [Seed] 
    200       10   977126 

3 additional parameters 

Mean: 
 (Rate)  (Shape) 
    0.3      0.4 

Counts: 
[Dispersion] 
         0.1 
copy

Access and modify parameters

For the moment, we can understand the splatparames object as a parameter object used by splat model to create simulated single-cell data, which contains all the parameter information of the single-cell model. In addition to the basic information such as gene number and cell number, it also includes information such as mean, batch, confounding factors, outliers and so on. For details, please refer to [[SplatParams detailed parameter introduction]]

visit:

getParam(params, "nGenes") 
#> [1] 10000
copy

Modification:

# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
#> $nGenes
#> [1] 8000
#> 
#> $mean.rate
#> [1] 0.5
#> 
#> $mean.shape
#> [1] 0.6
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (GENES)  (Cells)   [SEED] 
#>    8000      100    81261 
#> 
#> 29 additional parameters 
#> 
#> Batches: 
#>     [Batches]  [Batch Cells]     [Location]        [Scale]       [Remove] 
#>             1            100            0.1            0.1          FALSE 
#> 
#> Mean: 
#>  (RATE)  (SHAPE) 
#>     0.5      0.5 
#> 
#> Library size: 
#> (Location)     (Scale)      (Norm) 
#>         11         0.2       FALSE 
#> 
#> Exprs outliers: 
#> (Probability)     (Location)        (Scale) 
#>          0.05              4            0.5 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [PROBABILITY]    [Down Prob]     [Location]        [Scale] 
#>           0.2            0.5            0.1            0.4 
#> 
#> BCV: 
#> (Common Disp)          (DoF) 
#>           0.1             60 
#> 
#> Dropout: 
#>     [Type]  (Midpoint)     (Shape) 
#>       none           0          -1 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8
copy

Estimating parameters from real data

splat also allows us to estimate parameters directly from real single cell data, single cell experience (SCE) objects.

Create simulation data:

set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)

> sce
class: SingleCellExperiment 
dim: 2000 200 
metadata(0):
assays(1): counts
rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
rowData names(0):
colnames(200): Cell_001 Cell_002 ... Cell_199 Cell_200
colData names(3): Mutation_Status Cell_Cycle Treatment
reducedDimNames(0):
altExpNames(1): Spikes
copy

splat estimates:

> params <- splatEstimate(sce)
NOTE: Library sizes have been found to be normally distributed instead of log-normal. You may want to check this is correct.
> params
A Params object of class SplatParams 
Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
Secondary parameters are usually set during simulation

Global: 
(GENES)  (CELLS)   [Seed] 
   2000      200   977126 

29 additional parameters 

Batches: 
    [BATCHES]  [BATCH CELLS]     [Location]        [Scale] 
            1            200            0.1            0.1 
     [Remove] 
        FALSE 

Mean: 
           (RATE)            (SHAPE) 
0.002962686167343  0.496997730070513 

Library size: 
      (LOCATION)           (SCALE)            (NORM) 
      357331.235  11607.2332293176              TRUE 

Exprs outliers: 
(PROBABILITY)     (Location)        (Scale) 
            0              4            0.5 

Groups: 
     [Groups]  [Group Probs] 
            1              1 

Diff expr: 
[Probability]    [Down Prob]     [Location]        [Scale] 
          0.1            0.5            0.1            0.4 

BCV: 
    (COMMON DISP)              (DOF) 
0.752043426792845   11211.8933424157 

Dropout: 
           [Type]         (MIDPOINT)            (SHAPE) 
             none   2.71153535179343  -1.37209356733765 

Paths: 
        [From]         [Steps]          [Skew]    [Non-linear] 
             0             100             0.5             0.1 
[Sigma Factor] 
           0.8 
copy

splat parameter estimation from real (simulated) data includes the following steps:

  • Mean parameters are estimated by fitting a gamma distribution to the mean expression levels
  • Library size: Library size parameters are estimated by fitting a log-normal distribution to the library sizes. (personal understanding is the library size counted by counts)
  • Expression outlier parameters are estimated by determining the number of outliers and fitting a log normal distribution to their difference from the medium
  • BCV parameters are estimated using the estimateDisp function from the edgeR package. (confounding variable)
  • Dropout parameters are estimated by checking if dropout is present and fitting a logistic function to the relationship between mean expression and proportion of zeros.

For details, see Splat simulation parameters (bioconductor.org)[6]

3 - construct the expression matrix by using the splat parameter estimation results

After configuring the splatparames object (after setting the parameter results for simulation), you can use this parameter object for simulation, that is, the function splatSimulate.

> sim <- splatSimulate(params, nGenes = 1000, 
                     batchCells = rep(100,10))
Getting parameters...
Creating simulation object...
Simulating library sizes...
Simulating gene means...
Simulating BCV...
Simulating counts...
Simulating dropout (if needed)...
Sparsifying assays...
Automatically converting to sparse matrices, threshold = 0.95
Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'BCV': estimated sparse size 1.5 * dense matrix
Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'TrueCounts': estimated sparse size 2.79 * dense matrix
Skipping 'counts': estimated sparse size 2.79 * dense matrix
Done!
> sim
class: SingleCellExperiment 
dim: 1000 200 
metadata(1): Params
assays(6): BatchCellMeans BaseCellMeans ... TrueCounts
  counts
rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
colnames(200): Cell1 Cell2 ... Cell199 Cell200
colData names(3): Cell Batch ExpLibSize
reducedDimNames(0):
altExpNames(0):
copy

As you can see, we created a 200x1000 single cell object.

> head(rowData(sim))
DataFrame with 6 rows and 4 columns
             Gene BaseGeneMean OutlierFactor  GeneMean
      <character>    <numeric>     <numeric> <numeric>
Gene1       Gene1    367.75129             1 367.75129
Gene2       Gene2    183.59043             1 183.59043
Gene3       Gene3    460.64653             1 460.64653
Gene4       Gene4      5.16081             1   5.16081
Gene5       Gene5     90.97209             1  90.97209
Gene6       Gene6     81.68033             1  81.68033
> head(colData(sim))
DataFrame with 6 rows and 3 columns
             Cell       Batch ExpLibSize
      <character> <character>  <numeric>
Cell1       Cell1      Batch1     364054
Cell2       Cell2      Batch1     337135
Cell3       Cell3      Batch1     353075
Cell4       Cell4      Batch1     354925
Cell5       Cell5      Batch1     357749
Cell6       Cell6      Batch1     356384
copy

We can also visualize it. Here, refer to the visualiseDim function in the package CiteFuse:

sim <- logNormCounts(sim)
# Plot PCA
sim <- runPCA(sim)
visualiseDim(sim,
             dimNames = "PCA",
             colour_by = "Batch")
copy

It can be seen that the output of splatSimulate function is an sce object, which can be used for subsequent single-cell analysis.

This function outputs the following information:

  • Cell information (colData)
    • Cell - Unique cell identifier.
    • Group - The group or path the cell belongs to.
    • ExpLibSize - The expected library size for that cell.
    • Step (paths only) - How far along the path each cell is.
  • Gene information (rowData)
    • Gene - Unique gene identifier.
    • BaseGeneMean - The base expression level for that gene.
    • OutlierFactor - Expression outlier factor for that gene (1 is not an outlier).
    • GeneMean - Expression level after applying outlier factors.
    • DEFac[Group] - The differential expression factor for each gene in a particular group (1 is not differentially expressed).
    • GeneMean[Group] - Expression level of a gene in a particular group after applying differential expression factors.
  • Gene by cell information (assays)
    • BaseCellMeans - The expression of genes in each cell adjusted for expected library size.
    • BCV - The Biological Coefficient of Variation for each gene in each cell.
    • CellMeans - The expression level of genes in each cell adjusted for BCV.
    • TrueCounts - The simulated counts before dropout.
    • Dropout - Logical matrix showing which counts have been dropped in which cells

In fact, it corresponds to the relevant parameters designed in the splatparames object mentioned above.

4 - two modes of simulation

In addition to the above single type cells, the splatSimulate function can also be used to generate multiple single cell clusters or trajectory data:

which simulation method to use. Options are "single" which produces a single population, "groups" which produces distinct groups (eg. cell types), or "paths" which selects cells from continuous trajectories (eg. differentiation processes).

4.1-cluster

Similar to the batchCells parameter (the number of batches cannot be specified directly), it can be used to control the number of batches and the size of each batch. group.prob can also be used to control the proportion and number of groups:

# group
sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
                            verbose = FALSE, batchCells = rep(1000,2))
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
visualiseSce(sim.groups,
             dimNames = "PCA",
             colour_by = "Group", shape_by = "Batch")
copy

4.2-path(trajectory)

Modify the corresponding method parameter to the paths mode:

# path
sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
                           verbose = FALSE, batchCells = rep(200,2))
sim.paths <- logNormCounts(sim.paths)
sim.paths <- runPCA(sim.paths)
tmp.list$sim.paths.df <- colData(sim.paths)
tmp.list$sim.paths.df <- cbind(tmp.list$sim.paths.df, 
                               reducedDim(sim.paths, "PCA")[,1:2])
tmp.list$sim.paths.df <- as.data.frame(tmp.list$sim.paths.df)
ggplot(tmp.list$sim.paths.df) + 
  geom_point(
    aes(PC1, PC2, color = Step, shape = Batch)
  ) + viridis::scale_colour_viridis()
copy

5 - other simulations

batch

In fact, in the above data, I have specified the batch through batchCells.

Because the difference between the group 1 and the group 2 is obviously larger than that in the group 2, it can also be seen that there is a significant difference between the group 1 and the real pca

In fact, splat method is also a suite:

Each of the Splatter simulation methods has it's own convenience function. To simulate a single population use splatSimulateSingle() (equivalent to splatSimulate(method = "single")), to simulate groups use splatSimulateGroups() (equivalent to splatSimulate(method = "groups")) or to simulate paths use splatSimulatePaths() (equivalent to splatSimulate(method = "paths")).

Other methods

Full 15 sets of methods:

listSims()
#> Splatter currently contains 15 simulations 
#> 
#> Splat (splat) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> The Splat simulation generates means from a gamma distribution, adjusts them for BCV and generates counts from a gamma-poisson. Dropout and batch effects can be optionally added. 
#> 
#> Splat Single (splatSingle) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> The Splat simulation with a single population. 
#> 
#> Splat Groups (splatGroups) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> The Splat simulation with multiple groups. Each group can have it's own differential expression probability and fold change distribution. 
#> 
#> Splat Paths (splatPaths) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> The Splat simulation with differentiation paths. Each path can have it's own length, skew and probability. Genes can change in non-linear ways. 
#> 
#> Kersplat (kersplat) 
#> DOI:      GitHub: Oshlack/splatter    Dependencies: scuttle, igraph 
#> The Kersplat simulation extends the Splat model by adding a gene network, more complex cell structure, doublets and empty cells (Experimental). 
#> 
#> splatPop (splatPop) 
#> DOI: 10.1186/s13059-021-02546-1   GitHub: Oshlack/splatter    Dependencies: VariantAnnotation, preprocessCore 
#> The splatPop simulation enables splat simulations to be generated for multiple individuals in a population, accounting for correlation structure by simulating expression quantitative trait loci (eQTL). 
#> 
#> Simple (simple) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> A simple simulation with gamma means and negative binomial counts. 
#> 
#> Lun (lun) 
#> DOI: 10.1186/s13059-016-0947-7    GitHub: MarioniLab/Deconvolution2016    Dependencies:  
#> Gamma distributed means and negative binomial counts. Cells are given a size factor and differential expression can be simulated with fixed fold changes. 
#> 
#> Lun 2 (lun2) 
#> DOI: 10.1093/biostatistics/kxw055     GitHub: MarioniLab/PlateEffects2016     Dependencies: scran, scuttle, lme4, pscl, limSolve 
#> Negative binomial counts where the means and dispersions have been sampled from a real dataset. The core feature of the Lun 2 simulation is the addition of plate effects. Differential expression can be added between two groups of plates and optionally a zero-inflated negative-binomial can be used. 
#> 
#> scDD (scDD) 
#> DOI: 10.1186/s13059-016-1077-y    GitHub: kdkorthauer/scDD    Dependencies: scDD 
#> The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions. 
#> 
#> BASiCS (BASiCS) 
#> DOI: 10.1371/journal.pcbi.1004333     GitHub: catavallejos/BASiCS     Dependencies: BASiCS 
#> The BASiCS simulation is based on a bayesian model used to deconvolve biological and technical variation and includes spike-ins and batch effects. 
#> 
#> mfa (mfa) 
#> DOI: 10.12688/wellcomeopenres.11087.1     GitHub: kieranrcampbell/mfa     Dependencies: mfa 
#> The mfa simulation produces a bifurcating pseudotime trajectory. This can optionally include genes with transient changes in expression and added dropout. 
#> 
#> PhenoPath (pheno) 
#> DOI: 10.1038/s41467-018-04696-6   GitHub: kieranrcampbell/phenopath   Dependencies: phenopath 
#> The PhenoPath simulation produces a pseudotime trajectory with different types of genes. 
#> 
#> ZINB-WaVE (zinb) 
#> DOI: 10.1038/s41467-017-02554-5   GitHub: drisso/zinbwave     Dependencies: zinbwave 
#> The ZINB-WaVE simulation simulates counts from a sophisticated zero-inflated negative-binomial distribution including cell and gene-level covariates. 
#> 
#> SparseDC (sparseDC) 
#> DOI: 10.1093/nar/gkx1113      GitHub: cran/SparseDC   Dependencies: SparseDC 
#> The SparseDC simulation simulates a set of clusters across two conditions, where some clusters may be present in only one condition.
copy

For example, this scDD considers cells with different conditions:

The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions.

6 - comparison with real data sets

Respective comparison

splat provides a way to observe individual single-cell data sets.

The compareSCEs function accepts list objects:

set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params, nGenes = 2000)
sim2 <-splatSimulate(nGenes = 2000)
sim3 <- simpleSimulate(nGenes = 2000, verbose = FALSE)
comparison <- compareSCEs(list(sce = sce, sim1 = sim1,
                               sim2 = sim2, sim3 = sim3))
copy

For example, I compare here:

  • Simulation data generated by mockSCE;
  • Estimate the data generated by splat through the parameters of mockSCE simulation results;
  • splat data created in two modes.
> head(comparison$ColData)
         Dataset    sum detected  total PctZero
Cell_001     sce 392907     1502 402665   24.90
Cell_002     sce 398904     1509 405090   24.55
Cell_003     sce 358855     1503 365409   24.85
Cell_004     sce 378909     1527 386163   23.65
Cell_005     sce 384063     1532 389848   23.40
Cell_006     sce 370596     1522 378267   23.90
> head(comparison$RowData)
          Dataset    mean detected MeanCounts  VarCounts CVCounts
Gene_0001     sce  11.460     48.5     11.460   950.1994 2.689817
Gene_0002     sce  78.560     92.0     78.560  9041.7049 1.210385
Gene_0003     sce  21.505     55.5     21.505  2641.8090 2.390074
Gene_0004     sce  20.780     57.0     20.780  1827.4890 2.057225
Gene_0005     sce  18.290     50.0     18.290  1979.6140 2.432633
Gene_0006     sce 191.455     99.5    191.455 46423.7367 1.125391
          MedCounts MADCounts   MeanCPM     VarCPM    CVCPM
Gene_0001       0.0    0.0000  29.69738   6420.309 2.698111
Gene_0002      46.0   63.7518 205.45104  62237.453 1.214276
Gene_0003       1.0    1.4826  56.01672  17949.158 2.391687
Gene_0004       2.0    2.9652  54.13480  12419.975 2.058656
Gene_0005       0.5    0.7413  47.92973  13753.695 2.446835
Gene_0006     119.0  126.7623 500.52564 321714.428 1.133206
              MedCPM     MADCPM MeanLogCPM VarLogCPM  CVLogCPM
Gene_0001   0.000000   0.000000   2.183822  6.987662 1.2104552
Gene_0002 119.251041 163.890597   6.125446  7.454117 0.4457182
Gene_0003   2.607953   3.866551   2.792780  9.128253 1.0818252
Gene_0004   5.085582   7.539884   2.973293  9.256661 1.0232681
Gene_0005   1.231218   1.825403   2.613443  9.025811 1.1495559
Gene_0006 308.946214 330.608840   8.044234  3.727668 0.2400125
          MedLogCPM MADLogCPM PctZero
Gene_0001 0.0000000  0.000000    51.5
Gene_0002 6.9098774  2.455602     8.0
Gene_0003 1.8511790  2.744558    44.5
Gene_0004 2.6051412  3.862382    43.0
Gene_0005 0.8958936  1.328252    50.0
Gene_0006 8.2758656  1.782199     0.5

> table(comparison$RowData$Dataset)

 sce sim1 sim2 sim3 
2000 2000 2000 2000 
> table(comparison$ColData$Dataset)

 sce sim1 sim2 sim3 
 200  200  100  100 
copy

Detailed statistics of each datasets gene and cell information. And a variety of drawing results:

> names(comparison$Plots)
[1] "Means"        "Variances"    "MeanVar"      "LibrarySizes"
[5] "ZerosGene"    "ZerosCell"    "MeanZeros"    "VarGeneCor"  
copy

It's very good-looking, wooden and exquisite. Use notched box plot:

More than one

Use the function diffSCEs. Here, the dimensions of the dataset need to be the same, so reconfigure.

And the length of this ref is required to be one, so it is more than one.

Error in diffSCEs(list(sce = sce, sim1 = sim1, sim2 = sim2, sim3 = sim3), : Assertion on 'ref' failed: Must have length 1.

# compare some to ref
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params)
sim2 <-splatSimulate(nGenes = 2000, batchCells = 200)
sim3 <- simpleSimulate(nGenes = 2000, nCells = 200, verbose = FALSE)

difference <- diffSCEs(list(sce = sce, sim1 = sim1,
              sim2 = sim2, sim3 = sim3), ref = "sce")
copy

We can compare it with ref:

Other contents

For example, add tpm and fpkm data. For sce objects, you can directly use the method of scater package:

sim <- simpleSimulate(verbose = FALSE)
sim <- addGeneLengths(sim)
head(rowData(sim))
#> DataFrame with 6 rows and 3 columns
#>              Gene  GeneMean    Length
#>       <character> <numeric> <numeric>
#> Gene1       Gene1 0.5641399       917
#> Gene2       Gene2 0.0764411       765
#> Gene3       Gene3 2.6791742      5972
#> Gene4       Gene4 1.3782005      3491
#> Gene5       Gene5 4.0117653     15311
#> Gene6       Gene6 0.3536760      1190

tpm(sim) <- calculateTPM(sim, rowData(sim)$Length)
tpm(sim)[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#>           Cell1    Cell2     Cell3     Cell4     Cell5
#> Gene1 342.21897  .         .       169.73637 170.06608
#> Gene2   .        .         .         .         .      
#> Gene3 131.36922  .       187.68277 182.44101  78.34089
#> Gene4  89.89252  .       183.46630  89.17115   .      
#> Gene5  30.74405 20.50798  83.66284  81.32623  40.74211
copy

If you feel that some contents of the sce object created by splat do not need to be used, you can delete some metadata or asset to compress the object size:

sim <- splatSimulate()
#> Getting parameters...
#> Creating simulation object...
#> Simulating library sizes...
#> Simulating gene means...
#> Simulating BCV...
#> Simulating counts...
#> Simulating dropout (if needed)...
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
#> Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
#> Skipping 'BCV': estimated sparse size 1.5 * dense matrix
#> Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
#> Skipping 'TrueCounts': estimated sparse size 1.65 * dense matrix
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Done!
minimiseSCE(sim)
#> Minimising SingleCellExperiment...
#> Original size: 43.9 Mb
#> Removing all rowData columns
#> Removing all colData columns
#> Removing all metadata items
#> Keeping 1 assays: counts
#> Removing 5 assays: BatchCellMeans, BaseCellMeans, BCV, CellMeans, TrueCounts
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Minimised size: 5.3 Mb (12% of original)
#> class: SingleCellExperiment 
#> dim: 10000 100 
#> metadata(0):
#> assays(1): counts
#> rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
#> rowData names(0):
#> colnames(100): Cell1 Cell2 ... Cell99 Cell100
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

minimiseSCE(sim, rowData.keep = "Gene", colData.keep = c("Cell", "Batch"),
            metadata.keep = TRUE)
#> Minimising SingleCellExperiment...
#> Original size: 43.9 Mb
#> Keeping 1 rowData columns: Gene
#> Removing 3 rowData columns: BaseGeneMean, OutlierFactor, GeneMean
#> Keeping 2 colData columns: Cell, Batch
#> Removing 1 colData columns: ExpLibSize
#> Keeping 1 assays: counts
#> Removing 5 assays: BatchCellMeans, BaseCellMeans, BCV, CellMeans, TrueCounts
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Minimised size: 5.9 Mb (14% of original)
#> class: SingleCellExperiment 
#> dim: 10000 100 
#> metadata(1): Params
#> assays(1): counts
#> rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
#> rowData names(1): Gene
#> colnames(100): Cell1 Cell2 ... Cell99 Cell100
#> colData names(2): Cell Batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
copy

rowData(sce), colData(sce) and metadata(sce) are deleted by default.

splat also provides functions for assembling the drawing contents in the comparison results:

p1 <- makeCompPanel(comparison)
copy

y1s1, it's ugly. I won't show it.

ps: however, there seems to be a bug in the many to one diagram:

makeCompPanel(difference):Error in makeCompPanel(difference) : Assertion on 'comp' failed: Must have length 3, but has length 5.
copy

Personally, I think the splat suite is quite worth playing with. Let's see if we can personalize the batch magnitude of each sce or the var size between group s.

reference material

[1]

Introduction to Splatter (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatter.html

[2]

Oshlack/splatter: Simple simulation of single-cell RNA sequencing data (github.com): https://github.com/Oshlack/splatter

[3]

splatPop: simulating single-cell data for populations: http://www.bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatPop.html#2_Quick_start

[4]

gamma distribution: https://en.wikipedia.org/wiki/Gamma_distribution

[5]

Poisson distribution: https://en.wikipedia.org/wiki/Poisson_distribution

[6]

Splat simulation parameters (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splat_params.html#27_Dropout_parameters

Posted by Toonster on Thu, 19 May 2022 11:36:01 +0300