单细胞数据高级分析之初步降维和聚类 | Dimensionality reduction | Clustering

Dimensionality reduction.

Throughout the manuscript we use diffusion maps, a non-linear dimensionality reduction technique37. We calculate a cell-to-cell distance matrix using 1 - Pearson correlation and use the diffuse function of the diffusionMap R package with default parameters to obtain the first 50 DMCs. To determine the significant DMCs, we look at the reduction of eigenvalues associated with DMCs. We determine all dimensions with an eigenvalue of at least 4% relative to the sum of the first 50 eigenvalues as significant, and scale all dimensions to have mean 0 and standard deviation of 1.

Initial clustering of all cells.

To identify contaminating cell populations and assess overall heterogeneity in the data, we clustered all single cells. We first combined all Drop-seq samples and normalized the data (21,566 cells, 10,791 protein-coding genes detected in at least 3 cells and mean UMI at least 0.005) using regularized negative binomial regression as outlined above (correcting for sequencing depth related factors and cell cycle). We identified 731 highly variable genes; that is, genes for which the z-scored standard deviation was at least 1. We used the variable genes to perform dimensionality reduction using diffusion maps as outlined above (with relative eigenvalue cutoff of 2%), which returned 10 significant dimensions.

For clustering we used a modularity optimization algorithm that finds community structure in the data with Jaccard similarities (neighbourhood size 9, Euclidean distance in diffusion map coordinates) as edge weights between cells38. With the goal of overclustering the data to identify rare populations, the small neighbourhood size resulted in 15 clusters, of which two were clearly separated from the rest and expressed marker genes expected from contaminating cells (Neurod6 from excitatory neurons, Igfbp7 from epithelial cells). These cells represent rare cellular contaminants in the original sample (2.6% and 1%), and were excluded from further analysis, leaving 20,788 cells.

# for clustering

dim.red <- function(expr, max.dim, ev.red.th, plot.title=NA, do.scale.result=FALSE) {
  cat(‘Dimensionality reduction via diffusion maps using‘, nrow(expr), ‘genes and‘, ncol(expr), ‘cells\n‘)
  if (sum(is.na(expr)) > 0) {
    dmat <- 1 - cor(expr, use = ‘pairwise.complete.obs‘)
  } else {
    dmat <- 1 - cor(expr)
  }

  max.dim <- min(max.dim, nrow(dmat)/2)
  dmap <- diffuse(dmat, neigen=max.dim, maxdim=max.dim)
  ev <- dmap$eigenvals

  ev.red <- ev/sum(ev)
  evdim <- rev(which(ev.red > ev.red.th))[1]

  if (is.character(plot.title)) {
    plot(ev, ylim=c(0, max(ev)), main = plot.title)
    abline(v=evdim + 0.5, col=‘blue‘)
  }

  evdim <- max(2, evdim, na.rm=TRUE)
  cat(‘Using‘, evdim, ‘significant DM coordinates\n‘)

  colnames(dmap$X) <- paste0(‘DMC‘, 1:ncol(dmap$X))
  res <- dmap$X[, 1:evdim]
  if (do.scale.result) {
    res <- scale(dmap$X[, 1:evdim])
  }
  return(res)
}

# jaccard similarity
# rows in ‘mat‘ are cells
jacc.sim <- function(mat, k) {
  # generate a sparse nearest neighbor matrix
  nn.indices <- get.knn(mat, k)$nn.index
  j <- as.numeric(t(nn.indices))
  i <- ((1:length(j))-1) %/% k + 1
  nn.mat <- sparseMatrix(i=i, j=j, x=1)
  rm(nn.indices, i, j)
  # turn nn matrix into SNN matrix and then into Jaccard similarity
  snn <- nn.mat %*% t(nn.mat)
  snn.summary <- summary(snn)
  snn <- sparseMatrix(i=snn.summary$i, j=snn.summary$j, x=snn.summary$x/(2*k-snn.summary$x))
  rm(snn.summary)
  return(snn)
}

cluster.the.data.simple <- function(cm, expr, k, sel.g=NA, min.mean=0.001,
                                    min.cells=3, z.th=1, ev.red.th=0.02, seed=NULL,
                                    max.dim=50) {
  if (all(is.na(sel.g))) {
    # no genes specified, use most variable genes
    goi <- rownames(expr)[apply(cm[rownames(expr), ]>0, 1, sum) >= min.cells & apply(cm[rownames(expr), ], 1, mean) >= min.mean]
    sspr <- apply(expr[goi, ]^2, 1, sum)
    sel.g <- goi[scale(sqrt(sspr)) > z.th]
  }
  cat(sprintf(‘Selected %d variable genes\n‘, length(sel.g)))
  sel.g <- intersect(sel.g, rownames(expr))
  cat(sprintf(‘%d of these are in expression matrix.\n‘, length(sel.g)))

  if (is.numeric(seed)) {
    set.seed(seed)
  }

  dm <- dim.red(expr[sel.g, ], max.dim, ev.red.th, do.scale.result = TRUE)

  sim.mat <- jacc.sim(dm, k)

  gr <- graph_from_adjacency_matrix(sim.mat, mode=‘undirected‘, weighted=TRUE, diag=FALSE)
  cl <- as.numeric(membership(cluster_louvain(gr)))

  results <- list()
  results$dm <- dm
  results$clustering <- cl
  results$sel.g <- sel.g
  results$sim.mat <- sim.mat
  results$gr <- gr
  cat(‘Clustering table\n‘)
  print(table(results$clustering))
  return(results)
}

原文地址：https://www.cnblogs.com/leezx/p/8648390.html

时间： 2024-10-08 14:49:02

单细胞数据高级分析之初步降维和聚类 | Dimensionality reduction | Clustering

Dimensionality reduction.

Initial clustering of all cells.

单细胞数据高级分析之初步降维和聚类 | Dimensionality reduction | Clustering的相关文章

大数据之高级分析如何从天气中获取洞察力

全基因组重测序基础及高级分析知识汇总

一站式大数据敏捷分析平台

第二篇：智能电网(Smart Grid)中的数据工程与大数据案例分析

通过WireShark抓取iOS联网数据实例分析

基于Jpcap的TCP/IP数据包分析(一)

北风网零基础到数据（大数据）分析专家-首席分析师

报表模板 — 在项目管理中应用数据报表分析

大数据案例分析