Jupyter Notebook Binder

Integrate scRNA-seq datasets#

scRNA-seq data integration is the process of analyzing data from several scRNA sequencing experiments to uncover common or distinct biological insights and patterns.

Here, weโ€™ll demonstrate how to fetch two scRNA-seq datasets by registered metadata such as cell types to finally integrate them.

Setup#

!lamin load test-scrna
Hide code cell output
๐Ÿ’ก found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
โœ… loaded instance: testuser1/test-scrna

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
โœ… loaded instance: testuser1/test-scrna (lamindb 0.52.2)
ln.track()
๐Ÿ’ก notebook imports: anndata==0.9.2 lamindb==0.52.2 lnschema_bionty==0.30.4
โœ… saved: Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='scrna2', version='0', type=notebook, updated_at=2023-09-06 17:23:24, created_by_id='DzTjkKse')
โœ… saved: Run(id='wRp7wmEH6RhrW5PYetqh', run_at=2023-09-06 17:23:24, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')

Access #

Query files by provenance metadata#

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).search("register scrna")
id __ratio__
name
Validate & register scRNA-seq datasets Nv48yAceNSh8z8 90.0
Integrate scRNA-seq datasets agayZTonayqAz8 85.5
transform = ln.Transform.filter(id="Nv48yAceNSh8z8").one()
ln.File.filter(transform=transform).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
7XqFQvVXP2vYY29SEF8V kSdUg2Cl None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 vqkf0y8uN3Qdq8ROUKih None 2023-09-06 17:22:49 DzTjkKse
LqUYQ5NBOBmTw4af4ZKn kSdUg2Cl None .h5ad AnnData 10x reference pbmc68k None 660792 GU-hbSJqGkENOxVKFLmvbA md5 Nv48yAceNSh8z8 vqkf0y8uN3Qdq8ROUKih None 2023-09-06 17:23:17 DzTjkKse

Query files based on biological metadata#

assays = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
cell_types = lb.CellType.lookup()
query = ln.File.filter(
    experimental_factors=assays.single_cell_rna_sequencing,
    species=species.human,
    cell_types=cell_types.cd8_positive_alpha_beta_memory_t_cell,
)
query.df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
7XqFQvVXP2vYY29SEF8V kSdUg2Cl None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 vqkf0y8uN3Qdq8ROUKih None 2023-09-06 17:22:49 DzTjkKse
LqUYQ5NBOBmTw4af4ZKn kSdUg2Cl None .h5ad AnnData 10x reference pbmc68k None 660792 GU-hbSJqGkENOxVKFLmvbA md5 Nv48yAceNSh8z8 vqkf0y8uN3Qdq8ROUKih None 2023-09-06 17:23:17 DzTjkKse

Transform #

Compare gene sets#

Get file objects:

file1, file2 = query.list()
file1.describe()
๐Ÿ’ก File(id='7XqFQvVXP2vYY29SEF8V', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-09-06 17:22:49)

Provenance:
  ๐Ÿ—ƒ๏ธ storage: Storage(id='kSdUg2Cl', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-06 17:23:22, created_by_id='DzTjkKse')
  ๐Ÿ“” transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type='notebook', updated_at=2023-09-06 17:23:16, created_by_id='DzTjkKse')
  ๐Ÿ‘ฃ run: Run(id='vqkf0y8uN3Qdq8ROUKih', run_at=2023-09-06 17:22:08, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-06 17:23:22)
Features:
  var: FeatureSet(id='Bs3qnL2mUb5A67MjKBeY', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-06 17:22:44, modality_id='XgCPze0r', created_by_id='DzTjkKse')
    'RPLP1', 'SLC3A2', 'IGHV1-67', 'TLR8-AS1', 'HDGFL3', 'CA14', 'None', 'None', 'EMSY', 'URAD', ...
  obs: FeatureSet(id='u0Fwvp8ZeUa3qRFkRDSE', n=4, registry='core.Feature', hash='Lxv_RV1GMXi24AlwilEg', updated_at=2023-09-06 17:22:49, modality_id='8hmGOB2Q', created_by_id='DzTjkKse')
    ๐Ÿ”— donor (12, core.Label): 'A35', 'A31', 'A36', '582C', 'A37', 'D503', '640C', 'A29', 'D496', '637C', ...
    ๐Ÿ”— assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
    ๐Ÿ”— cell_type (32, bionty.CellType): 'gamma-delta T cell', 'mucosal invariant T cell', 'effector memory CD4-positive, alpha-beta T cell', 'lymphocyte', 'megakaryocyte', 'germinal center B cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'animal cell', 'CD16-positive, CD56-dim natural killer cell, human', ...
    ๐Ÿ”— tissue (17, bionty.Tissue): 'caecum', 'jejunal epithelium', 'lung', 'blood', 'ileum', 'duodenum', 'mesenteric lymph node', 'omentum', 'skeletal muscle tissue', 'lamina propria', ...
Labels:
  ๐Ÿท๏ธ species (1, bionty.Species): 'human'
  ๐Ÿท๏ธ tissues (17, bionty.Tissue): 'caecum', 'jejunal epithelium', 'lung', 'blood', 'ileum', 'duodenum', 'mesenteric lymph node', 'omentum', 'skeletal muscle tissue', 'lamina propria', ...
  ๐Ÿท๏ธ cell_types (32, bionty.CellType): 'gamma-delta T cell', 'mucosal invariant T cell', 'effector memory CD4-positive, alpha-beta T cell', 'lymphocyte', 'megakaryocyte', 'germinal center B cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'animal cell', 'CD16-positive, CD56-dim natural killer cell, human', ...
  ๐Ÿท๏ธ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
  ๐Ÿท๏ธ labels (12, core.Label): 'A35', 'A31', 'A36', '582C', 'A37', 'D503', '640C', 'A29', 'D496', '637C', ...
file1.view_flow()
https://d33wubrfki0l68.cloudfront.net/e356c03669a95f9f9cb121f8544b219a0fedf98a/9e2e6/_images/841d6fbe0c32d06cabae87404c883547fd29f295d9eb056e0b7ba88518710cb2.svg
file2.describe()
๐Ÿ’ก File(id='LqUYQ5NBOBmTw4af4ZKn', suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', size=660792, hash='GU-hbSJqGkENOxVKFLmvbA', hash_type='md5', updated_at=2023-09-06 17:23:17)

Provenance:
  ๐Ÿ—ƒ๏ธ storage: Storage(id='kSdUg2Cl', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-06 17:23:22, created_by_id='DzTjkKse')
  ๐Ÿ“” transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type='notebook', updated_at=2023-09-06 17:23:16, created_by_id='DzTjkKse')
  ๐Ÿ‘ฃ run: Run(id='vqkf0y8uN3Qdq8ROUKih', run_at=2023-09-06 17:22:08, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  ๐Ÿ‘ค created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-06 17:23:22)
Features:
  var: FeatureSet(id='caaugrojoACm1HppamYy', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-09-06 17:23:16, modality_id='XgCPze0r', created_by_id='DzTjkKse')
    'SYPL1', 'GTF2E2', 'IFITM2', 'MRPS21', 'SLC3A2', 'FYB1', 'CD74', 'DENND2D', 'ARID4B', 'ATP6V0E1', ...
  obs: FeatureSet(id='Nss5TL03yTdOqC7wVn4B', n=1, registry='core.Feature', hash='Q9xarVfGpJg6dU53Jh_Q', updated_at=2023-09-06 17:23:17, modality_id='8hmGOB2Q', created_by_id='DzTjkKse')
    ๐Ÿ”— cell_type (9, bionty.CellType): 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'B cell, CD19-positive', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'dendritic cell', 'CD8-positive, alpha-beta memory T cell', 'central memory CD8-positive, alpha-beta T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'monocyte'
  external: FeatureSet(id='ogV09OEwuyEShQdmbeGc', n=2, registry='core.Feature', hash='FWcE11dG0K-9jYgLf5d3', updated_at=2023-09-06 17:23:17, modality_id='8hmGOB2Q', created_by_id='DzTjkKse')
    ๐Ÿ”— species (1, bionty.Species): 'human'
    ๐Ÿ”— assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
Labels:
  ๐Ÿท๏ธ species (1, bionty.Species): 'human'
  ๐Ÿท๏ธ cell_types (9, bionty.CellType): 'CD16-positive, CD56-dim natural killer cell, human', 'mature T cell', 'B cell, CD19-positive', 'Cd4-negative, CD8_alpha-negative, CD11b-positive dendritic cell', 'dendritic cell', 'CD8-positive, alpha-beta memory T cell', 'central memory CD8-positive, alpha-beta T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'monocyte'
  ๐Ÿท๏ธ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
file2.view_flow()
https://d33wubrfki0l68.cloudfront.net/019e7b7954b01fb887d9264c43709e3a849926db/6fbe8/_images/373acecc932410c67a9b38a236649ee446d8e593ecd5d147dd83349fbf9fa241.svg

Load files into memory:

file1_adata = file1.load()
file2_adata = file2.load()
๐Ÿ’ก adding file 7XqFQvVXP2vYY29SEF8V as input for run wRp7wmEH6RhrW5PYetqh, adding parent transform Nv48yAceNSh8z8
๐Ÿ’ก adding file LqUYQ5NBOBmTw4af4ZKn as input for run wRp7wmEH6RhrW5PYetqh, adding parent transform Nv48yAceNSh8z8

Here we compute shared genes without loading files:

file1_genes = file1.features["var"]
file2_genes = file2.features["var"]

shared_genes = file1_genes & file2_genes
len(shared_genes)
749
shared_genes.list("symbol")[:10]
['SLC3A2',
 'DUSP2',
 'WDR13',
 'TKT',
 'BST2',
 'ICAM4',
 'H1-10',
 'NFE2',
 'HLA-DMA',
 'HVCN1']

Compare cell types#

file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()

shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['CD16-positive, CD56-dim natural killer cell, human',
 'CD8-positive, alpha-beta memory T cell']

We can now subset the two datasets by shared cell types:

file1_adata_subset = file1_adata[
    file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]

file2_adata_subset = file2_adata[
    file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]

Concatenate subsetted datasets:

adata_concat = ad.concat(
    [file1_adata_subset, file2_adata_subset],
    label="file",
    keys=[file1.description, file2.description],
)
adata_concat
AnnData object with n_obs ร— n_vars = 244 ร— 749
    obs: 'cell_type', 'file'
    obsm: 'X_umap'
adata_concat.obs.value_counts()
cell_type                                           file                 
CD8-positive, alpha-beta memory T cell              Conde22                  120
CD16-positive, CD56-dim natural killer cell, human  Conde22                  114
CD8-positive, alpha-beta memory T cell              10x reference pbmc68k      7
CD16-positive, CD56-dim natural killer cell, human  10x reference pbmc68k      3
dtype: int64
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
Hide code cell output
๐Ÿ’ก deleting instance testuser1/test-scrna
โœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
โœ…     instance cache deleted
โœ…     deleted '.lndb' sqlite file
โ—     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna