Skip to content

ISCC-SUM: High-Performance Content Identification for Science#

Version 0.1.0

First optimized implementation of ISCC Data-Code and Instance-Code generation.

  • 🚀 50-130× Faster


    Process large amounts of scientific data in minutes, not hours. Built with Rust for maximum performance.

  • 🔬 Scientific Focus


    Designed for bioimaging workflows and large-scale scientific data management.

  • 🛡 ISO Standard


    Based on ISO 24138:2024, ensuring global interoperability.

  • 🔧 Easy Integration


    Drop-in replacement for checksum tools with familiar CLI and Python API.

From the BIO-CODES Project

ISCC-SUM addresses fundamental challenges in scientific data identification, particularly for bioimaging and large-scale research datasets.

Challenge: Large Data Volumes#

Scientific instruments generate terabytes of data daily.

Solution: Our Rust-based implementation processes data up to 130× faster than the pure python ISO reference implementation, making it practical for:

  • High-throughput microscopy facilities
  • Genomic sequencing centers
  • Climate modeling archives
  • Astronomical observation data

Challenge: Complex File Formats#

Scientific data comes in hundreds of specialized formats - from DICOM medical images to HDF5 datasets.

Solution: ISCC-SUM focuses on improving the media-agnostic ISCC-UNITs that work with any data format:

✓ ZARR (multidimensional array data)
✓ HDF5 (hierarchical data)
✓ NetCDF (climate/ocean data)
✓ DICOM (medical imaging)
✓ FASTQ (sequencing)
✓ FITS (astronomy)
✓ SEG-Y (seismic)
✓ Any binary format

Challenge: Scientific Adoption#

Researchers need tools that integrate seamlessly with existing workflows.

Solution: Familiar checksum-style interface:

# Just like md5sum or sha256sum
iscc-sum --tree imagedata.zarr
ISCC:KAA7WQPPQ6J54VLNZJ4LSMDTTEMI2DDUEHCG5DQVWCJVKENQCHSTOSA  *imagedata.zarr/

# Process entire datasets
iscc-sum /data/microscopy/*.tiff > checksums.txt

Core Components#

  • Rust Library - High-performance implementations of Data-Code and Instance-Code algorithms (1)
  • Python Extensions - Native bindings for seamless Python integration
  • CLI Tool - Unix-style command familiar to developers
  • Single-Pass Processing - Generate both codes reading data only once
  1. Optimized for parallel processing on modern multi-core systems

Why ISCC for Science?#

Beyond Simple Checksums

ISCC provides content-derived similarity hashes that can verify data integrity and find similar data at the same time.

Unique Advantages for Research#

Feature Traditional Checksums ISCC-SUM
Data Similarity Detection ❌ No ✅ Built-in
Container Level Checksums ❌ No ✅ Yes, storage agnostic
Standard Compliance ⚠ Various standards ✅ ISO 24138:2024

Real-World Applications#

🔬 Microscopy Facilities#

  • Duplicate Detection: Identify redundant datasets across different studies
  • Data Integrity: Verify images haven't been corrupted during transfer
  • Collaboration: Share verifiable references to specific datasets
from iscc_sum import code_iscc_sum

# Generate ISCC for microscopy image
code = code_iscc_sum("cell_culture_z047.ome.tiff")

Scientific Repositories#

  • Deduplication: Save storage by identifying duplicate submissions
  • Version Tracking: Track dataset evolution over time
  • Citation: Create persistent, verifiable data citations
# Process entire archive
iscc-sum --similar /archive

HPC Workflows#

  • Provenance: Track inputs/outputs in complex pipelines
  • Reproducibility: Verify exact datasets used in publications
  • Distribution: Efficiently sync datasets across compute nodes
# Verify dataset before processing
iscc-sum --check dataset.iscc

Technical Innovation#

Extending the Standard

ISCC-SUM introduces several enhancements beneficial for scientific computing:

TREEWALK : Efficient deterministic storage tree hashing for large dataset collections

SUBTYPE WIDE : Extended codes for higher precision in similarity detection

CHECKSUM API : Drop-in replacement for existing checksum workflows

Future: WebAssembly : Process bioimages directly in web browsers (planned)

Performance Benchmarks#

Real-world performance of in-memory data processing

Data Size ISO Reference ISCC-SUM Speedup
1 MB 5.97 MB/s 476.17 MB/s 79x
10 MB 6.48 MB/s 956.14 MB/ 147×
100 MB 6.09 MB/s 1121.44 MB/s 184×

Get Started Today#

  • Quick Start


    Install and generate your first ISCC in under 5 minutes

  • User Guide


    Comprehensive documentation for all features

  • API Reference


    Integrate ISCC-SUM into your Python applications

  • GitHub


    View source code and contribute


About the BIO-CODES Project

ISCC-SUM is developed as part of BIO-CODES, funded by the European Union's Horizon Europe programme (Grant Agreement No 101060954). Our mission is to make advanced content identification accessible to the global scientific community.

Learn more about ISCC