ISCC-SUM: High-Performance Content Identification for Science#
Version 0.1.0
First optimized implementation of ISCC Data-Code and Instance-Code generation.
-
50-130× Faster
Process large amounts of scientific data in minutes, not hours. Built with Rust for maximum performance.
-
Scientific Focus
Designed for bioimaging workflows and large-scale scientific data management.
-
ISO Standard
Based on ISO 24138:2024, ensuring global interoperability.
-
Easy Integration
Drop-in replacement for checksum tools with familiar CLI and Python API.
From the BIO-CODES Project
ISCC-SUM addresses fundamental challenges in scientific data identification, particularly for bioimaging and large-scale research datasets.
Challenge: Large Data Volumes#
Scientific instruments generate terabytes of data daily.
Solution: Our Rust-based implementation processes data up to 130× faster than the pure python ISO reference implementation, making it practical for:
- High-throughput microscopy facilities
- Genomic sequencing centers
- Climate modeling archives
- Astronomical observation data
Challenge: Complex File Formats#
Scientific data comes in hundreds of specialized formats - from DICOM medical images to HDF5 datasets.
Solution: ISCC-SUM focuses on improving the media-agnostic ISCC-UNITs that work with any data format:
Challenge: Scientific Adoption#
Researchers need tools that integrate seamlessly with existing workflows.
Solution: Familiar checksum-style interface:
# Just like md5sum or sha256sum
iscc-sum --tree imagedata.zarr
ISCC:KAA7WQPPQ6J54VLNZJ4LSMDTTEMI2DDUEHCG5DQVWCJVKENQCHSTOSA *imagedata.zarr/
# Process entire datasets
iscc-sum /data/microscopy/*.tiff > checksums.txt
Core Components#
- Rust Library - High-performance implementations of Data-Code and Instance-Code algorithms (1)
- Python Extensions - Native bindings for seamless Python integration
- CLI Tool - Unix-style command familiar to developers
- Single-Pass Processing - Generate both codes reading data only once
- Optimized for parallel processing on modern multi-core systems
Why ISCC for Science?#
Beyond Simple Checksums
ISCC provides content-derived similarity hashes that can verify data integrity and find similar data at the same time.
Unique Advantages for Research#
| Feature | Traditional Checksums | ISCC-SUM |
|---|---|---|
| Data Similarity Detection | ||
| Container Level Checksums | ||
| Standard Compliance |
Real-World Applications#
Microscopy Facilities#
- Duplicate Detection: Identify redundant datasets across different studies
- Data Integrity: Verify images haven't been corrupted during transfer
- Collaboration: Share verifiable references to specific datasets
Scientific Repositories#
- Deduplication: Save storage by identifying duplicate submissions
- Version Tracking: Track dataset evolution over time
- Citation: Create persistent, verifiable data citations
HPC Workflows#
- Provenance: Track inputs/outputs in complex pipelines
- Reproducibility: Verify exact datasets used in publications
- Distribution: Efficiently sync datasets across compute nodes
Technical Innovation#
Extending the Standard
ISCC-SUM introduces several enhancements beneficial for scientific computing:
TREEWALK : Efficient deterministic storage tree hashing for large dataset collections
SUBTYPE WIDE : Extended codes for higher precision in similarity detection
CHECKSUM API : Drop-in replacement for existing checksum workflows
Future: WebAssembly : Process bioimages directly in web browsers (planned)
Performance Benchmarks#
Real-world performance of in-memory data processing
| Data Size | ISO Reference | ISCC-SUM | Speedup |
|---|---|---|---|
| 1 MB | 5.97 MB/s | 476.17 MB/s | 79x |
| 10 MB | 6.48 MB/s | 956.14 MB/ | 147× |
| 100 MB | 6.09 MB/s | 1121.44 MB/s | 184× |
Get Started Today#
-
Install and generate your first ISCC in under 5 minutes
-
Comprehensive documentation for all features
-
Integrate ISCC-SUM into your Python applications
-
View source code and contribute
About the BIO-CODES Project
ISCC-SUM is developed as part of BIO-CODES, funded by the European Union's Horizon Europe programme (Grant Agreement No 101060954). Our mission is to make advanced content identification accessible to the global scientific community.