Agent skill
reshard-c4-data
Guide for implementing reversible data resharding systems with hierarchical constraints (max files/folders per directory, max file size). Use when building compress/decompress scripts that reorganize datasets while maintaining full reconstruction capability.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/reshard-c4-data
SKILL.md
Data Resharding with Hierarchical Constraints
Overview
This skill provides guidance for implementing reversible data transformation systems that reshard datasets according to constraints (maximum files per directory, maximum file size). The implementation requires paired compress/decompress operations that preserve data integrity through perfect round-trip reconstruction.
When to Use This Skill
Use this skill when tasked with:
- Creating compress/decompress scripts for datasets with size or count constraints
- Implementing shard/unshard systems for directory reorganization
- Building split/merge tools for large file management
- Developing reversible data transformations with structural constraints
Critical Constraint Understanding
Universal Constraint Application
The most common failure mode: When requirements state "maximum X items in each directory," this applies to EVERY directory in the output structure:
- The root output directory
- All intermediate parent directories
- Leaf directories containing actual files
Example violation: With 9,898 files and max 30 files per directory:
- Need 330 shards (9,898 / 30 = 330)
- Placing 330 shards at root level violates the 30-item constraint
- Solution: Create hierarchical structure with multiple nesting levels
Hierarchical Structure Design
Calculate whether hierarchical nesting is required:
N = number of items
K = max items per directory
If ceil(N/K) > K, hierarchical structure is mandatory.
Required depth = ceil(log_K(N)) levels
Correct hierarchical pattern:
output/
├── group_00/
│ ├── shard_0000/ (up to 30 files)
│ ├── shard_0001/
│ └── ... (up to 30 shards)
├── group_01/
│ └── ... (up to 30 shards)
└── ... (up to 30 groups)
Implementation Approach
1. Explore Input Data First
Before writing code, examine the input data structure:
- Total file count
- File size distribution (identify files exceeding size limit)
- Directory depth and structure
- Naming patterns
This informs whether hierarchical output structure is needed.
2. Design Metadata Schema Upfront
Metadata is the contract between compression and decompression. Define it completely before coding:
metadata = {
"version": "1.0",
"constraints": {
"max_items_per_dir": 30,
"max_file_size_bytes": 15 * 1024 * 1024
},
"original_structure": {
# Map resharded paths to original paths
"shard_0000/file_000001.txt": "original/path/to/file.txt"
},
"split_files": {
# Track files split due to size constraints
"large_file.bin": ["chunk_0", "chunk_1", "chunk_2"]
}
}
Metadata best practices:
- Store in hidden file (e.g.,
.reshard_metadata.json) at output root - Include version field for forward compatibility
- Validate required keys exist when reading metadata
- Use relative paths for portability
3. Handle Filename Collisions
When files from different subdirectories have the same basename:
Problem: Copying dir1/file.txt and dir2/file.txt to same shard overwrites one.
Solutions:
- Sequential numbering with metadata mapping:
file_000001.txtmaps todir1/file.txt - Path-encoded naming:
dir1__file.txt,dir2__file.txt - Preserve directory structure within shards
Sequential numbering with metadata is most robust for varied input structures.
4. File Splitting for Size Constraints
For files exceeding the maximum size:
MAX_SIZE = 15 * 1024 * 1024 # 15MB
def split_large_file(file_path, output_dir, max_size):
chunks = []
with open(file_path, 'rb') as f:
chunk_idx = 0
while data := f.read(max_size):
chunk_name = f"{file_path.name}.chunk{chunk_idx}"
chunk_path = output_dir / chunk_name
chunk_path.write_bytes(data)
chunks.append(chunk_name)
chunk_idx += 1
return chunks # Record in metadata for reconstruction
Remember: Split chunks count toward directory item limits.
5. Build System Configuration
For UV/Python projects with minimal dependencies:
[project]
name = "dataset-resharding"
version = "0.1.0"
requires-python = ">=3.8"
dependencies = []
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
[tool.setuptools]
py-modules = []
Verify the setup works:
cd /app
uv sync
uv run python compress.py --help
Verification Strategy
Progressive Testing
Test in order of increasing complexity:
- Minimal: Single file, flat structure
- Basic: Multiple files, flat structure
- Nested: Files in subdirectories (catches filename collision bugs early)
- Constraint boundary: Exactly at limits (30 files, 15MB)
- Over boundary: Just over limits (31 files, 15.1MB)
- Full dataset: Production-scale data
Round-Trip Verification
# Generate checksums for original
cd input_dir && find . -type f -exec md5sum {} + | sort -k2 > /tmp/original.md5
# Run compress → decompress
python compress.py input_dir output_dir
python decompress.py output_dir
# Verify checksums match
cd output_dir && find . -type f -exec md5sum {} + | sort -k2 > /tmp/result.md5
diff /tmp/original.md5 /tmp/result.md5
Constraint Compliance Validation
After compression, verify ALL directories meet constraints:
def validate_output(directory, max_items=30, max_file_size=15*1024*1024):
violations = []
for root, dirs, files in os.walk(directory):
# Check item count at EVERY directory level
total_items = len(dirs) + len(files)
if total_items > max_items:
violations.append(f"{root}: {total_items} items (max {max_items})")
# Check every file size
for f in files:
path = os.path.join(root, f)
size = os.path.getsize(path)
if size > max_file_size:
violations.append(f"{path}: {size} bytes (max {max_file_size})")
return violations
Run this validation before claiming completion.
Artifact Cleanup Verification
After decompression, verify no transformation artifacts remain:
def verify_cleanup(directory):
"""Ensure no shards, chunks, or metadata remain"""
patterns = [".reshard_metadata.json", "shard_*", "*.chunk*"]
for pattern in patterns:
matches = list(Path(directory).glob(f"**/{pattern}"))
if matches:
raise ValueError(f"Artifacts remain: {matches}")
Common Pitfalls
1. Narrow Constraint Interpretation
Mistake: Applying "max 30 per directory" only to leaf shards, not to the output root. Fix: Walk entire output tree and validate every directory.
2. Incomplete Metadata Design
Mistake: Adding metadata fields during debugging instead of designing upfront. Fix: Specify complete metadata schema before implementation.
3. Testing Only Flat Structures
Mistake: Tests pass on flat input but fail on nested directories due to filename collisions. Fix: Include nested directory tests early in test progression.
4. Forgetting Metadata in Item Counts
Mistake: Output root has 30 shards plus metadata file = 31 items (violation). Fix: Account for metadata file when distributing items.
5. Trial-and-Error with Build Configuration
Mistake: Repeatedly modifying pyproject.toml without understanding UV/setuptools requirements. Fix: Start with minimal working configuration, add only what's needed.
6. Truncated File Writes
Mistake: Large files written via tools get truncated silently. Fix: Verify file contents after writing, especially for large scripts.
Implementation Checklist
Before Coding:
- Examine input data (file count, sizes, structure)
- Calculate if hierarchical output structure is needed
- Design complete metadata schema
- Plan test progression from simple to complex
During Implementation:
- Implement compression with hierarchical structure support
- Track all file mappings in metadata
- Handle large file splitting with metadata tracking
- Implement decompression using metadata for reconstruction
- Add constraint validation function
Before Completion:
- Run validation on output structure (all directories)
- Verify round-trip with checksums
- Confirm artifact cleanup after decompression
- Test with production-scale data
- Verify build system works (
uv sync,uv run)
Summary
Successful data resharding requires:
- Universal constraint application at every directory level
- Hierarchical structure when item counts exceed single-level capacity
- Complete metadata design before coding
- Progressive testing including nested structures early
- Explicit validation of all constraints before completion
The key insight: constraints apply globally, not just locally. Validate the entire output tree, not just individual shards.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?