Agent skill
sqlite-db-truncate
This skill should be used when recovering data from corrupted or truncated SQLite databases. Apply this skill when a .db file cannot be opened with standard SQLite tools, when binary truncation has occurred, or when manual parsing of the SQLite binary format is needed to extract recoverable data.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/sqlite-db-truncate
SKILL.md
SQLite Database Truncation Recovery
Overview
Recover data from SQLite databases that have been corrupted through binary truncation. This skill provides a systematic approach to binary format analysis and data extraction when standard SQLite libraries fail.
Recommended Approach
Phase 1: Initial Assessment (Do This First)
Before writing any recovery code, perform thorough analysis:
-
Check file size and basic properties
pythonimport os file_size = os.path.getsize('database.db') print(f"File size: {file_size} bytes") print(f"Expected pages: {file_size // 4096} (assuming 4KB pages)") -
Attempt standard recovery methods first
bash# Try sqlite3 recovery (may work for partial corruption) sqlite3 database.db ".recover" > recovered.sql 2>&1 sqlite3 database.db "PRAGMA integrity_check;" -
Create comprehensive hex dump before coding
pythonwith open('database.db', 'rb') as f: data = f.read() # Examine first 256 bytes thoroughly for i in range(0, min(256, len(data)), 16): hex_str = ' '.join(f'{b:02x}' for b in data[i:i+16]) ascii_str = ''.join(chr(b) if 32 <= b < 127 else '.' for b in data[i:i+16]) print(f"{i:04x}: {hex_str:<48} {ascii_str}")
Critical observations to document:
- Is the SQLite header present? (Bytes 0-15 should be "SQLite format 3\0")
- What is the page size? (Bytes 16-17, big-endian)
- Is page 0 corrupted/zeroed?
- Where does data appear to start?
- What page type byte is present? (Look for 0x0d = leaf table)
Phase 2: Recovery Method Selection
Choose the appropriate method based on Phase 1 findings:
If standard SQLite tools work: Use sqlite3 library directly, no manual parsing needed.
If header is intact but data is truncated: Standard recovery may partially work; supplement with manual parsing for truncated pages.
If header is corrupted/missing: Manual binary parsing is required. Common scenario for truncated databases where first page is damaged.
Phase 3: Structured Binary Parsing
Follow this order strictly to avoid iteration waste:
Step 1: Identify Page Structure
# For truncated DBs, data often starts at a page boundary
# Common page types:
# 0x0d (13) = Leaf table page (contains actual row data)
# 0x05 (5) = Interior table page (contains child page pointers)
# 0x0a (10) = Leaf index page
# 0x02 (2) = Interior index page
page_type = data[0] # Or data[100] if header is intact
if page_type == 0x0d:
print("Found leaf table page - contains recoverable data")
Step 2: Extract Cell Count and Pointers
import struct
# For leaf table page (0x0d), header is 8 bytes
# Bytes 3-4: number of cells (big-endian uint16)
num_cells = struct.unpack('>H', data[3:5])[0]
print(f"Cell count: {num_cells}")
# Cell pointer array starts at byte 8
cell_pointers = []
for i in range(num_cells):
ptr_offset = 8 + (i * 2)
if ptr_offset + 2 <= len(data):
ptr = struct.unpack('>H', data[ptr_offset:ptr_offset+2])[0]
# Validate pointer is within file bounds
if ptr < len(data):
cell_pointers.append(ptr)
else:
print(f"Warning: Cell pointer {ptr} exceeds file size {len(data)}")
Step 3: Parse ONE Cell First (Critical)
Before processing all cells, validate the parsing logic on a single cell:
def parse_cell_verbose(data, offset):
"""Parse a single cell with detailed debug output."""
print(f"\n=== Cell at offset {offset} (0x{offset:04x}) ===")
print(f"Raw bytes: {' '.join(f'{b:02x}' for b in data[offset:offset+32])}")
# Read payload size (varint)
payload_size, pos = read_varint(data, offset)
print(f"Payload size: {payload_size} (next pos: {pos})")
# Read row ID (varint)
row_id, pos = read_varint(data, pos)
print(f"Row ID: {row_id} (next pos: {pos})")
# Read header size (varint)
header_size, header_start = read_varint(data, pos)
print(f"Header size: {header_size} (header starts at: {header_start})")
# Calculate header end
header_end = pos + header_size
print(f"Header end: {header_end}")
# Read serial types
serial_types = []
current = header_start
while current < header_end:
serial_type, current = read_varint(data, current)
serial_types.append(serial_type)
print(f" Serial type: {serial_type} (meaning: {describe_serial_type(serial_type)})")
# Parse column values
body_offset = header_end
columns = []
for i, st in enumerate(serial_types):
value, body_offset = decode_serial_type(st, data, body_offset)
print(f" Column {i}: {value}")
columns.append(value)
return {'row_id': row_id, 'columns': columns}
# Test on first cell
result = parse_cell_verbose(data, cell_pointers[0])
print(f"\nParsed result: {result}")
Validation checks:
- Does the row ID look reasonable?
- Do column values make sense for the expected schema?
- Does string data decode properly as UTF-8?
- Are numeric values within expected ranges?
Step 4: Generalize to All Cells
Only after single-cell validation succeeds:
recovered_rows = []
failed_cells = []
for ptr in cell_pointers:
try:
if ptr >= len(data):
failed_cells.append({'offset': ptr, 'reason': 'beyond file boundary'})
continue
result = parse_cell(data, ptr)
# Validate completeness
if all(v is not None for v in result['columns']):
recovered_rows.append(result)
else:
failed_cells.append({'offset': ptr, 'reason': 'incomplete data', 'partial': result})
except Exception as e:
failed_cells.append({'offset': ptr, 'reason': str(e)})
print(f"Recovered: {len(recovered_rows)} rows")
print(f"Failed: {len(failed_cells)} cells")
Phase 4: Output Generation
import json
# Format output according to task requirements
output = [
{'word': row['columns'][0], 'value': row['columns'][1]}
for row in recovered_rows
if row['columns'][0] is not None and row['columns'][1] is not None
]
with open('recover.json', 'w') as f:
json.dump(output, f, indent=2)
# Also save recovery log for debugging
with open('recovery.log', 'w') as f:
f.write(f"Total cells found: {len(cell_pointers)}\n")
f.write(f"Successfully recovered: {len(output)}\n")
f.write(f"Failed/partial: {len(failed_cells)}\n")
for fail in failed_cells:
f.write(f" {fail}\n")
Common Pitfalls to Avoid
Pitfall 1: Writing Multiple Trial Scripts
Problem: Creating attempt1.py, attempt2.py, etc. wastes time and creates confusion.
Solution: Build ONE modular script with helper functions. Use verbose debug output to understand issues rather than rewriting from scratch. Keep all utility functions (varint parsing, serial type decoding) in reusable form.
Pitfall 2: Skipping Initial Analysis
Problem: Jumping straight to code without understanding the data structure leads to many wasted iterations.
Solution: Always complete Phase 1 (Initial Assessment) fully. Document observations about:
- Header presence/absence
- Page size and type
- Visible patterns in hex dump
- Estimated truncation point
Pitfall 3: Not Validating Single Cell First
Problem: Processing all cells with buggy parsing logic produces incorrect output that's hard to debug.
Solution: Parse ONE cell with verbose output first (Phase 3, Step 3). Verify each field matches expectations before generalizing.
Pitfall 4: Including Incomplete Records
Problem: Adding records with null values pollutes the output.
Solution: Only include records where ALL required fields are successfully recovered. Log partial recoveries separately for debugging.
Pitfall 5: Cleaning Up Before Verification
Problem: Deleting scripts/debug output before fully verifying results makes later debugging impossible.
Solution: Keep all artifacts until final output is verified correct. Only clean up as a final step after confirmation.
Pitfall 6: Syntax Errors from Rushed Coding
Problem: Basic syntax errors like if48 <= (missing space) or 12and (missing space).
Solution: Write code carefully. Review before executing. Common mistakes:
- Missing spaces around operators
- Incomplete string concatenation
- Incorrect indentation
Pitfall 7: Confusing Serial Type Encoding
Problem: Misinterpreting serial type codes leads to wrong data extraction.
Key formulas:
- TEXT length:
(serial_type - 13) // 2(for odd values >= 13) - BLOB length:
(serial_type - 12) // 2(for even values >= 12) - Type 1 = 8-bit int, Type 7 = 64-bit float
Verification Strategies
1. Sanity Check Record Count
# Cell count should match expected number of records
print(f"Expected records: (check task description)")
print(f"Cell pointers found: {len(cell_pointers)}")
print(f"Successfully recovered: {len(recovered_rows)}")
2. Check Value Patterns
# Look for expected patterns in recovered data
words = [r['columns'][0] for r in recovered_rows]
values = [r['columns'][1] for r in recovered_rows]
print(f"Word pattern: {words[:3]} ... {words[-1]}")
print(f"Value range: {min(values)} to {max(values)}")
3. Validate JSON Output
# Re-read and validate output file
with open('recover.json') as f:
output = json.load(f)
print(f"Output records: {len(output)}")
for record in output[:3]:
print(f" {record}")
# Check for null values
null_count = sum(1 for r in output if None in r.values())
print(f"Records with null values: {null_count}")
Bundled Resources
scripts/sqlite_recovery.py
Utility functions for SQLite binary parsing:
read_varint(data, offset)- Decode variable-length integersdecode_serial_type(serial_type, data, offset)- Extract column values based on serial typedescribe_serial_type(serial_type)- Human-readable serial type descriptionhex_dump(data, offset, length)- Format binary data for debugging
Execute the script directly to analyze a database file:
python scripts/sqlite_recovery.py database.db
references/sqlite_format.md
Comprehensive reference for SQLite binary format including:
- Database header structure (100 bytes)
- Page types and their headers
- Varint encoding algorithm
- Serial type codes (0-11 and formula-based for text/blob)
- Cell structure for leaf table pages
- Common truncation patterns
Grep patterns for quick lookup:
"Serial Type"- Find type code reference"Varint"- Variable-length integer encoding"Page Header"- Page structure details"Truncation"- Handling truncated data
Decision Flowchart
Database recovery task
│
▼
┌─────────────────────────┐
│ Phase 1: Initial │
│ Assessment │
│ - Check file size │
│ - Try standard sqlite3 │
│ - Create hex dump │
│ - Document observations │
└───────────┬─────────────┘
│
▼
Standard tools work?
│ │
YES NO
│ │
▼ ▼
Use sqlite3 Manual parsing required
directly │
▼
┌─────────────────────────┐
│ Phase 3: Binary Parsing │
│ Step 1: Identify pages │
│ Step 2: Extract cells │
│ Step 3: Parse ONE cell │◄── CRITICAL
│ Step 4: Validate │
│ Step 5: Generalize │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Phase 4: Output │
│ - Only complete records │
│ - Log partial/failed │
│ - Verify before cleanup │
└─────────────────────────┘
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?