Data Processing Tutorial
Symusic's high performance makes it particularly well-suited for processing large datasets of symbolic music files, a common task in Music Information Retrieval (MIR) and machine learning.
Why Symusic for Data Processing?
- Speed: Loading and basic manipulation of MIDI files are significantly faster than pure Python libraries.
- Efficiency: C++ backend reduces memory overhead compared to Python object representations.
- Multiprocessing: Symusic objects are efficiently serializable (pickleable), making them work seamlessly with Python's
multiprocessing
module for parallel processing.
Common Data Processing Tasks
- Loading and Filtering: Reading a large dataset and selecting relevant files or tracks.
- Data Cleaning: Removing empty files, fixing timing issues, standardizing formats.
- Feature Extraction: Computing musical features (e.g., pitch histograms, note density, tempo).
- Representation Conversion: Converting MIDI to other formats like piano roll or custom event sequences for model input.
- Data Augmentation: Applying transformations like transposition or time stretching.
Using multiprocessing
Python's multiprocessing
library is the standard way to parallelize CPU-bound tasks across multiple cores. Symusic objects integrate well with this.
import multiprocessing as mp
from pathlib import Path
from symusic import Score
import time
import pickle # Symusic uses efficient pickling
def worker_function(midi_path: Path) -> dict | None:
"""Loads a MIDI file and extracts some basic features."""
try:
# Load the score
score = Score(midi_path)
# Skip empty scores
if score.empty():
return None
# Basic feature extraction
num_notes = score.note_num()
duration_seconds = score.to("second").end()
avg_tempo = 0
if score.tempos:
avg_tempo = sum(t.qpm for t in score.tempos) / len(score.tempos)
return {
"filename": midi_path.name,
"num_notes": num_notes,
"duration_sec": duration_seconds,
"avg_tempo": avg_tempo
}
except Exception as e:
print(f"Error processing {midi_path.name}: {e}")
return None
if __name__ == "__main__":
start_time = time.time()
input_dir = Path("path/to/your/midi/dataset")
midi_files = list(input_dir.glob("**/*.mid")) # Recursively find MIDI files
print(f"Found {len(midi_files)} MIDI files.")
# Determine number of processes (use all available cores)
num_processes = mp.cpu_count()
print(f"Using {num_processes} processes.")
# Process files in parallel
with mp.Pool(processes=num_processes) as pool:
# Use map to apply the worker function to each file path
results = pool.map(worker_function, midi_files)
# Filter out None results (errors or empty files)
valid_results = [r for r in results if r is not None]
end_time = time.time()
print(f"\nProcessed {len(valid_results)} files in {end_time - start_time:.2f} seconds.")
# Example: Save results to a file (e.g., JSON or CSV)
# import json
# with open("dataset_features.json", "w") as f:
# json.dump(valid_results, f, indent=2)
# Example: Print some stats
if valid_results:
avg_notes = sum(r['num_notes'] for r in valid_results) / len(valid_results)
avg_duration = sum(r['duration_sec'] for r in valid_results) / len(valid_results)
print(f"Average notes per file: {avg_notes:.1f}")
print(f"Average duration per file: {avg_duration:.1f} seconds")
Explanation:
worker_function
: This function takes a singlePath
object, loads the MIDI file usingScore
, performs some processing (feature extraction in this case), and returns the result (orNone
on error/empty score).if __name__ == "__main__":
: Essential formultiprocessing
to work correctly, especially on Windows.mp.Pool
: Creates a pool of worker processes.pool.map(worker_function, midi_files)
: This is the core of the parallel processing. It distributes themidi_files
list among the worker processes, each callingworker_function
on a subset of the files. It collects the results in order.- Result Handling: The collected
results
list contains the return values fromworker_function
for each file.
Tips for Large Datasets
- Error Handling: Include
try...except
blocks in your worker function to gracefully handle corrupted or problematic MIDI files without crashing the entire process. - Memory Management: While Symusic is efficient, processing extremely large files or extracting complex features might still consume significant memory. Monitor memory usage. If needed, process files individually or use generators if results don't all need to be in memory at once.
- Logging: For long-running jobs, use Python's
logging
module instead ofprint
for better tracking of progress and errors, especially when redirecting output to files. - Serialization Cost: Symusic's pickling is fast, but transferring large processed
Score
objects back from worker processes can still have overhead. If only extracted features are needed, return just those features from the worker function, not the entireScore
object. - Intermediate Results: For very large datasets, consider saving intermediate results periodically to avoid losing work in case of interruptions.
Example: Filtering and Saving Piano Tracks
This example processes a dataset, keeps only the piano tracks, and saves the modified files.
import multiprocessing as mp
from pathlib import Path
from symusic import Score
import time
def filter_piano_tracks(midi_path: Path, output_dir: Path):
"""Loads a MIDI, keeps only piano tracks, and saves to output_dir."""
try:
score = Score(midi_path)
if score.empty():
return f"Skipped (empty): {midi_path.name}"
# Find piano tracks (Program 0, not drums)
piano_tracks = [t for t in score.tracks if t.program == 0 and not t.is_drum]
if not piano_tracks:
return f"Skipped (no piano): {midi_path.name}"
# Create a new score with only piano tracks
# Important: Need to copy global events like tempos, time signatures
new_score = Score(score.ticks_per_quarter)
new_score.tracks = piano_tracks # Assign the filtered tracks
new_score.tempos = score.tempos.copy()
new_score.time_signatures = score.time_signatures.copy()
new_score.key_signatures = score.key_signatures.copy()
new_score.markers = score.markers.copy()
# Note: Deep copies might be needed depending on subsequent modifications
output_path = output_dir / midi_path.name
new_score.dump_midi(output_path)
return f"Processed: {midi_path.name}"
except Exception as e:
return f"Error processing {midi_path.name}: {e}"
if __name__ == "__main__":
start_time = time.time()
input_dir = Path("path/to/your/midi/dataset")
output_dir = Path("path/to/filtered/piano_midi")
output_dir.mkdir(parents=True, exist_ok=True)
midi_files = list(input_dir.glob("**/*.mid"))
print(f"Found {len(midi_files)} MIDI files.")
num_processes = mp.cpu_count()
print(f"Using {num_processes} processes.")
# Prepare arguments for starmap
args = [(mf, output_dir) for mf in midi_files]
with mp.Pool(processes=num_processes) as pool:
# Use starmap for functions taking multiple arguments
results = pool.starmap(filter_piano_tracks, args)
end_time = time.time()
print(f"\nFinished in {end_time - start_time:.2f} seconds.")
# Print summary/errors
processed_count = sum(1 for r in results if r.startswith("Processed"))
skipped_count = sum(1 for r in results if r.startswith("Skipped"))
error_count = sum(1 for r in results if r.startswith("Error"))
print(f"Processed: {processed_count}, Skipped: {skipped_count}, Errors: {error_count}")
# Optionally log the specific error messages
# for r in results:
# if r.startswith("Error"):
# print(r)
This tutorial demonstrates how to leverage Symusic's performance for efficient batch processing of symbolic music data using Python's standard multiprocessing tools.