sklearn-model-trainer
// Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
$ git log --oneline --stat
stars:384
forks:73
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
namesklearn-model-trainer
descriptionScikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
allowed-toolsRead, Grep, Write, Bash, Edit, Glob
Scikit-learn Model Trainer
Train machine learning models using scikit-learn with cross-validation, hyperparameter tuning, and pipeline construction.
Overview
This skill provides comprehensive capabilities for training machine learning models using scikit-learn. It supports the full model development workflow from data preprocessing through model training, evaluation, and serialization.
Capabilities
Model Training
- Train classification models (LogisticRegression, RandomForest, SVM, etc.)
- Train regression models (LinearRegression, GradientBoosting, etc.)
- Train clustering models (KMeans, DBSCAN, etc.)
- Support for ensemble methods (VotingClassifier, Stacking, etc.)
Cross-Validation
- K-fold cross-validation
- Stratified K-fold for imbalanced datasets
- Time series split for temporal data
- Leave-one-out and leave-p-out validation
- Custom cross-validation strategies
Hyperparameter Tuning
- GridSearchCV for exhaustive search
- RandomizedSearchCV for random sampling
- Halving search strategies for efficiency
- Custom scoring functions
- Multi-metric evaluation
Pipeline Construction
- Feature preprocessing pipelines
- Column transformers for heterogeneous data
- Feature selection integration
- Composite pipelines with caching
Model Serialization
- Save models with joblib (recommended)
- Pickle serialization
- ONNX export for interoperability
- Model versioning support
Prerequisites
Installation
pip install scikit-learn>=1.0.0 joblib pandas numpy
Optional Dependencies
# For ONNX export
pip install skl2onnx onnxruntime
# For additional preprocessing
pip install category_encoders imbalanced-learn
Usage Patterns
Basic Model Training
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import joblib
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train model
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)
# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# Save model
joblib.dump(model, 'model.joblib')
Pipeline with Preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
# Define preprocessing
numeric_features = ['age', 'income', 'score']
categorical_features = ['category', 'region']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Create full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier())
])
# Train
pipeline.fit(X_train, y_train)
Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [3, 5, 10, None],
'classifier__learning_rate': [0.01, 0.1, 0.2]
}
# Grid search
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1,
verbose=2
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Get best model
best_model = grid_search.best_estimator_
Feature Selection
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.ensemble import RandomForestClassifier
# Method 1: SelectFromModel
selector = SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42),
threshold='median'
)
X_selected = selector.fit_transform(X_train, y_train)
# Method 2: Recursive Feature Elimination
rfe = RFE(
estimator=RandomForestClassifier(n_estimators=100, random_state=42),
n_features_to_select=10,
step=1
)
X_rfe = rfe.fit_transform(X_train, y_train)
# Get selected features
selected_features = X.columns[rfe.support_].tolist()
Integration with Babysitter SDK
Task Definition Example
const sklearnTrainingTask = defineTask({
name: 'sklearn-model-training',
description: 'Train a scikit-learn model with cross-validation',
inputs: {
modelType: { type: 'string', required: true },
trainDataPath: { type: 'string', required: true },
targetColumn: { type: 'string', required: true },
hyperparameters: { type: 'object', default: {} },
cvFolds: { type: 'number', default: 5 },
scoringMetric: { type: 'string', default: 'accuracy' }
},
outputs: {
modelPath: { type: 'string' },
cvScores: { type: 'array' },
bestScore: { type: 'number' },
featureImportances: { type: 'object' }
},
async run(inputs, taskCtx) {
return {
kind: 'skill',
title: `Train ${inputs.modelType} model`,
skill: {
name: 'sklearn-model-trainer',
context: {
operation: 'train_with_cv',
modelType: inputs.modelType,
trainDataPath: inputs.trainDataPath,
targetColumn: inputs.targetColumn,
hyperparameters: inputs.hyperparameters,
cvFolds: inputs.cvFolds,
scoringMetric: inputs.scoringMetric
}
},
io: {
inputJsonPath: `tasks/${taskCtx.effectId}/input.json`,
outputJsonPath: `tasks/${taskCtx.effectId}/result.json`
}
};
}
});
Model Selection Guide
Classification Models
| Model | Use Case | Pros | Cons |
|---|---|---|---|
| LogisticRegression | Binary/multiclass, interpretable | Fast, interpretable | Linear boundary |
| RandomForestClassifier | General purpose | Robust, handles nonlinearity | Can overfit |
| GradientBoostingClassifier | High accuracy needed | State-of-art performance | Slower training |
| SVC | Small/medium datasets | Effective in high dimensions | Slow on large data |
| XGBClassifier | Competition/production | Fast, accurate | Many hyperparameters |
Regression Models
| Model | Use Case | Pros | Cons |
|---|---|---|---|
| LinearRegression | Baseline, interpretable | Simple, fast | Assumes linearity |
| Ridge/Lasso | Regularization needed | Prevents overfitting | Still linear |
| RandomForestRegressor | General purpose | Handles nonlinearity | Can overfit |
| GradientBoostingRegressor | High accuracy | Excellent performance | Slower |
| SVR | Small datasets | Robust to outliers | Slow scaling |
Best Practices
- Always Use Pipelines: Prevent data leakage by including preprocessing in pipelines
- Stratified Splits: Use stratified sampling for imbalanced classification
- Cross-Validation: Never tune hyperparameters on test data
- Feature Scaling: Apply appropriate scaling for distance-based models
- Random Seeds: Set random_state for reproducibility
- Model Persistence: Use joblib over pickle for large numpy arrays