Назад към всички

sklearn-model-trainer

// Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.

$ git log --oneline --stat
stars:384
forks:73
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
namesklearn-model-trainer
descriptionScikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
allowed-toolsRead, Grep, Write, Bash, Edit, Glob

Scikit-learn Model Trainer

Train machine learning models using scikit-learn with cross-validation, hyperparameter tuning, and pipeline construction.

Overview

This skill provides comprehensive capabilities for training machine learning models using scikit-learn. It supports the full model development workflow from data preprocessing through model training, evaluation, and serialization.

Capabilities

Model Training

  • Train classification models (LogisticRegression, RandomForest, SVM, etc.)
  • Train regression models (LinearRegression, GradientBoosting, etc.)
  • Train clustering models (KMeans, DBSCAN, etc.)
  • Support for ensemble methods (VotingClassifier, Stacking, etc.)

Cross-Validation

  • K-fold cross-validation
  • Stratified K-fold for imbalanced datasets
  • Time series split for temporal data
  • Leave-one-out and leave-p-out validation
  • Custom cross-validation strategies

Hyperparameter Tuning

  • GridSearchCV for exhaustive search
  • RandomizedSearchCV for random sampling
  • Halving search strategies for efficiency
  • Custom scoring functions
  • Multi-metric evaluation

Pipeline Construction

  • Feature preprocessing pipelines
  • Column transformers for heterogeneous data
  • Feature selection integration
  • Composite pipelines with caching

Model Serialization

  • Save models with joblib (recommended)
  • Pickle serialization
  • ONNX export for interoperability
  • Model versioning support

Prerequisites

Installation

pip install scikit-learn>=1.0.0 joblib pandas numpy

Optional Dependencies

# For ONNX export
pip install skl2onnx onnxruntime

# For additional preprocessing
pip install category_encoders imbalanced-learn

Usage Patterns

Basic Model Training

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import joblib

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train, y_train)

# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Save model
joblib.dump(model, 'model.joblib')

Pipeline with Preprocessing

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# Define preprocessing
numeric_features = ['age', 'income', 'score']
categorical_features = ['category', 'region']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Create full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier())
])

# Train
pipeline.fit(X_train, y_train)

Hyperparameter Tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 10, None],
    'classifier__learning_rate': [0.01, 0.1, 0.2]
}

# Grid search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

# Get best model
best_model = grid_search.best_estimator_

Feature Selection

from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.ensemble import RandomForestClassifier

# Method 1: SelectFromModel
selector = SelectFromModel(
    RandomForestClassifier(n_estimators=100, random_state=42),
    threshold='median'
)
X_selected = selector.fit_transform(X_train, y_train)

# Method 2: Recursive Feature Elimination
rfe = RFE(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    n_features_to_select=10,
    step=1
)
X_rfe = rfe.fit_transform(X_train, y_train)

# Get selected features
selected_features = X.columns[rfe.support_].tolist()

Integration with Babysitter SDK

Task Definition Example

const sklearnTrainingTask = defineTask({
  name: 'sklearn-model-training',
  description: 'Train a scikit-learn model with cross-validation',

  inputs: {
    modelType: { type: 'string', required: true },
    trainDataPath: { type: 'string', required: true },
    targetColumn: { type: 'string', required: true },
    hyperparameters: { type: 'object', default: {} },
    cvFolds: { type: 'number', default: 5 },
    scoringMetric: { type: 'string', default: 'accuracy' }
  },

  outputs: {
    modelPath: { type: 'string' },
    cvScores: { type: 'array' },
    bestScore: { type: 'number' },
    featureImportances: { type: 'object' }
  },

  async run(inputs, taskCtx) {
    return {
      kind: 'skill',
      title: `Train ${inputs.modelType} model`,
      skill: {
        name: 'sklearn-model-trainer',
        context: {
          operation: 'train_with_cv',
          modelType: inputs.modelType,
          trainDataPath: inputs.trainDataPath,
          targetColumn: inputs.targetColumn,
          hyperparameters: inputs.hyperparameters,
          cvFolds: inputs.cvFolds,
          scoringMetric: inputs.scoringMetric
        }
      },
      io: {
        inputJsonPath: `tasks/${taskCtx.effectId}/input.json`,
        outputJsonPath: `tasks/${taskCtx.effectId}/result.json`
      }
    };
  }
});

Model Selection Guide

Classification Models

ModelUse CaseProsCons
LogisticRegressionBinary/multiclass, interpretableFast, interpretableLinear boundary
RandomForestClassifierGeneral purposeRobust, handles nonlinearityCan overfit
GradientBoostingClassifierHigh accuracy neededState-of-art performanceSlower training
SVCSmall/medium datasetsEffective in high dimensionsSlow on large data
XGBClassifierCompetition/productionFast, accurateMany hyperparameters

Regression Models

ModelUse CaseProsCons
LinearRegressionBaseline, interpretableSimple, fastAssumes linearity
Ridge/LassoRegularization neededPrevents overfittingStill linear
RandomForestRegressorGeneral purposeHandles nonlinearityCan overfit
GradientBoostingRegressorHigh accuracyExcellent performanceSlower
SVRSmall datasetsRobust to outliersSlow scaling

Best Practices

  1. Always Use Pipelines: Prevent data leakage by including preprocessing in pipelines
  2. Stratified Splits: Use stratified sampling for imbalanced classification
  3. Cross-Validation: Never tune hyperparameters on test data
  4. Feature Scaling: Apply appropriate scaling for distance-based models
  5. Random Seeds: Set random_state for reproducibility
  6. Model Persistence: Use joblib over pickle for large numpy arrays

References