python - Decision Tree Of SkLearn: Overfitting or Bug? -
i'm analyzing training error , validation error of decision tree model using tree package of sklearn.
#compute rms error def compute_error(x, y, model): yfit = model.predict(x.toarray()) return np.mean(y != yfit) def drawlearningcurve(model,xtrain, ytrain, xtest, ytest): sizes = np.linspace(2, 25000, 50).astype(int) train_error = np.zeros(sizes.shape) crossval_error = np.zeros(sizes.shape) i,size in enumerate(sizes): model = model.fit(xtrain[:size,:].toarray(),ytrain[:size]) #compute validation error crossval_error[i] = compute_error(xtest,ytest,model) #compute training error train_error[i] = compute_error(xtrain[:size,:],ytrain[:size],model) sklearn import tree clf = tree.decisiontreeclassifier() drawlearningcurve(clf, xtr, ytr, xte, yte)
the problem (i don't know whether problem) if give decision tree model function drawlearningcurve
, receive result of training error 0.0
in each loop. related nature of dataset, or of tree package of sklearn? or there else wrong?
ps: training error absolutely not 0.0 @ other models naive-bayes, knn or ann.
the commends give pretty useful directions. i'd add parameter might want tweak called max_depth
.
what worries me more compute_error
function odd. fact error of 0
says classifier makes no errors on training set. however, if did make mistakes error function won't tell that.
import numpy np np.mean([0,0,0,0] != [0,0,0,0]) # perfect match, error 0 0.0 np.mean([0,0,0,0] != [1, 1, 1, 1]) # 100% wrong answers 1.0 np.mean([0,0,0,0] != [1, 1, 1, 0]) # 75% wrong answers 1.0 np.mean([0,0,0,0] != [1, 1, 0, 0]) # 50% wrong answers 1.0 np.mean([0,0,0,0] != [1, 1, 2, 2]) # 50% wrong answers 1.0
what want np.sum(y != yfit)
, or better, 1 of error functions come sklearn, such accuracy_score
.
Comments
Post a Comment