machine learning - Ideal classifiers in python to fit sparse high dimensional features (with hierarchical classification) -
this task:
i have set of hierarchical classes (ex. "object/architecture/building/residential building/house/farmhouse")--and i've written 2 ways of classifying:
treating each class independently (using 1 model/classifier overall)
using tree each node represents decision (the root represents "object/", , each level decreases generality), , specific model/classifier each node (here, consider c (usually 3) highest probabilities come out of each node, , propagate probabilities down (summing log probs) leaves), , choose highest.
i had introduce way incentivize going further down tree (as stop @ object/architecture/building (if there corresponding training data)), , used arbitrary trial-and-error process decide how (i don't feel comfortable this).:
if numcategories == 4: tempscore +=1 elif numcategories ==5: tempscore +=1.3 elif numcategories ==6: tempscore +=1.5 elif numcategories >6: tempscore +=2
it important note have around 290k training samples , ~150k (currently/mostly) boolean features (represented 1.0 or 0.0)--although it's highly sparse, use scipy's sparse matrices. also, there ~6500 independent classes (though many less each node in method 2)
with method 1, scikit's sgdclassifier(loss=hinge), around 75-76% accuracy, , linearsvc, around 76-77% (although it's 8-9 times slower).
however, second method (which think can/should perform better) neither of these classifiers produce true probabilities, , while i've attempted scale confidence scores produced .decision_functions(), didn't work (accuracies of 10-25%). thus, switched logisticregression(), gets me ~62-63% accuracy. also, nb based classifiers seem perform substantially less well.
ultimately, have twoish questions:
- is there better classifier (than scikit's
logisticregression()) around implemented in python (could scikit or mlpy/nltk/orange/etc) can (i) handle sparse matrices, (ii) produce (something close to) probabilities, , (iii) work multiclass classification? - is there way handle method 2 better? 2.a. specifically, there way better handle incentivizing classifier produce results further down tree?
Comments
Post a Comment