machine learning - Ideal classifiers in python to fit sparse high dimensional features (with hierarchical classification) -


this task:

i have set of hierarchical classes (ex. "object/architecture/building/residential building/house/farmhouse")--and i've written 2 ways of classifying:

  1. treating each class independently (using 1 model/classifier overall)

  2. using tree each node represents decision (the root represents "object/", , each level decreases generality), , specific model/classifier each node (here, consider c (usually 3) highest probabilities come out of each node, , propagate probabilities down (summing log probs) leaves), , choose highest.

    i had introduce way incentivize going further down tree (as stop @ object/architecture/building (if there corresponding training data)), , used arbitrary trial-and-error process decide how (i don't feel comfortable this).:

        if numcategories == 4:         tempscore +=1     elif numcategories ==5:         tempscore +=1.3     elif numcategories ==6:         tempscore +=1.5     elif numcategories >6:         tempscore +=2 

it important note have around 290k training samples , ~150k (currently/mostly) boolean features (represented 1.0 or 0.0)--although it's highly sparse, use scipy's sparse matrices. also, there ~6500 independent classes (though many less each node in method 2)

with method 1, scikit's sgdclassifier(loss=hinge), around 75-76% accuracy, , linearsvc, around 76-77% (although it's 8-9 times slower).

however, second method (which think can/should perform better) neither of these classifiers produce true probabilities, , while i've attempted scale confidence scores produced .decision_functions(), didn't work (accuracies of 10-25%). thus, switched logisticregression(), gets me ~62-63% accuracy. also, nb based classifiers seem perform substantially less well.

ultimately, have twoish questions:

  1. is there better classifier (than scikit's logisticregression()) around implemented in python (could scikit or mlpy/nltk/orange/etc) can (i) handle sparse matrices, (ii) produce (something close to) probabilities, , (iii) work multiclass classification?
  2. is there way handle method 2 better? 2.a. specifically, there way better handle incentivizing classifier produce results further down tree?


Comments

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -

c# - String.format() DateTime With Arabic culture -