machine learning - How to pre-process dataset for maximum effectiveness with LibSVM Weka implementation -


so read paper said processing dataset correctly can increase libsvm classification accuracy dramatically...i'm using weka implementation , making sure dataset optimal.

here (example) attributes:

power numeric (real numbers, range 0 1.5132, 9000+ unique values) voltage numeric (similar power) light numeric (0 , 1 2 possible values) day numeric (1 through 20 possible values, equal number of each value) range nominal {1,2,3,4,5} <----these classes 

my question is: weka pre-processing filters should apply make dataset more effective libsvm?

  1. should normalize and/or standardize power , voltage data values?
  2. should use discretization filter on anything?
  3. should binning power/voltage values lot smaller number of bins?
  4. should make light value binary instead of numeric?
  5. should normalize day values? make sense that?
  6. should using nominal binary or nominal thing else filter classes "range"?

please advice on these questions , else think might have missed...

thanks in advance!!

normalization important, influences concept of distance used svm. 2 main approaches normalization are:

  1. scale each input dimension same interval, example [0, 1]. common approach far. necessary prevent input dimensions dominate others. recommended libsvm authors in beginner's guide (appendix b examples).
  2. scale each instance given length. common in text mining / computer vision.

as handling types of inputs:

  1. continuous: no work needed, svm works on these implicitly.
  2. ordinal: treat continuous variables. example cold, lukewarm, hot modeled 1, 2, 3 without implicitly defining unnatural structure.
  3. nominal: perform one-hot encoding, e.g. input n levels, generate n new binary input dimensions. necessary because must avoid implicitly defining varying distance between nominal levels. example, modelling cat, dog, bird 1, 2 , 3 implies dog , bird more similar cat , bird nonsense.

normalization must done after substituting inputs necessary.


to answer questions:

  1. should normalize and/or standardize power , voltage data values?

    yes, standardize (final) input dimensions same interval (including dummies!).

  2. should use discretization filter on anything?

    no.

  3. should binning power/voltage values lot smaller number of bins?

    no. treat them continuous variables (e.g. 1 input each).

  4. should make light value binary instead of numeric?

    no, svm has no concept of binary variables , treats numeric. converting lead type-cast internally.

  5. should normalize day values? make sense that?

    if want use 1 input dimension, must normalize others.

  6. should using nominal binary or nominal thing else filter classes "range"?

    nominal binary, using one-hot encoding.


Comments

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -