machine learning - How to pre-process dataset for maximum effectiveness with LibSVM Weka implementation -
so read paper said processing dataset correctly can increase libsvm classification accuracy dramatically...i'm using weka implementation , making sure dataset optimal.
here (example) attributes:
power numeric (real numbers, range 0 1.5132, 9000+ unique values) voltage numeric (similar power) light numeric (0 , 1 2 possible values) day numeric (1 through 20 possible values, equal number of each value) range nominal {1,2,3,4,5} <----these classes
my question is: weka pre-processing filters should apply make dataset more effective libsvm?
- should normalize and/or standardize power , voltage data values?
- should use discretization filter on anything?
- should binning power/voltage values lot smaller number of bins?
- should make light value binary instead of numeric?
- should normalize day values? make sense that?
- should using nominal binary or nominal thing else filter classes "range"?
please advice on these questions , else think might have missed...
thanks in advance!!
normalization important, influences concept of distance used svm. 2 main approaches normalization are:
- scale each input dimension same interval, example
[0, 1]
. common approach far. necessary prevent input dimensions dominate others. recommended libsvm authors in beginner's guide (appendix b examples). - scale each instance given length. common in text mining / computer vision.
as handling types of inputs:
- continuous: no work needed, svm works on these implicitly.
- ordinal: treat continuous variables. example cold, lukewarm, hot modeled
1
,2
,3
without implicitly defining unnatural structure. - nominal: perform one-hot encoding, e.g. input n levels, generate n new binary input dimensions. necessary because must avoid implicitly defining varying distance between nominal levels. example, modelling cat, dog, bird
1
,2
,3
implies dog , bird more similar cat , bird nonsense.
normalization must done after substituting inputs necessary.
to answer questions:
should normalize and/or standardize power , voltage data values?
yes, standardize (final) input dimensions same interval (including dummies!).
should use discretization filter on anything?
no.
should binning power/voltage values lot smaller number of bins?
no. treat them continuous variables (e.g. 1 input each).
should make light value binary instead of numeric?
no, svm has no concept of binary variables , treats numeric. converting lead type-cast internally.
should normalize day values? make sense that?
if want use 1 input dimension, must normalize others.
should using nominal binary or nominal thing else filter classes "range"?
nominal binary, using one-hot encoding.
Comments
Post a Comment