r - tapply - creating NA? -
i'm trying calculate mean number of unique fruits per person (my usual practice data). works both these lines of code:
with(df, tapply(fruit, names, fun = function(x) length(unique(x))))->uniques sum(uniques)/length(unique(df$names)) aggregate(df[,"fruit"], by=list(id=names), fun = function(x) length(unique(x)))->d1 sum(d1$x)/length(unique(df$names)) my problem when use code on real data doesn't work. real data prescribing data, want mean number of unique drugs per person. tapply code, has appeared create brand new patient ids not exist in original df. has given 1000s of na values. there no missing values in id column , none in drug_code column either
with(dt3, tapply(drug_code, id, fun = function(x) length(unique(x))))->uniques head(uniques) uniques patient hai0000001 na patient hai0000003 na patient hai0000008 na patient hai0000010 na patient hai0000014 na patient hai0000020 na table(dt3$id=="patient hai0000001") ##checking see if ha10000001 occurs in original df. dim of df 228954 rows , 5 cols false 228954 for aggregate code error:
aggregate(dt3[,"drug_code"], by=list(id=id), fun = function(x) length(unique(x)))->d1 error in aggregate.data.frame(as.data.frame(x), ...) : arguments must have same length i don't understand whats happening. real data similar practice data in has id col , has drug/fruit column. there no missing data in either df. know lapply better dataframes, don't need df back. , in case tapply code works on practice data df. have idea of happening here?
practice df:
names<-as.character(c("john", "john", "john", "john", "john", "mary", "mary","mary","mary","mary", "jim", "sylvia","ted","ted","mary", "sylvia", "jim", "ted", "john", "ted")) dates<-as.date(c("2010-07-01", "2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01", "2010-08-12", "2010-11-11", "2010-05-12", "2010-12-03", "2010-07-12", "2010-12-21", "2010-02-18", "2010-10-29", "2010-08-13", "2010-11-11", "2010-05-12", "2010-04-01", "2010-05-06", "2010-09-28", "2010-11-28" )) fruit<-as.character(c("kiwi","apple","banana","orange","apple","orange","apple","orange", "apple", "apple", "pineapple", "peach", "nectarine", "grape", "melon", "apricot", "plum", "lychee", "watermelon", "apple" )) df<-data.frame(names,dates,fruit) example of real data:
head(dt3) id quantity date_of_claim drug_code index 1 patient hai0000560 1 2009-10-15 r03ac02 2010-04-06 2 patient hai0000560 1 2009-10-15 r03ak06 2010-04-06 3 patient hai0000560 30 2009-10-15 r03bb04 2010-04-06 4 patient hai0000560 30 2009-10-15 a02bc01 2010-04-06 5 patient hai0000560 50 2009-10-15 m02aa15 2010-04-06 6 patient hai0000560 30 2009-10-15 n02be51 2010-04-06
in case asking fir single number: mean of individual lengths of particular vector (unique(fruits)) within patient-id. shws first indivdual unique counts , mean function result:
> with(df, tapply(fruit, names, function(x) length(unique(x)) )) jim john mary sylvia ted 2 5 3 2 4 > mean ( with(df, tapply(fruit, names, function(x) length(unique(x)) )) ) [1] 3.2 i comment test containment of particular value in code above had trailing space might have caused problems. "string " not equal "string". have put copy of use trim function in pkg::gdata in .rprofile file make easier me handle possibility.
Comments
Post a Comment