多列间有规律的计算
【问题】
I’ve done a fair amount of searching through SO and have not found or perhaps have not understood the solution to the following question. I have the following example data frame:
df <- data.frame(Day = c('Day 1', 'Day 2', 'Day 3', 'Day 1', 'Day 2', 'Day 3'),
WBC.esc = c('Yes', 'Yes', 'Yes', 'No', 'No', 'No'),
WBC.deesc = c('No', 'No', 'No', 'No', 'Yes', 'Yes'),
WBC.cdm = c('Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes'),
Hgb.esc = c('Yes', 'No', 'Yes', 'No', 'Yes', 'No'),
Hgb.deesc = c('No', 'Yes', 'No', 'No', 'No', 'Yes'),
Hgb.cdm = c('No', 'No', 'No', 'No', 'No', 'No'),
Plt.esc = c('No', 'No', 'Yes', 'No', 'No', 'No'),
Plt.deesc = c('Yes', 'Yes', 'No', 'No', 'Yes', 'Yes'),
Plt.cdm = c('Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes'))
# Day WBC.esc WBC.deesc WBC.cdm Hgb.esc Hgb.deesc Hgb.cdm Plt.esc Plt.deesc Plt.cdm
\# 1 Day 1 Yes No Yes Yes No No No Yes Yes
\# 2 Day 2 Yes No Yes No Yes No No Yes Yes
\# 3 Day 3 Yes No Yes Yes No No Yes No Yes
\# 4 Day 1 No No Yes No No No No No Yes
\# 5 Day 2 No Yes Yes Yes No No No Yes Yes
\# 6 Day 3 No Yes Yes No Yes No No Yes Yes
Days represent the number of days a patient has been in the hospital. The other column names refer to types of orders. I would like to answer the question of whether there were orders made (“Yes”) in any of the WBC/Hgb/Plt.esc or WBC/Hgb/Plt.deesc categories. In the large dataset there are other prefixes before *.esc such as HeartRate.esc, BP.esc so just searching on the .esc suffix won’t work. I know how to do this inefficiently by specifying each column name. Using the following code:
sum(df$Day=="Day 1" & (df$WBC.esc=="Yes" | df$WBC.deesc=="Yes" | df$Hgb.esc=="Yes" | df$Hgb.deesc=="Yes" | df$Plt.esc=="Yes" | df$Plt.deesc=="Yes"))
# [1] 1
This gives the correct answer of 1.
I have referenced the columns by their numbers which saves me from using the full column name (ex: dt[,2] instead of dt$WBC.esc)but even this seems prohibitively tedious when in my actual data set there are sometimes 40-50 columns to reference. In the full data set I would like to be able to do something like the following (using excel terminology here since that’s what I’m familiar with, with alphabets being column names). The idea being that it will take a row, look through all the specified columns in that row for a “Yes” value and if it finds it, count it as a 1 or if it does not find “Yes” then count it as a 0.
countif(B:E=="Yes" | H:Y=="Yes" | AE:AG=="Yes")
I also tried this as:
sum(df$Day=="Day 1" & (dt\[,2:3\]=="Yes" | df\[,5:6\]=="Yes" | df\[,8:9\]=="Yes"),na.rm=TRUE)
which does not work as it returns an answer of “2” suggesting that it is counting each “Yes.” Also this function does not run unless the number of columns reference in each is uneven such as dt[,2:4] | dt[,8:9] in which case I get:
“Error: binary operation on non-conformable arrays”
I’ve searched through a number of questions but in the posts I’ve seen the column names are individually listed out, not referred to in aggregate. How can I perform this operation referencing multiple sets of adjacent columns without having to list each variable separately?
【回答】
输入列号的序列 (比如 2,3,5,6,8,9) 就能执行查询,只改变序列的长度和值就能改变查询条件,要实现类似的算法,需要动态引用列号。这类结构化数据运算用 SPL 实现比较简单
A |
|
1 |
=file("D://data.csv").import@t() |
2 |
=[2,3,5,6,8,9] |
3 |
=A2.("#"/~/"==\"Yes\"").concat("||") |
4 |
=A1.count(Day=="Day1" &&(${A3})) |
A3 格的结果值是字符串:#2==“Yes”||#3==“Yes”||#5==“Yes”||#6==“Yes”||#8==“Yes”||#9==“Yes”
集算器应用环境类似 R,参考【集算器实现文本处理的应用方案】,关于宏的用法参考【集算器高级代码之活用字符串】