Note on Learning statistics with R
书,本书用到的dataset, code
http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/
Learning statistics with R (notes)
Part I Research design:
Measurement
Operationalization: define what you want to measure precisely; determine the method; define the set of the allowable values that the measurement can take
A theoretical conduct: concept
A measure: method or tool used to make observations
An operationalization: the logical connection between the previous two
A variable: actual data
Scales of measurement: Nominal(categorical), Ordinal, Interval, Ratio scale
Continuous versus discrete values
Likert-scale quasi-interval scale
Reliability(precise) & validity(accurate): R repeatability or consistency of measurement; correctness of a measurement
Way of measure reliability:
Test-retest reliability (consistency over time)
Inter-rater reliability (consistency across people)
Parallel forms reliability (theoretically equivalent measurements)
Internal consistency reliability (different parts that perform similar functions)
Dependent variable(DV) outcome & Independent variable (predicator)
Randomization: randomly assign people to different group and then give each a different treatment
None-experimental research: quasi-experimental research and case studies
Types of validity: internal; external; construct; face; ecological
Internal validity: to what extent which you are able draw the correct conclusions about the causal relationship between variables. Relationship within the study.
External validity: the generalizability of findings.
Construct validity: whether you’re measuring what you want to be measuring
Face validity: whether or not a measure “loos like” it’s doing what it’s supposed to, nothing more.
Ecological validity: the study should closely approximate the real world scenario that is being investigated.
Confounds, artifacts and other treats to validity
Confound: additional, unmeasured variable that turns out to be related to both the predictor and the outcomes.
Artifact: it only holds in the special situation that you happened to test your study.
History effects: specific events may occur during the study itself that might influence the outcomes.
Maturation effects: just how people change their own over time
Repeated testing: learning and practice; familiarity with the testing situation; auxiliary changes caused by testing.
Selection bias:
differential attrition: homogenous attrition (same for all groups; treatments or conditions)
none-response bias: missing data.
Regression to the mean:
It refers to any situation where you select data based on an extreme value on some measure
Experimenter bias come in multiple forms [double blind studies]
Demand effects and reactivity: good participant; negative participant; faithful participant; apprehensive participant
Placebo effect.
Data fabrication; hoaxes; data misrepresentation; study misdesign; data mining & post hoc hypothesizing; publication bias & self-consoring
-----------------------------------------------------------------------------------------
Part II Intro to R
assign 用 <- , ->
variablename 可以用period "." & underscore "_". 不可以包含space, if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, Na_real_, NA_complex_, NA_character ; case sensitive, start with a letter or a period (.variable <- "for special purpose")
PrincipleforName <- c("informative", "short", "conventional styles for multi-word variable names")
sqrt(x) 和 x ^ 0.5 结果相同,但是前者是calling function # 虽然不是很明白具体差别在哪里。
abs() #绝对值 ; round() #绝对值, 格式是round( x = ,digits = ) ,x是值,digits是位数,default = 0
indexing [ ], [c( )], [ : ]
length(x = list )
nchar( x = string)
"!" # not, "|" # or, "&" # and
x <- c(TRUE, FALSE, TRUE) # logical value can be stored
logical indexing, only shows the indexing which is True
q() #退出
getwd() # working directory
setwd() # change working directory
list.files() # list all the files
path.expand() # 不明白啥意思
a.Rdata file extension # Workingspace files
a.csv file extension # Comma separate value
.R file extension # Script files
load(file = "thefullpath.filetype") | setwd("thefullpath") load("nameoffile.typeoffile")
read.csv(file = "filename.csv")
view()
save.image( file = "filename.filetype" )
save(variablename1, variablename1, file = "filename")
save.me <- c("variablename1", "variablename2")
save( file = "filename.filetype", list = save.me)
special value:
Inf # 1 / 0,
NaN # 0 / 0
NA # Not available
NULL # No value
names() # assign names to value
# names(variable) <- c("name1", "name2", "name3", "name4"...)
names <- NULL # remove names of variables
class() # statically meaning distinction
mode() # format of information that variable stores, text data or numeric data
typeof() #
as.factor() # make a group of things as a factor,
# group <- as.factor( group) the type of value is factor
levels() <- c() assign meaningful labels to different levels of each factor
mydataframe <- data.frame(var1, var2, var3, var4) # 把var1...var4 结合到一个mydataframe里
#提取里面的需要mydataframe$var1
names() # 提取名字
list() # 可以打各种信息
# zyy <- list( age = 100, gender = "female", pet = "cat") 调取 zyy$ gender
createaformula <- a ~ b # ~
print.default()
#关于help ?functioname 此处基本就是作者吐槽help没啥用,which I totally agree.
www.rseek.org
PART III Working with Data
【Descriptive Stats】
mean( x = , trim = ) #mean: average number
median( x = ) # 中位数
sort() # 排序
tips: 如果是nominal data, mean = median; 如果是ordinal scale, 用median; 如果是interval and ratio data两者皆可
table()
modeOf( x = ) # 众数的那个东西
maxFreq( x = ) # 众数的具体数
max()
min()
range()
Interquartile range(IQR) is like the range, but instead of calculating the difference between the biggest and smallest value between the 25the quantile and the 75th quantile.
quantile is a percentile: the 10th percentile of data set is the smallest number x such that 10% of the data is less than x.
quantile() # calculating quantile quantile(x = , probs = )
IQR()
lsr 里有 aad()
var() 表示方差
sd() 表示标准差
mad() median absolute deviation
skewness <- " a measure of asymmetry" # many extremely large value on left panel, the data is negatively skewed, while more extremely large values on the right panel, data are positively sewed.
在psych程序包里可以用skew(x = )
kurtosis of a data set is a measure of the "pointiness" of data set. normal curve(0 kurtosis), platykurtic(too flat, negative value), leptokurtic(too pointy, positive value), mesokurtic(0)
getting summary: summary(object = ) for factors, we get a frequency table(table()), but for a character vector (if we convert it to a a character vector (as.character()), things change)
in psych, describe() just for interval or ratio scale, and provides some descriptive data.
describeBy(x = , group = ) an additional argument which specifies a grouping variable.
by(data = , INDICES = , FUN) data specifies the data set, INDICES specifies the grouping variable, and the FUN specifies the name of a function.
multiple grouping using aggregate(formula = , data = , FUN = )
Standard scores ( a standard score ---- z- score, a better way to compare)
formula: standard score <- (raw_score - mean) / (standard deviation)
pnorm() can tell the percentile rank of the z-score
z-score can be compared because each standardised score is a statement about where an observation falls relative to its own population, it's possible to compare standardised scores across completely different variables.
Correlations (r = 1, perfect positive relationship, r = -1 negative relationship)
Cov(x, y) = (1/ (N -1)) 总和(xi - x平均)(yi - y平均)
rXY = (Cov(X,Y) / (X标准差 * Y 标准差)
cor(x, y) & always looking at the graph first
Spearman's ranking correlations: 1. rank.the.data <- rank(file$variable); 2 cor(a1, a2)
or just cor(a1, a2, method = "spearman")
handing missing value(NA) using na.rm = FALSE
in using correlation, there is no rm.na argument:
1. complete.obs (ignore all rows in dataframe completely even the na not involved in the calculation)
2. parwise.complete.obs
(lsr package ---- correlate()-----automatically use the "pairwise complete" method)
The death of one man is a tragedy.
The death of millions is a statistic.
– Josef Stalin, Potsdam 1945
【drawing graphs】
----------------------------------------------------------------------------------------
probability theory
dbinorm() refers to density, the prob of getting a particular even out of all.
pbinorm() refers to the accumulative probability of the particular quantile, p都是累积区间
qbinorm() refers to the quantile of the given probability. p都是累积区间
-----------------------------------
sample 里的data 是descriptive, 用sample 去估计population是inferential.
sample method 有很多: random, snowball, stratified sample. 合适才是好的。注意:with replacement 和without ~ 区别,前者指可能出现同一个东西被反复观测的可能性,一般心理实验是without placement,但是probability theory是 with用的多,然而在数据较大情况下两者没啥区别。
计算sample的意义是预测整个人群,所以我们必须得到estimated population的数据(estimated population mean 和estimated population error standard)
根据the law of big numbers, 当N 越大时,数据会更准确。
根据the central limit theorem, 当不断重复实验, the sampling distribution of the mean 会越来越接近population mean, 无论之前样本容量是多少。
然而样本容量小会影响到sd的估计。
这里重复次数就好比是 the number of observation, 也就是N; 而sampling number 就是the number of trial. # 我觉得时这样#
sd 一般来讲:sampling的sd比population sd 要小,所以在估计的时候用的不是N, 而是N-1
The two plots are quite different: on
average, the average sample mean is equal to the population mean. It is an unbiased estimator, which is essentially the reason why your best estimate for the population mean is the sample mean.6 The plot on the right is quite different: on average, the sample standard deviation s is smaller than the population standard deviation σ. It is a biased estimator. In other words, if we want to make a “best guess” ˆσ about the value of the population standard deviation σ, we should make sure our guess is a little bit larger than the sample standard deviation s.
s Sample standard deviation Yes, calculated from the raw data
σ Population standard deviation Almost never known for sure
ˆσ Estimate of the population Yes, but not the same as the sample standard deviation
confidence interval for the mean: an attempt to quantify the amount of uncertainty that attaches to our estimate.
95% chance: a normally-distributed quantity will fall within 1.96 standard deviations of the true mean.
N <- sample size
qt( p =, df = N-1)
N 是10的时候比是10000的时候大,N= 10时的interval 大。
plotting confidence intervals in R
lsr里有ciMean() calculate confidence.
可以用plots 里的plotmeans(); 或者是sciplot里的 bargraph.CI(), lineplot.CI()
http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/
Learning statistics with R (notes)
Part I Research design:
Measurement
Operationalization: define what you want to measure precisely; determine the method; define the set of the allowable values that the measurement can take
A theoretical conduct: concept
A measure: method or tool used to make observations
An operationalization: the logical connection between the previous two
A variable: actual data
Scales of measurement: Nominal(categorical), Ordinal, Interval, Ratio scale
Continuous versus discrete values
Likert-scale quasi-interval scale
Reliability(precise) & validity(accurate): R repeatability or consistency of measurement; correctness of a measurement
Way of measure reliability:
Test-retest reliability (consistency over time)
Inter-rater reliability (consistency across people)
Parallel forms reliability (theoretically equivalent measurements)
Internal consistency reliability (different parts that perform similar functions)
Dependent variable(DV) outcome & Independent variable (predicator)
Randomization: randomly assign people to different group and then give each a different treatment
None-experimental research: quasi-experimental research and case studies
Types of validity: internal; external; construct; face; ecological
Internal validity: to what extent which you are able draw the correct conclusions about the causal relationship between variables. Relationship within the study.
External validity: the generalizability of findings.
Construct validity: whether you’re measuring what you want to be measuring
Face validity: whether or not a measure “loos like” it’s doing what it’s supposed to, nothing more.
Ecological validity: the study should closely approximate the real world scenario that is being investigated.
Confounds, artifacts and other treats to validity
Confound: additional, unmeasured variable that turns out to be related to both the predictor and the outcomes.
Artifact: it only holds in the special situation that you happened to test your study.
History effects: specific events may occur during the study itself that might influence the outcomes.
Maturation effects: just how people change their own over time
Repeated testing: learning and practice; familiarity with the testing situation; auxiliary changes caused by testing.
Selection bias:
differential attrition: homogenous attrition (same for all groups; treatments or conditions)
none-response bias: missing data.
Regression to the mean:
It refers to any situation where you select data based on an extreme value on some measure
Experimenter bias come in multiple forms [double blind studies]
Demand effects and reactivity: good participant; negative participant; faithful participant; apprehensive participant
Placebo effect.
Data fabrication; hoaxes; data misrepresentation; study misdesign; data mining & post hoc hypothesizing; publication bias & self-consoring
-----------------------------------------------------------------------------------------
Part II Intro to R
assign 用 <- , ->
variablename 可以用period "." & underscore "_". 不可以包含space, if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, Na_real_, NA_complex_, NA_character ; case sensitive, start with a letter or a period (.variable <- "for special purpose")
PrincipleforName <- c("informative", "short", "conventional styles for multi-word variable names")
sqrt(x) 和 x ^ 0.5 结果相同,但是前者是calling function # 虽然不是很明白具体差别在哪里。
abs() #绝对值 ; round() #绝对值, 格式是round( x = ,digits = ) ,x是值,digits是位数,default = 0
indexing [ ], [c( )], [ : ]
length(x = list )
nchar( x = string)
"!" # not, "|" # or, "&" # and
x <- c(TRUE, FALSE, TRUE) # logical value can be stored
logical indexing, only shows the indexing which is True
q() #退出
getwd() # working directory
setwd() # change working directory
list.files() # list all the files
path.expand() # 不明白啥意思
a.Rdata file extension # Workingspace files
a.csv file extension # Comma separate value
.R file extension # Script files
load(file = "thefullpath.filetype") | setwd("thefullpath") load("nameoffile.typeoffile")
read.csv(file = "filename.csv")
view()
save.image( file = "filename.filetype" )
save(variablename1, variablename1, file = "filename")
save.me <- c("variablename1", "variablename2")
save( file = "filename.filetype", list = save.me)
special value:
Inf # 1 / 0,
NaN # 0 / 0
NA # Not available
NULL # No value
names() # assign names to value
# names(variable) <- c("name1", "name2", "name3", "name4"...)
names <- NULL # remove names of variables
class() # statically meaning distinction
mode() # format of information that variable stores, text data or numeric data
typeof() #
as.factor() # make a group of things as a factor,
# group <- as.factor( group) the type of value is factor
levels() <- c() assign meaningful labels to different levels of each factor
mydataframe <- data.frame(var1, var2, var3, var4) # 把var1...var4 结合到一个mydataframe里
#提取里面的需要mydataframe$var1
names() # 提取名字
list() # 可以打各种信息
# zyy <- list( age = 100, gender = "female", pet = "cat") 调取 zyy$ gender
createaformula <- a ~ b # ~
print.default()
#关于help ?functioname 此处基本就是作者吐槽help没啥用,which I totally agree.
www.rseek.org
PART III Working with Data
【Descriptive Stats】
mean( x = , trim = ) #mean: average number
median( x = ) # 中位数
sort() # 排序
tips: 如果是nominal data, mean = median; 如果是ordinal scale, 用median; 如果是interval and ratio data两者皆可
table()
modeOf( x = ) # 众数的那个东西
maxFreq( x = ) # 众数的具体数
max()
min()
range()
Interquartile range(IQR) is like the range, but instead of calculating the difference between the biggest and smallest value between the 25the quantile and the 75th quantile.
quantile is a percentile: the 10th percentile of data set is the smallest number x such that 10% of the data is less than x.
quantile() # calculating quantile quantile(x = , probs = )
IQR()
lsr 里有 aad()
var() 表示方差
sd() 表示标准差
mad() median absolute deviation
skewness <- " a measure of asymmetry" # many extremely large value on left panel, the data is negatively skewed, while more extremely large values on the right panel, data are positively sewed.
在psych程序包里可以用skew(x = )
kurtosis of a data set is a measure of the "pointiness" of data set. normal curve(0 kurtosis), platykurtic(too flat, negative value), leptokurtic(too pointy, positive value), mesokurtic(0)
getting summary: summary(object = ) for factors, we get a frequency table(table()), but for a character vector (if we convert it to a a character vector (as.character()), things change)
in psych, describe() just for interval or ratio scale, and provides some descriptive data.
describeBy(x = , group = ) an additional argument which specifies a grouping variable.
by(data = , INDICES = , FUN) data specifies the data set, INDICES specifies the grouping variable, and the FUN specifies the name of a function.
multiple grouping using aggregate(formula = , data = , FUN = )
Standard scores ( a standard score ---- z- score, a better way to compare)
formula: standard score <- (raw_score - mean) / (standard deviation)
pnorm() can tell the percentile rank of the z-score
z-score can be compared because each standardised score is a statement about where an observation falls relative to its own population, it's possible to compare standardised scores across completely different variables.
Correlations (r = 1, perfect positive relationship, r = -1 negative relationship)
Cov(x, y) = (1/ (N -1)) 总和(xi - x平均)(yi - y平均)
rXY = (Cov(X,Y) / (X标准差 * Y 标准差)
cor(x, y) & always looking at the graph first
Spearman's ranking correlations: 1. rank.the.data <- rank(file$variable); 2 cor(a1, a2)
or just cor(a1, a2, method = "spearman")
handing missing value(NA) using na.rm = FALSE
in using correlation, there is no rm.na argument:
1. complete.obs (ignore all rows in dataframe completely even the na not involved in the calculation)
2. parwise.complete.obs
(lsr package ---- correlate()-----automatically use the "pairwise complete" method)
The death of one man is a tragedy.
The death of millions is a statistic.
– Josef Stalin, Potsdam 1945
【drawing graphs】
----------------------------------------------------------------------------------------
probability theory
dbinorm() refers to density, the prob of getting a particular even out of all.
pbinorm() refers to the accumulative probability of the particular quantile, p都是累积区间
qbinorm() refers to the quantile of the given probability. p都是累积区间
-----------------------------------
sample 里的data 是descriptive, 用sample 去估计population是inferential.
sample method 有很多: random, snowball, stratified sample. 合适才是好的。注意:with replacement 和without ~ 区别,前者指可能出现同一个东西被反复观测的可能性,一般心理实验是without placement,但是probability theory是 with用的多,然而在数据较大情况下两者没啥区别。
计算sample的意义是预测整个人群,所以我们必须得到estimated population的数据(estimated population mean 和estimated population error standard)
根据the law of big numbers, 当N 越大时,数据会更准确。
根据the central limit theorem, 当不断重复实验, the sampling distribution of the mean 会越来越接近population mean, 无论之前样本容量是多少。
然而样本容量小会影响到sd的估计。
这里重复次数就好比是 the number of observation, 也就是N; 而sampling number 就是the number of trial. # 我觉得时这样#
sd 一般来讲:sampling的sd比population sd 要小,所以在估计的时候用的不是N, 而是N-1
The two plots are quite different: on
average, the average sample mean is equal to the population mean. It is an unbiased estimator, which is essentially the reason why your best estimate for the population mean is the sample mean.6 The plot on the right is quite different: on average, the sample standard deviation s is smaller than the population standard deviation σ. It is a biased estimator. In other words, if we want to make a “best guess” ˆσ about the value of the population standard deviation σ, we should make sure our guess is a little bit larger than the sample standard deviation s.
s Sample standard deviation Yes, calculated from the raw data
σ Population standard deviation Almost never known for sure
ˆσ Estimate of the population Yes, but not the same as the sample standard deviation
confidence interval for the mean: an attempt to quantify the amount of uncertainty that attaches to our estimate.
95% chance: a normally-distributed quantity will fall within 1.96 standard deviations of the true mean.
N <- sample size
qt( p =, df = N-1)
N 是10的时候比是10000的时候大,N= 10时的interval 大。
plotting confidence intervals in R
lsr里有ciMean() calculate confidence.
可以用plots 里的plotmeans(); 或者是sciplot里的 bargraph.CI(), lineplot.CI()