从在线购书行为中挖掘科学消费的党派差异
这是2017年4月的发表在NatureHumanBehavior上的文章,原文地址:Millions of online book co-purchases reveal partisan differences in the consumption of science
之前知乎上已经有人写了一篇关于这篇论文的文章,先转为敬:白婷:Nature 2017:从购书行为探索不同党派消费差异
去年看到这篇文章以后印象非常深刻,但一直没有细读。自从去年3月开始与手头的某APP的阅读行为数据打交道转眼也一年多了,不断打磨的过程中有很多困惑也有很多反思,今天再读感觉收获很大,小记于此。本文主要从数据和研究设计的角度分析这篇文章的构思。
一、核心观点:自由派偏爱基础科学,保守派偏爱应用科学。政客们都爱读科学,但他们读的不是同一种科学。
Findings reveal partisan preferences both within and across scientific disciplines. Across fields, customers for liberal or ‘blue’ political books prefer basic science (for example, physics, astronomy and zoology), whereas conservative or ‘red’ customers prefer applied and commercial science (for example, criminology, medicine and geophysics). Within disciplines, ‘red’ books tend to be co-purchased with a narrower subset of science books on the periphery of the discipline. We conclude that the political left and right share an interest in science in general, but not science in particular.
二、研究问题:为什么是对科学的消费(the consumption of science)?
We may disagree on emotionally charged social issues, but at least we can agree on science. (人们可能在社会问题上存在分歧,但在科学上不会。)
三、分析视角:书,而不是人。
Individuals are the units in surveys, but online retailers do not provide access to individual customer behaviour. Instead, we use individual books as the unit of analysis in constructing a co-purchase network. (调查方法通常以人为分析单元,但我们用书来构建“共同购买”的网络。)
这一点非常聪明,很多用于研究(尤其是计算社会科学的研究)的大规模数据集中,往往只有行为数据,没有个体属性数据,我只知道你干了什么,但不知道你是谁,你来自哪里,过着怎样的生活,是一个怎样的人,这一点对于社会科学的研究是致命的。如果你只挖掘出一堆行为规律,社会科学的老师们上来就可以批你一句:What is your research question? 没有个体属性,就难有所谓的”问题意识“。 以书为分析对象有什么好处呢?这样一来,”购买政治类书籍“这一行为本身就成了一种个体的属性特征,尤其是购买自由派或保守派的书,基本上就体现了你这个人的政治倾向(当然也是在一定程度上)。不过为了严谨起见,压根不用研究个体,就研究”自由派的书“和”保守派的书”,他们是如何与科学类的书被co-purchase的,这就已经足够讲出这个非常有趣的故事了,妙哉。
Rather, our attention is focused exclusively on the science preferences of those who purchase liberal and conservative political books, a group whose science preferences could differ from those who do not shop for books online or who shop for science books but not for politics.
正如文中所言,该研究只研究了宏观层面(population-level),没有深入个体行为,因而无法挖掘因果关系(casual modelling),不过这是计算社会科学、包括整个大数据分析最常见的困境,数据使然,也无可厚非,能发现显著的相关性已经很好了。在只有行为数据的情况下能够讲出一个问题如此清晰而又严谨的故事,着实令人佩服。
Second, we do not have individual-level co-purchase data and therefore cannot pursue analytical strategies for which such data would be required, such as multivariate causal modelling. Co-purchasing patterns reveal the population-level distribution of political interests in science but not the underlying individual-level causes.
另外,由于数据有限,只有在线行为数据,不知道线下买书的人是怎样的情况,因此研究结果缺乏generalize的能力,不过这些都可以理解了。
First, although half of the US population purchases books online, this is not a random sample, which limits the ability to generalize our results to the other half.
在数据集的选取上,在Amazon购买行为数据之外,还抓取了另一个公司(Barnes & Noble)的数据,大大增强了文章结果的说服力。
These three sets of concerns are mitigated, however, by similarity in the results derived from distinct bookstores, with different purchase patterns and company-specific algorithms.
基于这样的数据集,文章主要测量了Political Relevance、Political Alignment和Political polarization、Scientific Breadth四个指标(具体的测量方法参见原文),然后建立了这些变量之间的相关性。整个分析主要分成三步:
Our analysis proceeds in three steps. We first assess political relevance, alignment and polarization as measures of political interest in science compared with political interest in books and topics outside science; second, we report differences in these measures across scientific disciplines, broken down by the continuum of applied and basic science; third, we report the scientific breadth and location of red and blue books within disciplines.