https://nlp.lab.arizona.edu/sites/nlp.lab.arizona.edu/files/Kauchak-Leroy-Hogue-JASIST-2017.pdf
In previous work, we conducted a preliminary corpus study of grammar frequency which showed that difficult texts use a wider variety of high-level grammatical structures (Kauchak et al., 2012). However, because of the large number of structural variations possible, no clear indication was found showing specific structures predominantly appearing in either easy or difficult documents.
In this work, we propose a much more fine-grained analysis. We propose a measure of text difficulty based on grammatical frequency and show how it can be used to identify sentences with difficult syntactic structures. In particular, the grammatical difficulty of a sentence is measured based on the frequency of occurrence of the top-level parse tree structure of the sentence in a large corpus
根据term familiarity创造了grammer familiarity的概念:
Grammar familiarity is measured as the frequency of the 3rd level sentence parse tree
实验:
将wiki根据第三等级的parse tree分成了11 bins,计算每个bin出现的频率;
然后每个bin随机挑选20个句子(抛去句子长度和ter familiarity的影响),招募了三十几个人对句子评分(5 points)以及完形填空;
结果发现,即使表现看不出区别的句子,3rd parse tree出现频率越高的句子,评分越简单,所花时间越短
假设:
examine how grammatical frequency impacts the difficulty of a sentence and introduce a new measure of sentence-level text difficulty based on the grammatical structure of the sentence.
句子难度氛围perceived和actual
Our work here makes a step towards better simplification tools by 1) introducing a sentence-level, data-driven approach for measuring the grammatical difficulty of a sentence and 2) specifically measuring the impact of this measure using both how difficult a sentence looks (perceived difficulty) as well as how difficult a sentence is to understand (actual difficulty).
实词和动词在简单文本中更普遍
simple texts use simpler words, fewer overall words and words that are more general (Coster & Kauchak, 2011; Napoles & Dredze, 2010; Zhu, Bernhard, & Gurevych, 2010). Certain types of words have also been found to be more prevalent in simpler texts including function words and verbs (Kauchak, Leroy, & Coster, 2012; Leroy & Endicott, 2011).
The Role of Syntax in Simplification
The syntax or grammar of a language dictates how words and phrases interact to form sentences。
splitting long sentences has been show to improve Cloze scores (Kandula, Curtis, & Zeng-Treitler, 2010) and additive and causal表原因或目的connectors were easier to fill in than adversative or sequential connectors转折或者表时间顺序的连词 (Goldman & Murray, 1992). It has been suggested that grammatical difficulty is particularly important for L2 learners since they are still trying to learn appropriate grammatical structures for the language (Callan & Eskenazi, 2007; Clahsen & Felser, 2006).
(注:LOGICAL CONNECTORS https://staff.washington.edu/marynell/grammar/logicalconnectors.html)
简单文本更高比例的使用动词、function words和副词,难文本更高比例使用形容词和名词、更长的名词短语;医疗文本中,简单文本更容易使用主动语态;subject-verb-object versus object-subject-verb ordering也有区别
Some initial success has been achieved by automated simplification systems that perform syntactic transformations, 例如减少介词短语、不定式、改变动词时态
如何选择parse tree structure
We chose to focus on the 3rd level since it represents a compromise between generality and specificity.
45% of sentences in the corpus (2.47M) have unique 4th level parse tree structures, often because the 4th level regularly includes lexical components. 这样包含单词之后,很难泛华到其他句子
To remove anomalous data and likely misparses, we ignored any structure that had only been seen once among the 5.4 millions sentences. After filtering, this results in 139,969 unique 3rd level structures.
表一:
two sentences that have the same 3rd level structure, but that have varying frequency, ordered from most frequent to least. Because we focus on the high-level structure, the length of the sentences with the same structure also can vary widely
图2
grammatical frequency follows a Zipf – like distribution, with the most common structures occurring very frequently and many structures occurring infrequently
This approach for measuring the grammatical difficulty of text represents a generalized and datadriven approach that goes beyond specific, theory-based grammatical components of text difficult (e.g. active vs. passive voice, self-embedded clauses, etc. (Meyer & Rice, 1984)) and provides a generic framework for measuring grammatical difficulty.超越了基于理论的语法成分
评估:
To minimize confounding factors that might influence sentence difficulty we control for sentence length and term familiarity
1)We ranked the 139,939 unique 3rd level structures and divided them into 11 frequency bins.第一个bin占签1%频率的structure,后面是个依次占10%
2)Each of the 5.4 million Wikipedia sentences can be mapped to one of the 11 frequency bins and we selected a subset of these for our study.
3)只取长度均值附近的句子,假设句子长度服从均匀分布,去除六分之一最长的和最短的,还剩三分之二的句子
4)每个bin随机选20句,探究grammer frequency和句子长度term familiarity的关系:
考虑句子长度:在剩下的句子中按照句子长度分为3个等级,每个bin选择的10个属于最长的句子和10属于最短的句子
考虑term familiarity:在剩下的句子中选择Google web语料,计算句子单词的familiarity的均值,按照familiarity分为3个等级,每个bin选择的10个属于familiarity得分最高的句子和20个最低的
--》
This process resulted in a sample of 220 sentences in 11 frequency bins with each bin containing 5 long sentences with high familiarity, 5 long with low familiarity, 5 short with high familiarity, and 5 short with low familiarity
For each of the 220 sentences, we recruited 30 participants for a total of N=6,600 samples. To ensure the quality and accuracy of the data, participants were restricted to be within the United States and to have a previous approval rating of 95%.
众包:MTurk is a crowdsourcing tool where requesters can upload tasks to be accomplished by a set of workers for a fee.
结果:
A paired-samples t–test showed our two control variables to be effective, with length significantly different between short and long sentences (t(10) = -60.47, p < 0.001) and word frequency significantly different between the high and low group (t(10) = -38.47, p < 0.001).
1)实际难度:To measure actual difficulty (first dependent variable) we used a Cloze test. The basic Cloze test involves replacing every nth word in a text with a blank. Participants are then asked to fill in the blanks and are scored based on how many of their answers matched the original text (Taylor, 1953).
We employed a multiple-choice Cloze test. For each sentence, four nouns were randomly selected and replaced with blanks. For each sentence, we create five multiple-choice options containing the four removed words in different random orders, one of which is the correct ordering.
2)To measure perceived difficulty (second dependent variable), participants were asked to rate the sentences on a 5-point Likert scale with higher numbers representing more difficult sentences.
Each condition (11 x 2 x 2) had 5 sentences and for each sentence we gathered 30 responses, resulting in a dataset of N=6,600. T
An ANOVA shows these differences to be significant (F(10,6556)= 3.453, p < 0.001), for grammar frequency and sentence length, and (F(10,6556)= 1.870, p = 0.044), for grammar frequency and term familiarity. In addition, the interaction between all three variables is also significant (F(10, 6556) = 4.650, p < 0.001)
(注:方差分析(Analysis of Variance,简称ANOVA),又称“变异数分析”)
1、独立样本T检验一般仅仅比较两组数据有没有区别,区别的显著性,如比较两组人的身高,体重等等,而这两组一般都是独立的,没有联系的,只是比较这两组数据有没有统计学上的区别或差异。
2、单因素ANOVA也就是单因素方差分析,是用来研究一个控制变量的不同水平是否对观测变量产生了显著影响。说白了就是分析x的变化对y的影响的显著性,所以一般变量之间存在某种影响关系的,验证一种变量的变化对另一种变量的影响显著性的检验。一般的,方差分析都是配对的。
更少见的structure更难~one-tailed Pearson correlation coefficient:
To complete this analysis and understand the strength of the effect on actual difficulty, we calculated a one-tailed Pearson correlation coefficient between the grammar frequency and the actual difficulty (percentage correct) for both the raw scores and scores aggregated by frequency bin. There was a negative correlation between grammar frequency and the actual difficulty of the sentence (raw scores: N = 6,600, r = -0.053, p < 0.01; bin averages: N = 11, r = -0.596, p < 0.05) indicating that sentences that used less frequent structures were harder to understand.
在中等structure frequency的时候,长句和短句的准确率差不多:
In contrast to actual difficulty, we also find a main effect of the sentence length on perceived difficulty with longer sentences seen as more difficult (averaged 2.2) than the shorter sentences (averaged 2.0). Surprisingly, there was no effect of the average term frequency on perceived difficulty.
The effect of grammar frequency on perceived difficulty is smaller in shorter sentences and those with lower term frequency
Both high and low frequency sentences show a jump in difficulty, though it occurs earlier (bin 7) for low frequency sentences than for high frequency sentences (bin 8)
we found a significant correlation between how well readers performed on the Cloze test and how difficult they thought a sentence was. Lower accuracy correlated with higher difficulty scores (N = 11, r = -0.574, p < 0.05; N = 6600, r = -0.203, p < 0.01)
Actual and perceived difficulty as measured in our user study for the 220 sentences binned by the Flesch-Kincaid grade level:
即使fk公式来看220个句子的难度差别很小,但是perceived difficulty确实很大的
GRAMMAR FAMILIARITY AS AN ANALYSIS TOOL:
corpus:
Each of the texts were tokenized and split into sentences using the Stanford CoreNLP toolkit and then parsed using the Berkeley Parser (the same preprocessing as the frequency bins)
总结:
1、阐述了现有的研究上,actual readability和perceived difficulty的不对等
2、阐述了parse tree leve3 frequency和actual和perceived difficulty的相关性,有效性
3、在短句中actual difficulty受到grammer的影响很小,因为shorter sentences are easy to understand and any effect of grammar is difficult to detect (ceiling effect)
Similarly, in sentences with low term familiarity (i.e. more difficult words) the grammar familiarity doesn’t impact text difficulty since users are struggling with the lexical difficulty
However, in sentences with very familiar terms, which are easier to understand, grammar frequency does have an impact on actual difficulty; only in sentences where the words are more familiar does the grammatical frequency have a strong effect. Interestingly, there was very little impact overall of term frequency on actual difficulty.
Based on these observations, we hypothesize that there is a relation between grammatical frequency and term frequency. Future studies are required to fully validate these hypotheses. Our study has limitations. Text comprehension was measured with individual
原文地址:https://www.cnblogs.com/rosyYY/p/10523069.html