[Audio] Hello everyone, I am Sen Yang. I am here to present the paper, Challenges to Open-Domain Constituency Parsing. This is a collaborating work together with Leyang Cui, Ruoxi Ning, Di Wu and Yue Zhang..
[Audio] For the past few decades, constituency parsers have been trained and tested on the Peen Treebank, whose raw texts are from Wall Street Journal, which is in the newswire domain..
[Audio] With the power of neural parsers and large pretrained language models, recent parsers have achieved an performance of over 95% F1 score on PTB..
[Audio] Thus, recent research has turned to investigate the cross-domain generalization of neural constituency parsers. However, their findings were made on a rather limited number of domains, i.e., biomedical, web text and literature domains..
[Audio] It remains an interesting research question to understand the performance of constituency parsing with regard to a wider range of domains and text genres in order to understand the boundaries and existing techniques and identify the main challenges for robust open-domain constituent parsing. We thus evaluate three strong parsers on twenty-three test-sets. These test-sets and their data statistics are shown in the slides. We also manually label constituent structures for five typical domains, resulting in a test set of one thousand sentences for each domain. The three parsers include the non-neural bllip parser, the transition-based in-order parser and the chart-based Berkeley self-attentive parser.
[Audio] We aim to answer three research question. First, do these parsers generalize well in the open-domain? Second, what are the relative strengths of each parser? These three parsers are different in multiple aspects, such as neural versus non-neural, chart versus transition. Third, we want to find out the challenges for cross-domain constituency parsing..
[Audio] Let us see the first question.. 01. 02. What are the relative strengths of different parser models, and does BERT give similar improvements for all domains?.
[Audio] As we can see from the table, even for the strongest Berkeley Neural Parser, the cross-domain performance can drop to 84.34%, with a relative error increase of more than 200%. Among the domains, the review and Switchboard domains are the most difficult, while the law domain is the easiest. We assume that the law domain mainly consists of formal English, which is the closest to the newswire domain..
[Audio] Now let us turn to the second question.. 01.
[Audio] Previous work investigated this question on three domains. Here we give a more fine-grained analysis. As shown in the table, BLLIP gives a similar cross-domain performance drop as compared with Berkeley parser, which shows that a discrete parser does not necessarily show weaker cross-domain robustness than a neural parser. Without BERT, both neural parsers gives rather similar performances across various domains. After augmenting with BERT, in-order parser outperforms chart-based parser for a small margin on most test sets..
[Audio] Now let us see the third and the most interesting question: what are the challenges for cross-domain constituency parsing?.
[Audio] To answer this question, we first need to identify the differences between PTB training set and each test set. We report the differences between the PTB training set and various test sets in the table, by adopting a list of linguistic features from previous work. Each cell in the table represents the Jensen-Shannon divergence between the distribution of a specific feature of the PTB training set and that distribution of a specific test set. Each value ranges from 0 to 1, and a higher value reflects less correlation on that feature between the PTB training set and the corresponding test set. The linguistic features include n-gram token, n-gram constituent, grammar rule, headed-lexicalized grammar rule and grandparent rule..
[Audio] After that, we calculate the Pearson correlation between parser performances and feature JS-divergences for all the five parsers. In the figures, each column shows the Pearson correlation of a specific parser with a specific feature, where a longer bar reflects more reliance on the feature. We make the following observations..
[Audio] First, overall all the parsers are more influenced by larger grammatical structures which are in green color, such as GR, GP and n-gram sub constituents, while the parsers are less influenced by the features that are in blue color, such as word-level N-gram features and simple constituent label features. This shows that the cross-domain challenge arises mostly from more complex structural variations, instead of cross-domain word and N-gram distribution differences..
[Audio] Second, the non-neural parser shows strong reliance to lexical features but is less sensitive to syntactic patterns. This shows that the strong representation power of neural models allow them to learn more syntactic structures..
[Audio] Third, after augmented with BERT, the neural parsers show stronger dependence to uni-gram and bi-gram constituent features and weaker dependence to tri-gram and four-gram constituent features..