第487行: |
第487行: |
| == Evaluation 评估 == | | == Evaluation 评估 == |
| | | |
− | The accuracy of a sentiment analysis system is, in principle, how well it agrees with human judgments. This is usually measured by variant measures based on [[precision and recall]] over the two target categories of negative and positive texts. However, according to research human raters typically only agree about 80%<ref> | + | The accuracy of a sentiment analysis system is, in principle, how well it agrees with human judgments. This is usually measured by variant measures based on [[precision and recall]] over the two target categories of negative and positive texts. However, according to research human raters typically only agree about 80%<ref name=":26"> |
| {{cite news | | {{cite news |
| | last = Ogneva | first = M. | | | last = Ogneva | first = M. |
第493行: |
第493行: |
| | url=http://mashable.com/2010/04/19/sentiment-analysis/ | publisher = Mashable | | | url=http://mashable.com/2010/04/19/sentiment-analysis/ | publisher = Mashable |
| |access-date=2012-12-13}} | | |access-date=2012-12-13}} |
− | </ref> of the time (see [[Inter-rater reliability]]). Thus, a program that achieves 70% accuracy in classifying sentiment is doing nearly as well as humans, even though such accuracy may not sound impressive. If a program were "right" 100% of the time, humans would still disagree with it about 20% of the time, since they disagree that much about ''any'' answer.<ref> | + | </ref> of the time (see [[Inter-rater reliability]]). Thus, a program that achieves 70% accuracy in classifying sentiment is doing nearly as well as humans, even though such accuracy may not sound impressive. If a program were "right" 100% of the time, humans would still disagree with it about 20% of the time, since they disagree that much about ''any'' answer.<ref name=":27"> |
| {{cite book | | {{cite book |
| | last = Roebuck | first = K. | | | last = Roebuck | first = K. |
第502行: |
第502行: |
| </ref> | | </ref> |
| | | |
− | 原则上来说,情感分析系统的准确性就是它与人类判断的一致性程度。这通常由基于负面和正面文本这两个目标类别识别的查准率和查全率的变量来衡量的。这通常是衡量的不同措施的基础上的准确率召回率,超过两个目标类别的消极和积极的文本。然而,根据现有研究,人类评分员之间通常只有80%的几率是达成一致的(参见评分者之间的信度Inter-rater reliability)。因此,一个情感分类的程序如果能够达到70%的准确率,那么尽管这样的准确率这听起来还不算引人注目,但它的表现已经和人工识别的表现得几乎一样好。同时需要注意的是,因为人类本身对任何情感分类的答案都可能有很大的不同意见,如果一个程序有100%的准确率,人类仍然会有20%的可能不同意其判断的结果。 | + | 原则上来说,情感分析系统的准确性就是它与人类判断的一致性程度。这通常由基于负面和正面文本这两个目标类别识别的查准率和查全率的变量来衡量的。这通常是衡量的不同措施的基础上的准确率召回率,超过两个目标类别的消极和积极的文本。然而,根据现有研究,人类评分员之间通常只有80%<ref name=":26" /> 的几率是达成一致的(参见评分者之间的信度Inter-rater reliability)。因此,一个情感分类的程序如果能够达到70%的准确率,那么尽管这样的准确率这听起来还不算引人注目,但它的表现已经和人工识别的表现得几乎一样好。同时需要注意的是,因为人类本身对任何情感分类的答案都可能有很大的不同意见,如果一个程序有100%的准确率,人类仍然会有20%的可能不同意其判断的结果。<ref name=":27" /> |
| | | |
| | | |
− | On the other hand, computer systems will make very different errors than human assessors, and thus the figures are not entirely comparable. For instance, a computer system will have trouble with negations, exaggerations, [[joke]]s, or sarcasm, which typically are easy to handle for a human reader: some errors a computer system makes will seem overly naive to a human. In general, the utility for practical commercial tasks of sentiment analysis as it is defined in academic research has been called into question, mostly since the simple one-dimensional model of sentiment from negative to positive yields rather little actionable information for a client worrying about the effect of public discourse on e.g. brand or corporate reputation.<ref> | + | |
| + | |
| + | On the other hand, computer systems will make very different errors than human assessors, and thus the figures are not entirely comparable. For instance, a computer system will have trouble with negations, exaggerations, [[joke]]s, or sarcasm, which typically are easy to handle for a human reader: some errors a computer system makes will seem overly naive to a human. In general, the utility for practical commercial tasks of sentiment analysis as it is defined in academic research has been called into question, mostly since the simple one-dimensional model of sentiment from negative to positive yields rather little actionable information for a client worrying about the effect of public discourse on e.g. brand or corporate reputation.<ref name=":28"> |
| [[Jussi Karlgren|Karlgren, Jussi]], [[Magnus Sahlgren]], Fredrik Olsson, Fredrik Espinoza, and Ola Hamfors. "Usefulness of sentiment analysis." In European Conference on Information Retrieval, pp. 426-435. Springer Berlin Heidelberg, 2012. | | [[Jussi Karlgren|Karlgren, Jussi]], [[Magnus Sahlgren]], Fredrik Olsson, Fredrik Espinoza, and Ola Hamfors. "Usefulness of sentiment analysis." In European Conference on Information Retrieval, pp. 426-435. Springer Berlin Heidelberg, 2012. |
− | </ref><ref> | + | </ref><ref name=":29"> |
| [[Jussi Karlgren|Karlgren, Jussi]]. "The relation between author mood and affect to sentiment in text and text genre." In Proceedings of the fourth workshop on Exploiting semantic annotations in information retrieval, pp. 9-10. ACM, 2011. | | [[Jussi Karlgren|Karlgren, Jussi]]. "The relation between author mood and affect to sentiment in text and text genre." In Proceedings of the fourth workshop on Exploiting semantic annotations in information retrieval, pp. 9-10. ACM, 2011. |
− | </ref><ref> | + | </ref><ref name=":30"> |
| [[Jussi Karlgren|Karlgren, Jussi]]. "[http://www.diva-portal.org/smash/get/diva2:1042636/FULLTEXT01.pdf Affect, appeal, and sentiment as factors influencing interaction with multimedia information]." In Proceedings of Theseus/ImageCLEF workshop on visual information retrieval evaluation, pp. 8-11. 2009. | | [[Jussi Karlgren|Karlgren, Jussi]]. "[http://www.diva-portal.org/smash/get/diva2:1042636/FULLTEXT01.pdf Affect, appeal, and sentiment as factors influencing interaction with multimedia information]." In Proceedings of Theseus/ImageCLEF workshop on visual information retrieval evaluation, pp. 8-11. 2009. |
| </ref> | | </ref> |
| | | |
− | 另一方面,计算机系统会犯与人类评估员非常不同的错误,因此这些数字并不完全可比。例如,计算机系统在否定、夸张、笑话或讽刺方面会遇到麻烦,而这些对于人类读者来说通常是很容易处理的: 计算机系统出现的一些错误对于人类来说会显得过于天真。一般来说,学术研究中定义的情绪分析对实际商业任务的效用受到质疑,主要是因为简单的从消极到积极的情绪单维度模型产生的可操作信息很少,客户担心公共话语对情绪分析的影响。品牌或企业声誉。Karlgren, Jussi, Magnus Sahlgren, Fredrik Olsson, Fredrik Espinoza, and Ola Hamfors.“情绪分析的有用性。”在欧洲信息检索会议上,pp。426-435.Springer Berlin Heidelberg,2012年。尤西 · 卡尔格伦。作者情绪与文本和文本体裁中情感的关系在《第四次研讨会论文集---- 开发信息检索语义标注》中,pp。9-10.美国计算机协会,2011。尤西 · 卡尔格伦。影响与多媒体信息互动的因素包括情感、吸引力和情感在 Theseus/ImageCLEF 视觉信息检索评估研讨会论文集中,第页。8-11.2009.
| + | 另一方面,计算机系统会犯与人类评分者非常不同的错误,因此这些数字并不完全可比。例如,计算机系统在处理否定句、夸张句、笑话或讽刺句时会遇到困难,而这些句子对人类读者来说通常很容易处理,也就是说计算机系统所犯的一些错误在人类看来通常会显得过于幼稚。总的来说,学术研究中定义的情感分析在实际商业任务中的效用受到了质疑,主要是因为对于担心公众话语对品牌或企业声誉的影响的客户来说,从负面到正面的简单的单维度情感模型几乎没有提供什么可操作的信息。<ref name=":28" /><ref name=":29" /><ref name=":30" /> |
| + | |
| | | |
− | To better fit market needs, evaluation of sentiment analysis has moved to more task-based measures, formulated together with representatives from PR agencies and market research professionals. The focus in e.g. the RepLab evaluation data set is less on the content of the text under consideration and more on the effect of the text in question on [[brand image|brand reputation]].<ref> | + | To better fit market needs, evaluation of sentiment analysis has moved to more task-based measures, formulated together with representatives from PR agencies and market research professionals. The focus in e.g. the RepLab evaluation data set is less on the content of the text under consideration and more on the effect of the text in question on [[brand image|brand reputation]].<ref name=":31"> |
| Amigó, Enrique, Adolfo Corujo, Julio Gonzalo, Edgar Meij, and [[Maarten de Rijke]]. "Overview of RepLab 2012: Evaluating Online Reputation Management Systems." In CLEF (Online Working Notes/Labs/Workshop). 2012. | | Amigó, Enrique, Adolfo Corujo, Julio Gonzalo, Edgar Meij, and [[Maarten de Rijke]]. "Overview of RepLab 2012: Evaluating Online Reputation Management Systems." In CLEF (Online Working Notes/Labs/Workshop). 2012. |
− | </ref><ref> | + | </ref><ref name=":32"> |
| Amigó, Enrique, Jorge Carrillo De Albornoz, Irina Chugur, Adolfo Corujo, Julio Gonzalo, Tamara Martín, Edgar Meij, [[Maarten de Rijke]], and Damiano Spina. "Overview of replab 2013: Evaluating online reputation monitoring systems." In International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 333-352. Springer Berlin Heidelberg, 2013. | | Amigó, Enrique, Jorge Carrillo De Albornoz, Irina Chugur, Adolfo Corujo, Julio Gonzalo, Tamara Martín, Edgar Meij, [[Maarten de Rijke]], and Damiano Spina. "Overview of replab 2013: Evaluating online reputation monitoring systems." In International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 333-352. Springer Berlin Heidelberg, 2013. |
| </ref><ref name="replab2014"> | | </ref><ref name="replab2014"> |
第524行: |
第527行: |
| | | |
| | | |
− | 为了更好地适应市场需求,情绪分析的评估已转向更多基于任务的措施,与公关机构和市场研究专业人士的代表共同制定。中的焦点。RepLab 评估数据集较少考虑文本的内容,而更多考虑文本对品牌声誉的影响。Amigó, Enrique, Adolfo Corujo, Julio Gonzalo, Edgar Meij, and Maarten de Rijke.“ RepLab 2012概述: 评估在线信誉管理系统”在 CLEF (网上工作笔记/实验室/工作坊)。2012.Amigó, Enrique, Jorge Carrillo De Albornoz, Irina Chugur, Adolfo Corujo, Julio Gonzalo, Tamara Martín, Edgar Meij, Maarten de Rijke, and Damiano Spina.“ replab 2013概述: 评估在线声誉监控系统。”欧洲语言跨语言评价论坛国际会议,第页。333-352.Springer Berlin Heidelberg,2013年。Amigó, Enrique, Jorge Carrillo-de-Albornoz, Irina Chugur, Adolfo Corujo, Julio Gonzalo, Edgar Meij, Maarten de Rijke, and Damiano Spina.“ replab 2014概述: 在线声誉管理的作者特征和声誉维度。”欧洲语言跨语言评价论坛国际会议,第页。307-322.斯普林格国际出版社,2014年。
| + | 为了更好地适应市场需求,情感分析的评估已转向更多基于任务的措施,这些措施是与公关机构和市场研究专业人士的代表共同制定的。例如,RepLab评估数据集中较少考虑的文本内容,而更多地关注文本对品牌声誉问题的影响。<ref name=":31" /><ref name=":32" /><ref name="replab2014" /> |
| | | |
| Because evaluation of sentiment analysis is becoming more and more task based, each implementation needs a separate training model to get a more accurate representation of sentiment for a given data set. | | Because evaluation of sentiment analysis is becoming more and more task based, each implementation needs a separate training model to get a more accurate representation of sentiment for a given data set. |
| | | |
− | 由于情感分析的评价越来越多地基于任务,每个实现都需要一个单独的训练模型来更准确地表达给定数据集的情感。
| + | 由于情感分析的评估越来越多地基于特定任务,每个分类器的都需要一个单独的训练模型来实现更准确地识别给定数据集的情感表达。 |
| | | |
| == Web 2.0 == | | == Web 2.0 == |