更改

添加614字节 、 2021年8月9日 (一) 15:46
第243行: 第243行:  
However, researchers recognized several challenges in developing fixed sets of rules for expressions respectably. Much of the challenges in rule development stems from the nature of textual information. Six challenges have been recognized by several researchers: 1) metaphorical expressions, 2) discrepancies in writings, 3) context-sensitive, 4) represented words with fewer usages, 5) time-sensitive, and 6) ever-growing volume.
 
However, researchers recognized several challenges in developing fixed sets of rules for expressions respectably. Much of the challenges in rule development stems from the nature of textual information. Six challenges have been recognized by several researchers: 1) metaphorical expressions, 2) discrepancies in writings, 3) context-sensitive, 4) represented words with fewer usages, 5) time-sensitive, and 6) ever-growing volume.
   −
然而,研究人员认识到在为表达方式制定一套固定的规则集方面存在一些挑战。规则开发中的大部分挑战源于文本信息的性质。一些研究人员已经认识到了六个挑战: 1)隐喻性的表达,2)写作中的差异,3)上下文敏感性,4)代表性词用法较少,5)时间敏感性,以及6)不断增长的数量。<u>【翻译到这里】</u>
+
然而,研究人员认识到在为表达分类制定一套固定的规则集方面存在一些挑战。规则开发中的大部分挑战源于文本信息的性质。一些研究人员已经认识到了六个挑战: 1)隐喻性的表达,2)写作中的差异,3)上下文敏感性,4)时间敏感性,5)代表性词用法较少以及6)不断增长的数量。
   −
# Metaphorical expressions. The text contains metaphoric expression may impact on the performance on the extraction.<ref>{{Cite journal|last1=Wiebe|first1=Janyce|last2=Riloff|first2=Ellen|date=July 2011|title=Finding Mutual Benefit between Subjectivity Analysis and Information Extraction|url=https://ieeexplore.ieee.org/document/5959154|journal=IEEE Transactions on Affective Computing|volume=2|issue=4|pages=175–191|doi=10.1109/T-AFFC.2011.19|s2cid=16820846|issn=1949-3045}}</ref> Besides, metaphors take in different forms, which may have been contributed to the increase in detection.  
+
# Metaphorical expressions. The text contains metaphoric expression may impact on the performance on the extraction.<ref name=":13">{{Cite journal|last1=Wiebe|first1=Janyce|last2=Riloff|first2=Ellen|date=July 2011|title=Finding Mutual Benefit between Subjectivity Analysis and Information Extraction|url=https://ieeexplore.ieee.org/document/5959154|journal=IEEE Transactions on Affective Computing|volume=2|issue=4|pages=175–191|doi=10.1109/T-AFFC.2011.19|s2cid=16820846|issn=1949-3045}}</ref> Besides, metaphors take in different forms, which may have been contributed to the increase in detection.
 
# Discrepancies in writings. For the text obtained from the Internet, the discrepancies in the writing style of targeted text data involve distinct writing genres and styles
 
# Discrepancies in writings. For the text obtained from the Internet, the discrepancies in the writing style of targeted text data involve distinct writing genres and styles
 
# Context-sensitive. Classification may vary based on the subjectiveness or objectiveness of previous and following sentences.<ref name=":1">{{Cite journal|last1=Pang|first1=Bo|last2=Lee|first2=Lillian|date=2008-07-06|title=Opinion Mining and Sentiment Analysis|url=https://www.nowpublishers.com/article/Details/INR-011|journal=Foundations and Trends in Information Retrieval|language=en|volume=2|issue=1–2|pages=1–135|doi=10.1561/1500000011|issn=1554-0669}}</ref>
 
# Context-sensitive. Classification may vary based on the subjectiveness or objectiveness of previous and following sentences.<ref name=":1">{{Cite journal|last1=Pang|first1=Bo|last2=Lee|first2=Lillian|date=2008-07-06|title=Opinion Mining and Sentiment Analysis|url=https://www.nowpublishers.com/article/Details/INR-011|journal=Foundations and Trends in Information Retrieval|language=en|volume=2|issue=1–2|pages=1–135|doi=10.1561/1500000011|issn=1554-0669}}</ref>
第252行: 第252行:  
# Ever-growing volume. The task is also challenged by the sheer volume of textual data. The textual data's ever-growing nature makes the task overwhelmingly difficult for the researchers to complete the task on time.
 
# Ever-growing volume. The task is also challenged by the sheer volume of textual data. The textual data's ever-growing nature makes the task overwhelmingly difficult for the researchers to complete the task on time.
   −
# 隐喻性的表达:文本中包含的隐喻表达可能会影响抽取的性能。此外,隐喻采取不同的形式,这可能有助于增加检测。# 文字上的差异。对于从互联网上获得的文本,目标文本数据的写作风格差异涉及不同的写作类型和风格 # 上下文敏感。分类可以根据前面和后面句子的主观性或客观性而有所不同。# 时间敏感属性。该任务受到某些文本数据的时间敏感属性的挑战。如果一群研究人员想要在新闻中证实一个事实,他们需要更长的时间,比新闻变得过时更长的交叉验证。# 暗示用词较少的词语。# 不断增长的数量。这项任务还受到大量文本数据的挑战。文本数据的不断增长性使得研究人员很难按时完成任务。
+
# 隐喻性的表达:文本中包含隐喻性的表达可能会影响抽取的表现。<ref name=":13" /> 此外,隐喻可能采取不同的形式,这会增加识别的难度。
# 写作中的差异
+
# 写作中的差异:对于从互联网上获得的文本,目标文本数据的写作差异涉及不同的写作类型和风格 。
# 上下文敏感性
+
# 上下文敏感性:根据前后句的主观性或客观性,分类会有所不同。<ref name=":1" />
# 代表性词用法较少
+
# 时间敏感性:该任务受到某些文本数据的时间敏感属性的挑战。如果一群研究人员想要确认新闻中的事实,他们需要比新闻变得过时的更长的时间进行交叉验证。
# 时间敏感性
+
# 代表性词用法较少:关键提示词使用的次数很少。
# 不断增长的数量
+
# 不断增长的数量:这项任务还受到大量文本数据的挑战。文本数据的不断增长性使得研究人员很难按时完成任务。
    
Previously, the research mainly focused on document level classification. However, classifying a document level suffers less accuracy, as an article may have diverse types of expressions involved. Researching evidence suggests a set of news articles that are expected to dominate by the objective expression, whereas the results show that it consisted of over 40% of subjective expression.<ref name="Wiebe 2005 486–497"/>
 
Previously, the research mainly focused on document level classification. However, classifying a document level suffers less accuracy, as an article may have diverse types of expressions involved. Researching evidence suggests a set of news articles that are expected to dominate by the objective expression, whereas the results show that it consisted of over 40% of subjective expression.<ref name="Wiebe 2005 486–497"/>
第271行: 第271行:  
# Time-consuming. Manual annotation task is an assiduious work. Riloff (1996) show that a 160 texts cost 8 hours for one annotator to finish.<ref>{{Cite journal|last=Riloff|first=Ellen|date=1996-08-01|title=An empirical study of automated dictionary construction for information extraction in three domains|url=https://dx.doi.org/10.1016%2F0004-3702%2895%2900123-9|journal=Artificial Intelligence|language=en|volume=85|issue=1|pages=101–134|doi=10.1016/0004-3702(95)00123-9|issn=0004-3702|doi-access=free}}</ref>
 
# Time-consuming. Manual annotation task is an assiduious work. Riloff (1996) show that a 160 texts cost 8 hours for one annotator to finish.<ref>{{Cite journal|last=Riloff|first=Ellen|date=1996-08-01|title=An empirical study of automated dictionary construction for information extraction in three domains|url=https://dx.doi.org/10.1016%2F0004-3702%2895%2900123-9|journal=Artificial Intelligence|language=en|volume=85|issue=1|pages=101–134|doi=10.1016/0004-3702(95)00123-9|issn=0004-3702|doi-access=free}}</ref>
   −
# 理解上的变化。在手工注释过程中,由于语言的模糊性,注释者之间可能会出现主观或客观实例的分歧。# 人为错误。手工注释是一项细致的工作,需要高度集中精力才能完成。# 费时。手工注释是一项繁重的工作。里洛夫(1996)表明,一个注释者完成160篇文本需要8个小时.
+
# 理解上的变化。在手工注释过程中,由于语言的模糊性,注释者之间可能会出现主观或客观实例的分歧。
 +
# 人为错误。手工注释是一项细致的工作,需要高度集中精力才能完成。
 +
# 费时。手工注释是一项繁重的工作。里洛夫(1996)表明,一个注释者完成160篇文本需要8个小时.
 
All these mentioned reasons can impact on the efficiency and effectiveness of subjective and objective classification. Accordingly, two bootstrapping methods were designed to learning linguistic patterns from unannotated text data.  Both methods are starting with a handful of seed words and unannotated textual data.
 
All these mentioned reasons can impact on the efficiency and effectiveness of subjective and objective classification. Accordingly, two bootstrapping methods were designed to learning linguistic patterns from unannotated text data.  Both methods are starting with a handful of seed words and unannotated textual data.
   第279行: 第281行:  
# Basilisk (<u>B</u>ootstrapping <u>A</u>pproach to <u>S</u>emantIc <u>L</u>exicon <u>I</u>nduction using <u>S</u>emantic <u>K</u>nowledge) by Thelen and Riloff.<ref>{{Cite journal|last1=Thelen|first1=Michael|last2=Riloff|first2=Ellen|date=2002-07-06|title=A bootstrapping method for learning semantic lexicons using extraction pattern contexts|journal=Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10|series=EMNLP '02|volume=10|location=USA|publisher=Association for Computational Linguistics|pages=214–221|doi=10.3115/1118693.1118721|s2cid=137155|doi-access=free}}</ref>  Step One: Generate extration patterns  Step Two: Move best patterns from Pattern Pool to Candidate Word Pool.  Step Three: Top 10 words will be marked and add to the dictionary.  Repeat.
 
# Basilisk (<u>B</u>ootstrapping <u>A</u>pproach to <u>S</u>emantIc <u>L</u>exicon <u>I</u>nduction using <u>S</u>emantic <u>K</u>nowledge) by Thelen and Riloff.<ref>{{Cite journal|last1=Thelen|first1=Michael|last2=Riloff|first2=Ellen|date=2002-07-06|title=A bootstrapping method for learning semantic lexicons using extraction pattern contexts|journal=Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10|series=EMNLP '02|volume=10|location=USA|publisher=Association for Computational Linguistics|pages=214–221|doi=10.3115/1118693.1118721|s2cid=137155|doi-access=free}}</ref>  Step One: Generate extration patterns  Step Two: Move best patterns from Pattern Pool to Candidate Word Pool.  Step Three: Top 10 words will be marked and add to the dictionary.  Repeat.
   −
# 1999年里洛夫和琼斯的 Meta-Bootstrapping。第一级: 根据预定义的规则生成提取模式,并根据每个模式所包含的种子词数量生成提取模式。第二步: 前5个单词将被标记并添加到字典中。重复。# Basilisk (Bootstrapping Approach to SemantIc Lexicon inducing using SemantIc Knowledge) Thelen and Riloff.第一步: 生成抽取模式第二步: 将最好的模式从模式池移动到候选单词池。第三步: 将前10个单词标记并添加到字典中。重复。
+
# 1999年里洛夫和琼斯的 Meta-Bootstrapping。第一级: 根据预定义的规则生成提取模式,并根据每个模式所包含的种子词数量生成提取模式。第二步: 前5个单词将被标记并添加到字典中。重复。
 +
# Basilisk (Bootstrapping Approach to SemantIc Lexicon inducing using SemantIc Knowledge) Thelen and Riloff.第一步: 生成抽取模式第二步: 将最好的模式从模式池移动到候选单词池。第三步: 将前10个单词标记并添加到字典中。重复。
 +
 
 +
 
      第309行: 第314行:  
* 电子邮件分析: 主观和客观分类器通过追踪目标单词的语言模式来检测垃圾邮件。
 
* 电子邮件分析: 主观和客观分类器通过追踪目标单词的语言模式来检测垃圾邮件。
   −
=== Feature/aspect-based ===
+
=== Feature/aspect-based功能/属性为基础的情感分析 ===
 
It refers to determining the opinions or sentiments expressed on different features or aspects of entities, e.g., of a cell phone, a digital camera, or a bank.<ref name="HuLiu04">{{cite conference
 
It refers to determining the opinions or sentiments expressed on different features or aspects of entities, e.g., of a cell phone, a digital camera, or a bank.<ref name="HuLiu04">{{cite conference
 
  | first1 = Minqing | last1 = Hu
 
  | first1 = Minqing | last1 = Hu
第318行: 第323行:  
  | url = http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
 
  | url = http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
 
}}
 
}}
</ref> A feature or aspect is an attribute or component of an entity, e.g., the screen of a cell phone, the service for a restaurant, or the picture quality of a camera. The advantage of feature-based sentiment analysis is the possibility to capture nuances about objects of interest. Different features can generate different sentiment responses, for example a hotel can have a convenient location, but mediocre food.<ref>{{Cite journal|title = Good location, terrible food: detecting feature sentiment in user-generated reviews|journal = Social Network Analysis and Mining|date = 2013-06-22|issn = 1869-5450|pages = 1149–1163|volume = 3|issue = 4|doi = 10.1007/s13278-013-0119-7|first1 = Mario|last1 = Cataldi|first2 = Andrea|last2 = Ballatore|first3 = Ilaria|last3 = Tiddi|first4 = Marie-Aude|last4 = Aufaure|citeseerx = 10.1.1.396.9313|s2cid = 5025282}}</ref> This problem involves several sub-problems, e.g., identifying relevant entities, extracting their features/aspects, and determining whether an opinion expressed on each feature/aspect is positive, negative or neutral.<ref name="LiuHuCheng04">{{cite conference
+
</ref> A feature or aspect is an attribute or component of an entity, e.g., the screen of a cell phone, the service for a restaurant, or the picture quality of a camera. The advantage of feature-based sentiment analysis is the possibility to capture nuances about objects of interest. Different features can generate different sentiment responses, for example a hotel can have a convenient location, but mediocre food.<ref name=":14">{{Cite journal|title = Good location, terrible food: detecting feature sentiment in user-generated reviews|journal = Social Network Analysis and Mining|date = 2013-06-22|issn = 1869-5450|pages = 1149–1163|volume = 3|issue = 4|doi = 10.1007/s13278-013-0119-7|first1 = Mario|last1 = Cataldi|first2 = Andrea|last2 = Ballatore|first3 = Ilaria|last3 = Tiddi|first4 = Marie-Aude|last4 = Aufaure|citeseerx = 10.1.1.396.9313|s2cid = 5025282}}</ref> This problem involves several sub-problems, e.g., identifying relevant entities, extracting their features/aspects, and determining whether an opinion expressed on each feature/aspect is positive, negative or neutral.<ref name="LiuHuCheng04">{{cite conference
 
  | first1 = Bing | last1 = Liu
 
  | first1 = Bing | last1 = Liu
 
  | first2 = Minqing | last2 = Hu | first3 = Junsheng | last3 = Cheng
 
  | first2 = Minqing | last2 = Hu | first3 = Junsheng | last3 = Cheng
第326行: 第331行:  
  | url = http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
 
  | url = http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
 
}}
 
}}
</ref> The automatic identification of features can be performed with syntactic methods, with [[topic model]]ing,<ref>{{Cite book|title = Constrained LDA for Grouping Product Features in Opinion Mining|publisher = Springer Berlin Heidelberg|date = 2011-01-01|isbn = 978-3-642-20840-9|pages = 448–459|series = Lecture Notes in Computer Science|doi = 10.1007/978-3-642-20841-6_37|first1 = Zhongwu|last1 = Zhai|first2 = Bing|last2 = Liu|first3 = Hua|last3 = Xu|first4 = Peifa|last4 = Jia|editor-first = Joshua Zhexue|editor-last = Huang|editor-first2 = Longbing|editor-last2 = Cao|editor-first3 = Jaideep|editor-last3 = Srivastava|citeseerx = 10.1.1.221.5178}}</ref><ref>{{Cite book|title = Modeling Online Reviews with Multi-grain Topic Models|publisher = ACM|journal = Proceedings of the 17th International Conference on World Wide Web|date = 2008-01-01|location = New York, NY, USA|isbn = 978-1-60558-085-2|pages = 111–120|series = WWW '08|doi = 10.1145/1367497.1367513|first1 = Ivan|last1 = Titov|first2 = Ryan|last2 = McDonald|arxiv = 0801.1063|s2cid = 13609860}}</ref> or with [[deep learning]].<ref name="Poria">{{cite journal
+
</ref> The automatic identification of features can be performed with syntactic methods, with [[topic model]]ing,<ref name=":15">{{Cite book|title = Constrained LDA for Grouping Product Features in Opinion Mining|publisher = Springer Berlin Heidelberg|date = 2011-01-01|isbn = 978-3-642-20840-9|pages = 448–459|series = Lecture Notes in Computer Science|doi = 10.1007/978-3-642-20841-6_37|first1 = Zhongwu|last1 = Zhai|first2 = Bing|last2 = Liu|first3 = Hua|last3 = Xu|first4 = Peifa|last4 = Jia|editor-first = Joshua Zhexue|editor-last = Huang|editor-first2 = Longbing|editor-last2 = Cao|editor-first3 = Jaideep|editor-last3 = Srivastava|citeseerx = 10.1.1.221.5178}}</ref><ref name=":16">{{Cite book|title = Modeling Online Reviews with Multi-grain Topic Models|publisher = ACM|journal = Proceedings of the 17th International Conference on World Wide Web|date = 2008-01-01|location = New York, NY, USA|isbn = 978-1-60558-085-2|pages = 111–120|series = WWW '08|doi = 10.1145/1367497.1367513|first1 = Ivan|last1 = Titov|first2 = Ryan|last2 = McDonald|arxiv = 0801.1063|s2cid = 13609860}}</ref> or with [[deep learning]].<ref name="Poria">{{cite journal
 
  | first = Soujanya | last = Poria | display-authors=etal
 
  | first = Soujanya | last = Poria | display-authors=etal
 
  | title = Aspect extraction for opinion mining with a deep convolutional neural network
 
  | title = Aspect extraction for opinion mining with a deep convolutional neural network
第349行: 第354行:  
  | url = http://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentiment-analysis.pdf
 
  | url = http://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentiment-analysis.pdf
 
}}
 
}}
</ref>
+
</ref>一个更加优化的分析模型叫做“功能/属性为基础的情感分析(feature/aspect-based sentiment analysis)”。这是指判定针对一个实体在某一个方面或者某一功能下表现出来的意见或是情感, 实体可能是一个手机、一个数码相机或者是一个银行<ref name="HuLiu04" /> 。“功能”或者“属性”是一件实体的某个属性或者组成部分,例如手机的屏幕、参观的服务或者是相机的图像质量等。不同的特征会产生不同的情感反应,比如一个酒店可能有方便的位置,但食物却很普通。<ref name=":14" />  这个问题涉及到若干个子问题,譬如,识别相关的实体,提取它们的功能或属性,然后判断对每个特征/方面表达的意见是正面的、负面的还是中性的。<ref name="LiuHuCheng04" /> 特征的自动识别可以通过语法方法、主题建模<ref name=":15" /><ref name=":16" /> 或深度学习来实现。<ref name="Poria" /><ref name="Ma" /> 更多关于这个层面的情感分析的讨论可以参照NLP手册“情感分析和主观性(Sentiment Analysis and Subjectivity)”这一章。<ref name="Liu2010" />
 
  −
 
  −
它指的是确定对实体的不同特征或方面表达的意见或感情,例如,手机、数码相机或银行。功能或方面是一个实体的属性或组成部分,例如,手机的屏幕,餐厅的服务,或照相机的图像质量。基于特征的情感分析的优势在于可以捕捉感兴趣对象的细微差别。不同的特征可以产生不同的情绪反应,例如,酒店可以有一个方便的地点,但平庸的食物。这个问题涉及几个子问题,例如,识别相关实体,提取它们的特征/方面,以及确定对每个特征/方面表达的意见是积极的、消极的还是中性的。特征的自动识别可以通过句法方法、主题建模或者深度学习来实现。关于这一层次的情感分析的更详细的讨论可以在刘的作品中找到。
      
== Methods and features方法和特征 ==
 
== Methods and features方法和特征 ==
第463行: 第465行:  
现有的情感分析方法可以分为三大类: 基于知识的技术、统计方法和混合方法。基于知识的技术根据明确的情感词汇的出现,如高兴、悲伤、害怕和无聊,按照情感类别对文本进行分类。一些知识库不仅列出了明显的情感词汇,而且还赋予任意的词汇一种可能的特定情感的“亲和力”。统计方法利用机器学习中的元素,例如潜在语义学、支持向量机、“单词包”、语义定位的“点间互信息”和深度学习。更复杂的方法试图检测情绪的持有者(即保持情绪状态的人)和目标(即感受情绪的实体)。为了在上下文中挖掘观点,得到说话人的观点,使用了词语的语法关系。语法依存关系是通过对文本的深入分析得到的。混合方法利用机器学习和来自知识表示的元素,如本体论和语义网络,以便检测以微妙的方式表示的语义,例如,通过分析没有明确传达相关信息,但是隐含链接到这样做的其他概念的概念。
 
现有的情感分析方法可以分为三大类: 基于知识的技术、统计方法和混合方法。基于知识的技术根据明确的情感词汇的出现,如高兴、悲伤、害怕和无聊,按照情感类别对文本进行分类。一些知识库不仅列出了明显的情感词汇,而且还赋予任意的词汇一种可能的特定情感的“亲和力”。统计方法利用机器学习中的元素,例如潜在语义学、支持向量机、“单词包”、语义定位的“点间互信息”和深度学习。更复杂的方法试图检测情绪的持有者(即保持情绪状态的人)和目标(即感受情绪的实体)。为了在上下文中挖掘观点,得到说话人的观点,使用了词语的语法关系。语法依存关系是通过对文本的深入分析得到的。混合方法利用机器学习和来自知识表示的元素,如本体论和语义网络,以便检测以微妙的方式表示的语义,例如,通过分析没有明确传达相关信息,但是隐含链接到这样做的其他概念的概念。
   −
Open source software tools as well as range of free and paid sentiment analysis tools deploy [[machine learning]], statistics, and natural language processing techniques to automate sentiment analysis on large collections of texts, including web pages, online news, internet discussion groups, online reviews, web blogs, and social media.<ref name="AkcoraBayirDemirbasFerhatosmanoglu2010">
+
Open source software tools as well as range of free and paid sentiment analysis tools deploy [[machine learning]], statistics, and natural language processing techniques to automate sentiment analysis on large collections of texts, including web pages, online news, internet discussion groups, online reviews, web blogs, and social media.有很多开源软件使用机器学习(machine learning)、统计、自然语言处理的技术来计算大型文本集的情感分析, 这些大型文本集合包括网页、网络新闻、网上讨论群、网络评论、博客和社交媒介。<ref name="AkcoraBayirDemirbasFerhatosmanoglu2010">
 
{{cite conference
 
{{cite conference
 
| first1 = Cuneyt Gurcan | last1 = Akcora | first2 = Murat Ali | last2 = Bayir | first3 = Murat | last3 = Demirbas | first4 = Hakan | last4 = Ferhatosmanoglu
 
| first1 = Cuneyt Gurcan | last1 = Akcora | first2 = Murat Ali | last2 = Bayir | first3 = Murat | last3 = Demirbas | first4 = Hakan | last4 = Ferhatosmanoglu
54

个编辑