更改

数据科学 (查看源代码)

2020年5月15日 (五) 12:48的版本

删除1,223字节、 2020年5月15日 (五) 12:48

→‎所涉及的科技和技术

第218行：第218行：

=== 技术 ===

−

~~====聚类====~~

*[[聚类分析|聚类]]是一种将数据分组整合的技术。

第233行：第232行： −

* [[Dimensionality reduction]] is used to reduce the complexity of data computation so that it can be performed more quickly.

+

*[[降维]]用于降低数据计算的复杂度，从而提高计算速度。

数据降维可以降低模型的计算量并减少模型运行时间、降低噪音变量信息对于模型结果的影响、便于通过可视化方式展示归约后的维度信息并减少数据存储空间。因此，大多数情况下，当我们面临高维数据时，都需要对数据做降维处理。

+

数据降维有两种方式：特征选择，维度转换。

+

# 特征选择

特征选择指根据一定的规则和经验，直接在原有的维度中挑选一部分参与到计算和建模过程，用选择的特征代替所有特征，不改变原有特征，也不产生新的特征值。

特征选择的降维方式好处是可以保留原有维度特征的基础上进行降维，既能满足后续数据处理和建模需求，又能保留维度原本的业务含义，以便于业务理解和应用。对于业务分析性的应用而言，模型的可理解性和可用性很多时候要有限于模型本身的准确率、效率等技术指标。例如，决策树得到的特征规则，可以作为选择用户样本的基础条件，而这些特征规则便是基于输入的维度产生。

+

#维度转换

第249行：第251行： −

* [[Machine learning]] is a technique used to perform tasks by inferencing patterns from data.

*[[机器学习]]是一种通过从数据中推断模式来执行任务的技术。

第255行：第256行：

专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。它是人工智能的核心，是使计算机具有智能的根本途径。机器学习也是用数据或以往的经验,以此优化计算机程序的性能标准。

−

~~--[[用户:趣木木|趣木木]]（[[用户讨论:趣木木|讨论]]）先将其意思译出来后再进行一些补充~~

−

=== ~~Technologies~~ ===

+

=== 科技 ===

−

科技

+

* [[Python(编程语言)|Python]]是数据科学中广泛使用的一种语法简单的编程语言。数据科学中使用了大量的python库，包括numpy、panda和scipy。<ref>{{Cite web|url=https://sites.engineering.ucsb.edu/~shell/che210d/python.pdf|title=An introduction to Python for scientific computing|last=Shell|first=M Scott|date=September 24, 2019|website=|url-status=live|archive-url=|archive-date=|access-date=April 2, 2020}}</ref>

−

* [[Python (programming language)|Python]] is a programming language with simple syntax that is commonly used for data science.<ref>{{Cite web|url=https://sites.engineering.ucsb.edu/~shell/che210d/python.pdf|title=An introduction to Python for scientific computing|last=Shell|first=M Scott|date=September 24, 2019|website=|url-status=live|archive-url=|archive-date=|access-date=April 2, 2020}}</ref> There are a number of python libraries that are used in data science including numpy, pandas, and scipy.

−

* [[Python(编程语言)|Python]]是数据科学中广泛使用的一种语法简单的编程语言。数据科学中使用了大量的python库，包括numpy、panda和scipy。

−

+

*[[R(程序设计语言)|R]]语言是一种为统计学家和数据挖掘而设计的编程语言，<ref>{{Cite web|url=https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-is-R_003f|title=R FAQ|website=cran.r-project.org|access-date=2020-04-03}}</ref>并优化了计算。

−

* [[R (~~programming language~~)|R]] ~~is a programming language that was designed for statisticians and data mining~~<ref>{{Cite web|url=https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-is-R_003f|title=R FAQ|website=cran.r-project.org|access-date=2020-04-03}}</ref> ~~and is optimized for computation.~~

−

*[[R(程序设计语言)|R]]语言是一种为统计学家和数据挖掘而设计的编程语言，并优化了计算。

第277行：第272行： −

−

* [[TensorFlow]] is a framework for creating machine learning models developed by Google.

*[[TensorFlow]]是由Google开发的用于创建机器学习模型的框架。

第287行：第280行： −

−

* [[Pytorch]] is another framework for machine learning developed by Facebook.

*[[Pytorch]]是Facebook开发的另一个机器学习框架。

第298行：第289行： −

−

* [[Jupyter Notebook]] is an interactive web interface for Python that allows faster experimentation.

*[[Jupyter Notebook]]是一个用于Python的交互式web界面，可以更快地进行实验。

第306行：第295行： −

+

*[[Tableau软件|Tableau]]制作了许多用于数据可视化的软件。<ref>{{Cite journal|url=https://www.wired.com/2014/07/a-drag-and-drop-toolkit-that-lets-anyone-create-interactive-maps/|journal=Wired|access-date=2020-04-03|title=A Dead-Simple Tool That Lets Anyone Create Interactive Maps|date=15 July 2014|last1=Rhodes|first1=Margaret}}</ref>

−

* [[~~Tableau Software~~|Tableau]] ~~makes a variety of software that is used for data visualization~~<ref>{{Cite journal|url=https://www.wired.com/2014/07/a-drag-and-drop-toolkit-that-lets-anyone-create-interactive-maps/|journal=Wired|access-date=2020-04-03|title=A Dead-Simple Tool That Lets Anyone Create Interactive Maps|date=15 July 2014|last1=Rhodes|first1=Margaret}}</ref>.

−

*[[Tableau软件|Tableau]]制作了许多用于数据可视化的软件。

第314行：第301行： −

−

* [[Apache Hadoop]] is a software framework that is used to process data over large distributed systems.

*[[Apache Hadoop]]是一个用于在大型分布式系统上处理数据的软件框架。

乐多多

763

个编辑