遗传算法

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
Qige96讨论 | 贡献2021年3月20日 (六) 18:06的版本
跳到导航 跳到搜索
Dna.jpg

Genetic algorithms are based on the classic view of a chromosome as a string of genes. R.A. Fisher used this view to found mathematical genetics, providing mathematical formula specifying the rate at which particular genes would spread through a population (Fisher, 1958). Key elements of Fisher’s formulation are:

遗传算法的基础是“染色体是一串基因”这一经典观点。R.A. Fisher 使用这种观点建立了mathematical genetics 数理遗传学,提供了数学公式来说明特定基因在整个种群中的扩散速率(Fisher,1958)。 费舍尔的工作可以总结为以下关键要点:

  • a specified set of alternatives (alleles) for each gene, thereby specifying the allowable strings of genes (the possible chromosomes),
  • a generation-by-generation view of evolution where, at each stage, a population of individuals produces a set of offspring that constitutes the next generation,
  • a fitness function that assigns to each string of alleles the number of offspring the individual carrying that chromosome will contribute to the next generation, and
  • a set of genetic operators, particularly mutation in Fisher’s formulation, that modify the offspring of an individual so that the next generation differs from the current generation.
  • 染色体每个位置的一组可以相互替代的基因(等位基因),从而指定了所有可能的基因串(染色体)
  • 一代又一代的进化观,在每个阶段,一群个体生产一组后代个体,构成下一代,
  • 一个适应度函数,计算携带某个染色体的个体能贡献的后代数量,以及
  • 一组遗传算子,特别是Fisher的形式化中的突变算子,可以改变个体的后代,从而使下一代不同于当前一代。


Natural selection, in this formulation, can be thought of as a procedure for searching through the set of possible individuals, the search space, to find individuals of progressively higher fitness. Even when a natural population consists of a single species – say the current generation of humans – there is considerable variation within that population. These variants constitute samples of the search space.

在这种表述中,自然选择可以被认为是一种搜索方法,用以搜索可能个体集合(即搜索空间),以找到具有更高适应度的个体。即使当自然种群仅由一个物种组成时(例如,当前这一代人类),该种群中也存在相当大的差异。 这些变体就构成搜索空间的所有样本。


定义

A genetic algorithm (GA) is a generalized, computer-executable version of Fisher’s formulation (Holland J, 1995). The generalizations consist of:

遗传算法(GA)是Fisher公式的一种泛化后的计算机程序版本,泛化的要点包括:

  • concern with interaction of genes on a chromosome, rather than assuming alleles act independently of each other, and
  • enlargement of the set of genetic operators to include other well-known genetic operators such as crossing-over (recombination) and inversion.
  • 关注染色体上基因的交互,而非假设等位基因相互独立起作用,以及
  • 一套更广泛的遗传算子,包括了交叉 crossing-over(重组 recombination)和到位 inversion等著名的遗传算子。


Under the first generalization, the fitness function becomes a complex nonlinear function that usually cannot be usefully approximated by summing up the effects of individual genes. The second generalization puts emphasis on genetic mechanisms, such as crossover, that operate regularly on chromosomes. Crossover takes place in every mating whereas mutation of a given gene typically occurs in less than 1 in a million individuals. Crossover is the central reason that mammals produce offspring exhibiting a mix of their parents’ characteristics – consider the human population for a vivid, familiar example – and crossover makes possible the artificial cross-breeding of selected animals and plants to produce superior varieties. Clearly crossover’s high frequency gives it an important role in evolution, and it has an essential role in the operation of GAs, as outlined below.

上述第一种泛化使得适应度函数变成了一个复杂的非线性函数,不能再简单地用各个基因效应的加和来近似。第二种泛化则强调了像交叉这种作用于染色体的常见遗传算子。交叉操作再每一次染色体配对中都会出现,而变异操作则在每一百万个个体中才可能出现一次。交叉是哺乳动物能产生混合了亲代特征的子代的核心原因。人类群体就是最生动的例证。交叉还使得使用人工杂交育种来创造出具有优良特征的变种动植物成为可能。很显然,交叉的高频出现使得它成为演化中的一个重要角色,以及遗传算法中的一个关键步骤。


Crossover in its simplest form (one-point crossover) is easily defined: Two chromosomes are lined up, a point along the chromosome is randomly selected, then the pieces to the left of that point are exchanged between the chromosomes, producing a pair of offspring chromosomes.

交叉算子的最简单形式(单点交叉)可以被很容易的定义:两条染色体对齐排列好,随机选择染色体上的一点,两条染色体交换从该点起直到末尾的这一段,从而生成了两个后代染色体。

CrossOver.png

This simple version of crossover is a good approximation to what actually takes places in mating organisms. Under one-point crossover, alleles that are close together on the chromosome are likely to be passed as a unit to one of the offspring, while alleles that are far apart are likely to be separated by crossover with one allele appearing in one offspring and the other allele appearing in the other offspring. For example, given a chromosome with 1001 genes, there is only one chance in a thousand that an adjacent pair of alleles will be separated in the offspring, while alleles at opposite ends of the string will always be separated. In standard genetic terminology, this phenomenon is called linkage. By observing the frequency of separation of allele-determined characteristics in successive generations, linkage can be determined experimentally. Indeed, assuming one-point crossover made gene sequencing possible long before we had any knowledge of DNA – in 1944, using experimentally determined linkage, the Carnegie Institution of Washington published a large volume recording the order of genes in the fruit fly (Lindsley & Grell, 1944).

这种简单版本的交叉操作能很好地近似在实际的配中出现的交叉现象。在单点交叉的作用下,z唉一条染色体中,相距较近的等位基因可能会被分配到同一个后代中,而相距较远的等位基因则更可能被分割开来,分配到不同的后代中。比如,在一条有1001个等位基因的染色体中,只有1/1001的概率使得一堆临近的等位基因被分隔开,而在染色体头尾两端的基因则总是会被分割开来。在标准的遗传学术语中,这种现象叫做 连锁 linkage。通过观察由等位基因决定的性状在后代中被分隔的频率,连锁现象就可以被确定。事实上,假设单点交叉就能让基因测序变得可能,即便我们没有任何关于DNA的知识——在1944年,使用实验方法来确定连锁效应,华盛顿卡内基研究所 Carnegie Institution of Washington 就发表了一大本记录了果蝇基因序列的手册。


The genetic algorithm, following Fisher’s formulation, uses the differing fitness of variants in the current generation to propagate improvements to the next generation, but the GA places strong emphasis on the variants produced by crossover. The basic GA subroutine, which produces the next generation from the current one, executes the following steps (where, for simplicity, it is assumed that each individual is specified by a single chromosome):

在费舍尔的工作基础上,遗传算法使用了不同的适应度。但是遗传算法着重强调了由交叉算子产生的变异。基本的遗传算法程序通过下列步骤来根据当代群体产生后代群体(为简单期间,假设每一个个体都只有一个染色体):


  1. Start with a population of [math]\displaystyle{ N }[/math] individual strings of alleles (perhaps generated at random).
  2. Select two individuals at random from the current population, biasing selection toward individuals with higher fitness.
  3. Use crossover (with occasional use of mutation) to produce two new individuals which are assigned to the next generation.
  4. Repeat steps (1) and (2) [math]\displaystyle{ N/2 }[/math] times to produce a new generation of N individuals.
  5. Return to step (1) to produce the next generation.
  1. 从有[math]\displaystyle{ N }[/math]个等位基因串的群体开始(最初可能是随机产生的)
  2. 随机从当代群体中选择两个个体,有更高适应度的个体更可能被选中
  3. 使用交叉算子(偶尔用一下变异算子)来产生两个新的个体,这两个新个体会参与组成下一代群体
  4. 重复步骤(2)和(3)[math]\displaystyle{ N/2 }[/math]次,产生新一代N个个体。
  5. 回到步骤(1),根据新一代群体来继续产生后代

Of course, there are many ways to modify these steps, but most characteristics of a GA are already exhibited by this basic program.

当然,可以有很多种修该这些步骤的方法,但这即基本的流程已经展示了遗传算法的基本性质。

交叉

Crossover introduces a considerable complication in the study of successive generations: Highly effective individuals in the parent generation (say, an “Einstein”), will not reappear in the next generation, because only a subset of the parent’s alleles is passed on to any given offspring. This raises an important question: What is passed from generation to generation if the parents’ specific chromosomes are never passed on? An answer to this question requires a prediction of the generation-by-generation spread of clusters of alleles, requiring a substantial generalization of Fisher’s fundamental theorem. The schema theorem is one such generalization; it deals with arbitrary allele clusters call schemas. A schema is specified using the symbol * (‘don’t care’) to specify places along the chromosome not belonging to the cluster. For example, if there are two distinct alleles for each position, call them 1 and 0, then the cluster consisting of allele 1 at position 2, allele 0 at position 4, and allele 0 at position 5, is designated by the string *1*00**...*. Let N(s,t) be the number of chromosomes carrying schema s in generation t. The schema theorem specifies the (expected) number [math]\displaystyle{ N(s,t+1) }[/math] of chromosomes carrying schema [math]\displaystyle{ s }[/math] in the next generation. A simplified version has the following form:

交叉算子在后续世代地研究中引入了大量地复杂性:在亲代中及其有效的个体(比如爱因斯坦)在后代中不会出现,因为该个体中地等位基因被分割开来,传到了不同地后代个体中。这就引发了一个重要的问题:如果不是亲代的染色体,那到底是什么在一代又一代中遗传着呢?回答这个问题,需要需要预测在一代又一代中扩散的等位基因团,还需要对费舍尔的基本理论做进一步的扩展和泛化。模式定理 schema theorem 就是这么一个泛化。模式定理把任意一个等位基因团成为“模式”。一个模式使用符号 * 来表示染色体中不属于该模式的位置。例如,如果一个位置中由两个不同的等位基因,分别为1和0,那么位置2为1,位置4为0,位置5为0的基因团就被表示成符号串*1*00**...*。用 N(s,t) 表示在第t代中携带模式s的染色体数目模式定理揭示了携带模式s的染色体在下一代中预期的数目[math]\displaystyle{ N(s,t+1) }[/math]。下面是一个简化后的版本:


[math]\displaystyle{ N(s,t+1) = u(s,t)[1-e]N(s,t) }[/math]


where [math]\displaystyle{ u(s,t) }[/math] is the average fitness of the chromosomes carrying schema [math]\displaystyle{ s }[/math] at time [math]\displaystyle{ t }[/math] (the observed average fitness), and [math]\displaystyle{ e }[/math] is the overall probability (usually quite small) that the cluster s will be destroyed (or created) by mutation or crossover. (Note that [math]\displaystyle{ e }[/math] does become large if the alleles in the cluster are spread over a considerable part of the chromosome). This formula for [math]\displaystyle{ N(s,t+1) }[/math] can be restated in terms of probabilities (proportions) [math]\displaystyle{ P(s,t)\ , }[/math] a more typical form for mathematical genetics, by noting that [math]\displaystyle{ P(s,t) = N(s,t)/N(t)\ , }[/math] were [math]\displaystyle{ N(t) }[/math] is the size of the population at time [math]\displaystyle{ t\ . }[/math]

其中 [math]\displaystyle{ u(s,t) }[/math] 指在第[math]\displaystyle{ t }[/math]代中携带模式[math]\displaystyle{ s }[/math]的染色体的平均适应度, [math]\displaystyle{ e }[/math] 指模式s会被变异算子和交叉算子破坏的概率(通常会很小,但如果模式中的等位基因在染色体中分布很广,即模式的长度很大,则[math]\displaystyle{ e }[/math]会很大)。上述公式可以被重写为一个概率值形式 [math]\displaystyle{ P(s,t)\ , }[/math],这是在数理遗传学中更加典型的形式:[math]\displaystyle{ P(s,t) = N(s,t)/N(t)\ , }[/math] ,其中[math]\displaystyle{ N(t) }[/math]是群体在第[math]\displaystyle{ t\ . }[/math]代的规模。


The schema theorem shows that every cluster of alleles present in the population increases or decreases its proportion in the population at a rate determined by the observed average fitness of its carriers. In particular, schemas consistently associated with strings of above-average fitness spread rapidly through the population. As a result, crossover regularly tries out new combinations of these above-average schemas; they become building blocks for further attempts at solution. Though a GA only directly processes [math]\displaystyle{ N(t) }[/math] strings in a generation, it effectively processes the much larger number of schemata carried in the population. For example, the number of schemas on a single string of length 40 is [math]\displaystyle{ 2 * exp(40) \sim 16,000\ , }[/math] so a GA processing a few dozen strings of length 40 effectively processes thousands of schemas. For problems in which schemas capture regularities in the search space, this is a tremendous speedup.

模式定理说明了,群体中每一个等位基因团的比例都会提高活降低,期速率由它所能带来的平均适应度决定。特别地,能带来更高适应度的模式,会在群体中快速传播开来。结果就是,交叉算子会使用这些高于平均适应度的模式,尝试各种新的组合。这些模式成为了构建解的组件块 building blocks。尽管遗传算法每一代只处理[math]\displaystyle{ N(t) }[/math]个符号串,它能有效处理在群体中承载的多得多的模式。比如,一个长度为40的符号串中包含的模式的数量为[math]\displaystyle{ 2 * exp(40) \sim 16,000\ , }[/math],因此遗传算法只要处理几十个长度为40的符号串,就能处理成千上万个模式。这是一个及其出色的速度。


使用

Genetic algorithms are routinely used to find good solutions for problems that do not yield to standard techniques such as gradient ascent (“hill-climbing”) or additive approximations (“the whole equals the sum of the parts”) (Mitchell, 2009). Some typical problems on which the GA has been used are control of pipelines, jet engine design, scheduling, protein folding, machine learning, modeling language acquisition and evolution, and modeling complex adaptive systems (such as markets and ecosystems). To use the GA, the search space must be represented as strings over some fixed alphabet, much as biological chromosomes are represented by strings over 4 nucleotides. The strings can represent anything from biological organisms, or rules for processing signals, to agents in a complex adaptive system. The GA is initialized with a population of these strings, which may be simply selected at random from the search space, or the initial population may be “salted” with strings picked out using prior knowledge of the problem. The GA then processes the strings, and successive generations uncover schemas of above-average observed fitness. When above-average, well-linked schemas are regularly associated with improvements, the GA rapidly exploits those improvements.

遗传算法经常被用于为那些没有标准解法(如梯度上升、相加近似)的问题寻找一个好的解。一些已经应用了遗传算法的典型问题包括控制流设计、飞机引擎设计、规划、蛋白质折叠、机器学习、对语言的获取和演化建模、以及对复杂适应系统(例如市场和生态系统)建模等。要使用遗传算法,搜索空间必须被表示成用一个固定符号集生成的一组符号序列,就像生物的染色体只由四种碱基构成一样。这些符号串可以表示任何事物,从生物体,到信号处理规则,到复杂实应系统中的智能体等。遗传算法要初始化一个由这些符号串组成的群体,这可以随机的方法生成,也可以用一些对于问题的先验知识来生成。然后,遗传算法就会处理这些符号串,后续的世代就会显示出一些模式,这些模式能是个体拥有高于平均值的适应度。当高于平均、优良连接的模式频繁地出现时,遗传算法就会迅速发现并用来优化。


References

  • R.A., Fisher (1958). The Genetical Theory of Natural Selection Oxford University Press : .
  • D.L.(1944). Genetic Variations of Drosophila Melanogaster Carnegie Institution of Washington. : .