5
$\begingroup$
  1. Can a CFG over large alphabet describe a language, where each terminal appears even number of times?

  2. If yes, would the Chomsky Normal Form be polynomial in |Σ| ?

EDIT: What about a language where each terminal appears 0 or 2 times?

EDIT: Solving this may give interesting extension to "Restricted read twice BDDs and context free grammars"

Find a context-free grammar (CFG) for the language L of all words such that each terminal in a word occurs even number of times over a possibly large alphabet Σ

My long aproach is (the only nonterminal is S):

S ⟶ ε | SS

x ∈ Σ : S ⟶ xSx

x,y ∈ Σ : S ⟶ xxSyy | yySxx | xySxy | xySyx | yxSyx | yxSxy

Second try, incremental approach.

S ⟶ ε | SS
Terminal productions (suboptimal): x,y ∈ Σ : S ⟶ xx | yy |xxyy | yyxx | xyxy | xyyx | yxyx | yxxy
Incremental: x ∈ Σ : S ⟶ SxSxS

By the way, the homework is over now.

Technically it wasn't a homework assigned to me. I don't know the answer and thought the question was easy. Best.

EDIT: I didn't award the bounty :-)

  • 0
    Please specify the desired language precisely. I have a different idea that uses only regular rules. How would you build an automaton that recoginizes the language?2010-12-14
  • 0
    Each terminal in a word must appear even number of times. I wouldn't try regular expressions because palindromes are in the language. I need this for a graph problem.2010-12-14
  • 1
    `abcabc` isn't in your language, I think!2010-12-14
  • 0
    "abcabc isn't in your language, I think!" According to a parser it is...2010-12-14
  • 3
    Whoops! Have I messed up? The proof outline I had in mind went as follows: none of the second and third productions matches up (as `a` != `c` and `ab` uses different letters than `bc`) so the starting production would have to be `S -> SS`. But each `S` will eventually produce each terminal an even number of times. Chopping `abcabc` up as `ab`/`cabc` or `abca`/`bc` won't work either, for obvious reasons. What sequence of productions did your parser find?2010-12-14
  • 0
    @User4750: could you sketch the parse sequence for that? I also don't see it.2010-12-14
  • 0
    @all: Thank you! It may be a parser bug (java cfgto) - if someone is interested I can upload the CNF somewhere in about 14 hours.2010-12-14
  • 0
    Please do, I'm getting curious by now! You may also want to look at the third set of productions, I think alternatives 1 & 2 and 3 & 5 & 6 are the same. What you want for this grammar is `S -> xxSyy | xySxy | xySyx` since you can swap `x` and `y` around. (I suppose you meant to put `yxSxy` in there somewhere :)2010-12-14
  • 1
    Sorry, I can't make much sense out of that... For the alphabet `{a,b,c}` you need the following grammar, if you follow Carl's construction: `S -> eps | aT | bU | cV [], T -> aS | bW | cX [a], U -> aW | bS | cY [b], V -> aX | bY | cS [c], W -> aU | bT | cZ [ab], X -> aV | bZ | cT [ac], Y -> aZ | bV | cU [bc], Z -> aY | bX | cZ [abc]` In brackets behind each non-terminal is the set of terminals that has occurred an odd number of times.2010-12-15
  • 0
    Instead of joking around with a parser, you might as well proof correctness.2010-12-15
  • 0
    I will bet my construction is wrong. What I dislike in Carl's construction is being exponential :-( Thank you all!2010-12-15
  • 1
    Alas! Your second try is also not working, again with `abcabc` as counterexample. The second production can't be used since there are no substrings matching those patterns. For the third, x is either `a` but then S needs to produce `bc` twice *independently!* , `b` but then S needs to produce `a` which can't be done, or `c` which has the same problem as `a`. My gut feeling is it can't be done, but I can't back that up...2010-12-15
  • 1
    Of course, if you are just looking for a parser you can as well use full power (TM/RM) and recognize the language in time $\mathcal{O}(n)$ with $k$ bits in memory.2010-12-16
  • 0
    Can someone recommend reading with examples for dummies what poly-size CFGs can describe over large alphabet?2010-12-16
  • 0
    What about a language where each terminal appears 0 or 2 times?2010-12-16
  • 0
    For 0 or 2 times, probably using a pumping lemma (with bounds depending on the size of the grammar) would show that the grammar needs to be extremely verbose.2010-12-20
  • 0
    @yatima2975 I will appreciate help in quality assurance of positive answers for the bounty.2010-12-21
  • 0
    If someone knows better forum he/she or I may ask there. I am surprised the way I earn reputation points here. I give email on my profile. Help on improving the quality of the question is appreciated.2010-12-22
  • 0
    @jerr18: I will look at Yuval's proof more closely, but it's Christmas as well :) OTOH, Christmas is boring and my brain could use some stimulation...2010-12-26
  • 0
    @yatima2975 Merry Christmas Eyeryone!2010-12-26
  • 0
    _Technically_ it wasn't a homework assigned to me. I don't know the answer and thought the question was easy.2010-12-26

5 Answers 5

4

There is a deterministic finite state machine that accepts a word if and only if each terminal appears an even number of times. Assume that there are $k$ symbols in the alphabet. The machine has $2^k$ states, each of which is associated with a subset of the alphabet, and we identify the states with the subsets. The machine is defined so that:

  • The initial state corresponds to the empty set, and this is the only accepting state.
  • The transition rules are defined as follows. From state $A$, on input $l$, if $l \in A$ then let $B = A - \{l\}$ and move to state $B$. Otherwise let $B = A \cup \{l\}$ and move to state $B$.

The idea is that a state $B$ corresponds to the set of terminals that have appeared an odd number of times so far. Being in state $\emptyset$ corresponds to every terminal appearing an even number of times.

Since the language is accepted by a DFSM, the language is regular. This means that there is, trivially, a CFG that accepts the language. The grammar has one nonterminal for each state of the machine, and the rules of the grammar match up with transitions of the machine. I don't know yet whether there is a context free grammar for this language that is polynomial size compared to the size of the alphabet, though.

  • 2
    Exactly what I had in mind.2010-12-15
  • 0
    @Carl: Thank you!! Do you have an opinion or intuition about polynomial size CNF - need this for a graph problem.2010-12-15
  • 2
    I don't know about a polynomial size CNF but I did work out that you can't do it with a polynomial size DFSM, in fact if there are $k$ letters then you need at least $2^k$ states for *any* correct DFSM. So my answer is optimal for DFSMs that accept the language. The problem for context free grammars seems much more difficult.2010-12-15
  • 0
    Thank you again. I hope the second edit is marginally better.2010-12-15
  • 0
    Solving this may give interesting extension to "Restricted read twice IBDDs and context free grammars" http://cstheory.stackexchange.com/questions/3816/restricted-read-twice-ibdds-and-context-free-grammars2010-12-16
  • 0
    @Carl Mummert I suppose if a solution exists it would be easier to find it than to prove negative result?2010-12-18
  • 0
    @Carl Mummert since a bounty is running I fail to see your opinion about the poly-size part?2010-12-20
  • 0
    I don't know anything about the polynomial size question, but I would be very interested to see a solution if anyone has one.2010-12-21
  • 0
    @Carl Mummert Thank you! I am unsure the second answer here is correct...2010-12-21
  • 0
    @Carl, using the Myhill–Nerode theorem I think it is easy to see that your automaton is actually the minimal automaton recognizing the language.2010-12-26
2

Edit: Here is a more verbose proof. The original proof is retained below.

Introduction

Let $\Sigma$ be some alphabet with $n$ symbols. Let $L$ be the language of words in which each symbol appears an even number of times. Suppose $G$ is a grammar for $L$. Our goal is to given an exponential lower bound on the size of $G$ in terms of $n$.

Transition to Chomsky Normal Form

According to Wikipedia (and not difficult to prove), there is a grammar $G'$ for $L$ in Chomsky Normal Form such that $|G'| = O(|G|^2)$. We will show an exponential lower bound on $|G'|$, and this will imply an exponential lower bound on $|G|$ (if $|G'| = \Omega(c^n)$ for $c > 1$ then $|G| = \Omega(\sqrt{c}^n)$).

For our purposes, a grammar is in Chomsky Normal Form (CNF) if the productions are of the following two types: $A \rightarrow BC$ and $A \rightarrow a$, where $A,B,C$ are non-terminals and $a$ is terminal. Some people also place the restriction that $B,C$ cannot equal the starting symbol $S$, but we do not need this condition.

Word derivation in CNF grammars

Let $w \in L$. Choose some derivation of $w$ in $G'$ (there might be several sharing the same derivation tree). The derivation can be thought of as a binary tree (derivation tree) where each non-leaf node is labeled by a terminal, and each leaf is labeled by a non-terminal; a non-leaf node either has two non-leaf children or one leaf child.

As an example, consider the derivation $$S \rightarrow AB \rightarrow ACD \rightarrow^* xyz.$$ The corresponding derivation tree can be described as follows (sorry for the missing graphics):

  1. The root is labeled $S$.
  2. The root has two children labeled $A,B$.
  3. The node labeled $B$ has two children labeled $C,D$.
  4. The node labeled $A$ has one child labeled $x$, which is a leaf.
  5. The node labeled $C$ has one child labeled $y$, which is a leaf.
  6. The node labeled $D$ has one child labeled $z$, which is a leaf.

Note that in general, several nodes can share a label (indeed, this is how one proves the pumping lemma).

Each node in the tree generates a subword of the original word, which is formed from its descendant leaf nodes in order. Continuing our example:

  1. The root generates the entire word $xyz$.
  2. The node labeled $A$ generates the subword $x$.
  3. The node labeled $B$ generates the subword $yz$.
  4. The node labeled $C$ generates the subword $y$.
  5. The node labeled $D$ generates the subword $z$.

A subword lemma

Lemma. Suppose $w \in L$ is of length $m$, and choose some positive $\ell \leq m$. Then there is a node in the derivation tree of $w$ that generates a subword of length $\alpha$ satisfying $\ell \leq \alpha < 2\ell$.

Proof. Run the following algorithm:

  1. Set $v$ to the root of the derivation tree.
  2. While the subword generated by $v$ has size at least $2\ell$, replace $v$ by its child whose generated subword is longest.
  3. Output the subword generated by $v$.

Note that the node $v$ always has two children, since otherwise it generated a subword of length $1$. If the loop in step (2) never runs then certainly $v$ generates a subword of the requisite length, by our assumption that $\ell \leq m$.

Otherwise, consider the last iteration of the loop. Denote by $w(u)$ the subword generated by a node $u$. In the last iteration of the loop, there is a node $v$ with $|w(v)| \geq 2\ell$, with children $v_1,v_2$ such that $|w(v_1)| \geq |w(v_2)|$; we move to $v_1$. Note that $$2\ell \leq |w(v)| = |w(v_1)| + |w(v_2)| \leq 2|w(v_1)|$$ so that $|w(v_1)| \geq \ell$. Since the loop stopped, $|w(v_1)| < 2\ell$ and $w(v_1)$ is of the requisite size.

The proof

We consider the set $P$ of words in $L$ of the form $\pi(\Sigma) \pi(\Sigma)$, where $\pi$ runs over all permutations of $n$ symbols. For example, if $\Sigma = \{0,1\}$ then $$P = \{0101,1010\}.$$ Note that indeed $P \subseteq L$.

Since all words in $P$ are of length $2n$ (recall $n$ is the size of the alphabet $\Sigma$), using the lemma with $\ell = n/3$ (assume for simplicity that $n$ is divisible by $3$) we get that some node in the derivation tree of any $w \in P$ generates a subword $x(w)$ whose size satisfies $$n/3 \leq |x(w)| < 2n/3.$$ Denote by $s(w)$ the non-terminal generating $x(w)$ (it is the label of the node).

Since $|x(w)| \leq n$, all symbols in $x(w)$ are distinct: indeed, any subword of $w$ of length $n$ is a cyclic rotation of the original permutation (think of all subwords of length $3$ in $abcabc$: $abc,bca,cab,abc$).

Our strategy now will be to show that any single non-terminal can generate only a relatively small number of the $x(w)$; since $P$ is big, it will follow that $G'$ must have lots of non-terminals. More explicitly, let $$S = \{ s(w) : w \in P\}.$$ We can find a list of representative, distinct words $w_1,\ldots,w_{|S|}$ such that $$S = \{ s(w_i) : 1 \leq i \leq |S| \}.$$ We will show that for each $w_i$, the set of words $w \in P$ with $s(w) = s(w_i)$ consists of at most $B$ words ($B$ will be determined below). It will follow that the number of distinct non-terminals is at least $$\frac{|P|}{B} = \frac{n!}{B}.$$

So let $w \in P$, and suppose that $s(w) = s(w_i)$ for some fixed $i$. What this means is that we can replace $x(w)$ in $w$ with $x(w_i)$ and still get a word in $L$. The word $w$ has a decomposition $w = lx(w)r$. Every symbol appearing in $x(w)$ appears exactly once in $lr$, and every other symbol appears exactly twice in $lr$. Note that this defines the word $x(w)$ up to permutation, i.e. since we assumed that $lx(w_i)r \in L$, then $x(w_i)$ must be a permutation of $x(w)$ (more accurately, the set of (unique) symbols occurring in $x(w),x(w_i)$ must be the same).

We will now bound the number of words in $P$ that have a subword of length $x(w_i)$ which is a permutation of $x(w_i)$. Any such word $w$ can be broken as $lyzr$ where $yz$ is a permutation of $x(w_i)$ and $|ly|=|zr|=n$; in fact, since $w \in P$ then $ly = zr$. There are $|x(w_i)|!$ possible choices for $yz$. Together, these define $|x(w_i)|$ points of the permutation $\pi(\Sigma) = ly$. For the rest of the points we have $(n-|x(w_i)|)!$ possible choices. Finally, $yz$ can start in at most $n$ "cyclic" locations inside $\pi$, and so $$B \leq n|x(w_i)|!(n-|x(w_i)|)!.$$ When is this expression maximized, as a function of $\alpha = |x(w_i)|$? Taking the logarithm of the bound, $$\log B \leq \log 2n + \log \alpha! + \log (n-\alpha)!.$$ The factorial function (in fact, the Gamma function) is log-convex, and so subject to the condition $n/3 \leq \alpha \leq 2n/3$, this expression is maximized for $\alpha=n/3$ or $\alpha =2n/3$ (the value is the same). That is $$B \leq n(n/3)!(2n/3)!.$$

Plugging the value of $B$ that we have laboriously estimated, we conclude that the number of different non-terminals in $G'$ must be at least $$\frac{|P|}{B} = \frac{n!}{n(n/3)!(2n/3)!} = \frac{1}{n} \binom{n}{n/3}.$$ Now we will use Stirling's approximation $$m! \sim \sqrt{2\pi m} (m/e)^m.$$ Plugging the approximation, we get $$\frac{n!}{2n(n/3)!(2n/3)!} \sim \frac{\sqrt{2\pi n}(n/e)^n}{2n\sqrt{2\pi(n/3)}(n/3e)^{n/3} \sqrt{2\pi(2n/3)}(2n/3e)^{2n/3}}.$$ All powers of $n$ and $e$ cancel, and we're left with the following $$\frac{n!}{n(n/3)!(2n/3)!} \sim \frac{1}{n\sqrt{4\pi n/9}} 3^{n/3} (3/2)^{2n/3} = \Omega\left(n^{-3/2} c^n \right),$$ where the constant $c$ is $$c = 3^{1/3} (3/2)^{2/3} = \frac{3}{2^{2/3}} > 1.$$

In particular, the original grammar $G$ must be of size at least $\Omega(n^{-3/4} \sqrt{c}^n)$.

Exactly 0 or 2 occurrences

Another questioned concerned the language $L'$ of all words which contain exactly $0$ or $2$ occurrences of each symbol in $\Sigma$. The proof given in the preceding section applies equally well to $L'$ since the set of words $P$ we used belongs to $L'$; the rest of the proof uses the fact that certain words are not in $L$, and since $L \supset L'$, they are a fortiori not in $L'$.

Generalization

We have seen that the lower bound holds for both $L$ and $L'$. When does the proof hold for a language $\Lambda$? We used only two properties of $\Lambda$:

  1. $\Lambda$ contains $P$.
  2. If $w_1,w_2 \in P$, $x_1,x_2$ are the subwords of $w_1,w_2$ chosen using the subword lemma, and $w_1(x_1:=x_2),w_2(x_2:=x_1) \in \Lambda$ (i.e. the word resulting by replacing $x_1$ with $x_2$ in $w_1$, and the word resulting by replacing $x_2$ with $x_1$ in $w_2$) then $x_2$ is a permutation of $x_1$.

When will the second condition fail? The words $x_1,x_2$ both have no repeated symbols. If they are not permutations of each other then $w_1(x_1:=x_2)$ will have a symbol repeated $1$ or $3$ times. So words containing such symbols must lie out of $\Lambda$. Considering $w_2(x_2:=x_1)$, we see that it is enough to exclude one of the possibilities. We get the following theorem.

Theorem. Suppose $\Lambda$ is a language containing $P$ in which no word contains any symbol exactly once, or in which no word contains any symbol exactly $3$ times. Then any grammar for $\Lambda$ must have size $\Omega(n^{-3/4} C^n)$, where $C = \sqrt{3}/\sqrt[3]{2}$.

As an example, the theorem applies for the language in which the number of occurrences of each symbol lies in $S$ as long as $2 \in S$ and $1,3 \notin S$. For example, $L$ corresponds to $S = \{0,2,4,6,\ldots\}$ and $L'$ corresponds to $S = \{0,2\}$.

We leave the reader the following easy generalizations.

Theorem. Suppose $\Lambda$ is a language containing $\{ \pi(\Sigma)^\ell \}$ for some $\ell > 0$, in which no word contains any symbol exactly $\ell + \epsilon$ times, where $\epsilon$ is either $1$ or $-1$. Then any grammar for $\Lambda$ must have size $\Omega(n^{-3/4} C^n)$, where $C = \sqrt{3}/\sqrt[3]{2}$.


Original, condensed proof

Let $L$ be the language of words in which each symbol appears an even number of times. We give an exponential lower bound (in $n = |\Sigma|$) on the size of a Chomsky Normal Form (CNF) grammar for $L$. This implies an exponential lower bound on the size of any context-free grammar for $L$, since (according to Wikipedia) the blowup to CNF is at most quadratic.

Given any derivation of a word $w$ of length $m$ and any $\ell < m/2$ we can always find a non-terminal symbol in the derivation which derives a subword $x$ of $w$ of length $\ell \leq |x| < 2\ell$: start from the root of the derivation, and always pick the larger branch. The first time we hit a branch smaller than $2\ell$, it must be at least $\ell$ in size.

Consider the set of all $S = n!$ words in $L$ of the form $\pi\pi$ (all of size $2n$), where $\pi$ is a permutation of $\Sigma$. For each such word $w$ we can find a symbol $s_w$ generating a subword $x_w$ of size $n/3 \leq |x_w| < 2n/3$. Note that all symbols in $x_w$ are distinct. Therefore if $s_u = s_v$ then $x_u,x_v$ must be permutations of each other.

Given $w$, for how many words $u$ can it be that $s_w = s_u$? To get from $w$ to $u$ we need to permute $x_w$, to permute the rest of the permutation, and to choose the starting location of $x_u$ within the permutation generating $w_u$. All in all, the number of possibilities is at most $$M = n|x_w|!(n-|x_w|)! \leq n(n/3)!(2n/3)!.$$ Therefore the number of distinct symbols in the grammar is at least $$\frac{S}{M} = \frac{n!}{n(n/3)!(2n/3)!} = O\left(\frac{c^n}{n^{1.5}}\right),$$ where $c = 3/2^{2/3} \approx 1.88988$ (thanks to Wolfram Alpha).

  • 0
    +1 for just answering. **I have no idea if the answer is correct**.2010-12-22
  • 0
    @jerr18 Maybe the new version is more understandable. If there are still points which you find unclear, I can expound on them.2010-12-24
  • 0
    Thank you. I have some irrational doubts about the correctness of your solution. _Nothing personal_.2010-12-25
  • 0
    Unfortunately I lack the knowledge to verify your proof. So far I resort to watching peer review :-) [homework hunters should trust the peer review actors, by the way. There are some blatantly wrong statements on this forum...]2010-12-26
1

Edit: This solution is wrong, as explained in the comments below. The proof breaks in the italicized phrase.

Let $\Sigma = \{ \sigma_i \}$. Take the following grammar of size $O(|\Sigma|)$: $$S \longrightarrow \sigma_i S \sigma_i S | \varepsilon$$ Clearly in every word produced, every terminal appears an even number of times (proof by induction). In order to generate a word $w$ matching this description, suppose that $w = \sigma_i w'$. Then $\sigma_i$ must appear again in $w'$, say $w = \sigma_i z_1 \sigma_i z_2$. Note that $z_1,z_2$ also match the description.

Now in CNF: $$\begin{align*} S &\longrightarrow \varepsilon \\ S &\longrightarrow S_i S_i \\ S_i &\longrightarrow \sigma_i | A_i T \\ A_i &\longrightarrow \sigma_i \\ T &\longrightarrow S_i S_i \end{align*}$$

  • 0
    Thank you. Before accepting I will need some time.2010-12-21
  • 0
    Thank you. I have some irrational doubts about the correctness of your solution (if I understand it correctly). Seems epsillon removal will require some troubling sub-words. I can live with your CNF if the rest is correct...2010-12-21
  • 0
    @Yuval Filmus over the alphabet {a,b,c} I don't see how to generate **abcabc** using your rules and the parser cgm claims it is not in the language (may be a parser bug). Am I missing something?2010-12-21
  • 0
    cgm is at http://www.ncc.up.pt/cgm/2010-12-21
  • 0
    You got me! I'll keep on thinking.2010-12-21
  • 0
    I am unsure about your motivation, but you may comment/edit your $``CNF''$2010-12-22
  • 0
    Why vote for this answer? It is clearly wrong.2010-12-26
1

I have no idea how to comment. But I do know how to answer. Sorry about that.

I considered the pumping lemma. And I considered how many words are necessary to pump the language.

Consider the language over {0,1}.

We need the following words: 00, 11, 0101, and 1010. To pump we can combine any words by concatenation or insertion. For this example a grammar would be

S => | S0S0S | S1S1S | S0S1S0S1S | S1S0S1S0S

A quick count shows the number of words necessary for the grammar with n terminals is exponential in n. I guess that shows there is no polynomial sized grammar.

  • 0
    This is not a proof! Maybe the idea can be turned into a proof, can you try to explain it more verbosely? What do you mean by "how many words are necessary to pump the language"?2010-12-23
  • 0
    +1 for just answering. **I have no idea if the answer is correct.**2010-12-23
  • 0
    @user5055 If your goal is just to win 50 pts, a **faster** way will be to answer other questions on the site.2010-12-23
1

jerr18 ---

Ignoring the fact that my posted "answer" was clearly indicated to be a "comment" ...

Perhaps you do not understand what a proof is. As a writer: a proof is what makes me believe a result. As a reader whatever you wish to accept is a proof. A counterexample is what is needed to show a proof is defective. Sometimes a believer will use a comment to provide a counterexample. (I am not asking that you make me disbelieve my comment/proof by providing a counter example.)

As for your question, I believe that the empty language is poly for any alphabet and it satisfies your requirement that every terminal appears an even number of times. I suppose your question should be reposed to conform to the effort that Yuval Filmus put forth.

  • 0
    I think the OP meant the language consisting of all words in which each symbol appears an even number of times (never or twice in the variant).2010-12-26
  • 0
    Your concept of proof is very Lakatosian. Working mathematicians believe (perhaps naively) that there are some absolute standard of proof, such that whenever a result is proven, no counterexample is possible. That's the common meaning of "proof" in the mathematical community.2010-12-26
  • 0
    +1. Other suggestions for reposing? *Technically* it wasn't a homwork. At least assigned to me.2010-12-26
  • 0
    Thank you. Trying to decrease my crankiness level: If I knew the answer, I wouldn't ask. The negative answer is probably on the right site of the bet (I suppose the answer to this question is negative). The answer is not convincing for my level of expertise and the answerer tried the other side of the bet too.2010-12-27