The Chomsky Objection the AI Industry Has Been Quietly Working Around
DEV Community Grade 10 4d ago

The Chomsky Objection the AI Industry Has Been Quietly Working Around

A useful technical idea, repeated often enough, eventually generates an unuseful philosophical claim. The current example is grammar-constrained decoding. The technique is straightforward — at each generation step, the language model's next-token logits are masked so that only tokens whose continuation can satisfy a formal grammar remain selectable; the output is, by construction, structurally valid. JSON parses. SQL is well-formed. Function-call signatures match. There is a real engineering payoff and a healthy ecosystem of libraries that deliver it. The drift is not in the engineering. It is in the rhetorical move that follows the engineering. A growing corner of 2025-2026 AI writing argues, more or less explicitly, that constraining a model's output is making the model approach meaning — that filtering linear sequences is somehow building structure, and that structure is somehow building understanding. I want to take that drift seriously, because it is the same conflation Chomsky and collaborators flagged in their March 2023 essay in the New York Times , and the engineering literature on constrained decoding agrees with Chomsky on the substantive question, even when the marketing copy doesn't. What grammar-constrained decoding actually is A language model produces output one token at a time. At each step, the model emits a probability distribution over its vocabulary, and the decoding strategy (greedy, top-k, nucleus, etc.) picks one token. Without modification, the model is free to emit any continuation; the resulting text might happen to be valid JSON, or it might not. Grammar-constrained decoding intervenes in that step. A formal grammar — typically a context-free grammar, sometimes a regular expression, sometimes a JSON schema or Pydantic model — defines what counts as valid output. At each generation step, the constraint engine computes which next tokens could lead to a continuation that is still satisfiable under the grammar, masks the logits for all other tokens to negative infinity, and lets the model's distribution operate within the remaining set. The first token that closes the grammar terminates the generation. The Python library Outlines , maintained by .txt, is the most widely-used implementation. The minimal usage is short: import outlines from transformers import AutoTokenizer , AutoModelForCausalLM from pydantic import BaseModel class Person ( BaseModel ): name : str age : int city : str # Load a small open-weights model and wrap it with Outlines. hf_model = AutoModelForCausalLM . from_pretrained ( " microsoft/Phi-3-mini-4k-instruct " , device_map = " auto " ) hf_tokenizer = AutoTokenizer . from_pretrained ( " microsoft/Phi-3-mini-4k-instruct " ) model = outlines . from_transformers ( hf_model , hf_tokenizer ) # At each token step, only tokens whose continuation can satisfy # the Person schema remain selectable. The output is guaranteed # to parse as valid Person JSON; whether the contents are *correct* # is a different question. result = model ( " Extract the person from: ' Marie, 34, Paris '" , Person , max_new_tokens = 200 , ) print ( result ) # {"name": "Marie", "age": 34, "city": "Paris"} The same pattern is supported by llguidance (Microsoft), lm-format-enforcer , and llama.cpp's GBNF grammars — different libraries, same algorithm class. Production agent pipelines, structured-extraction pipelines, and typed-API surfaces use this technique as a matter of course in 2026, and the reliability gain over "ask the model nicely to emit JSON" is large enough that nobody serious questions whether the engineering tool is useful. The question is what the engineering tool is doing. The Chomsky objection On 8 March 2023, Noam Chomsky, Ian Roberts, and Jeffrey Watumull published "The False Promise of ChatGPT" in the New York Times . The essay's central claim is sharp and easy to misread, and worth reading in its actual phrasing rather than its summary. Their argument is that statistical pattern continuation — what large language models do — is not the same activity as the construction of explanations. Human language, on the Chomskyan view, is not a long sequence of tokens that gets filtered through constraints into surface-correct strings. Human language is the externalisation of an underlying generative system whose primary work is the construction of hierarchical syntactic structures, which then linearise into the speech or text the listener perceives. This is not new. Chomsky has been arguing some version of this since Syntactic Structures in 1957 — that "language as a finite filter over linear sequences" is the wrong level of description, and that "language as a generative system that produces hierarchical structure" is the right one. The 2023 essay applies that long-running argument to the specific case of large language models, and the framing is precisely the one the constrained-decoding rhetorical drift collides with. The drift goes: we constrained the decoder so the output is structurall

A useful technical idea, repeated often enough, eventually generates an unuseful philosophical claim. The current example is grammar-constrained decoding. The technique is straightforward — at each generation step, the language model's next-token logits are masked so that only tokens whose continuation can satisfy a formal grammar remain selectable; the output is, by construction, structurally valid. JSON parses. SQL is well-formed. Function-call signatures match. There is a real engineering payoff and a healthy ecosystem of libraries that deliver it. The drift is not in the engineering. It is in the rhetorical move that follows the engineering. A growing corner of 2025-2026 AI writing argues, more or less explicitly, that constraining a model's output is making the model approach meaning — that filtering linear sequences is somehow building structure, and that structure is somehow building understanding. I want to take that drift seriously, because it is the same conflation Chomsky and collaborators flagged in their March 2023 essay in the New York Times, and the engineering literature on constrained decoding agrees with Chomsky on the substantive question, even when the marketing copy doesn't. What grammar-constrained decoding actually is A language model produces output one token at a time. At each step, the model emits a probability distribution over its vocabulary, and the decoding strategy (greedy, top-k, nucleus, etc.) picks one token. Without modification, the model is free to emit any continuation; the resulting text might happen to be valid JSON, or it might not. Grammar-constrained decoding intervenes in that step. A formal grammar — typically a context-free grammar, sometimes a regular expression, sometimes a JSON schema or Pydantic model — defines what counts as valid output. At each generation step, the constraint engine computes which next tokens could lead to a continuation that is still satisfiable under the grammar, masks the logits for all other tokens to negative infinity, and lets the model's distribution operate within the remaining set. The first token that closes the grammar terminates the generation. The Python library Outlines, maintained by .txt, is the most widely-used implementation. The minimal usage is short: import outlines from transformers import AutoTokenizer, AutoModelForCausalLM from pydantic import BaseModel class Person(BaseModel): name: str age: int city: str # Load a small open-weights model and wrap it with Outlines. hf_model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-3-mini-4k-instruct", device_map="auto" ) hf_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct") model = outlines.from_transformers(hf_model, hf_tokenizer) # At each token step, only tokens whose continuation can satisfy # the Person schema remain selectable. The output is guaranteed # to parse as valid Person JSON; whether the contents are *correct* # is a different question. result = model( "Extract the person from: 'Marie, 34, Paris'", Person, max_new_tokens=200, ) print(result) # {"name": "Marie", "age": 34, "city": "Paris"} The same pattern is supported by llguidance (Microsoft), lm-format-enforcer, and llama.cpp's GBNF grammars — different libraries, same algorithm class. Production agent pipelines, structured-extraction pipelines, and typed-API surfaces use this technique as a matter of course in 2026, and the reliability gain over "ask the model nicely to emit JSON" is large enough that nobody serious questions whether the engineering tool is useful. The question is what the engineering tool is doing. The Chomsky objection On 8 March 2023, Noam Chomsky, Ian Roberts, and Jeffrey Watumull published "The False Promise of ChatGPT" in the New York Times. The essay's central claim is sharp and easy to misread, and worth reading in its actual phrasing rather than its summary. Their argument is that statistical pattern continuation — what large language models do — is not the same activity as the construction of explanations. Human language, on the Chomskyan view, is not a long sequence of tokens that gets filtered through constraints into surface-correct strings. Human language is the externalisation of an underlying generative system whose primary work is the construction of hierarchical syntactic structures, which then linearise into the speech or text the listener perceives. This is not new. Chomsky has been arguing some version of this since Syntactic Structures in 1957 — that "language as a finite filter over linear sequences" is the wrong level of description, and that "language as a generative system that produces hierarchical structure" is the right one. The 2023 essay applies that long-running argument to the specific case of large language models, and the framing is precisely the one the constrained-decoding rhetorical drift collides with. The drift goes: we constrained the decoder so the output is structurally valid; structure is what language is; therefore the model is closer to language. Each step in that chain is wrong on its own terms. Constraint is not structure. Linear validity is not hierarchy. Nothing in the model's filtered token stream is doing what Chomsky calls the work of language — distinguishing the possible from the impossible, constructing causal explanations, anchoring statements to truth rather than to probability. The drift is also, in a particular sense, structurally similar to other rhetorical moves the field has been making. Wrap an LLM in a planner, a memory store, and a tool-call interface, and the result is often called an "agent" — a label that imports the philosophical baggage of agency without the system having any of it. Wrap a decoder in a grammar, and the result is sometimes called "approaching language" — a label that imports the philosophical baggage of meaning without the system doing any of the work meaning would require. External orchestration gets credited as internal cognition. The same move, in two different costumes. What the engineering literature actually says The argument that constrained decoding produces meaning is not a claim the technical literature on constrained decoding makes. The literature is, in fact, careful in the other direction. Raspanti, Ozcelebi, and Holenderski's ACL 2025 industry-track paper, "Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers," shows that constraint-based generation improves parsing reliability on logical-form tasks. It does not claim the technique improves the model's reasoning. The improvements are in format adherence; the semantic-error rate within validly-formatted outputs is largely unchanged. Beurer-Kellner, Fischer, and Vechev, "Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation," ICML 2024, the paper that introduced the DOMINO approach, documents the token-misalignment problem: the LLM's subword tokenisation does not align cleanly with grammar terminals, so the constraint engine sometimes forces the model through unnatural intermediate token paths that distort its native probability distribution. This is a documented engineering cost of the technique, not a step toward better understanding. Banerjee, Suresh, Ugare, Misailovic, and Singh, "CRANE: Reasoning with constrained LLM generation," 2025 (arXiv:2502.09061), and Schall and de Melo, "The Hidden Cost of Structure: How Constrained Decoding Affects Language Model Performance," RANLP 2025, are even more pointed. Both papers find that strict constraints applied during reasoning steps actively degrade model performance on the underlying task. When the model is forced to emit only syntactically valid intermediate states, it loses the chain-of-thought scaffolding that gives it whatever reasoning ability it has. The fix, where there is one, is to relax the grammar around the reasoning steps and re-tighten only at the final answer. A small but important paper in the same lineage, Kallini, Papadimitriou, Futrell, Mahowald, and Potts, "Mission: Impossible Language Models," ACL 2024, goes further: the authors construct deliberately-impossible synthetic languages (with grammar rules no human language uses), train transformer language models on them, and show the models learn the impossible languages about as easily as the possible ones. The empirical finding mirrors the Chomskyan theoretical position — what current LLMs are sensitive to is statistical structure, not the kinds of constraints that distinguish humanly-acquirable languages from arbitrary string sets. The engineering picture, summarised in one table: | Property | Does constrained decoding address it? | Documented in | |---|---|---| | Format validity (parses against schema) | ✓ Guaranteed by construction | Outlines / DOMINO docs; engineering benchmarks | | Type compliance (fields match required types) | ✓ Guaranteed when grammar is well-specified | Same | | Enumeration adherence (only allowed values) | ✓ Same | Same | | Factual accuracy of contents | ✗ Wrong facts in correctly-shaped wrappers | Raspanti et al., ACL 2025 | | Semantic correctness of the output | ✗ Plausible nonsense passes the constraint | Same; broader hallucination literature | | Reasoning quality | ✗ Strict grammars during reasoning degrade performance | Banerjee et al., CRANE 2025; Schall & de Melo, RANLP 2025 | | Native probability distribution | ✗ Token-misalignment artifacts distort it | Beurer-Kellner et al., DOMINO ICML 2024 | | Sensitivity to humanly-possible vs impossible languages | ✗ Models learn arbitrary string sets about equally well | Kallini et al., ACL 2024 | The engineering work is honest about what the technique is and isn't. The marketing layer above the engineering is the part that occasionally drifts, and that drift is the part the Chomsky objection lands directly on. The same move keeps repeating It is worth pausing on the rhetorical pattern, because it is consistent across the AI industry and predicts where the next conflation will

Comments

No comments yet. Start the discussion.