Glass Biome — The Process

Glass Biome is based in Davis. UC Davis has the #1 US agriculture program. The USDA-ARS Western Human Nutrition Research Center is on campus — the only West Coast USDA human-nutrition lab. AI-extracted claims are validated at WHNRC before being released as cultivation protocols.

O.01 The Core Question

For any crop we grow, we want to know: which growing parameters (light, pH, salinity, humidity, harvest timing, etc.) change which nutrients (vitamins, antioxidants, alkaloids, terpenes, etc.) — by how much, and in which direction?

The answer lives scattered across tens of thousands of published papers. No human can read them all. AI scans research papers, extracts claims about how growing conditions affect nutrient outcomes, and scores them. Top findings get validated at the USDA-ARS lab on campus.

O.02 Current Scale

Crops

87

Nutrients

90

Parameters

55

Verticals

28

Papers Indexed

17,839

Claims Extracted

5,894

Organized into 12 market verticals: longevity & anti-aging, nootropic & cognitive, immune & adaptogenic, precision nutrition, dermatological actives, flavor & aroma chemistry, pigment & colorant, stimulant & alkaloid, essential oil & terpene, sweetener & functional food, marine & aquaculture, and rare & ultra-premium.

O.03 Five-Stage Flow

Every research scan follows the same deterministic sequence:

Stage 1 Scan Build queries from ontology, fetch papers from 3 APIs

Stage 2 Extract Parse text, match entities, infer direction & effect size

Stage 3 Link Aggregate triples, discover nutrient pairs & triads

Stage 4 Model Fit dose-response curves, classify evidence strength

Stage 5 Protocol Generate grower-ready recommendations with confidence

O.04 Query Generation

Search queries are built from the ontology — a curated knowledge base of crops, nutrients, parameters, and their known relationships. For each crop, it identifies target nutrients (e.g., spinach targets iron, folate, vitamin C, nitrates) and priority parameters (e.g., light intensity, salinity, harvest timing).

Queries are structured as crop × nutrient × parameter triples. A full scan produces ~1,131 query lanes. Each lane is a specific question: “How does light intensity affect vitamin C in spinach?”

O.05 Multi-Source Paper Fetching

Queries are sent to three scientific literature APIs simultaneously:

Europe PMC — Primary source. Structured field searches across title and abstract. Supports full-text XML access for open-access papers.
Semantic Scholar — Broad relevance-ranked keyword search. Provides citation context and open-access PDF links.
OpenAlex — Open bibliographic data. Reconstructed abstracts from inverted index, high coverage of recent publications.

After fetching, Unpaywall enrichment looks up DOIs to find free full-text versions. Papers with PDFs can optionally be processed through GROBID, which converts PDFs into structured XML with sections, tables, and figures.

Deduplication removes exact duplicates by content hash and DOI matching. The result is a unified paper collection with each record tagged by ingestion depth: full text, abstract only, or title only.

O.06 Entity Matching

Each paper’s text is split into sentences. For every sentence, the engine searches for mentions of crops, parameters, and nutrients using alias-based word-boundary matching. Each entity in the ontology has multiple aliases (e.g., “vitamin C”, “ascorbic acid”, “L-ascorbate” all map to vitamin_c).

When an entity is missing from a sentence but present in a neighboring sentence (±2), the engine can borrow it — but only if the sentences share at least one other entity. This prevents cross-paragraph false associations.

O.07 Direction Inference

For each crop × parameter × nutrient triple found, AI models read the surrounding text and determine the direction of the relationship:

Positive — the parameter increases the nutrient (cues: “increased”, “higher”, “enhanced”, “promoted”, etc.)
Negative — the parameter decreases the nutrient (cues: “decreased”, “lower”, “reduced”, “inhibited”, etc.)
Nonlinear — the relationship is dose-dependent or biphasic (cues: “quadratic”, “U-shaped”, “bell-shaped”)
Unknown — entities are mentioned together but no clear directional signal found

Negation detection flips direction when terms like “not”, “did not”, or “failed to” appear within 3 words of a direction cue. If both positive and negative cues are present, the claim is marked nonlinear.

O.08 Effect Size & Evidence Type

Quantitative signals are extracted from the text by AI: percentage changes (+15%), fold changes (2.3x), concentration units (mg/kg, µmol), and p-values (p<0.05). The first percentage found becomes the claim’s effect size.

Evidence type is classified from study design keywords:

Meta-analysis — systematic reviews, PRISMA, Cochrane, pooled analyses
Replicated — RCTs, field trials, factorial experiments, multi-site studies
Single study — individual experiments (default)
Gray literature — conference proceedings, theses, posters
Theoretical — in silico, computational modeling, hypothesis papers

O.09 Three-Tier Extraction

The engine uses three extraction methods with decreasing precision but increasing recall:

Tier 1 Standard regex extraction. Sentence-level entity co-occurrence. All three entities (crop, parameter, nutrient) must appear in the sentence or its ±2 neighbors. Highest precision. Free and fast.

Tier 2 Deep re-extraction. Paper-level entity context. Targets “scanned, no claim” lanes where papers were found but standard extraction failed. Relaxes co-occurrence to paper level with a 0.75× confidence penalty. Recovers claims from papers that discuss entities across different sections.

Tier 3 LLM-powered extraction. Uses a language model to read abstracts and extract structured claims. Targets unknown-direction claims, unresolved conflicts, and high-value empty lanes. Highest recall and semantic understanding. Per-call API cost.

O.10 Quality Gates

Every claim must pass validation before entering the database:

Minimum source length — source text must be ≥40 characters (rejects truncation artifacts)
Entity presence — at least the crop or the nutrient must actually appear in the source sentence (rejects neighbor-borrowing false positives)
Parameter context — if the parameter was borrowed from a neighbor sentence and no experimental language is detected (“treatment”, “effect”, “exposure”, etc.), the claim receives a 0.6× confidence penalty
Deduplication — no duplicate (paper, crop, parameter, nutrient, direction) combinations within a document

O.11 How Claims Are Scored

Every claim receives a confidence score between 0 and 1. This is not a p-value — it reflects extraction certainty based on multiple signals:

Evidence type base — meta-analysis (0.82) > replicated (0.68) > single study (0.48) > gray (0.36) > theoretical (0.26)
Direction signal — +0.12 for explicit positive/negative, +0.10 for nonlinear, +0 for unknown
Cue density — +0.03 per directional cue word found (max +0.15)
Effect size — +0.14 if a percentage/numeric effect was extracted
Quantitative signals — +0.025 per unit/p-value found if no effect size (max +0.10)
Study design bonus — +0.06 if meta-analysis or replicated AND has quantitative signals
Conflict penalty — −0.08 per conflicting claim on the same triple

Result: a single-study abstract with one “increased” keyword and no numbers scores ~0.63. A replicated field trial with an effect size, 3 directional cues, and p-values scores ~0.95+. The spread is intentional — it separates “mentioned together” from “quantitatively demonstrated.”

O.12 First-Order Aggregation

Claims are grouped by their crop × parameter × nutrient triple. For each triple, the engine computes:

Total claim count and supporting paper count
Dominant direction (most common across claims)
Direction distribution (how many positive vs. negative vs. nonlinear)
Mean confidence (average across claims)
Confidence-weighted average effect size
Effect range and variance flag (>30% range = high variance)

O.13 Second-Order Links (Nutrient Pairs)

For each crop, the engine looks at all pairs of nutrients that share evidence under the same growing parameters. If spinach has claims for both iron and vitamin C under similar light conditions, those nutrients form a second-order link.

Links are scored by: combined claim support × average confidence × density balance × bioavailability multiplier (from curated pair rules that encode known nutrient synergies, e.g., vitamin C enhancing non-heme iron absorption).

O.14 Third-Order Links (Nutrient Triads)

When two high-scoring second-order pairs share a common “anchor” nutrient, the engine forms a triad — a three-way nutrient cluster that may indicate a shared metabolic pathway. These are rare (currently 199 identified) but represent the highest-value discovery targets for protocol optimization.

O.15 Parameter Interactions

The engine also identifies parameter pairs that co-modulate the same nutrient within a crop. For example, if both light intensity and UV-B exposure independently affect anthocyanin content in basil, their interaction may be worth testing as a combined protocol.

O.16 Dose-Response Modeling

For triples with numeric effect sizes from 2+ papers, the engine fits a linear summary model: mean effect, median, standard deviation, range, coefficient of variation, and direction consistency.

Relationships are classified by strength:

Strong — 3+ data points, ≥80% direction consistency, mean effect ≥10%, CV ≤1.2
Moderate — ≥70% consistency, mean effect ≥5%, CV ≤1.5
Weak — ≥50% consistency
Conflicting — inconsistent directions across studies

O.17 Deficiency-Gap Prioritization

Nutrients are ranked by research priority based on human deficiency prevalence. The engine cross-references its evidence base against a curated list of global micronutrient deficiencies (iron, zinc, folate, vitamin D, etc.) to identify where growing protocol optimization could have the most public health impact.

Priority score = deficiency weight × (1 + 0.6 × evidence depth) × unmet bonus (1.22× for nutrients with zero current evidence).

O.18 When Papers Disagree

When the same crop × parameter × nutrient triple has claims in opposite directions (one paper says “increased,” another says “decreased”), the engine flags a conflict set. All claims in a conflict receive a confidence penalty proportional to the number of opposing claims.

Conflicts are not failures — they often represent real scientific complexity: dose-dependent responses, cultivar differences, or interaction effects. The dashboard surfaces them explicitly so researchers can investigate the underlying conditions.

O.19 Knowledge Base Structure

The ontology is the engine’s vocabulary. It defines every entity the system can recognize, stored in human-readable YAML files:

Ontology file manifest (7 YAML files)

crops.yaml — 56 crops with species names, growth types, cycle lengths, and aliases
nutrients.yaml — 81 nutrients with units, categories, chemical forms, and aliases
parameters.yaml — 55 growing parameters with domains, units, and aliases
nutrient_pair_rules.yaml — known synergy/antagonism rules between nutrient pairs with bioavailability multipliers
parameter_interaction_rules.yaml — known parameter co-modulation effects
human_deficiency_rules.yaml — global deficiency prevalence weights for prioritization
verticals.yaml — market vertical groupings (which crops, nutrients, parameters belong to each vertical)

The ontology is decoupled from code. Adding a new crop or nutrient means adding a YAML entry with aliases — the entire pipeline automatically picks it up on the next scan.

O.20 Batch Sweep Architecture

A full scan is divided into 17 crop batches (spinach & kale, microgreens, adaptogens, tea varieties, mushrooms, etc.). Each batch runs as an isolated subprocess with its own query scope, paper collection, and claim set.

After all batches complete, results are merged incrementally: claims are deduplicated by ID, papers by source ID, and lane statuses are accumulated. This provides fault tolerance (one batch can fail without losing others) and allows the dashboard to update after each batch finishes.

O.21 Why Regex + Heuristics (Not Just AI)

The core extraction pipeline is deterministic: same input always produces same output. This is a deliberate choice:

Reproducibility — every claim can be traced to exact cue words in exact sentences
Auditability — no black-box model decisions; the scoring formula is explicit
Cost — regex extraction is free; LLM extraction has per-call costs
Speed — regex processes thousands of papers in minutes

LLM extraction (Tier 3) runs on top of regex results — resolving unknowns, adjudicating conflicts, and handling complex language that regex cannot parse. The tiers complement each other: regex provides breadth and reproducibility, LLM provides depth and semantic understanding.

O.22 What This Is Not

This engine does not replace peer review, meta-analysis, or domain expertise. Confidence scores are extraction certainty metrics, not statistical significance. “Strong evidence” in dose-response means consistent direction across multiple papers — not clinical-trial-grade proof. The output identifies where the literature has converged, where it disagrees, and where evidence is missing.

How the Engine Works