Glass Biome · Methodology
How the Engine Works
Our literature-to-protocol research engine. It reads thousands of published scientific papers, extracts specific claims about how growing conditions affect crop nutrition, scores them for reliability, discovers hidden relationships between nutrients, and produces protocol-ready recommendations for growers.
Glass Biome is based in Davis. UC Davis has the #1 US agriculture program. The USDA-ARS Western Human Nutrition Research Center is on campus — the only West Coast USDA human-nutrition lab. AI-extracted claims are validated at WHNRC before being released as cultivation protocols.
O.01 The Core Question
For any crop we grow, we want to know: which growing parameters (light, pH, salinity, humidity, harvest timing, etc.) change which nutrients (vitamins, antioxidants, alkaloids, terpenes, etc.) — by how much, and in which direction?
The answer lives scattered across tens of thousands of published papers. No human can read them all. AI scans research papers, extracts claims about how growing conditions affect nutrient outcomes, and scores them. Top findings get validated at the USDA-ARS lab on campus.
O.02 Current Scale
Organized into 12 market verticals: longevity & anti-aging, nootropic & cognitive, immune & adaptogenic, precision nutrition, dermatological actives, flavor & aroma chemistry, pigment & colorant, stimulant & alkaloid, essential oil & terpene, sweetener & functional food, marine & aquaculture, and rare & ultra-premium.
O.03 Five-Stage Flow
Every research scan follows the same deterministic sequence:
O.04 Query Generation
Search queries are built from the ontology — a curated knowledge base of crops, nutrients, parameters, and their known relationships. For each crop, it identifies target nutrients (e.g., spinach targets iron, folate, vitamin C, nitrates) and priority parameters (e.g., light intensity, salinity, harvest timing).
Queries are structured as crop × nutrient × parameter triples. A full scan produces ~1,131 query lanes. Each lane is a specific question: “How does light intensity affect vitamin C in spinach?”
O.05 Multi-Source Paper Fetching
Queries are sent to three scientific literature APIs simultaneously:
- Europe PMC — Primary source. Structured field searches across title and abstract. Supports full-text XML access for open-access papers.
- Semantic Scholar — Broad relevance-ranked keyword search. Provides citation context and open-access PDF links.
- OpenAlex — Open bibliographic data. Reconstructed abstracts from inverted index, high coverage of recent publications.
After fetching, Unpaywall enrichment looks up DOIs to find free full-text versions. Papers with PDFs can optionally be processed through GROBID, which converts PDFs into structured XML with sections, tables, and figures.
Deduplication removes exact duplicates by content hash and DOI matching. The result is a unified paper collection with each record tagged by ingestion depth: full text, abstract only, or title only.
O.06 Entity Matching
Each paper’s text is split into sentences. For every sentence, the engine searches
for mentions of crops, parameters, and nutrients using alias-based word-boundary
matching. Each entity in the ontology has multiple aliases (e.g., “vitamin C”,
“ascorbic acid”, “L-ascorbate” all map to vitamin_c).
When an entity is missing from a sentence but present in a neighboring sentence (±2), the engine can borrow it — but only if the sentences share at least one other entity. This prevents cross-paragraph false associations.
O.07 Direction Inference
For each crop × parameter × nutrient triple found, AI models read the surrounding text and determine the direction of the relationship:
- Positive — the parameter increases the nutrient (cues: “increased”, “higher”, “enhanced”, “promoted”, etc.)
- Negative — the parameter decreases the nutrient (cues: “decreased”, “lower”, “reduced”, “inhibited”, etc.)
- Nonlinear — the relationship is dose-dependent or biphasic (cues: “quadratic”, “U-shaped”, “bell-shaped”)
- Unknown — entities are mentioned together but no clear directional signal found
Negation detection flips direction when terms like “not”, “did not”, or “failed to” appear within 3 words of a direction cue. If both positive and negative cues are present, the claim is marked nonlinear.
O.08 Effect Size & Evidence Type
Quantitative signals are extracted from the text by AI: percentage
changes (+15%), fold changes (2.3x), concentration units
(mg/kg, µmol), and p-values (p<0.05).
The first percentage found becomes the claim’s effect size.
Evidence type is classified from study design keywords:
- Meta-analysis — systematic reviews, PRISMA, Cochrane, pooled analyses
- Replicated — RCTs, field trials, factorial experiments, multi-site studies
- Single study — individual experiments (default)
- Gray literature — conference proceedings, theses, posters
- Theoretical — in silico, computational modeling, hypothesis papers
O.09 Three-Tier Extraction
The engine uses three extraction methods with decreasing precision but increasing recall:
Tier 1 Standard regex extraction. Sentence-level entity co-occurrence. All three entities (crop, parameter, nutrient) must appear in the sentence or its ±2 neighbors. Highest precision. Free and fast.
Tier 2 Deep re-extraction. Paper-level entity context. Targets “scanned, no claim” lanes where papers were found but standard extraction failed. Relaxes co-occurrence to paper level with a 0.75× confidence penalty. Recovers claims from papers that discuss entities across different sections.
Tier 3 LLM-powered extraction. Uses a language model to read abstracts and extract structured claims. Targets unknown-direction claims, unresolved conflicts, and high-value empty lanes. Highest recall and semantic understanding. Per-call API cost.
O.10 Quality Gates
Every claim must pass validation before entering the database:
- Minimum source length — source text must be ≥40 characters (rejects truncation artifacts)
- Entity presence — at least the crop or the nutrient must actually appear in the source sentence (rejects neighbor-borrowing false positives)
- Parameter context — if the parameter was borrowed from a neighbor sentence and no experimental language is detected (“treatment”, “effect”, “exposure”, etc.), the claim receives a 0.6× confidence penalty
- Deduplication — no duplicate (paper, crop, parameter, nutrient, direction) combinations within a document
O.11 How Claims Are Scored
Every claim receives a confidence score between 0 and 1. This is not a p-value — it reflects extraction certainty based on multiple signals:
- Evidence type base — meta-analysis (0.82) > replicated (0.68) > single study (0.48) > gray (0.36) > theoretical (0.26)
- Direction signal — +0.12 for explicit positive/negative, +0.10 for nonlinear, +0 for unknown
- Cue density — +0.03 per directional cue word found (max +0.15)
- Effect size — +0.14 if a percentage/numeric effect was extracted
- Quantitative signals — +0.025 per unit/p-value found if no effect size (max +0.10)
- Study design bonus — +0.06 if meta-analysis or replicated AND has quantitative signals
- Conflict penalty — −0.08 per conflicting claim on the same triple
Result: a single-study abstract with one “increased” keyword and no numbers scores ~0.63. A replicated field trial with an effect size, 3 directional cues, and p-values scores ~0.95+. The spread is intentional — it separates “mentioned together” from “quantitatively demonstrated.”
O.12 First-Order Aggregation
Claims are grouped by their crop × parameter × nutrient triple. For each triple, the engine computes:
- Total claim count and supporting paper count
- Dominant direction (most common across claims)
- Direction distribution (how many positive vs. negative vs. nonlinear)
- Mean confidence (average across claims)
- Confidence-weighted average effect size
- Effect range and variance flag (>30% range = high variance)
O.13 Second-Order Links (Nutrient Pairs)
For each crop, the engine looks at all pairs of nutrients that share evidence under the same growing parameters. If spinach has claims for both iron and vitamin C under similar light conditions, those nutrients form a second-order link.
Links are scored by: combined claim support × average confidence × density balance × bioavailability multiplier (from curated pair rules that encode known nutrient synergies, e.g., vitamin C enhancing non-heme iron absorption).
O.14 Third-Order Links (Nutrient Triads)
When two high-scoring second-order pairs share a common “anchor” nutrient, the engine forms a triad — a three-way nutrient cluster that may indicate a shared metabolic pathway. These are rare (currently 199 identified) but represent the highest-value discovery targets for protocol optimization.
O.15 Parameter Interactions
The engine also identifies parameter pairs that co-modulate the same nutrient within a crop. For example, if both light intensity and UV-B exposure independently affect anthocyanin content in basil, their interaction may be worth testing as a combined protocol.
O.16 Dose-Response Modeling
For triples with numeric effect sizes from 2+ papers, the engine fits a linear summary model: mean effect, median, standard deviation, range, coefficient of variation, and direction consistency.
Relationships are classified by strength:
- Strong — 3+ data points, ≥80% direction consistency, mean effect ≥10%, CV ≤1.2
- Moderate — ≥70% consistency, mean effect ≥5%, CV ≤1.5
- Weak — ≥50% consistency
- Conflicting — inconsistent directions across studies
O.17 Deficiency-Gap Prioritization
Nutrients are ranked by research priority based on human deficiency prevalence. The engine cross-references its evidence base against a curated list of global micronutrient deficiencies (iron, zinc, folate, vitamin D, etc.) to identify where growing protocol optimization could have the most public health impact.
Priority score = deficiency weight × (1 + 0.6 × evidence depth) × unmet bonus (1.22× for nutrients with zero current evidence).
O.18 When Papers Disagree
When the same crop × parameter × nutrient triple has claims in opposite directions (one paper says “increased,” another says “decreased”), the engine flags a conflict set. All claims in a conflict receive a confidence penalty proportional to the number of opposing claims.
Conflicts are not failures — they often represent real scientific complexity: dose-dependent responses, cultivar differences, or interaction effects. The dashboard surfaces them explicitly so researchers can investigate the underlying conditions.
O.19 Knowledge Base Structure
The ontology is the engine’s vocabulary. It defines every entity the system can recognize, stored in human-readable YAML files:
Ontology file manifest (7 YAML files)
- crops.yaml — 56 crops with species names, growth types, cycle lengths, and aliases
- nutrients.yaml — 81 nutrients with units, categories, chemical forms, and aliases
- parameters.yaml — 55 growing parameters with domains, units, and aliases
- nutrient_pair_rules.yaml — known synergy/antagonism rules between nutrient pairs with bioavailability multipliers
- parameter_interaction_rules.yaml — known parameter co-modulation effects
- human_deficiency_rules.yaml — global deficiency prevalence weights for prioritization
- verticals.yaml — market vertical groupings (which crops, nutrients, parameters belong to each vertical)
The ontology is decoupled from code. Adding a new crop or nutrient means adding a YAML entry with aliases — the entire pipeline automatically picks it up on the next scan.
O.20 Batch Sweep Architecture
A full scan is divided into 17 crop batches (spinach & kale, microgreens, adaptogens, tea varieties, mushrooms, etc.). Each batch runs as an isolated subprocess with its own query scope, paper collection, and claim set.
After all batches complete, results are merged incrementally: claims are deduplicated by ID, papers by source ID, and lane statuses are accumulated. This provides fault tolerance (one batch can fail without losing others) and allows the dashboard to update after each batch finishes.
O.21 Why Regex + Heuristics (Not Just AI)
The core extraction pipeline is deterministic: same input always produces same output. This is a deliberate choice:
- Reproducibility — every claim can be traced to exact cue words in exact sentences
- Auditability — no black-box model decisions; the scoring formula is explicit
- Cost — regex extraction is free; LLM extraction has per-call costs
- Speed — regex processes thousands of papers in minutes
LLM extraction (Tier 3) runs on top of regex results — resolving unknowns, adjudicating conflicts, and handling complex language that regex cannot parse. The tiers complement each other: regex provides breadth and reproducibility, LLM provides depth and semantic understanding.
O.22 What This Is Not
This engine does not replace peer review, meta-analysis, or domain expertise. Confidence scores are extraction certainty metrics, not statistical significance. “Strong evidence” in dose-response means consistent direction across multiple papers — not clinical-trial-grade proof. The output identifies where the literature has converged, where it disagrees, and where evidence is missing.