Pipeline Status

Glass Biome · Methodology

How the Engine Works

Our literature-to-protocol research engine. It reads thousands of published scientific papers, extracts specific claims about how growing conditions affect crop nutrition, scores them for reliability, discovers hidden relationships between nutrients, and produces protocol-ready recommendations for growers.

What it does

Glass Biome is based in Davis. UC Davis has the #1 US agriculture program. The USDA-ARS Western Human Nutrition Research Center is on campus — the only West Coast USDA human-nutrition lab. AI-extracted claims are validated at WHNRC before being released as cultivation protocols.

O.01 The Core Question

For any crop we grow, we want to know: which growing parameters (light, pH, salinity, humidity, harvest timing, etc.) change which nutrients (vitamins, antioxidants, alkaloids, terpenes, etc.) — by how much, and in which direction?

The answer lives scattered across tens of thousands of published papers. No human can read them all. AI scans research papers, extracts claims about how growing conditions affect nutrient outcomes, and scores them. Top findings get validated at the USDA-ARS lab on campus.

O.02 Current Scale

Crops
100
Nutrients
96
Parameters
55
Verticals
28
Papers Indexed
21,851
Claims Extracted
8,635

Organized into 12 market verticals: longevity & anti-aging, nootropic & cognitive, immune & adaptogenic, precision nutrition, dermatological actives, flavor & aroma chemistry, pigment & colorant, stimulant & alkaloid, essential oil & terpene, sweetener & functional food, marine & aquaculture, and rare & ultra-premium.

The Pipeline

O.03 Five-Stage Flow

Every research scan follows the same deterministic sequence:

Stage 1 Scan Build queries from ontology, fetch papers from 3 APIs
Stage 2 Extract Parse text, match entities, infer direction & effect size
Stage 3 Link Aggregate triples, discover nutrient pairs & triads
Stage 4 Model Fit dose-response curves, classify evidence strength
Stage 5 Protocol Generate grower-ready recommendations with confidence
Stage 1 — Scan

O.04 Query Generation

Search queries are built from the ontology — a curated knowledge base of crops, nutrients, parameters, and their known relationships. For each crop, it identifies target nutrients (e.g., spinach targets iron, folate, vitamin C, nitrates) and priority parameters (e.g., light intensity, salinity, harvest timing).

Queries are structured as crop × nutrient × parameter triples. A full scan produces ~1,131 query lanes. Each lane is a specific question: “How does light intensity affect vitamin C in spinach?”

O.05 Multi-Source Paper Fetching

Queries are sent to three scientific literature APIs simultaneously:

After fetching, Unpaywall enrichment looks up DOIs to find free full-text versions. Papers with PDFs can optionally be processed through GROBID, which converts PDFs into structured XML with sections, tables, and figures.

Deduplication removes exact duplicates by content hash and DOI matching. The result is a unified paper collection with each record tagged by ingestion depth: full text, abstract only, or title only.

Stage 2 — Extract

O.06 Entity Matching

Each paper’s text is split into sentences. For every sentence, the engine searches for mentions of crops, parameters, and nutrients using alias-based word-boundary matching. Each entity in the ontology has multiple aliases (e.g., “vitamin C”, “ascorbic acid”, “L-ascorbate” all map to vitamin_c).

When an entity is missing from a sentence but present in a neighboring sentence (±2), the engine can borrow it — but only if the sentences share at least one other entity. This prevents cross-paragraph false associations.

O.07 Direction Inference

For each crop × parameter × nutrient triple found, AI models read the surrounding text and determine the direction of the relationship:

Negation detection flips direction when terms like “not”, “did not”, or “failed to” appear within 3 words of a direction cue. If both positive and negative cues are present, the claim is marked nonlinear.

O.08 Effect Size & Evidence Type

Quantitative signals are extracted from the text by AI: percentage changes (+15%), fold changes (2.3x), concentration units (mg/kg, µmol), and p-values (p<0.05). The first percentage found becomes the claim’s effect size.

Evidence type is classified from study design keywords:

O.09 Three-Tier Extraction

The engine uses three extraction methods with decreasing precision but increasing recall:

Tier 1 Standard regex extraction. Sentence-level entity co-occurrence. All three entities (crop, parameter, nutrient) must appear in the sentence or its ±2 neighbors. Highest precision. Free and fast.

Tier 2 Deep re-extraction. Paper-level entity context. Targets “scanned, no claim” lanes where papers were found but standard extraction failed. Relaxes co-occurrence to paper level with a 0.75× confidence penalty. Recovers claims from papers that discuss entities across different sections.

Tier 3 LLM-powered extraction. Uses a language model to read abstracts and extract structured claims. Targets unknown-direction claims, unresolved conflicts, and high-value empty lanes. Highest recall and semantic understanding. Per-call API cost.

O.10 Quality Gates

Every claim must pass validation before entering the database:

Confidence Scoring

O.11 How Claims Are Scored

Every claim receives a confidence score between 0 and 1. This is not a p-value — it reflects extraction certainty based on multiple signals:

Result: a single-study abstract with one “increased” keyword and no numbers scores ~0.63. A replicated field trial with an effect size, 3 directional cues, and p-values scores ~0.95+. The spread is intentional — it separates “mentioned together” from “quantitatively demonstrated.”

Stage 3 — Link

O.12 First-Order Aggregation

Claims are grouped by their crop × parameter × nutrient triple. For each triple, the engine computes:

O.13 Second-Order Links (Nutrient Pairs)

For each crop, the engine looks at all pairs of nutrients that share evidence under the same growing parameters. If spinach has claims for both iron and vitamin C under similar light conditions, those nutrients form a second-order link.

Links are scored by: combined claim support × average confidence × density balance × bioavailability multiplier (from curated pair rules that encode known nutrient synergies, e.g., vitamin C enhancing non-heme iron absorption).

O.14 Third-Order Links (Nutrient Triads)

When two high-scoring second-order pairs share a common “anchor” nutrient, the engine forms a triad — a three-way nutrient cluster that may indicate a shared metabolic pathway. These are rare (currently 199 identified) but represent the highest-value discovery targets for protocol optimization.

O.15 Parameter Interactions

The engine also identifies parameter pairs that co-modulate the same nutrient within a crop. For example, if both light intensity and UV-B exposure independently affect anthocyanin content in basil, their interaction may be worth testing as a combined protocol.

Stage 4 — Model

O.16 Dose-Response Modeling

For triples with numeric effect sizes from 2+ papers, the engine fits a linear summary model: mean effect, median, standard deviation, range, coefficient of variation, and direction consistency.

Relationships are classified by strength:

O.17 Deficiency-Gap Prioritization

Nutrients are ranked by research priority based on human deficiency prevalence. The engine cross-references its evidence base against a curated list of global micronutrient deficiencies (iron, zinc, folate, vitamin D, etc.) to identify where growing protocol optimization could have the most public health impact.

Priority score = deficiency weight × (1 + 0.6 × evidence depth) × unmet bonus (1.22× for nutrients with zero current evidence).

Conflict Detection

O.18 When Papers Disagree

When the same crop × parameter × nutrient triple has claims in opposite directions (one paper says “increased,” another says “decreased”), the engine flags a conflict set. All claims in a conflict receive a confidence penalty proportional to the number of opposing claims.

Conflicts are not failures — they often represent real scientific complexity: dose-dependent responses, cultivar differences, or interaction effects. The dashboard surfaces them explicitly so researchers can investigate the underlying conditions.

The Ontology

O.19 Knowledge Base Structure

The ontology is the engine’s vocabulary. It defines every entity the system can recognize, stored in human-readable YAML files:

Ontology file manifest (7 YAML files)
  • crops.yaml — 56 crops with species names, growth types, cycle lengths, and aliases
  • nutrients.yaml — 81 nutrients with units, categories, chemical forms, and aliases
  • parameters.yaml — 55 growing parameters with domains, units, and aliases
  • nutrient_pair_rules.yaml — known synergy/antagonism rules between nutrient pairs with bioavailability multipliers
  • parameter_interaction_rules.yaml — known parameter co-modulation effects
  • human_deficiency_rules.yaml — global deficiency prevalence weights for prioritization
  • verticals.yaml — market vertical groupings (which crops, nutrients, parameters belong to each vertical)

The ontology is decoupled from code. Adding a new crop or nutrient means adding a YAML entry with aliases — the entire pipeline automatically picks it up on the next scan.

Execution Model

O.20 Batch Sweep Architecture

A full scan is divided into 17 crop batches (spinach & kale, microgreens, adaptogens, tea varieties, mushrooms, etc.). Each batch runs as an isolated subprocess with its own query scope, paper collection, and claim set.

After all batches complete, results are merged incrementally: claims are deduplicated by ID, papers by source ID, and lane statuses are accumulated. This provides fault tolerance (one batch can fail without losing others) and allows the dashboard to update after each batch finishes.

Design Philosophy

O.21 Why Regex + Heuristics (Not Just AI)

The core extraction pipeline is deterministic: same input always produces same output. This is a deliberate choice:

LLM extraction (Tier 3) runs on top of regex results — resolving unknowns, adjudicating conflicts, and handling complex language that regex cannot parse. The tiers complement each other: regex provides breadth and reproducibility, LLM provides depth and semantic understanding.

O.22 What This Is Not

This engine does not replace peer review, meta-analysis, or domain expertise. Confidence scores are extraction certainty metrics, not statistical significance. “Strong evidence” in dose-response means consistent direction across multiple papers — not clinical-trial-grade proof. The output identifies where the literature has converged, where it disagrees, and where evidence is missing.