Draft — Merck In Silico Cup 2026

Here is the complete draft. REGISTRATION FORM Applicant name: Eniola Olutogun Role: Sole founder and author Track: Startup Venture stage: Pre-seed, proof-of-concept validated, not yet incorporated Email: ennyolutogun@gmail.com ORCID: [INSERT ORCID iD] Venture one-line description: Predicting drug resistance mutations from protein sequence alone, with no crystal structure required, at 100 percent mutation coverage. Sector tags: computational biology, pharma AI, drug discovery, oncology, antivirals, antimicrobial resistance Compute environment for sprint: CPU-only VPS, consistent with the Merck In Silico Cup constraint Why this competition: The Merck In Silico Cup rewards drug discovery methods that work under a CPU-only budget. My production stack was engineered for exactly that constraint, so the 48-hour sprint tests the system on its home ground rather than forcing a redesign. TEAM AND MOTIVATION STATEMENT Most resistance variants a discovery team actually chases have no solved crystal structure. Structure-dependent prediction tools cover only 17.6 percent of mutations in the Platinum benchmark; the other 82.4 percent return no answer at all. My system answers all of them, because it reads the protein sequence directly. That coverage gap is the problem I built the venture to close, and it maps onto Merck KGaA's own pipeline: EGFR and KRAS acquired resistance in oncology, MSI-H tumor evolution, and the antimicrobial resistance programmes where new variants appear faster than crystallographers can keep up. The method is deliberately lean. ESM-2 protein language model delta-embeddings capture the effect of a point mutation, ECFP4 fingerprints encode the drug, and a Random Forest classifier ties the two together. The full pipeline runs on CPU. That was an engineering decision made for a VPS budget, and it is the reason I can enter a CPU-only 48-hour sprint without rebuilding anything. On the Platinum benchmark of 357 mutations under protein-grouped cross-validation, it reaches AUROC 0.634, in the same band as published structure-based work such as mCSM-lig at 0.70, while covering roughly six times as many mutations. My background is the reason the model is grounded rather than a benchmark exercise. Training as a pharmacist before moving into machine learning means I read a resistance prediction as a clinical event, not only as a label. That perspective shapes which mutations and drug pairs matter and keeps the feature choices tied to pharmacology. For the Merck In Silico Cup specifically, it lets me frame any sprint result against the practical question a discovery chemist asks: does this variant blunt this compound, and how confident should I be. Winning visibility with a Merck drug discovery jury is worth more to a pre-seed venture than the prize itself, though the 10,000 euro non-dilutive prize directly funds the next milestone: fine-tuning ESM-2 on the SKEMPI set of roughly 3,000 mutations to push AUROC toward 0.70, ahead of a planned pilot with Servier in Suresnes. TECHNICAL REPORT Problem framing Drug resistance prediction is gated by structural data. Methods that need a co-crystal or docked complex cannot score the majority of clinically relevant variants because no structure exists for them. The Merck In Silico Cup's CPU-only environment also rules out heavy structure-based simulation within a 48-hour window. Both constraints point to the same solution: a sequence-first method that is cheap to run. Method The pipeline has three components. First, ESM-2 produces per-residue embeddings for wild-type and mutant sequences; the delta between them encodes the mutation's effect without any structural input. Second, the candidate drug is represented by ECFP4 circular fingerprints. Third, a Random Forest classifier predicts whether the mutation confers resistance to that drug. Every stage runs on CPU, with no GPU dependency at inference, which keeps the method inside the competition's VPS limits. Validation to date On the Platinum benchmark, 357 mutations, evaluated under protein-grouped cross-validation so that no protein appears in both train and test folds, the classifier reaches AUROC 0.634. Mutation coverage is 100 percent against 17.6 percent for structure-limited tools on the same set. Published structure-based comparators such as mCSM-lig report AUROC near 0.70 but only on the minority of cases where a structure is available. The trade the system makes is explicit: a modest accuracy gap in exchange for answering every case. Sprint plan for the 48-hour challenge - Hour 0 to 6: ingest the provided dataset, map it to the wild-type and mutant sequence plus drug-pair format, run sanity checks on label balance and protein grouping. - Hour 6 to 24: generate ESM-2 delta-embeddings and ECFP4 fingerprints on CPU, baseline the Random Forest, record AUROC and coverage. - Hour 24 to 40: feature ablations and threshold calibration tuned to the jury's stated metric, with grouped cross-validation to avoid leakage. - Hour 40 to 48: finalize predictions, write results, and document the coverage advantage against any structure-dependent baseline in the provided data. Relevance to Merck KGaA The same model that scores a benchmark variant scores an EGFR or KRAS resistance mutation, an MSI-H associated change, or an antimicrobial resistance substitution. For a Merck discovery team weighing which variants threaten a compound, a method that returns a calibrated score for every variant, structure or not, removes the blind spot that structure-dependent tools leave behind. Roadmap beyond the sprint Fine-tuning ESM-2 on the SKEMPI set of roughly 3,000 mutations is the next step toward AUROC 0.70, after which a pilot with Servier in Suresnes converts the method into a partner-validated tool, with Sanofi in Gentilly and Paris-Saclay groups including I2BC and Institut Pasteur as the surrounding network. CHECKLIST - [ ] Complete and submit the registration form by June 30, 2026 - [ ] Confirm Startup track selection on the registration form - [ ] Insert ORCID iD into the registration form - [ ] Verify the CPU-only VPS is provisioned and the ESM-2 plus Random Forest pipeline runs end to end on it before July 1 - [ ] Participate in the 48-hour sprint, July 1 to 3, 2026 - [ ] Submit the solution file in the format the organizers specify - [ ] Submit the technical report with sprint-specific results filled in - [ ] Confirm submission email matches ennyolutogun@gmail.com EDITOR NOTES - Eligibility risk: registration closes June 30 and the sprint is July 1 to 3. Register today and confirm whether the Startup track has team-size or incorporation requirements, since the venture is not yet incorporated. - Verify before submitting: confirm the exact prize terms (10,000 euro to top teams, not guaranteed), the official submission format for both solution and report, and the precise scoring metric so the sprint plan's calibration step targets the right number. - Fill in personal detail not in this profile: ORCID iD is a placeholder; add it. If the organizers ask for prior competition history, affiliations, or a headshot, those are not in the profile. - Technical report caveat: the sprint dataset is unknown in advance, so all sprint results in the report are placeholders until July 3. Do not submit the report with the validation-set numbers presented as sprint results; clearly separate prior Platinum benchmark figures from sprint output. - Confirm the CPU-only pipeline timing on the actual VPS ahead of time. ESM-2 embedding generation on CPU can be slow for large inputs; benchmark throughput before the clock starts so the hour 6 to 24 block is realistic.

AI Draft — Merck In Silico Cup 2026