Executive Summary
Until now, answering a question as basic as "how much of our grant portfolio goes to travel?" required opening hundreds of individual budget spreadsheets. This report presents the results of a systematic extraction of budget data from 658 grants awarded between 2023 and 2025, covering 90% of the Foundation's G-prefix (non-fellowship) portfolio and representing $253 million in total budgeted costs.
The extraction was performed using a two-track pipeline: a deterministic parser for the Foundation's standard Excel budget template (23% of grants) and an AI-powered extractor using Claude for PDF budgets (77% of grants). The results are stored in a queryable database and presented through an interactive dashboard. This report describes the method, presents the findings, and notes areas where further classification is possible.
Method
Data Source
Grant records were queried from the Foundation's Salesforce Nonprofit Cloud instance via the REST API. The query targeted all records with a G-prefix grant number (excluding Sloan Research Fellowships) awarded in fiscal years 2023, 2024, and 2025. This yielded 731 grants: 248 in 2023, 231 in 2024, and 252 in 2025.
For each grant, attached files (ContentDocuments in Salesforce) were examined to identify the budget. File selection used a heuristic: filenames containing "budget" or "b&e" were preferred; when multiple budget files existed, the most recently uploaded version was used. Budget files were found on 690 of 731 grants (94%). Of those, 152 were Excel spreadsheets (.xlsx) and 538 were PDFs.
Extraction Pipeline
3 fiscal years
94% hit rate
two-track
PostgreSQL
dashboard
Track 1: Deterministic Template Parsing
The Foundation's standard budget template (Version 7.5, last modified January 2025) uses a fixed structure derived from the NSF/NIH budget format. The template has known cell locations for each budget category: salaries at row 34, fringe benefits at row 39, travel at rows 45–46, and so on through indirect costs at row 79 and the grand total at row 94. Year-by-year amounts appear in columns O, V, and AC; totals in column AJ.
A Python parser using openpyxl reads these cells directly. Template detection checks that cell A1 contains "Alfred P Sloan Foundation" and A2 contains "Budget." The parser also handles a two-year template variant with a different row layout for overhead-exempt costs and totals. When the parser detects the template, extraction is deterministic: no interpretation, no API calls, sub-second processing, and results accurate to the penny.
Of the 152 Excel files, 152 matched the standard template (including 2-year variants). Extracted totals were validated against Salesforce award amounts, matching to the cent in all cases tested.
Track 2: AI-Powered PDF Extraction
The remaining 506 budgets were PDFs — overwhelmingly the same Sloan template printed to PDF, not a different budget format. A sample of 10 PDFs revealed that 9 of 10 were printed Sloan templates (versions 7.0–7.5) and only 1 was a genuinely custom format (a film production budget).
These PDFs were sent to the Anthropic API (Claude Sonnet 4) as base64-encoded documents with a structured extraction prompt. The prompt instructed the model to identify the Sloan template layout, extract amounts by section letter (A through L), and return JSON conforming to the Foundation's budget taxonomy. For non-template PDFs, the model mapped line items to the closest standard category and assigned a confidence score.
Extraction was performed via the Anthropic Batch API in batches of 150–200 PDFs, with results polled and retrieved programmatically. Success rate: 97% on the first pass. Failures were primarily oversized files (>15MB) or image-only PDFs without a text layer. Total API cost across all three years: $22.46.
Data Storage and Validation
Extracted data is stored in a Supabase (PostgreSQL) database in two tables: budget_extractions (one row per grant, with metadata including extraction method, file type, and confidence score) and budget_line_items (one row per budget category per grant, with the original label from the source file preserved alongside the standardized category code). Every record carries an extraction_method field (template or ai) for provenance tracking.
Automated validation checks compare the sum of extracted line items against the document's stated total and against the Salesforce award amount. Discrepancies greater than 5% are flagged for review. For AI extractions, a model-assigned confidence score (0–1) is stored; extractions below 0.7 are flagged for manual verification.
Budget Taxonomy
All budget line items are classified into a standardized taxonomy derived from the Foundation's template structure and the NSF/NIH budget format. The taxonomy contains 20 categories spanning personnel costs, direct costs, overhead-exempt costs, indirect costs, and sub-awards.
| Category | Template Section | Portfolio Total | % of Total | Grants |
|---|---|---|---|---|
| Salaries & Wages | A+B | $90.0M | 35.6% | 526 |
| Sub-awards | J | $32.1M | 12.7% | 126 |
| Indirect Costs (F&A) | I | $24.0M | 9.5% | 404 |
| Other Direct Costs | F.7 | $16.8M | 6.6% | 253 |
| Fringe Benefits | C | $15.9M | 6.3% | 406 |
| Consultant Services | F.4 | $11.8M | 4.7% | 194 |
| Travel — Domestic | D.1 | $7.4M | 2.9% | 316 |
| Film & Media Production ✦ | F.7 → AI | $6.5M | 2.6% | 33 |
| Participant Stipends | E.1 | $5.8M | 2.3% | 79 |
| Prizes, Awards & Fellowships ✦ | F.7 → AI | $5.5M | 2.2% | 28 |
| Materials & Supplies | F.1 | $5.2M | 2.1% | 229 |
| Participant Subsistence | E.3 | $4.7M | 1.9% | 93 |
| Participant Travel | E.2 | $4.4M | 1.7% | 130 |
| Events & Meetings ✦ | F.7 → AI | $3.8M | 1.5% | 104 |
| Honoraria | F.3 | $3.7M | 1.5% | 116 |
| Other Overhead-Exempt | G.2 | $3.0M | 1.2% | 29 |
| Tuition | G.1 | $2.6M | 1.0% | 66 |
| Participant Other | E.4 | $2.4M | 0.9% | 58 |
| Equipment | F.6 | $2.2M | 0.9% | 72 |
| Computer Services | F.5 | $1.9M | 0.8% | 88 |
| Publication Costs | F.2 | $1.8M | 0.7% | 140 |
| Travel — International | D.2 | $0.9M | 0.4% | 72 |
| Overhead on Sub-awards | K | $0.7M | 0.3% | 81 |
✦ Subcategory derived from AI classification of the "Other Direct Costs" (F.7) write-in field. Original labels preserved; classification is reversible. See "Partial Classification" below.
Findings
Portfolio Composition
Personnel costs (salaries plus fringe benefits) account for 42% of direct spending — consistent with a research-focused portfolio. The second-largest allocation is sub-awards at 12.7%, concentrated in 126 multi-institution grants. Indirect costs (facilities and administration) average 9.5% of total budgets, well below the Foundation's 20% cap. This reflects two structural factors: grants of $50,000 or less carry no indirect costs by Foundation policy, and the 20% rate applies to Modified Total Direct Costs, which excludes tuition, participant support, equipment, and sub-awards above $25,000.
Program-Level Patterns
| Program | Total Budget | Grants | Distinctive Pattern |
|---|---|---|---|
| Research | $114.9M | 273 | Salary-heavy (45% of direct); highest fringe and tuition allocations |
| Public Understanding | $50.9M | 159 | Concentrates nearly all media production spending; highest consultant share |
| Higher Education | $35.7M | 81 | Largest participant support (stipends, subsistence); significant prizes/awards |
| Technology | $30.3M | 77 | Highest sub-award share (multi-institution collaborations); computing costs |
| Energy & Environment | $14.2M | 42 | Above-average travel allocation; equipment-intensive fieldwork grants |
| Economics | $7.0M | 26 | Smallest program; highest salary-to-total ratio |
Year-over-Year
| Fiscal Year | Total Budget | Grants | Avg Grant | Coverage |
|---|---|---|---|---|
| 2023 | $83.9M | 218 | $385K | 88% |
| 2024 | $76.4M | 209 | $366K | 90% |
| 2025 | $92.7M | 231 | $401K | 92% |
Partial Classification of "Other Direct Costs"
The Foundation's budget template includes a write-in field for "Other Direct Costs" (section F.7), where grantees describe expenses that don't fit the predefined categories. Across 658 grants, $32.7 million (14% of direct costs) was entered in this field — making it the fourth-largest category in the portfolio.
To understand what this bucket actually contains, the original labels from all 654 line items were clustered using AI classification (Claude Sonnet 4). Labels such as "VFX and post-production," "3D conversion," and "music licensing" clustered naturally into Film & Media Production; "conference costs," "workshop expenses," and "venue rental" into Events & Meetings; and "seed grants," "fellowship awards," and "Scialog prizes" into Prizes, Awards & Fellowships.
| Cluster | Amount | % of F.7 | Items | Examples |
|---|---|---|---|---|
| Administrative & Operating | $9.6M | 29% | 219 | Often the residual — "Other Direct Costs" with no further detail |
| Film & Media Production ✦ | $6.5M | 20% | 100 | VFX, post-production, music licensing, aerial units |
| Prizes, Awards & Fellowships ✦ | $5.5M | 17% | 40 | Scialog awards, seed grants, fellowship stipends |
| Events & Meetings ✦ | $3.8M | 12% | 145 | Conference costs, venue rental, workshop expenses |
| Facilities & Occupancy | $2.5M | 8% | 30 | Lab space rental, office costs, renovation |
| All other clusters | $4.8M | 14% | 120 | Marketing, specialized services, participant incentives |
Three clusters (marked with ✦) were promoted to named subcategories in the database. The reclassification is reversible: the original grantee-supplied label is preserved on every record, and each reclassified item carries a note indicating the AI-derived provenance. The interactive dashboard includes a toggle to collapse these subcategories back into the original "Other Direct Costs" aggregate for comparison with external benchmarks.
Further classification is possible. The "Administrative & Operating" cluster ($9.6M) and the five smaller clusters ($4.8M combined) were not reclassified because the underlying labels are too generic or too varied to assign with confidence. A targeted review of the 73 grants whose labels consist solely of "Other Direct Costs" or "Other" could yield additional resolution through examination of the budget justification narratives.
Technical Notes
Infrastructure
The extraction pipeline runs as part of the Foundation's sloan-ops operational automation platform. Source code is in the apps/budget-analysis/ directory of a private GitHub repository. Scripts are written in Python and import shared utilities for Salesforce authentication, Anthropic API calls, and Supabase database operations. The pipeline can be re-run for new fiscal years in under an hour.
Models and Costs
| Component | Technology | Cost |
|---|---|---|
| Template parsing | Python / openpyxl | $0 (no API calls) |
| PDF extraction | Claude Sonnet 4 via Anthropic Batch API | ~$0.05 per file |
| Label clustering | Claude Sonnet 4 (single prompt) | <$1 total |
| Data storage | Supabase PostgreSQL (Pro plan) | $25/month (shared) |
| Dashboard hosting | Cloudflare Pages | Free tier |
Total incremental cost for the complete three-year extraction: approximately $23.
Limitations
Coverage gaps. 73 grants (10%) could not be extracted — primarily grants with no budget file attached in Salesforce or with oversized/malformed PDF documents. These are disproportionately older (2023) and smaller grants.
Budgeted vs. actual expenditure. All figures represent budgeted allocations at the time of award, not actual expenditures. Grantees may reallocate between categories during the grant period. Actual spending is not currently captured in a structured format.
AI extraction confidence. The 506 AI-extracted budgets carry a model-assigned confidence score. Ninety-five percent scored 0.90 or above. A small number of edge cases — institutional operating budgets, film production budgets, and multi-workbook PDFs — required manual review or produced partial extractions.
Sub-award detail. Sub-award totals are captured for all grants, but the internal breakdown (what the sub-awardee spent on salary vs. travel vs. equipment) is extracted only where the sub-award budget appears in the same document. A dedicated sub-award extraction pass could increase granularity for the 126 grants with sub-awards.
Data Access
The interactive dashboard is available to Foundation staff at ops.sloan.org/budget/ (Okta authentication required). The underlying data can be queried directly via the Supabase SQL editor or REST API. Four pre-built SQL views provide summary aggregations by year, by program, by category percentage, and with the expanded Other Direct Costs subcategories.