Portfolio Budget Analysis — Alfred P. Sloan Foundation

Executive Summary

Until now, answering a question as basic as "how much of our grant portfolio goes to travel?" required opening hundreds of individual budget spreadsheets. This report presents the results of a systematic extraction of budget data from 658 grants awarded between 2023 and 2025, covering 90% of the Foundation's G-prefix (non-fellowship) portfolio and representing $253 million in total budgeted costs.

The extraction was performed using a two-track pipeline: a deterministic parser for the Foundation's standard Excel budget template (23% of grants) and an AI-powered extractor using Claude for PDF budgets (77% of grants). The results are stored in a queryable database and presented through an interactive dashboard. This report describes the method, presents the findings, and notes areas where further classification is possible.

658

Grants Extracted

90%

Portfolio Coverage

$253M

Total Budgeted

$22

Total API Cost

Method

Data Source

Grant records were queried from the Foundation's Salesforce Nonprofit Cloud instance via the REST API. The query targeted all records with a G-prefix grant number (excluding Sloan Research Fellowships) awarded in fiscal years 2023, 2024, and 2025. This yielded 731 grants: 248 in 2023, 231 in 2024, and 252 in 2025.

For each grant, attached files (ContentDocuments in Salesforce) were examined to identify the budget. File selection used a heuristic: filenames containing "budget" or "b&e" were preferred; when multiple budget files existed, the most recently uploaded version was used. Budget files were found on 690 of 731 grants (94%). Of those, 152 were Excel spreadsheets (.xlsx) and 538 were PDFs.

Extraction Pipeline

Salesforce Query

731 grants
3 fiscal years

File Discovery

690 budgets found
94% hit rate

Extraction

Template + AI
two-track

Storage

Supabase
PostgreSQL

Analysis

SQL views
dashboard

Track 1: Deterministic Template Parsing

The Foundation's standard budget template (Version 7.5, last modified January 2025) uses a fixed structure derived from the NSF/NIH budget format. The template has known cell locations for each budget category: salaries at row 34, fringe benefits at row 39, travel at rows 45–46, and so on through indirect costs at row 79 and the grand total at row 94. Year-by-year amounts appear in columns O, V, and AC; totals in column AJ.

A Python parser using openpyxl reads these cells directly. Template detection checks that cell A1 contains "Alfred P Sloan Foundation" and A2 contains "Budget." The parser also handles a two-year template variant with a different row layout for overhead-exempt costs and totals. When the parser detects the template, extraction is deterministic: no interpretation, no API calls, sub-second processing, and results accurate to the penny.

Of the 152 Excel files, 152 matched the standard template (including 2-year variants). Extracted totals were validated against Salesforce award amounts, matching to the cent in all cases tested.

Track 2: AI-Powered PDF Extraction

The remaining 506 budgets were PDFs — overwhelmingly the same Sloan template printed to PDF, not a different budget format. A sample of 10 PDFs revealed that 9 of 10 were printed Sloan templates (versions 7.0–7.5) and only 1 was a genuinely custom format (a film production budget).

These PDFs were sent to the Anthropic API (Claude Sonnet 4) as base64-encoded documents with a structured extraction prompt. The prompt instructed the model to identify the Sloan template layout, extract amounts by section letter (A through L), and return JSON conforming to the Foundation's budget taxonomy. For non-template PDFs, the model mapped line items to the closest standard category and assigned a confidence score.

Extraction was performed via the Anthropic Batch API in batches of 150–200 PDFs, with results polled and retrieved programmatically. Success rate: 97% on the first pass. Failures were primarily oversized files (>15MB) or image-only PDFs without a text layer. Total API cost across all three years: $22.46.

Data Storage and Validation

Extracted data is stored in a Supabase (PostgreSQL) database in two tables: budget_extractions (one row per grant, with metadata including extraction method, file type, and confidence score) and budget_line_items (one row per budget category per grant, with the original label from the source file preserved alongside the standardized category code). Every record carries an extraction_method field (template or ai) for provenance tracking.

Automated validation checks compare the sum of extracted line items against the document's stated total and against the Salesforce award amount. Discrepancies greater than 5% are flagged for review. For AI extractions, a model-assigned confidence score (0–1) is stored; extractions below 0.7 are flagged for manual verification.

Budget Taxonomy

All budget line items are classified into a standardized taxonomy derived from the Foundation's template structure and the NSF/NIH budget format. The taxonomy contains 20 categories spanning personnel costs, direct costs, overhead-exempt costs, indirect costs, and sub-awards.

Category	Template Section	Portfolio Total	% of Total	Grants
Salaries & Wages	A+B	$90.0M	35.6%	526
Sub-awards	J	$32.1M	12.7%	126
Indirect Costs (F&A)	I	$24.0M	9.5%	404
Other Direct Costs	F.7	$16.8M	6.6%	253
Fringe Benefits	C	$15.9M	6.3%	406
Consultant Services	F.4	$11.8M	4.7%	194
Travel — Domestic	D.1	$7.4M	2.9%	316
Film & Media Production ✦	F.7 → AI	$6.5M	2.6%	33
Participant Stipends	E.1	$5.8M	2.3%	79
Prizes, Awards & Fellowships ✦	F.7 → AI	$5.5M	2.2%	28
Materials & Supplies	F.1	$5.2M	2.1%	229
Participant Subsistence	E.3	$4.7M	1.9%	93
Participant Travel	E.2	$4.4M	1.7%	130
Events & Meetings ✦	F.7 → AI	$3.8M	1.5%	104
Honoraria	F.3	$3.7M	1.5%	116
Other Overhead-Exempt	G.2	$3.0M	1.2%	29
Tuition	G.1	$2.6M	1.0%	66
Participant Other	E.4	$2.4M	0.9%	58
Equipment	F.6	$2.2M	0.9%	72
Computer Services	F.5	$1.9M	0.8%	88
Publication Costs	F.2	$1.8M	0.7%	140
Travel — International	D.2	$0.9M	0.4%	72
Overhead on Sub-awards	K	$0.7M	0.3%	81

✦ Subcategory derived from AI classification of the "Other Direct Costs" (F.7) write-in field. Original labels preserved; classification is reversible. See "Partial Classification" below.

Findings

Portfolio Composition

Salaries & Wages

35.6%

Sub-awards

12.7%

Indirect Costs

9.5%

Other Direct

6.6%

Fringe Benefits

6.3%

Consultants

4.7%

Travel (Domestic)

2.9%

Media Production ✦

2.6%

Participant Stipends

2.3%

Prizes & Awards ✦

2.2%

All other

14.6%

Personnel costs (salaries plus fringe benefits) account for 42% of direct spending — consistent with a research-focused portfolio. The second-largest allocation is sub-awards at 12.7%, concentrated in 126 multi-institution grants. Indirect costs (facilities and administration) average 9.5% of total budgets, well below the Foundation's 20% cap. This reflects two structural factors: grants of $50,000 or less carry no indirect costs by Foundation policy, and the 20% rate applies to Modified Total Direct Costs, which excludes tuition, participant support, equipment, and sub-awards above $25,000.

Program-Level Patterns

Program	Total Budget	Grants	Distinctive Pattern
Research	$114.9M	273	Salary-heavy (45% of direct); highest fringe and tuition allocations
Public Understanding	$50.9M	159	Concentrates nearly all media production spending; highest consultant share
Higher Education	$35.7M	81	Largest participant support (stipends, subsistence); significant prizes/awards
Technology	$30.3M	77	Highest sub-award share (multi-institution collaborations); computing costs
Energy & Environment	$14.2M	42	Above-average travel allocation; equipment-intensive fieldwork grants
Economics	$7.0M	26	Smallest program; highest salary-to-total ratio

Year-over-Year

Fiscal Year	Total Budget	Grants	Avg Grant	Coverage
2023	$83.9M	218	$385K	88%
2024	$76.4M	209	$366K	90%
2025	$92.7M	231	$401K	92%

Partial Classification of "Other Direct Costs"

The Foundation's budget template includes a write-in field for "Other Direct Costs" (section F.7), where grantees describe expenses that don't fit the predefined categories. Across 658 grants, $32.7 million (14% of direct costs) was entered in this field — making it the fourth-largest category in the portfolio.

To understand what this bucket actually contains, the original labels from all 654 line items were clustered using AI classification (Claude Sonnet 4). Labels such as "VFX and post-production," "3D conversion," and "music licensing" clustered naturally into Film & Media Production; "conference costs," "workshop expenses," and "venue rental" into Events & Meetings; and "seed grants," "fellowship awards," and "Scialog prizes" into Prizes, Awards & Fellowships.

Cluster	Amount	% of F.7	Items	Examples
Administrative & Operating	$9.6M	29%	219	Often the residual — "Other Direct Costs" with no further detail
Film & Media Production ✦	$6.5M	20%	100	VFX, post-production, music licensing, aerial units
Prizes, Awards & Fellowships ✦	$5.5M	17%	40	Scialog awards, seed grants, fellowship stipends
Events & Meetings ✦	$3.8M	12%	145	Conference costs, venue rental, workshop expenses
Facilities & Occupancy	$2.5M	8%	30	Lab space rental, office costs, renovation
All other clusters	$4.8M	14%	120	Marketing, specialized services, participant incentives

Three clusters (marked with ✦) were promoted to named subcategories in the database. The reclassification is reversible: the original grantee-supplied label is preserved on every record, and each reclassified item carries a note indicating the AI-derived provenance. The interactive dashboard includes a toggle to collapse these subcategories back into the original "Other Direct Costs" aggregate for comparison with external benchmarks.

Further classification is possible. The "Administrative & Operating" cluster ($9.6M) and the five smaller clusters ($4.8M combined) were not reclassified because the underlying labels are too generic or too varied to assign with confidence. A targeted review of the 73 grants whose labels consist solely of "Other Direct Costs" or "Other" could yield additional resolution through examination of the budget justification narratives.

Technical Notes

Infrastructure

The extraction pipeline runs as part of the Foundation's sloan-ops operational automation platform. Source code is in the apps/budget-analysis/ directory of a private GitHub repository. Scripts are written in Python and import shared utilities for Salesforce authentication, Anthropic API calls, and Supabase database operations. The pipeline can be re-run for new fiscal years in under an hour.

Models and Costs

Component	Technology	Cost
Template parsing	Python / openpyxl	$0 (no API calls)
PDF extraction	Claude Sonnet 4 via Anthropic Batch API	~$0.05 per file
Label clustering	Claude Sonnet 4 (single prompt)	<$1 total
Data storage	Supabase PostgreSQL (Pro plan)	$25/month (shared)
Dashboard hosting	Cloudflare Pages	Free tier

Total incremental cost for the complete three-year extraction: approximately $23.

Limitations

Coverage gaps. 73 grants (10%) could not be extracted — primarily grants with no budget file attached in Salesforce or with oversized/malformed PDF documents. These are disproportionately older (2023) and smaller grants.

Budgeted vs. actual expenditure. All figures represent budgeted allocations at the time of award, not actual expenditures. Grantees may reallocate between categories during the grant period. Actual spending is not currently captured in a structured format.

AI extraction confidence. The 506 AI-extracted budgets carry a model-assigned confidence score. Ninety-five percent scored 0.90 or above. A small number of edge cases — institutional operating budgets, film production budgets, and multi-workbook PDFs — required manual review or produced partial extractions.

Sub-award detail. Sub-award totals are captured for all grants, but the internal breakdown (what the sub-awardee spent on salary vs. travel vs. equipment) is extracted only where the sub-award budget appears in the same document. A dedicated sub-award extraction pass could increase granularity for the 126 grants with sub-awards.

Data Access

The interactive dashboard is available to Foundation staff at ops.sloan.org/budget/ (Okta authentication required). The underlying data can be queried directly via the Supabase SQL editor or REST API. Four pre-built SQL views provide summary aggregations by year, by program, by category percentage, and with the expanded Other Direct Costs subcategories.