At UNICEF's Project Connect, one of our core challenges was building a comprehensive global database of schools. Government data was often incomplete, outdated, or simply unavailable. OpenStreetMap (OSM) — the crowd-sourced geographic database — turned out to be one of the richest supplementary sources. But getting usable school data out of OSM is harder than it sounds.
This article documents the ETL pipeline I built to extract, transform, and load school data from OSM across multiple countries — including Albania and Ukraine — and the technical challenges that came with it.
The problem
OSM data is messy by nature. It's contributed by thousands of volunteer mappers with varying standards. School data specifically suffers from:
- Inconsistent classification — a school might be tagged as
amenity=school,amenity=kindergarten,amenity=college, or even justbuilding=school - Multilingual content — school names in Arabic, Cyrillic, Latin scripts, sometimes multiple names per entry
- Duplicate records — the same school appearing as a node, a way, and a relation
- Missing or inconsistent metadata — education levels, student counts, and contact info are rarely standardized
The approach
The pipeline follows a three-phase methodology:
Extract: Data is pulled via the Overpass API, OSM's query interface. We query for all elements tagged with education-related keys across a target country's bounding box. This returns nodes (points), ways (polygons), and relations (grouped elements) — each representing schools differently.
Transform: This is where most of the complexity lives. The pipeline normalizes different OSM element types into a single schema, handles multilingual name fields, classifies education levels from inconsistent tags, runs duplicate detection using geographic proximity and name similarity, and validates coordinates and metadata.
Load: Clean records are loaded into a standardized database format compatible with Project Connect's global school registry. The schema accommodates regional variations while enforcing a minimum set of required fields.
Key findings
Data quality varies dramatically by region. Countries with active OSM mapping communities had significantly better coverage and consistency. Duplicate detection was essential — without it, school counts were inflated by 15-30% in some regions.
The education level classifier achieved reasonable accuracy but exposed a fundamental limitation: OSM's tagging schema wasn't designed for the granularity that educational planning requires. A "school" in OSM could be anything from a preschool to a university.
Multilingual handling proved critical for countries like Ukraine, where school names appear in both Ukrainian and Russian, and for countries in the MENA region with Arabic-script names that need careful normalization.
Takeaway
Crowd-sourced geographic data is powerful but requires significant engineering to make usable at scale. The pipeline processed school records across multiple nations and became part of Project Connect's broader data infrastructure — contributing to the platform that now tracks connectivity for 2.1M+ schools globally.
The full technical article is available on Zenodo.