Stop Building Geospatial Pipelines from Scratch! Use GeoFlo Instead
What if I told you that data scientists worldwide are wasting 40+ hours per project on the same repetitive task? Not model training. Not hyperparameter tuning. Something far more mundane—and far more costly.
I'm talking about geospatial feature extraction.
Every environmental researcher, every poverty prediction team, every urban planner working with machine learning eventually hits the same wall. You need satellite imagery statistics. You need land cover ratios. You need building footprints, night time lights, road networks. And you need them all aggregated to the exact same administrative boundaries your labels use.
So you open Google Earth Engine. You write batch export scripts. You wait. You download from Google Drive. You merge CSV files. You reproject, standardise, join, clean. By the time you're ready to actually model, your energy is drained, your timeline is shot, and your code is a fragile mess of one-off scripts.
Here's the secret top geospatial ML teams know: they stopped building this pipeline years ago.
They use GeoFlo—an automated, modular, battle-tested pipeline for downloading, preprocessing, and extracting geospatial features from Earth Observation and survey data. And in this deep dive, I'm going to show you exactly how to go from zero to production-ready features in four commands.
Ready to reclaim your time? Let's get into it.
What is GeoFlo?
GeoFlo is an open-source geospatial data pipeline created by the Data Science Unit—a team that clearly understands the pain of operationalising Earth Observation data for machine learning. The repository lives at github.com/Data-Science-Unit/GeoFlo and has quietly become one of the most practical tools for admin-level ML tasks like poverty prediction, socioeconomic analysis, and environmental monitoring.
But why is it trending now?
Three forces are converging. First, Google Earth Engine has matured into a production-grade platform, but its Python API remains notoriously finicky for batch workflows. Second, the explosion of foundation models for remote sensing means researchers need standardised feature pipelines more than ever—not custom scripts per project. Third, organisations like the World Bank, UN agencies, and national statistics offices are racing to build machine learning models for Sustainable Development Goals, all requiring consistent admin-level features across countries.
GeoFlo solves this by abstracting the entire pipeline into four deterministic stages: Data Download → Preprocessing → Feature Extraction → Feature Labelling. Each stage is a standalone Python script with YAML configuration. No hidden state. No magic. Just clean, reproducible, auditable geospatial engineering.
The architecture is deliberately modular. Swap data sources. Change admin levels. Retarget to a new country. Everything is configuration-driven, which means your "code" becomes portable documentation that any team member can understand.
Key Features That Make GeoFlo Insane
Let's dissect what makes this pipeline genuinely powerful—not just convenient.
Multi-Source Earth Observation Integration GeoFlo doesn't lock you into one satellite or one dataset. It orchestrates five distinct data sources simultaneously: Landsat for surface reflectance statistics, ESRI Land Cover for classification ratios, VIDA building footprints for structural density metrics, Night Time Lights for economic activity proxies, and OpenStreetMap for infrastructure and amenity features. Each source has dedicated download modules that handle source-specific quirks—Landsat's cloud masking, OSM's tagging complexity, NTL's annual compositing.
Google Earth Engine Batch Orchestration The pipeline doesn't just "use" GEE—it manages the entire batch export lifecycle. It submits tasks, expects Google Drive outputs, and provides clear monitoring instructions. This is critical because GEE's Python API is asynchronous by design; most developers stumble on the manual download step. GeoFlo makes this explicit and manageable.
Administrative Level Flexibility
Whether you're modelling at ADM1 (provinces), ADM2 (districts), or finer granularities, the pipeline adapts through configuration. The standardised ADM{level}_PCODE / ADM{level}_PT naming convention ensures your features join cleanly with virtually any demographic or survey dataset.
Geographic Preprocessing Automation Reprojection to EPSG:4326, CRS standardisation, batch file merging—it's all handled in Stage 2. The preprocessing module consolidates GEE's fragmented tile outputs into clean, admin-level datasets ready for feature engineering.
Pluggable Labelling Architecture
The optional Stage 4 isn't hardcoded to population. The reference implementation shows the pattern: join any label dataset on ADM{level}_PCODE. Wealth indices, survey scores, health indicators, crop yields—if it has an admin code, it integrates.
Deterministic, Logged Execution
Every download run generates timestamped logs. Use --verbose for DEBUG output. This isn't just debugging convenience; it's audit trail compliance for research reproducibility.
Use Cases: Where GeoFlo Absolutely Dominates
1. Poverty Prediction from Satellite Imagery
The original killer app. Combine night time lights (economic activity proxy), building footprints (housing quality), land cover (agricultural dependence), and OSM road density (market access) to predict subnational poverty measures. GeoFlo's admin-level aggregation means your features directly align with DHS survey clusters or census data.
2. Post-Disaster Damage Assessment
When hurricanes or earthquakes strike, you need rapid feature extraction across affected regions. GeoFlo's modular design lets you swap in pre-event and post-event Landsat imagery, extract change features at ADM2 level, and feed directly into damage classification models.
3. Urban Growth Monitoring
Track expansion of built-up areas by running the pipeline annually with VIDA building footprints and ESRI land cover. The consistent admin-level outputs enable time-series analysis of urbanisation rates, supporting SDG 11 indicators.
4. Agricultural Yield Estimation
Replace population labelling with crop survey data. Use Landsat vegetation indices, land cover ratios, and OSM irrigation infrastructure as predictors. The pipeline's flexibility means rural ADM2 units in your country of interest get the exact feature set your yield model needs.
5. Healthcare Accessibility Modelling
OSM-derived road networks and amenity locations, combined with population distribution, enable travel time estimation to nearest health facilities. GeoFlo preps the foundational layers; your routing engine does the rest.
Step-by-Step Installation & Setup Guide
Let's get GeoFlo running on your machine. The setup is intentionally lightweight—you're building a pipeline, not fighting dependency hell.
Prerequisites
- Python 3.x
- A valid Google Cloud project with Earth Engine access (if using GEE data sources)
- Basic familiarity with YAML configuration
Environment Setup
The maintainers recommend starting with geoai-py, which bundles most geospatial dependencies:
# Install the recommended base environment
pip install geoai-py
This gives you geopandas, earthengine-api, and related libraries in one shot. Then add the remaining dependency:
pip install pyyaml
Alternative: If you prefer minimal installations or have conflicts, install individually:
pip install earthengine-api geopandas pandas pyyaml
Repository Setup
Clone the repository and navigate into it:
git clone https://github.com/Data-Science-Unit/GeoFlo.git
cd GeoFlo
Google Earth Engine Authentication
Before running downloads, authenticate with GEE:
earthengine authenticate
Follow the OAuth flow in your browser. This creates credentials that the pipeline uses for batch exports.
Configuration Files
GeoFlo uses three YAML files that you'll edit for your project:
| Config File | Purpose |
|---|---|
download_config.yml |
Controls Stage 1 (download) and Stage 2 (preprocessing) |
feature_extraction_config.yml |
Controls Stage 3 (feature extraction) |
population_labelling_config.yml |
Controls Stage 4 (optional labelling) |
Copy the provided templates and customise for your country, year, and admin levels. The detailed structure is covered in the next section with actual code.
REAL Code Examples from the Repository
Now for the meat. Let me walk you through actual code from GeoFlo's README, with detailed explanations of how each piece works.
Example 1: Download Configuration (download_config.yml)
This YAML file is the control centre for Stages 1 and 2. Here's the exact structure from the repository:
earth_engine_info:
project_name: "<your-project-name>" # Your GEE project identifier
drive_folder: "<your-drive-folder>" # Google Drive folder for exports
boundary_information:
boundary_path: "<path-to-boundary-shapefile>" # National boundaries
admin_level_path: "<path-to-admin-level-shapefile>" # Subnational boundaries
country: "<country-name>" # Human-readable country name
country_iso: "<ISO-code>" # ISO 3166-1 alpha-3 code (e.g., "KEN" for Kenya)
year: <year> # Analysis year for temporal consistency
landsat_download:
download: true # Toggle: set false to skip this source
year: <year> # Override year for Landsat specifically
admin_levels:
- <level-1> # e.g., 1 for ADM1 (provinces)
- <level-2> # e.g., 2 for ADM2 (districts)
land_class_download:
download: true
year: <year>
admin_levels:
- <level-1>
- <level-2>
buildings_download:
download: true
admin_levels:
- <level-1>
- <level-2>
ee_file_name: "<ee-file-name>" # Specific GEE asset name for buildings
night_time_light_download:
download: true
year: <year>
admin_levels:
- <level-1>
- <level-2>
osm_download:
download: true
output_path: "<path-to-output-directory>" # Local path for OSM data
pre_processing_info:
crs: "EPSG:4326" # Standard WGS84 coordinate system
gee_downloads_dir: "<path-to-gee-downloads-dir>" # Where YOU put GEE exports
base_downloads_dir: "<path-to-base-downloads-dir>" # Output for merged data
admin_levels:
- <level-1>
- <level-2>
year: <year>
delete_original_data: false # Safety: keep raw files for debugging
Why this design matters: Each data source has its own toggle (download: true/false), so you can iterate without re-downloading everything. The admin_levels list lets you extract multiple granularities in one run. The delete_original_data flag protects you from accidental data loss during experimentation.
Example 2: Running the Download Stage
Here's the actual command from the repository, with the optional verbose flag:
# Basic execution
python 1_data_download.py path/to/download_config.yml
# With debug logging for troubleshooting
python 1_data_download.py path/to/download_config.yml --verbose
Critical workflow detail: This script submits batch tasks to GEE, not direct downloads. The outputs land in your Google Drive. You must manually move them to gee_downloads_dir before Stage 2. Monitor progress at https://code.earthengine.google.com/tasks.
This two-step download (cloud export → local transfer) is GEE's architecture, not GeoFlo's limitation. The pipeline makes this explicit rather than hiding the complexity with fragile automation.
Example 3: Preprocessing and Feature Extraction Commands
Stages 2 and 3 follow the same pattern—pass a config, get deterministic output:
# Stage 2: Merge batch CSVs and apply geographic preprocessing
python 2_data_preprocessing.py path/to/download_config.yml
# Stage 3: Extract and merge all features into single CSV per admin level
python 3_create_admin_features.py path/to/feature_extraction_config.yml
Stage 2 solves a genuinely annoying problem: GEE exports each tile as a separate CSV. For a country with 50 tiles across 5 data sources, that's 250 files to merge. This script handles concatenation, deduplication, and CRS standardisation automatically.
Stage 3 produces the golden output:
output/features/admin_2_extracted_features.csv
# One row per admin unit, one column per feature
Example 4: Population Labelling (Reference Implementation)
The optional Stage 4 shows how to join any label dataset. Here's the exact config:
features_csv: "<path-to-extracted-features-csv>" # Output from Stage 3
population_csv: "<path-to-population-csv>" # Your label data
admin_level: <admin-level> # e.g., 2
admin_code_column: "<admin-code-column>" # Column matching ADM{level}_PCODE
output_fpath: "<path-to-output-csv>" # Final labelled dataset
And the command:
python label_features_with_population.py path/to/population_labelling_config.yml
The power of this pattern: The population CSV needs only two things—a T_TL column with total population, and an admin code column that joins to ADM{level}_PCODE. The script automatically handles UTF-8 and latin-1 encodings (common in international development data) and adds a log_plus_one_pop column for modelling convenience.
For non-population labels—wealth indices, survey scores, crop yields—write a similar join script using this as template. The abstraction is clean: features on one side, labels on the other, joined on standardised admin codes.
Advanced Usage & Best Practices
Optimise Your GEE Quotas
GEE has concurrent task limits. For large countries, stagger your admin_levels or data sources across multiple config files. Run overnight when quotas refresh. Use --verbose logging to catch quota errors early.
Version Your Configs
Treat YAML files as code. Name them download_config_kenya_2023.yml, commit to Git, document changes. This makes your entire feature extraction reproducible and auditable.
Custom Feature Engineering The extracted CSV is your foundation. Add derived features—NDVI trends, built-up change rates, road density per capita—before modelling. The clean admin-level structure makes this trivial in pandas.
Parallel Country Runs Since each country is self-contained in config, run multiple countries simultaneously on separate machines. The pipeline has no global state conflicts.
Handle Encoding Edge Cases International admin names often contain non-ASCII characters. The labelling script's dual encoding support (UTF-8 and latin-1) is there for a reason. If you hit encoding errors in custom labels, follow this pattern.
Comparison with Alternatives
| Capability | GeoFlo | Custom Scripts | Google Earth Engine Code Editor | geemap |
|---|---|---|---|---|
| Multi-source orchestration | ✅ Built-in | ❌ Manual | ⚠️ Partial | ⚠️ Partial |
| Admin-level aggregation | ✅ Native | ❌ Build yourself | ❌ Manual | ❌ Manual |
| Batch export management | ✅ Explicit workflow | ❌ Fragile | ✅ GUI only | ⚠️ API only |
| Reproducible configuration | ✅ YAML-driven | ❌ Ad hoc | ❌ Click-based | ❌ Code-based |
| Feature merging & CSV output | ✅ Automated | ❌ Manual | ❌ Not designed for | ❌ Not designed for |
| Labelling flexibility | ✅ Pluggable | ❌ Custom each time | ❌ Not applicable | ❌ Not applicable |
| Learning curve | Medium | High | Medium | Medium |
| Production readiness | ✅ Logging, modularity | ❌ Varies | ❌ Interactive only | ⚠️ Requires wrapping |
The verdict: GeoFlo occupies a unique niche. It's not a GEE alternative—it's a workflow layer above GEE that turns exploratory satellite analysis into production ML features. If you're running one-off visualisations, use the Code Editor. If you're building operational poverty prediction pipelines, GeoFlo saves weeks of engineering.
FAQ
Q: Do I need a paid Google Earth Engine account? A: No, GEE's standard tier handles most country-scale analyses. Large countries or high-resolution extractions may hit quotas; request a quota increase if needed.
Q: Can I use GeoFlo without Google Earth Engine? A: Partially. The OSM download stage runs independently. For non-GEE satellite data, you'd need to adapt the download modules to your source APIs.
Q: What admin levels are supported?
A: Any level your boundary shapefiles contain. The code uses ADM{level}_PCODE convention, so levels 0 (national) through 4+ are all valid if your data supports them.
Q: How long does a full pipeline run take? A: GEE batch exports dominate runtime—typically 15 minutes to 2 hours depending on country size and GEE load. Preprocessing and feature extraction are minutes once downloads complete.
Q: Can I add custom data sources?
A: Yes. The modular structure means adding a new data_download/ module and corresponding config section. Follow the existing patterns for Landsat or OSM.
Q: Is GeoFlo suitable for real-time applications? A: No—it's designed for batch, research, and operational modelling workflows. Real-time satellite pipelines need streaming architectures like Google Earth Engine's cloud API or dedicated services.
Q: How do I cite GeoFlo in research? A: Cite the GitHub repository directly: Data Science Unit, "GeoFlo: An automated data pipeline for downloading, preprocessing, and feature extraction of geospatial datasets," available at https://github.com/Data-Science-Unit/GeoFlo.
Conclusion
Here's the hard truth: geospatial feature extraction is a solved problem that most teams keep re-solving badly. Every custom script, every fragile Jupyter notebook chain, every manual Google Drive download is technical debt that compounds across projects.
GeoFlo offers something better: a battle-tested, configuration-driven, openly documented pipeline that turns weeks of engineering into four commands and three YAML files. The Data Science Unit has done the hard work of abstracting GEE's quirks, standardising admin-level aggregation, and creating a labelling pattern that generalises across domains.
Whether you're predicting poverty in East Africa, monitoring urban growth in South Asia, or building any admin-level ML model that needs Earth Observation features, GeoFlo deserves your attention.
Stop building from scratch. Start with GeoFlo.
Clone the repository today, configure for your country of interest, and join the teams who've already reclaimed their time for actual modelling and insight generation. The pipeline is waiting—your features are one config away.
Found this guide valuable? Star the repository, share with your geospatial ML colleagues, and watch your pipeline development time collapse.