How to Automate GIS Workflows with Python

Problem statement

Many GIS tasks are repetitive: open a file, check the CRS, reproject it, clean field names, remove bad records, and save a new copy. That works for one dataset, but it becomes slow and error-prone when you need to process 10, 50, or 200 files.

A common case is batch processing vector data such as shapefiles and GeoJSON files. You may receive files from different sources and need to standardize them to one CRS, keep only required fields, remove empty geometries, and export cleaned outputs to a separate folder.

This page shows how to automate a vector GIS workflow in Python with GeoPandas. The goal is to run the same steps on many files consistently.

Quick answer

To automate a GIS workflow in Python:

define the repeated steps
collect input files from a folder
process each file with the same GeoPandas operations
save outputs to a separate folder

Typical tools:

Python
GeoPandas
pathlib

A basic automation script can batch process shapefiles and GeoJSON files, reproject them, clean attributes, filter bad geometry records, and write standardized outputs automatically.

Step-by-step solution

Define the workflow first

Before writing code, list the exact steps you want to repeat. A simple vector workflow might be:

load shapefiles or GeoJSON files
check whether each file has a CRS
reproject to a target CRS such as EPSG:3857
standardize column names
drop rows with missing required values
remove null, empty, or invalid geometries
save processed files to a new folder

This matters because automation works best when the workflow is explicit and consistent.

Set up the Python environment

Install GeoPandas if needed:

pip install geopandas

Imports used in this tutorial:

from pathlib import Path
import geopandas as gpd

If you use Python regularly for GIS, keep a dedicated environment for geospatial packages.

Read vector files from an input folder

Use pathlib to collect files instead of hard-coding names.

from pathlib import Path

input_folder = Path("data/raw")
output_folder = Path("data/processed")
output_folder.mkdir(parents=True, exist_ok=True)

input_files = list(input_folder.glob("*.shp")) + list(input_folder.glob("*.geojson"))

print(f"Found {len(input_files)} files")
for file_path in input_files:
    print(file_path.name)

This pattern is useful for batch GIS work because you can drop new files into one folder and rerun the same script.

Apply the same processing steps to each file

The script below does four practical things:

skips files with missing CRS metadata
reprojects valid files to a target CRS
standardizes column names
removes bad geometry records and exports cleaned output

from pathlib import Path
import geopandas as gpd

input_folder = Path("data/raw")
output_folder = Path("data/processed")
output_folder.mkdir(parents=True, exist_ok=True)

target_crs = "EPSG:3857"

input_files = list(input_folder.glob("*.shp")) + list(input_folder.glob("*.geojson"))

for file_path in input_files:
    gdf = gpd.read_file(file_path)

    if gdf.crs is None:
        print(f"Skipping {file_path.name}: missing CRS")
        continue

    if gdf.crs.to_string() != target_crs:
        gdf = gdf.to_crs(target_crs)

    # standardize column names
    gdf.columns = [col.lower().replace(" ", "_") for col in gdf.columns]

    # remove empty or null geometries
    gdf = gdf[gdf.geometry.notnull()]
    gdf = gdf[~gdf.geometry.is_empty]

    # drop invalid geometries
    gdf = gdf[gdf.is_valid]

    # drop rows with null values in a required field if it exists
    if "name" in gdf.columns:
        gdf = gdf[gdf["name"].notna()]

    output_path = output_folder / f"{file_path.stem}_processed.geojson"
    gdf.to_file(output_path, driver="GeoJSON")

    print(f"Processed: {file_path.name} -> {output_path.name}")

This gives you a repeatable pattern for routine vector data processing.

Save outputs safely

A few details in the script are important:

output_folder.mkdir(parents=True, exist_ok=True) creates the output folder if needed
outputs use a new suffix such as _processed
original files stay unchanged
exporting to GeoJSON avoids some shapefile limitations during automation

If you need shapefile output instead:

output_path = output_folder / f"{file_path.stem}_processed.shp"
gdf.to_file(output_path)

Wrap the workflow in a reusable function

If you plan to run the workflow more than once, turn it into a function.

from pathlib import Path
import geopandas as gpd

def process_vector_files(input_folder, output_folder, target_crs="EPSG:3857"):
    input_folder = Path(input_folder)
    output_folder = Path(output_folder)
    output_folder.mkdir(parents=True, exist_ok=True)

    input_files = list(input_folder.glob("*.shp")) + list(input_folder.glob("*.geojson"))

    for file_path in input_files:
        gdf = gpd.read_file(file_path)

        if gdf.crs is None:
            print(f"Skipping {file_path.name}: missing CRS")
            continue

        if gdf.crs.to_string() != target_crs:
            gdf = gdf.to_crs(target_crs)

        gdf.columns = [col.lower().replace(" ", "_") for col in gdf.columns]
        gdf = gdf[gdf.geometry.notnull()]
        gdf = gdf[~gdf.geometry.is_empty]
        gdf = gdf[gdf.is_valid]

        if "name" in gdf.columns:
            gdf = gdf[gdf["name"].notna()]

        output_path = output_folder / f"{file_path.stem}_processed.geojson"
        gdf.to_file(output_path, driver="GeoJSON")

        print(f"Saved {output_path}")

if __name__ == "__main__":
    process_vector_files("data/raw", "data/processed", target_crs="EPSG:3857")

This structure is easier to reuse in scheduled jobs, command-line scripts, or larger data pipelines.

Code examples

Example 1: Batch read and export vector files

from pathlib import Path
import geopandas as gpd

input_folder = Path("data/raw")
output_folder = Path("data/exported")
output_folder.mkdir(parents=True, exist_ok=True)

for file_path in list(input_folder.glob("*.shp")) + list(input_folder.glob("*.geojson")):
    gdf = gpd.read_file(file_path)
    output_path = output_folder / f"{file_path.stem}.geojson"
    gdf.to_file(output_path, driver="GeoJSON")

Example 2: Reproject all files to one CRS

from pathlib import Path
import geopandas as gpd

target_crs = "EPSG:32633"
output_folder = Path("data/reprojected")
output_folder.mkdir(parents=True, exist_ok=True)

for file_path in Path("data/raw").glob("*.shp"):
    gdf = gpd.read_file(file_path)
    if gdf.crs is None:
        continue
    if gdf.crs.to_string() != target_crs:
        gdf = gdf.to_crs(target_crs)
    gdf.to_file(output_folder / f"{file_path.stem}_utm33n.geojson", driver="GeoJSON")

Example 3: Clean attributes during automation

import geopandas as gpd

gdf = gpd.read_file("data/raw/parcels.shp")
gdf = gdf.rename(columns={"Parcel ID": "parcel_id", "OwnerName": "owner_name"})
gdf = gdf[["parcel_id", "owner_name", "geometry"]]
gdf = gdf[gdf["parcel_id"].notna()]
gdf.to_file("data/processed/parcels_clean.geojson", driver="GeoJSON")

Example 4: Wrap a cleaning step in a reusable function

from pathlib import Path
import geopandas as gpd

def clean_layer(file_path, output_folder, target_crs):
    gdf = gpd.read_file(file_path)
    if gdf.crs is None:
        return

    if gdf.crs.to_string() != target_crs:
        gdf = gdf.to_crs(target_crs)

    gdf.columns = [c.lower().replace(" ", "_") for c in gdf.columns]
    gdf = gdf[gdf.geometry.notnull() & ~gdf.geometry.is_empty & gdf.is_valid]

    output_folder = Path(output_folder)
    output_folder.mkdir(parents=True, exist_ok=True)

    output_path = output_folder / f"{Path(file_path).stem}_clean.geojson"
    gdf.to_file(output_path, driver="GeoJSON")

Explanation

GIS automation with Python is useful when you need to apply the same rules to many datasets. Instead of repeating the same clicks in desktop software, you write the workflow once and run it on every file.

GeoPandas is a good fit for this type of vector workflow because it handles:

reading and writing common vector formats
column and attribute cleanup
CRS conversion with to_crs()
filtering rows based on geometry or attribute rules

This approach works well for:

repeated vector data cleaning
standardizing data from multiple sources
routine reporting workflows
small to medium batch jobs

It is also easier to audit. When the logic is in code, you can see exactly how each dataset was processed.

Edge cases or notes

Missing CRS metadata

If gdf.crs is None, do not guess unless you know the source CRS. A wrong CRS assignment can break the rest of the workflow. Skip those files and fix the metadata first.

Different CRS values across input files

Different source CRS values are normal. The important part is that each file has correct CRS metadata before reprojection.

Invalid geometries

The line below removes invalid features:

gdf = gdf[gdf.is_valid]

That is useful for cleaning, but it drops features rather than repairing them. If you need to keep invalid features, use a geometry repair step instead of filtering them out.

Area calculations need the right projected CRS

EPSG:3857 is convenient for web mapping, but it is not a good default for accurate area calculations. If you need area values, reproject to an appropriate projected CRS for your region and analysis before using gdf.area.

Shapefile limitations

Shapefiles have field name length limits and more restrictive data typing. For automated workflows, GeoJSON or GeoPackage is often easier to work with as an output format.

Large datasets

GeoPandas works well for many file-based workflows, but large datasets can use a lot of memory. For bigger jobs, consider formats and tools such as GeoPackage, Parquet, or PostGIS.

Avoid overwriting source data

Write outputs to a separate folder and use clear file name suffixes such as _processed or _clean. That reduces the risk of accidental data loss.

Internal links

For a broader overview, see Python GIS automation basics.

Related task guides:

If you need to export cleaned outputs, see How to Export GeoJSON in Python with GeoPandas.

FAQ

How do I automate multiple shapefiles in Python?

Use pathlib to collect all .shp files in a folder, loop through them with GeoPandas, apply the same processing steps, and save each result to an output folder.

What Python library is best for GIS workflow automation?

For vector file workflows, GeoPandas is usually the best starting point. It covers reading, writing, reprojection, filtering, and attribute cleanup in one workflow.

Can I automate reprojection and attribute cleaning in one script?

Yes. A single GeoPandas script can read a file, reproject it with to_crs(), rename fields, drop rows, filter invalid geometries, and export the cleaned result.

What should I do if some input files have different CRS values?

That is usually fine if the CRS metadata is correct. Read each file, check gdf.crs, and convert it to a shared target CRS before export. If CRS metadata is missing, fix that first.

How to Automate GIS Workflows with Python #

Problem statement #

Quick answer #

Step-by-step solution #

Define the workflow first #

Set up the Python environment #

Read vector files from an input folder #

Apply the same processing steps to each file #

Save outputs safely #

Wrap the workflow in a reusable function #

Code examples #

Example 1: Batch read and export vector files #

Example 2: Reproject all files to one CRS #

Example 3: Clean attributes during automation #

Example 4: Wrap a cleaning step in a reusable function #

Explanation #

Edge cases or notes #

Missing CRS metadata #

Different CRS values across input files #

Invalid geometries #

Area calculations need the right projected CRS #

Shapefile limitations #

Large datasets #

Avoid overwriting source data #

Internal links #

FAQ #

How do I automate multiple shapefiles in Python? #

What Python library is best for GIS workflow automation? #

Can I automate reprojection and attribute cleaning in one script? #

What should I do if some input files have different CRS values? #