Classification

Contains every function to do with map classification. This includes model creation, map classification and processes for array manipulation into scikit-learn compatible forms.

For details on how to build a class shapefile, see the notebook PyEO_sepal_model_training.ipynb within the notebooks directory in the PyEO GitHub.

All models are serialised and deserialised using joblib.dump or joblib.load, and saved with the .pkl extension.

Key functions

classify_model_for_region() Creates a model from a directory of class shapefile and .tif pairs. The benefit of this function is that a model can be produced from multiple regions, increasing the generalisation ability of the model.

create_trained_model() Creates a model from a class shapefile and a .tif

Alternatively, these two functions are suitable for those wishing to create a simpler model:

  1. extract_features_to_csv() Extracts class signatures from a class shapefile and a .tif

  2. create_model_from_signatures() Creates a model from a .csv of classes and band signatures

Finally, a raster can be classified using:

classify_image() Produces a classification map from an image using a model.

Function reference

pyeo.classification.change_from_composite(image_path: str, composite_path: str, model_path: str, class_out_path: str, prob_out_path: str | None = None, skip_existing: bool = False, apply_mask: bool = False) None

Stacks an image with a composite and classifies each pixel change with a scikit-learn model.

The image that is classified has the following bands:

    1. composite blue

    1. composite green

    1. composite red

    1. composite IR

    1. image blue

    1. image green

    1. image red

    1. image IR

Parameters:
  • image_path (str) – The path to the image

  • composite_path (str) – The path to the composite

  • model_path (str) – The path to a .pkl of a scikit-learn classifier that takes 8 features

  • class_out_path (str) – A location to save the resulting classification .tif

  • prob_out_path (str, optional) – A location to save the probability raster of each pixel.

  • skip_existing (bool, optional) – If true, do not run if class_out_path already exists. Defaults to False.

  • apply_mask (bool, optional) – If True, uses the .msk file corresponding to the image at image_path to skip any invalid pixels. Default False.

Return type:

None

pyeo.classification.classify_directory(in_dir: str, model_path: str, class_out_dir: str, prob_out_dir: str | None = None, apply_mask: bool = False, out_type: str = 'GTiff', chunks: int = 4, skip_existing: bool = False) None

Classifies every file ending in .tif in in_dir using model at model_path. Outputs are saved in class_out_dir and prob_out_dir, named [input_name]_class and _prob, respectively.

See the documentation for classify_image() for more details.

Parameters:
  • in_dir (str) – The path to the directory containing the rasters to be classified.

  • model_path (str) – The path to the .pkl file containing the model.

  • class_out_dir (str) – The directory that will store the classified maps

  • prob_out_dir (str, optional) – If present, the directory that will store the probability maps of the classified maps. If not provided, will not generate probability maps.

  • apply_mask (bool, optional) – If present, uses the corresponding .msk files to mask the directories. Defaults to True.

  • out_type (str, optional) – The raster format of the class image. Defaults to “GTiff” (geotif). See gdal docs for valid datatypes.

  • chunks (int, optional) – The number of chunks to break each image into for processing. See classify_image()

  • skip_existing (boolean, optional) – If True, skips the classification if the output file already exists.

Return type:

None

pyeo.classification.classify_image(image_path: str, model_path: str, class_out_path: str, prob_out_path: str | None = None, apply_mask: bool = False, out_format: str = 'GTiff', chunks: int = 4, nodata: int = 0, skip_existing: bool = False) str

Produces a class map from a raster and a model.

This applies the model’s fit() function to each pixel in the input raster, and saves the result into an output raster. The model is presumed to be a scikit-learn fitted model created using one of the other functions in this library (create_model_from_signatures() or create_trained_model()).

Parameters:
  • image_path (str) – The path to the raster image to be classified.

  • model_path (str) – The path to the .pkl file containing the model.

  • class_out_path (str) – The path that the classified map will be saved at.

  • prob_out_path (str, optional) – If present, the path that the class probability map will be stored at. Default None

  • apply_mask (bool, optional) – If True, uses the .msk file corresponding to the image at image_path to skip any invalid pixels. Default False.

  • out_type (str, optional) – The raster format of the class image. Defaults to “GTiff” (geotif). See gdal docs for valid types.

  • chunks (int, optional) – The number of chunks the image is broken into prior to classification. The smaller this number, the faster classification will run - but the more likely you are to get a outofmemory error. Default 10.

  • nodata (int, optional) – The value to write to masked pixels. Defaults to 0.

  • skip_existing (bool, optional) – If true, do not run if class_out_path already exists. Defaults to False.

Returns:

class_out_path – The output path for the classified image.

Return type:

str

Notes

If you want to create a custom model, the object is presumed to have the following methods and attributes:

    • model.n_cores : The number of CPU cores used to run the model

    • model.predict() : A function that will take a set of band inputs from a pixel and produce a class.

    • model.predict_proba() : If called with prob_out_path, a function that takes a set of n band inputs from a pixel and produces n_classes_ outputs corresponding to the probabilties of a given pixel being that class

pyeo.classification.classify_rf(raster_path: str, modelfile: str, outfile: str, verbose: bool = False)

This function:

Reads in a pickle file of a random forest model and a raster file with feature layers, and classifies the raster file using the model.

Parameters:
  • raster_path (str) – filename and path to the raster file to be classified (in tiff uint16 format)

  • modelfile (str) – filename and path to the pickled file with the random forest model in uint8 format

  • outfile (str) – filename and path to the output file with the classified map in uint8 format

  • verbose (bool, optional) – Defaults to False. If True, provides additional printed output.

Return type:

None

pyeo.classification.create_model_for_region(path_to_region: str, model_out: str, scores_out: str, attribute: str = 'CODE') None

Takes all .tif files in a given folder and creates a pickled scikit-learn model for classifying them. Wraps create_trained_model(); see docs for that for the details.

Parameters:
  • path_to_region (str) – Path to the folder containing the tifs.

  • model_out (str) – Path to location to save the .pkl file

  • scores_out (str) – Path to save the cross-validation scores

  • attribute (str, optional) – The label of the field in the training shapefiles that contains the classification labels. Defaults to “CODE”.

Return type:

None

pyeo.classification.create_model_from_signatures(sig_csv_path: str, model_out: str, sig_datatype=<class 'numpy.int32'>)

Takes a .csv file containing class signatures - produced by extract_features_to_csv - and uses it to train and pickle a scikit-learn model.

Parameters:
  • sig_csv_path (str) – The path to the signatures file

  • model_out (str) – The location to save the pickled model to.

  • sig_datatype (dtype, optional) – The datatype to read the csv as. Defaults to int32.

Return type:

None

Notes

At present, the model is an ExtraTreesClassifier arrived at by tpot:

model = ens.ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.55, min_samples_leaf=2,
      min_samples_split=16, n_estimators=100, n_jobs=-1, class_weight='balanced')
pyeo.classification.create_rf_model_for_region(path_to_region: str, model_out: str, attribute: str = 'CODE', band_names: list[str] = [], gridsearch: int = 1, k_fold: int = 5) None

Takes all .tif files in a given folder and creates a pickled scikit-learn random forest model.

Parameters:
  • path_to_region (str) – Path to the folder containing the tifs.

  • model_out (str) – Path to location to save the .pkl file

  • scores_out (str) – Path to save the cross-validation scores

  • attribute (str) – The label of the field in the training shapefiles that contains the classification labels. Defaults to “CODE”.

  • band_names (list[str]) – List of band names using in labelling the signatures in the signature file. Can be left as an empty list [].

  • gridsearch (int, optional) – Number of randomized random forests for gridsearch. Defaults to 1.

  • k_fold (int, optional) – Number of groups for k-fold validation during gridsearch. Defaults to 5.

Return type:

None

pyeo.classification.create_trained_model(training_image_file_paths: list[str], cross_val_repeats: int = 10, attribute: str = 'CODE')

Creates a trained model from a set of training images with associated shapefiles.

This assumes that each image in training_image_file_paths has in the same directory a folder of the same name containing a shapefile of the same name. For example, in the folder training_data:

training_data

  • area1.tif

  • area1

    • area1.shp

    • area1.dbf

    • area1.cpg

    • area1.shx

    • area1.prj

  • area2.tif

  • area2

    • area2.shp

    • area2.dbf

    • area2.cpg

    • area2.shx

    • area2.prj

Parameters:
  • training_image_file_paths (list[str]) – A list of filepaths to training images.

  • cross_val_repeats (int, optional) – The number of cross-validation repeats to use. Defaults to 10.

  • attribute (str, optional.) – The label of the field in the training shapefiles that contains the classification labels. Defaults to CODE.

Returns:

  • model (sklearn.classifier) – A fitted scikit-learn model. See notes.

  • scores (tuple(float, float, float, float)) – The cross-validation scores for model

Notes

For full details of how to create an appropriate shapefile, see [here](../docs/build/html/index.html#training_data). At present, the model is an ExtraTreesClassifier arrived at by tpot:

model = ens.ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.55,
    min_samples_leaf=2, min_samples_split=16, n_estimators=100, n_jobs=-1, class_weight='balanced')
pyeo.classification.extract_features_to_csv(in_ras_path: str, training_shape_path: str, out_path: str, attribute: str = 'CODE')

Given a raster and a shapefile containing training polygons, extracts all pixels into a CSV file for further analysis.

This produces a CSV file where each row corresponds to a pixel. The columns are as follows:

  • Column 1: Class labels from the shapefile field labelled as ‘attribute’.

  • Column 2+ : Band values from the raster at in_ras_path.

Parameters:
  • in_ras_path (str) – The path to the raster used for creating the training dataset

  • training_shape_path (str) – The path to the shapefile containing classification polygons

  • out_path (str) – The path for the new .csv file

  • attribute (str, optional.) – The label of the field in the training shapefile that contains the classification labels. Defaults to “CODE”

Return type:

None

pyeo.classification.get_shp_extent(shapefile: str)

Get the extent of the first layer, the CRS and the EPSG code from a shapefile.

Parameters:

shapefile (str) – path of the shapefile (.shp)

Returns:

  • extent : tuple(float, float, float, float) - Extent of the shapefile represented as (min_x, min_y, max_x, max_y).

  • SpatialRef : osgeo.osr.SpatialReference - Coordinate referencing system of the shapefile.

  • EPSG : int - EPSG code of the shapefile.

Return type:

tuple

pyeo.classification.get_training_data(image_path: str, shape_path: str, attribute: str = 'CODE')

Given an image and a shapefile with categories, returns training data and features suitable for fitting a scikit-learn classifier.Image and shapefile must be in the same map projection / coordinate referencing system.

For full details of how to create an appropriate shapefile, see [here](../index.html#training_data).

Parameters:
  • image_path (str) – The path to the raster image to extract signatures from

  • shape_path (str) – The path to the shapefile containing labelled class polygons

  • attribute (str, optional) – The shapefile field containing the class labels. Defaults to “CODE”.

Returns:

  • training_data (np.ndarray) – A numpy array of shape (n_pixels, bands), where n_pixels is the number of pixels covered by the training polygons

  • training_pixels (np.ndarray) – A 1-d numpy array of length (n_pixels) containing the class labels for the corresponding pixel in training_data

Notes

For performance, this uses scikit’s sparse.nonzero() function to get the location of each training data pixel. This means that this will ignore any classes with a label of ‘0’.

pyeo.classification.load_signatures(sig_csv_path: str, sig_datatype=<class 'numpy.int32'>)

Extracts features and class labels from a signature CSV.

Parameters:
  • sig_csv_path (str) – The path to the csv

  • sig_datatype (dtype, optional) – The type of pixel data in the signature CSV. Defaults to np.int32

Returns:

  • features (np.ndarray) – a numpy array of the shape (feature_count, sample_count)

  • class_labels (np.ndarray) – a 1d numpy array of class labels (int) corresponding to the samples in features.

pyeo.classification.plot_signatures(learning_path: str, out_path: str, format: str = 'PNG')

This function:

Creates a graphics image file of the signature scatterplots.

Parameters:
  • learning_path (str) – The string containing the full directory path to the learning data input file from the model training stage saved by joblib.dump()

  • out_path (str) – The string containing the full directory path to the output file for graphical plots

  • format (str, optional) – GDAL format for the quicklook raster file, defaults to PNG

Return type:

None

pyeo.classification.raster_reclass_binary(img_path: str, rcl_value: int, outFn: str, outFmt: str = 'GTiff', write_out: bool = True) ndarray

Takes a raster and reclassifies rcl_value to 1, with all others becoming 0. In-place operation if write_out is True.

Parameters:
  • img_path (str) – Path to 1 band input raster.

  • rcl_value (int) – Integer indication the value that should be reclassified to 1. All other values will be 0.

  • outFn (str) – Output file name.

  • outFmt (str, optional) – Output format. Set to GTiff by default. Other GDAL options available.

  • write_out (bool, optional.) – Set to True by default. Will write raster to disk. If False, only an array is returned

Returns:

in_array – Reclassifies numpy array

Return type:

np.ndarray

pyeo.classification.reshape_ml_out_to_raster(classes: ndarray, width: int, height: int) ndarray

Takes the output of a pixel classifier and reshapes to a single band image.

Parameters:
  • classes (array_like of int) – A 1-d numpy array of classes from a pixel classifier

  • width (int) – The width in pixels of the image the produced the classification

  • height (int) – The height in pixels of the image that produced the classification

Returns:

image_array – A 2-dimensional Numpy array of shape(width, height)

Return type:

np.ndarray

pyeo.classification.reshape_prob_out_to_raster(probs: ndarray, width: int, height: int)

Takes the probability output of a pixel classifier and reshapes it to a raster.

Parameters:
  • probs (np.ndarray) – A numpy array of shape(n_pixels, n_classes)

  • width (int) – The width in pixels of the image that produced the probability classification

  • height (int) – The height in pixels of the image that produced the probability classification

Returns:

image_array – The reshaped image array

Return type:

np.ndarray

pyeo.classification.reshape_raster_for_ml(image_array: ndarray) ndarray

A low-level function that reshapes an array from gdal order [band, y, x] to scikit features order [x*y, band]

For classification, scikit-learn functions take a 2-dimensional array of features of the shape (samples, features). For pixel classification, features correspond to bands and samples correspond to specific pixels.

Parameters:

image_array (np.ndarray) – A 3-dimensional Numpy array of shape (bands, y, x) containing raster data.

Returns:

image_array – A 2-dimensional Numpy array of shape (samples, features)

Return type:

np.ndarray

pyeo.classification.shapefile_to_raster(shapefilename: str, inraster_filename: str, outraster_filename: str, verbose=False, nodata=0, attribute: str = 'CODE')

This function:

Reads in a shapefile with polygons and produces a raster file that aligns with an input rasterfile (same corner coordinates, resolution, coordinate reference system and geotransform). Each pixel value in the output raster will indicate the number from the shapefile based on the selected attribute column.

Parameters:
  • shapefilename (str) – String pointing to the input shapefile in ESRI format.

  • inraster_filename (str) – String pointing to the input raster file that we want to align the output raster to.

  • outraster_filename (str) – String pointing to the output raster file.

  • verbose (boolean) – True or False. If True, additional text output will be printed to the log file.

  • nodata (int) – No data value.

  • attribute (str) – Name of the column of the attribute table of the shapefile that will be burned into the raster. If None, use the first attribute.

Returns:

outraster_filename

Return type:

str

Notes

Based on https://gis.stackexchange.com/questions/151339/rasterize-a-shapefile-with-geopandas-or-fiona-python

pyeo.classification.train_rf_model(raster_paths: list[str], modelfile: str, ntrees: int = 101, attribute: str = 'CODE', band_names: list[str] = [], weights: list[int] | None = None, balanced: bool = True, gridsearch: int = 1, k_fold: int = 5)

This function:

Trains a random forest classifier model based on a raster file with bands as features and a second raster file with training data, in which pixel values indicate the class.

Parameters:
  • raster_paths (list[str]) – list of filenames and paths to the raster files to be classified in tiff format. It is a condition that shapefiles of matching name exist in the same directory.

  • modelfile (str) – filename and path to a pickle file to save the trained model to

  • ntrees (int) – number of trees in the random forest, default = 101

  • attribute (str) – string naming the attribute column to be rasterised in the shapefile

  • band_names (list[str]) – list of strings indicating the names of the bands (used for text output and labelling the learning data output file). If [], a sequence of numbers is assigned.

  • weights (list[int], optional) – a list of integers giving weights for all classes. If not specified, all weights will be equal.

  • balanced (bool, optional) – if True, use a balanced number of training pixels per class

  • gridsearch (int, optional) – Number of randomized random forests for gridsearch. Defaults to 1.

  • k_fold (int, optional) – Number of groups for k-fold validation during gridsearch. Defaults to 5.

Return type:

random forest model object

Notes

Adapted from pygge.py