Calculating a DOPA indicator means, in general, to do a spatial intersection within at least two layers. EG:
Things are complicated by:
Due to the reasons above, updating the indicators requires a long processing time, which is a constraint for frequence of update.
A few (but not cheap) steps have been identified as necessary to improve the process in quality and speed:
Above targets are reached scripting all the sequence of actions needed to build a pseudo-topology (eg: merging information from countries, ecoregions and protected areas to build the base layer).
Pseudo-topology is meant as a simple feature layer, flattened, which keeps the information from the original sources at the level of each single atomic geometry in which it can be destructured.
Building a real topology would be too expensive due to the size of the input datasets.
Key parts of the process are:
This step, despite increasing the number of vector nodes participating to each tile, simplifies the overlapping geometries, in a more efficient way respect to other (tested) approaches: snap to grid; snap to other geometries; simplify; the already mentioned, unsustainable, real topology building.
The result is a spatial table, in which each box has a unique ID (qid) and contains MultiPolygons, named (cid) according to the unique combination of the original components (topic1, topic2, topic_N_ …)
Basically, the process consists of:
Running in sequence the above scripts, will in turn run the postgis functions, which in turn fill fill the postgres tables.
It takes 27 hours to process the whole CEP (global country, ecoregion, protection) on 40 cores hardware.
Some sample output deriving from this 10 Mb sample inputs:
.
A lot of effort has been put in making the above task universal, and not exclusively targeted to the above objects (country/ecoregion/pa). It can be applied as-it-is to pre-process other datasets (eg IUCN species).
It is tested on
A .pgpass file is required.
As alternative, add the line:
export PGPASSWORD=<your password here>
to the #database parameters
section in the env file.
In workflow_parameters.conf
TOPICS=NN
– defines the number of processed “topics” (eg: for CEP (country+ecoregion+pa) it would be 3)
TOPIC_N="topic name"
– defines a name to be used for topic N
VERSION_TOPIC_N="schema name.table name"
– specifies which table contains data for topic N
FID_TOPIC_N="topic ID"
– specifies which is the numeric field (not required to be unique at this stage) for topic N
… repeat the above for the NN number of the defined topics)…
SCH="working schema name"
– this one defines the working schema, WHICH WILL BE DROPPED AND RECREATED!
GS=x
– defines grid tile size in degrees, integer submultiple of 180; default is 1 degree
CS=y
– defines raster cell size in arcasec, integer submultiple of 3600; default is 1 arcsec
If needed, chmod +x
all the pre-existing scripts.
Run ./00_create_infrastructure.sh
, which will create from scratch the following scripts:
a_input_topic_n.sh
populates input tables: copies data inside the working schema; dumps them as single Polygons (redundant by FID) and check geometries; it starts the function f_pop_input()
, which will write results into table a_input_topic_n_
;b_clip_topic_n.sh
clips input geometries according to grid tile size; it starts the function f_clip()
, which will write results into table b_clip_topic_n
;c_rast_topic_n.sh
pseudo-rasterizes above clipped data: rasterize at cell size, then vectorize back collecting as MultiPolygons; it starts the function f_raster()
, which will write results into table c_raster_topic_n
;da_tiled_topic_n.sh
dumps and checks above geometries to single part, by tile, by topic; it starts the function f_pop_tiled()
, which will write results into table da_tiled_topic_n
;db_tiled_all.sh
collects above geometries by tile in a single table; it starts the function f_pop_tiled_temp()
, which will write results:
db_tiled_temp
(for each topic);dc_tiled_all
(for all the topics);e_flat_all.sh
flat all above polygons by tile: breaks polygons at intersections, collects unique geometries, then calculates the centroid; it starts the function f_flatter()
, which will write results into table e_flat_all
(except the field cid; see later);f_attributes_all.sh
calculate and define numeric id (CID) for unique combinations of topics within the whole dataset; it starts the function f_pop_atts_tile()
, which will write results:
fa_atts_tile
(all combinations of qid,tid,topic-array);fb_atts_all
(all the unique combinations of topic-arry, with unique id-cid);g_final_all.sh
JOINS flat geometries to unique combinations of attributes; it starts the function f_flat_recode()
, which will write results:
e_flat_all
(UPDATE only the field CID);g_flat_temp
(this is actually the result, just not ordered);h_output.sh
exports the flat final layer with dynamic SQL. This is the only single core process, the rest is parallelized on multicores.o_raster.sh
rasterize the flat layer at the same resolution of the pseudo-rasterization step
2. p_export_raster
export the flat layer as external raster: vrt (gdal virtual raster) made out of tif files (which in turn are made by 10x10 blocks of original vector tiles). It also exports an attribute table (cid=pixel value=unique combination of input topics).All the scripts from a_*
to g_*
(and o_
, p_
) must run in parallel, launching them in background.
EG: ./a_input_country.sh Ncores > logs/a_input_country_log.txt 2>&1
where Ncores is the number of cores to assign to the process, and the following part of the command will write a detailed log.
All the scripts (not the resulting tables, which are striclty interconnected) are independent from each other: this allows to debug (through the aforementioned logs) every step, and check every output.
Still, all the commands can be collected and launched in a unique script.
EG:z_do_it_all.sh
is the real world script used to generate DOPA CEP (read inline comments for instructions), which will generate the infrastructure according to the provided workflow_parameters.conf, and will launch in sequence all the scripts generated.
Further ancillary scripts for specific tasks should be self-explanatory:
EG: 01_create_filter.sh
gives the option to filter only few tiles (EG: specific BIOPAMA needs).
level | step | script | function | w_table | w_fields | r_table | cores | description |
---|---|---|---|---|---|---|---|---|
by topic | 1 | a_input_topic_n.sh | f_pop_input() | a_input_topic_n_ | fid, geom | |||
2 | b_clip_topic_n.sh | f_clip() | b_clip_topic_n | fid, geom | ||||
3 | c_rast_topic_n.sh | f_raster() | c_raster_topic_n | fid, rast, geom | ||||
4 | da_tiled_topic_n.sh | f_pop_tiled() | da_tiled_topic_n | fid, geom | ||||
aggregated | 5a | db_tiled_all.sh | f_pop_tiled_temp() | db_tiled_temp | fid, source, geom | |||
5b | db_tiled_all.sh | SELECT | dc_tiled_all | fid, source, geom | ||||
6 | e_flat_all.sh | f_flatter() | e_flat_all | tid, geom, point | ||||
7 | f_attributes_all.sh | f_pop_atts_tile() | fa_atts_tile | tid, topic1, topicN | ||||
8 | g_final_all.sh | f_flat_recode() | e_flat_all | cid | ||||
g_final_all.sh | SELECT | g_flat_temp | qid, cid, geom, country,ecoregion,wdpa,sqkm | |||||
9 | h_output.sh | SELECT | h_flat | qid, cid, geom, country,ecoregion,wdpa,sqkm | ||||
10 | o_raster.sh | f_pop_o_raster() | o_raster | qid, rast | ||||
11 | p_export_raster.sh | GDAL | raster_output.vrt | |||||
p_export_raster.sh | COPY | raster_output_attibutes.csv | ||||||
DOPA CEP has been generated using:
loop | stage | script | n dedicated threads | n objects | time (seconds) |
---|---|---|---|---|---|
for each input topic | a-input tables | a_input_country.sh | 72 | 2409 | 11 |
a_input_ecoregion.sh | 72 | 405496 | 83 | ||
a_input_wdpa.sh | 72 | 940614 | 914 | ||
b-clip tables | b_clip_country.sh | 72 | 77261 | 215 | |
b_clip_ecoregion.sh | 72 | 532073 | 626 | ||
b_clip_wdpa.sh | 72 | 1004269 | 1353 | ||
c-raster tables | c_rast_country.sh | 72 | 74235 | 3120 | |
c_rast_ecoregion.sh | 72 | 103900 | 4022 | ||
c_rast_wdpa.sh | 72 | 278315 | 1691 | ||
da-tiled tables | da_tiled_country.sh | 72 | 78551 | 266 | |
da_tiled_ecoregion.sh | 72 | 651673 | 780 | ||
da_tiled_wdpa.sh | 72 | 2144176 | 797 | ||
for aggregated results | db-tiled aggregated table | db_tiled_all.sh | 72 | 2874400 | 6514 |
e-flat table | e_flat_all.sh | 36 | 7425625 | 77273 | |
f-attributes table | f_attributes_all.sh | 36 | 424944 | 27294 | |
g-final table | g_final_all.sh | 36 | 598264 | 8524 | |
h-output | h_output.sh | 1 | 598264 | 280 | |
total | 133763 (37 hours) |