### ### Planet PostGIS

Welcome to Planet PostGIS

November 17, 2018

Simon Greener

Spatial Representation of a Communications Pit

This article shows how to represent a communications pit using logical referencing rather than hard spatial objects in PostGIS. The solution was originally created on Oracle.

by Simon Greener at November 17, 2018 12:05 AM

November 16, 2018

Simon Greener

PostGIS: Creating a line inside a polygon

Algorithm that allows one to create a line inside a polygon knowing only its starting point, and two bearings, one to start point of line, the other to the end point.

by Simon Greener at November 16, 2018 11:21 AM

November 03, 2018

PostGIS Development

Howard Hughes Medical Institute

As a software engineer at the Howard Hughes Medical Institute, I work on a collaborative neuron reconstruction and analysis software called CATMAID 1 (screenshot: 3), which is used for neuroscience research. We use PostGIS to represent neurons in a 3D space.

They consist of 3D points that reference their parent nodes or are the root [=soma of neuron] if they have no parent). Together with synapses, point clouds and TIN meshes for modeling compartments in a dataset, they model the spatial aspects of our neuroscience world. Users create those neuron reconstructions manually in a collaborative fashion plus segmentation programs can be used as additional data source. Using its spatial indices, PostGIS helps us to quickly query neurons in a particular field of view. The space of a single project contains sometimes 100s of millions of interconnected individual points. We also do bounding box intersection queries between neurons and compartment meshes, which then refine in the front-end by doing more precise intersection tests.

This software is used by quite a few research labs and as far as I know they all do their own hosting with a dedicated server and this is what we do as well. The reason being mainly that wth larger datasets, we benefit from machines with a lot of RAM (>256G), fast SSD/NVMe drives and many CPUs as well as fast local data access for e.g. image data.

Thanks so much for making PostGIS work well in non-GIS contexts too—-it makes my life much easier!

by Tom Kazimiers at November 03, 2018 12:00 AM

November 02, 2018

PostGIS Development

Vanguard Appraisals

Vanguard Appraisals is new to the GIS world. In fact, we aren’t really in the GIS world; we just kind of brush up against it. We do mass property appraisal for entire county and city jurisdictions, and we develop software to collect, price and maintain values. We also host assessment data online so that homeowners can search and find property information much simpler from the comfort of their own home. Our software and websites are used in 7 states (IA, IL, MN, MO, NE, ND, SD).

Continue Reading by clicking title hyperlink ..

by Andy Colson at November 02, 2018 12:00 AM

October 25, 2018

Safe Software

A Beginner’s Guide to… Bulk Database Updates with FME

This article is a simple guide to bulk database updates with FME.

Sometimes my blog posts are like a courtroom novel. The recent article on precision is an example of this. I reach a successful verdict, but getting there involves an opening statement, a journey through obscure aspects of theory, cross-examining our functionality, and making a final argument for various design decisions. Only after all that can I prove beyond doubt that our work is fit for purpose.

That sort of article is fun to write and it does illustrate Safe’s thought processes in designing FME functionality. But sometimes – like reading a courtroom novel – you want to skip to the final chapter. You just want the juicy part where our hero attorney has already got the result, and simply illuminates the solution in logical steps.

Well, today, I am that hero attorney. I am the Perry Mason of the FME world, and this article is the final chapter of “Updating Databases with FME”. No theory or cross examination involved. Step by step I’ll cast light on how FME made it easy to push mass updates to a database.

There is just one design issue I’ll touch on, and it is important. In law there are multiple jurisdictions, and a technique Perry Mason tries in a California court may not work for Atticus Finch in Alabama. Similarly, in our world of data there are multiple databases. But! FME techniques you use for Oracle will work for SQL Server, and FME techniques you use for Postgres will work for an Esri Geodatabase.

How can this happen? It’s because we’ve worked really hard to harmonize (or standardize) the interfaces inside FME. So whatever database applies in your jurisdiction – or even if your work involves several database types – this post covers you.

So let’s get on with it…

Setting up an FME Database Connection

When you want to read to or write from a database – including database updates – you need authorization. FME defines authorization parameters using a connection tool. It’s accessed through Tools > FME Options in FME Workbench. So start Workbench, select Tools > FME Options and you will get this dialog:

When I click on Database Connections, I get a list of available connections. If I want to create a new one I press the plus button and get this dialog:

Just enter your database details in there, test and then save. Now you have a connection defined, you can use it wherever you like in FME. So first let’s use it to insert some data into a database…

Inserting Data to a Database with FME

To carry out database updates, first you need data in the database! Let’s say you want to read a dataset – maybe an Excel spreadsheet – and write it to a database, creating a table at the same time. That’s simply done in FME by generating a new workspace and choosing a database format. The generate option exists on the start page of FME Workbench, or you can use the shortcut Ctrl+G. That opens the basic dialog for defining a translation:

Here I have filled in the fields to define a translation of data about parks in the city of Vancouver. The translation is from MapInfo TAB (a spatial data format) to PostGIS (a standard PostgreSQL database with a spatial extension). The spatial part is not necessary – the same setup works for data without a spatial component – but maps are cool so I’ll go with it.

The database connection I selected is the one I defined above, saving me the effort of re-entering my authentication details.

Click OK and that dialog creates a workspace that looks like this:

Each object on the left is a table, layer or class in your source data. Each object on the right is a table in your database.

Setting Database Parameters

When using databases the key settings for each table are accessed by clicking the cogwheel icon on those objects, like this:

Database Insert Parameters

The table name is the first parameter, and so I can rename the table to be something different; and I can choose which schema (Table Qualifier) to write to as well.

But the most important parameter (Feature Operation) tells us that we are INSERTing data, and I can also choose in what way the table is created:

So I can choose to create the table regardless of if it exists (Drop and Create), create it if it doesn’t already exist (Create if Needed), just add to the existing table (Use Existing), or empty it if it already exists (Truncate Existing).

Which I use depends on the scenario I am working through, but in this case – to create and fill a table – I’ll use Create if Needed. The advantage over Drop and Create is that if another user already has a table with that name (and I haven’t checked) then at least I won’t delete their content first.

Anyway, I run the workspace and FME loads the data:

Of course, at some point in the future I might find the source of the data (Parks.tab) has changed, and I need to update my data based on that changed dataset…

Updating Records in a Database with FME

Say I receive a dataset named ParksUpdates.tab. The simplest way to update my database is to do the same process as above, but to use Drop and Create as the table operation. That way I am just replacing everything:

But that relies on ParksUpdates.tab being the FULL replacement dataset. What if it only includes the records that need an update? Well in that scenario I simply change the operation to UPDATE (instead of INSERT) and choose to Use Existing table:

Database Updates Parameters

Notice that when I pick UPDATE, then another parameter becomes available to me: Match Columns. I need this to define which feature updates which record. In this case I have an attribute called parkid in my source data and a field (column) named parkid in my database table; so that’s the attribute I select:

So if an incoming feature has the attribute parkid=13, then its contents are used to update the database record where parkid=13.

That’s simple enough, but to add a little complexity (not too much) there is also a WHERE clause I can use instead. This lets me define the match where the attribute and field names are not the same (for example ParkNumber and parkid) but also allows me to add extra conditions using field names:

 

Here, for example, I’m updating records where ParkNumber = parkid, but also only where the neighborhoodname field is “Downtown”. So records outside of the Downtown area aren’t updated, even if the park ID matches. I could do a similar test for a status field (active, inactive) among many other examples.

So we do updates here, and that’s simple enough; but sometimes we also want to delete records…

Deleting Records from a Database with FME

Let’s assume my UpdatedParks dataset is a list of records to delete from the database table, not add. To do that I simply change the operation from UPDATE to DELETE:

Database Delete Parameters

I get the same Match Column parameter (or WHERE clause) to define which incoming features should delete which existing records, and this is again easy to define.

So deletes are no more complex than updates; the key question is what happens when I want to both delete and update records simultaneously?

Updating AND Deleting Database Records with FME

Let’s say some of the incoming records are updates, while others are deletions:

Obviously I can’t set the operation parameter to both DELETE and UPDATE for the entire table. What I do instead is tag each feature with the operation it will carry out. I do this using an attribute called fme_db_operation:

Database Updates AND Deletes

You can see here that I have added an attribute to each stream of data, using an AttributeCreator transformer. The attribute name is fme_db_operation. For one lot of data I set the value to UPDATE. The other set of data has a value of DELETE. This is how I tag each feature with its own operation.

I still have to set the operation type on the table itself. But this time, rather than choosing Insert, Update, or Delete, I choose the option labelled fme_db_operation:

Now when I run the workspace, the features tagged UPDATE update database records, while the features tagged DELETE delete database records. The Match Column (or WHERE clause) provides a match between features and records.

The one assumption is that we already know which feature are deletes and which are updates. In the above example, the source data is already divided into two. If we aren’t sure of that then we might need to do what is called Change Detection

Change Detection and Database Updates with FME

Change Detection is where we have a new dataset and want to compare it to existing records to find what has changed. Here is such a workspace:

It looks quite simple, and in fact it is. I have added a reader (Readers > Add Reader) to read the existing contents of my database table, and an UpdateDetector transformer to compare these records to the UpdatedParks dataset, to identify where changes have occurred. I can detect changes either on field values, or spatial contents, or both.

Then it’s just a case of writing the results back to the database table. I don’t even need to create the fme_db_operation attribute; the UpdateDetector has done that for me. I must just check that the table is set with the correct operation (fme_db_operation) and that Match Column is set.

At this point you probably know more than enough to carry out database updates; but there is one more scenario I can perhaps mention. What if each feature has a different match column? In that case you can write your match in the form of a where clause, and store it as an attribute. Then use that attribute for the match in the table parameters:

…just like that!

Wrap Up

I hope that was a good explanation for database updates, short on theory and long on practical examples. Speaking of which, we have an online tutorial on database updates, where you can carry out exercises, step by step, using similar examples to what I showed today. So if you want to try some of these techniques in a safe practice environment, click the above link to visit the tutorial.

As I mentioned, although I used Postgres here, most of our database formats use an identical interface because we’ve invested a lot of time standardizing them all. That work is still ongoing, so one of the tutorial articles includes a list of standardized formats, and what to do if your format is not yet updated.

And now, if you’ll excuse me, I’m going to do like Perry Mason and wrap up my case by going out to celebrate with steak and cocktails!

PS: Safe Software now has an Instagram page. I’m not sure what an instagram is, but here’s a picture of me presenting at a prior user conference (oh, and riding an inflatable horse):

https://www.instagram.com/safesoftware/

The post A Beginner’s Guide to… Bulk Database Updates with FME appeared first on Safe Software Blog.

by Mark Ireland at October 25, 2018 10:03 AM

October 15, 2018

Oslandia

French Ministry in charge of the ecological transition selected Oslandia

Ministère de la Transition Écologique et Solidaire Logo

The French ministry of the ecological transition  selected Oslandia for two of the three packages of its call for tender procedure dedicated to geomatic tools.  We are very proud to dedicate our team to one of the strongest support of geomatics and Open Source in France for the next 2 to 4 years.

First package is dedicated to expert studies covering spatial databases, software, components, protocols, norms and standards in the geomatics fields.

Second package provides support and development for QGIS,  the spatial cartridge PostGIS of PostgreSQL and their components. We are really happy to continue a common work already engaged in the previous contract.

This is again another proof that we face a major tendency of open source investment, where geomatics components are currently among the most dynamic and strongest open source projects. This is also a confirmation that actors are now integrating deeply the economic rationale of open source contribution inside their politics.

by Régis Haubourg at October 15, 2018 02:08 PM

October 01, 2018

Paul Ramsey

PostGIS Code Sprint 2018 #2

An important topic of conversation this sprint was what kinds of core PostgreSQL features might make PostGIS better in the future?

PostGIS Code Sprint 2018 #2

Wider Typmod

The current attribue typmod column is a 32-bit integer. For geometry, we are already packing that 32 bits to the limit: a couple bits for dimensionality, some more for the type number, and the bit kahune, a bunch for the SRID number. Having even a 64-bit typmod number would allow even more interesting things, like declared coordinate precision, to fit in there. Maybe we are abusing typmod and there’s a better way to do what we want though?

Parallel GIST Scan

The PostGIS spatial index is built using the PostgreSQL GIST index infrastructure, so anything that makes GIST scans faster is a win for us. This would be a big win for folks with large tables (and thus deep trees) and who run scans that return a lot of indexed tuples.

Faster GIST Index Building

B-Tree index builds are accellerated by pre-sorting the inputs; could the same trick be used in building GIST indexes? Again, for large tables, GIST index building is slower than B-Tree and “faster” is the #1 feature all existing users want.

Multi-Threading in Functions

This isn’t a feature request, so much as a request for clarification and assurance: PostGIS calls out to other libraries, like GEOS, and it’s possible we could make some of our algorithms there faster via parallel processing. If we do our parallel processing within a function call, so the PostgreSQL thread of execution isn’t doing anything until we return, is it OK for us to do threading? We use pthreads, or maybe OpenMP.

Compressed Datum Slicing

“Detoast datum slice” doesn’t actually get around to the slicing step until after the datum is decompressed, which can make some queries quite slow. We already try to read boxes from the headers of our objects, and for large objects that means decompressing the whole thing: it would be nice to only decompress the first few bytes. I have an ugly patch I will be testing to try and get committed.

Forced Inlining

A problem we have with PostgreSQL right now is that we cannot effectively cost our functions due to the current inlining behaviour on our wrapper functions like ST_Intersects(). When raised on the list, the hackers came to a tentative conclusion that improving the value caching behaviour would be a good way to avoid having inlining in incorrect places. It sounded like subtle and difficult work, and nobody jumped to it.

We propose leaving the current inlining logic unchanged, but adding an option to SQL language functions to force inlining for those functions. That way there is no ambiguity: we always want our wrapper functions inlined; always, always, always. If we could use an INLINE option on our wrapper function definitions all the current careful logic can remain in place for more dynamic use cases.

Extension Version Dependency

PostGIS is growing a small forest of dependent extensions

  • some inside the project, like postgis_topology and now postgis_raster
  • some outside the project, like PgRouting and pgpointcloud

When a new version of PostGIS is installed, we want all the PostGIS extensions to be updated. When a third party extension is installed, it may require features from a particular recent version of PostGIS.

The extension framework supports dependency, but for safety, as the ecosystem grows, version dependencies are going to be required eventually.

Global Upgrade Paths

Right now extension upgrade paths have to explicitly state the start and end version of the path. So an upgrade file might be named postgis--2.3.4--2.3.5.sql. That’s great if you have four or five versions. We have way more than that. The number of upgrade files we have keeps on growing and growing.

Unlike upgrade files for smaller projects, we drop and recreate all the functions in our upgrade files. That means that actually our current version upgrade file is capable of upgrading any prior version. Nonetheless, we have to make a copy, or a symlink, many many version combinations.

If there was a global “version”, we could use our master upgrade script, and only ship one script for each new version: postgis--ANY--2.3.5.sql

Size Based Costing in the Planner

Right now costing in the planner is based heavily on the “number of rows” a given execution path might generate at each stage. This is fine when the cost of processing each tuple is fairly uniform.

For us, the cost of processing a tuple can vary wildly: calculating the area of a 4 point polygon is far cheaper than calculating the area of a 40000 point polygon. Pulling a large feature out of TOAST tuples is more expensive than pulling it from main storage.

Having function COST taken into more consideration in plans, and having that COST scale with the average size of tuples would make for better plans for PostGIS. It would also make for better plans for PostgreSQL types that can get very large, like text and tsvector.

The analysis hooks might have to be enriched to also ask for stats on average tuple size for a query key, in addition to selectivity, so a query that pulled a moderate number of huge objects might have a higher cost than one that pulled quite a few small objects.

We Are Not Unreasonable People

We just want our due, you know?

Thanks, PostgreSQL team!

October 01, 2018 08:00 AM

September 30, 2018

Paul Ramsey

PostGIS Code Sprint 2018 #1

When I tell people I am heading to an open source “code sprint”, which I try to do at least once a year, they ask me “what do you do there?”

When I tell them, “talk, mostly”, they are usually disappointed. There’s a picture, which is not unearned, of programmers bent over their laptops, quietly tapping away. And that happens, but the real value, even when there is lots of tapping, is in the high-bandwidth, face-to-face communication.

PostGIS Code Sprint 2018 #1

So, inevitably I will be asked what I coded, this week at the PostGIS Code Sprint and I will answer… “uhhhhh”. I did a branch of PostgreSQL that will do partial decompression of compressed tuples, but didn’t get around to testing it. I tested some work that others had done. But mostly, we talked.

PostGIS 3

Why move to PostGIS 3 for the next release? Not necessarily because we will have any earth-shattering features, but to carry out a number of larger changes. Unlike most PostgreSQL extensions, PostGIS has a lot of legacy from past releases and has added, removed and renamed functions over time. These things are disruptive, and we’d like to do some of the known disruptive things at one time.

Split Vector and Raster

When we brought raster into PostGIS, we included it in the “postgis” extension, so if you CREATE EXTENSION postgis you get both vector and raster features. The rationale was that if we left it optional, packagers wouldn’t build it, and thus most people wouldn’t have access to the functionality, so it wouldn’t get used, so we’d be maintaining unused garbage code.

Even being included in the extension, by and large people haven’t used it much, and the demand from packagers and other users to have a “thin” PostGIS with only vector functionality have finally prevailed: when you ALTER EXTENSION postgis UPDATE TO '3.0.0' the raster functions will be unbundled from the extension. They can then be re-bundled into a “postgis_raster” dependent package and either dropped or kept around depending on user preference.

Remove postgis.so Minor Version

For users in production, working with packaged PostgreSQL, in deb or rpm packages, the packaging system often forces you to have only one version of PostGIS installed at a time. When upgrading PostgreSQL and PostGIS the net effect is to break pg_upgrade, meaning PostGIS users are mandated to do a full dump/restore.

Removing the minor version will allow the pg_upgrade process to run through, and users can then run the sql ALTER EXTENSION postgis UPDATE command to synchronize their SQL functions with the new binary postgis.so library.

This is good for most users. It’s bad for users who expect to be able to run multiple versions of PostGIS on one server: they won’t easily be able to. There will be a switch to make it possible to build with minor versions again, but we expect it will mostly be used by us (developers) for testing and development.

Serialization

As I discussed recently, the compression of geometries in PostgreSQL can have a very large effect on performance.

A new serialization could:

  • use a more effective compression format for our kind of data, arrays of autocorrelated doubles
  • add space for more flag bits for things like
    • encoding a smaller point format
    • flagging empty geometries
    • noting the validity of the object
    • noting the presense of a unique hash code
    • extra version bits
    • optional on-disk internal indexes or summary shapes

It’s not clear that a new serialization is a great idea. The only really pressing problem is that we are starting to use up our flag space.

Validity Flag

Testing geometry validity is computationally expensive, so for workflows that require validity a lot of time is spent checking and rechecking things that have already been confirmed to be valid. Having a flag on the object would allow the state to be marked once, the first time the check is done.

The downside of a validity flag is that every operation that alters coordinates must then carefully make sure to turn the flag off again, as the object may have been rendered invalid by the processing.

Exception Policy

A common annoyance for advanced users of PostGIS is when a long running computation stops suddenly and the database announces “TopologyException”!

It would be nice to provide some ways for users to indicate to the database that they are OK losing some data or changing some data, if that will complete the processing for 99% of the data.

We discussed adding some standard parameters to common processing functions:

  • null_on_exception (true = return null, false = exception)
  • repair_on_exception (true = makevalid() the inputs, false = do the null_on_exception action)

Modern C

C is not a quickly changing langauge, but since the PostgreSQL project has moved to C99, we will also be moving to C99 as our checked standard language.

Named Parameters

A lot of our function definitions were written before the advent of default values and named parameters as PostgreSQL function features. We will modernize our SQL extension file so we’re using named parameters everywhere. For users this will mean that correct parameter order will not be required anymore, it will be optional if you use named parameters.

M Coordinate

PostGIS supports “4 dimensional” features, with X, Y, Z and M, but while everyone knows what X, Y and Z are, only GIS afficionados know what “M” is. We will document M and also try and bring M support into GEOS so that operations in GEOS are “M preserving”.

Project Actions

Web Site

The web site is getting a little crufty around content, and the slick styling of 7 years ago is not quite as slick. We agreed that it would be OK to modernize, with particular emphasis on:

  • thinking about new user onboarding and the process of learning
  • supporting mobile devices with a responsive site style
  • using “standard” static site infrastructure (jekyll probably)

Standard Data

We agreed that having some standard data that was easy to install would make a whole bunch of other tasks much easier:

  • writing standard workshops and tutorials, so all the examples lined up
  • writing a performance harness that tracked the performance of common queries over time
  • having examples in the reference documentation that didn’t always have to generate their inputs on the fly

News File Policy

It’s a tiny nit, but for developers back-porting fixes over our 4-5 stable branches, knowing where to note bugs in the NEWS files, and doing it consistently is important.

  • Bug fixes applied to stable branches always listed in NEWS for each branch applied to
  • Bug fixes applied to stable and trunk should be listed in NEWS for each branch and NEWS in trunk
  • Bug fixes applied to only trunk can be left out of trunk NEWS (these bugs are development artifacts, not bugs that users in production have reported)

Next…

We also discussed features and changes to PostgreSQL that would help PostGIS improve, and I’ll write about those in the next post.

September 30, 2018 08:00 AM

September 28, 2018

Paul Ramsey

5x Faster Spatial Join with this One Weird Trick

My go-to performance test for PostGIS is the point-in-polygon spatial join: given a collection of polygons of variables sizes and a collection of points, count up how many points are within each polygon. It’s a nice way of testing indexing, point-in-polygon calculations and general overhead.

Setup

First download some polygons and some points.

Load the shapes into your database.

shp2pgsql -s 4326 -D -I ne_10m_admin_0_countries.shp countries | psql performance
shp2pgsql -s 4326 -D -I ne_10m_populated_places.shp places | psql performance

Now we are ready with 255 countries and 7343 places.

Countries and Places

One thing to note about the countries is that they are quite large objects, with 149 of them having enough vertices to be stored in TOAST tuples.

SELECT count(*) 
  FROM countries 
  WHERE ST_NPoints(geom) > (8192 / 16);

Baseline Performance

Now we can run the baseline performance test.

SELECT count(*), c.name 
  FROM countries c 
  JOIN places p 
  ON ST_Intersects(c.geom, p.geom) 
  GROUP BY c.name;

On my laptop, this query takes 25 seconds.

If you stick the process into a profiler while running it you’ll find that over 20 of those seconds are spent in the pglz_decompress function. Not doing spatial algorithms or computational geometry, just decompressing the geometry before handing it on to the actual processing.

Among the things we talked about this week at our PostGIS code sprint have been clever ways to avoid this overhead:

  • Patch PostgreSQL to allow partial decompression of geometries.
  • Enrich our serialization format to include a unique hash key at the front of geometries.

These are cool have-your-cake-and-eat-too ways to both retain compression for large geometries and be faster when feeding them into the point-in-polygon machinery.

However, they ignore a more brutal and easily testable approach to avoiding decompression: just don’t compress in the first place.

One Weird Trick

PostGIS uses the “main” storage option for its geometry type. The main option tries to keep geometries in their original table until they get too large, then compresses them in place, then moves them to TOAST.

There’s another option “external” that keeps geometries in place, and if they get too big moves them to TOAST uncompressed. PostgreSQL allows you to change the storage on columns at run-time, so no hacking or code is required to try this out.

-- Change the storage type
ALTER TABLE countries
  ALTER COLUMN geom
  SET STORAGE EXTERNAL;

-- Force the column to rewrite
UPDATE countries
  SET geom = ST_SetSRID(geom, 4326);

-- Re-run the query  
SELECT count(*), c.name 
  FROM countries c 
  JOIN places p 
  ON ST_Intersects(c.geom, p.geom) 
  GROUP BY c.name;

The spatial join now runs in under 4 seconds.

What’s the penalty?

  • With a “main” storage the table+toast+index is 6MB.
  • With a “external” storage the table+toast+index is 9MB.

Conclusion

For a 50% storage penalty, on a table that has far more large objects than most spatial tables, we achieved a 500% performance improvement. Maybe we shouldn’t apply compression to our large geometry at all?

Using “main” storage was mainly a judgement call back when we decided on it, it wasn’t benchmarked or anything – it’s possible that we were just wrong. Also, only large objects are compressed; since most tables are full of lots of small objects (short lines, points) changing to “external” by default wouldn’t have any effect on storage size at all.

September 28, 2018 08:00 AM

September 23, 2018

PostGIS Development

PostGIS 2.5.0

The PostGIS development team is pleased to release PostGIS 2.5.0.

Although this release will work for PostgreSQL 9.4 and above, to take full advantage of what PostGIS 2.5 offers, you should be running PostgreSQL 11beta4+ and GEOS 3.7.0 which were released recently.

Best served with PostgreSQL 11 beta4 and pgRouting 2.6.1.

WARNING: If compiling with PostgreSQL+JIT, LLVM >= 6 is required Supported PostgreSQL versions for this release are: PostgreSQL 9.4 - PostgreSQL 12 (in development) GEOS >= 3.5

2.5.0

Continue Reading by clicking title hyperlink ..

by Regina Obe at September 23, 2018 12:00 AM

September 16, 2018

PostGIS Development

PostGIS 2.5.0rc2

The PostGIS development team is pleased to release PostGIS 2.5.0rc2.

Although this release will work for PostgreSQL 9.4 and above, to take full advantage of what PostGIS 2.5 offers, you should be running PostgreSQL 11beta3+ and GEOS 3.7.0 which were released recently.

Best served with PostgreSQL 11beta3.

WARNING: If compiling with PostgreSQL 11+JIT, LLVM >= 6 is required

2.5.0rc2

Changes since PostGIS 2.5.0rc1 release are as follows:

  • 4162, ST_DWithin documentation examples for storing geometry and radius in table (Darafei Praliaskouski, github user Boscop).
  • 4163, MVT: Fix resource leak when the first geometry is NULL (Raúl Marín)
  • 4172, Fix memory leak in lwgeom_offsetcurve (Raúl Marín)
  • 4164, Parse error on incorrectly nested GeoJSON input (Paul Ramsey)
  • 4176, ST_Intersects supports GEOMETRYCOLLECTION (Darafei Praliaskouski)
  • 4177, Postgres 12 disallows variable length arrays in C (Laurenz Albe)
  • 4160, Use qualified names in topology extension install (Raúl Marín)
  • 4180, installed liblwgeom includes sometimes getting used instead of source ones (Regina Obe)

View all closed tickets for 2.5.0.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

by Regina Obe at September 16, 2018 12:00 AM

September 13, 2018

Simon Greener

PGAdmin Finally Has A Spatial Viewer

A great day has arrived for PostgreSQL developers (not just Spatial geeks) in that finally an integrated spatial viewer has been added to PgAdmin 4.3 as announced by Regina Obe

You can see the announcement and some examples here

by Simon Greener at September 13, 2018 11:10 PM

September 12, 2018

PostGIS Development

PostGIS Patch Release 2.4.5

The PostGIS development team is pleased to provide bug fix 2.4.5 for the 2.4 stable branch.

View all closed tickets for 2.4.5.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

2.4.5

by Paul Ramsey at September 12, 2018 12:00 AM

September 10, 2018

Paul Ramsey

Parallel PostGIS and PgSQL 11

A little under a year ago, with the release of PostgreSQL 10, I evaluated the parallel query infrastructure and how well PostGIS worked with it.

The results were less than stellar for my example data, which was small-but-not-too-small: under default settings of PostgreSQL and PostGIS, parallel behaviour did not occur.

However, unlike in previous years, as of PostgreSQL 10, it was possible to get parallel plans by making changes to PostGIS settings only. This was a big improvement from PostgreSQL 9.6, which substantial changes to the PostgreSQL default settings were needed to force parallel plans.

PostgreSQL 11 promises more improvements to parallel query:

  • Parallelized hash joins
  • Parallelized CREATE INDEX for B-tree indexes
  • Parallelized CREATE TABLE .. AS, CREATE MATERIALIZED VIEW, and certain queries using UNION

With the exception of CREATE TABLE ... AS none of these are going to affect spatial parallel query. However, there have also been some none-headline changes that have improved parallel planning and thus spatial queries.

Parallel PostGIS and PgSQL 11

TL;DR:

PostgreSQL 11 has slightly improved parallel spatial query:

  • Costly spatial functions on the query target list (aka, the SELECT ... line) will now trigger a parallel plan.
  • Under default PostGIS costings, parallel plans do not kick in as soon as they should.
  • Parallel aggregates parallelize readily under default settings.
  • Parallel spatial joins require higher costings on functions than they probably should, but will kick in if the costings are high enough.

Setup

In order to run these tests yourself, you will need:

  • PostgreSQL 11
  • PostGIS 2.5

You’ll also need a multi-core computer to see actual performance changes. I used a 4-core desktop for my tests, so I could expect 4x improvements at best.

The setup instructions show where to download the Canadian polling division data used for the testing:

  • pd a table of ~70K polygons
  • pts a table of ~70K points
  • pts_10 a table of ~700K points
  • pts_100 a table of ~7M points

PDs

We will work with the default configuration parameters and just mess with the max_parallel_workers_per_gather at run-time to turn parallelism on and off for comparison purposes.

When max_parallel_workers_per_gather is set to 0, parallel plans are not an option.

  • max_parallel_workers_per_gather sets the maximum number of workers that can be started by a single Gather or Gather Merge node. Setting this value to 0 disables parallel query execution. Default 2.

Before running tests, make sure you have a handle on what your parameters are set to: I frequently found I accidentally tested with max_parallel_workers set to 1, which will result in two processes working: the leader process (which does real work when it is not coordinating) and one worker.

show max_worker_processes;
show max_parallel_workers;
show max_parallel_workers_per_gather;

Aggregates

Behaviour for aggregate queries is still good, as seen in PostgreSQL 10 last year.

SET max_parallel_workers = 8;
SET max_parallel_workers_per_gather = 4;

EXPLAIN ANALYZE 
  SELECT Sum(ST_Area(geom)) 
    FROM pd;

Boom! We get a 3-worker parallel plan and execution about 3x faster than the sequential plan.

Scans

The simplest spatial parallel scan adds a spatial function to the target list or filter clause.

SET max_parallel_workers = 8;
SET max_parallel_workers_per_gather = 4;

EXPLAIN ANALYZE 
  SELECT ST_Area(geom)
    FROM pd; 

Unfortunately, that does not give us a parallel plan.

The ST_Area() function is defined with a COST of 10. If we move it up, to 100, we can get a parallel plan.

SET max_parallel_workers_per_gather = 4;

ALTER FUNCTION ST_Area(geometry) COST 100;

EXPLAIN ANALYZE 
  SELECT ST_Area(geom)
    FROM pd 

Boom! Parallel scan with three workers. This is an improvement from PostgreSQL 10, where a spatial function on the target list would not trigger a parallel plan at any cost.

Joins

Starting with a simple join of all the polygons to the 100 points-per-polygon table, we get:

SET max_parallel_workers_per_gather = 4;

EXPLAIN  
 SELECT *
  FROM pd 
  JOIN pts_100 pts
  ON ST_Intersects(pd.geom, pts.geom);

PDs & Points

In order to give the PostgreSQL planner a fair chance, I started with the largest table, thinking that the planner would recognize that a “70K rows against 7M rows” join could use some parallel love, but no dice:

Nested Loop  
(cost=0.41..13555950.61 rows=1718613817 width=2594)
 ->  Seq Scan on pd  
     (cost=0.00..14271.34 rows=69534 width=2554)
 ->  Index Scan using pts_gix on pts  
     (cost=0.41..192.43 rows=232 width=40)
       Index Cond: (pd.geom && geom)
       Filter: _st_intersects(pd.geom, geom)

As with all parallel plans, it is a nested loop, but that’s fine since all PostGIS joins are nested loops.

First, note that our query can be re-written like this, to expose the components of the spatial join:

EXPLAIN  
 SELECT *
  FROM pd 
  JOIN pts_100 pts
   ON pd.geom && pts.geom 
   AND _ST_Intersects(pd.geom, pts.geom);

The default cost of _ST_Intersects() is 100. If we adjust it up by a factor of 100, we can get a parallel plan.

ALTER FUNCTION _ST_Intersects(geometry, geometry) COST 10000;

Can we achieve the same affect adjusting the cost of the && operator? The && operator could activate one of two functions:

  • geometry_overlaps(geom, geom) is bound to the && operator
  • geometry_gist_consistent_2d(internal, geometry, int4) is bound to the 2d spatial index

However, no amount of increasing their COST causes the operator-only query plan to flip into a parallel mode:

ALTER FUNCTION  geometry_overlaps(geometry, geometry) COST 1000000000000;
ALTER FUNCTION  geometry_gist_consistent_2d(internal, geometry, int4) COST 10000000000000;

So for operator-only queries, it seems the only way to force a spatial join is to muck with the parallel_tuple_cost parameter.

Costing PostGIS?

A relatively simple way to push more parallel behaviour out to the PostGIS user community would be applying a global increase of PostGIS function costs. Unfortunately, doing so has knock-on effects that will break other use cases badly.

In brief, PostGIS uses wrapper functions, like ST_Intersects() to hide the index operators that speed up queries. So a query that looks like this:

SELECT ...
FROM ...
WHERE ST_Intersects(A, B)

Will be expanded by PostgreSQL “inlining” to look like this:

SELECT ...
FROM ...
WHERE A && B AND _ST_Intersects(A, B)

The expanded version includes both an index operator (for a fast, loose evaluation of the filter) and an exact operator (for an expensive and correct evaluation of the filter).

If the arguments “A” and “B” are both geometry, this will always work fine. But if one of the arguments is a highly costed function, then PostgreSQL will no longer inline the function. The index operator will then be hidden from the planner, and index scans will not come into play. PostGIS performance falls apart.

This isn’t unique to PostGIS, it’s just a side effect of some old code in PostgreSQL, and it can be replicated using PostgreSQL built-in types too.

It is possible to change current inlining behaviour with a very small patch but the current inlining behaviour is useful for people who want to use SQL wrapper functions as a means of caching expensive calculations. So “fixing” the behaviour PostGIS would break it for some non-empty set of existing PostgreSQL users.

Tom Lane and Adreas Freund briefly discussed a solution involving a smarter approach to inlining that would preserve both the ability inline while avoiding doing double work when inlining expensive functions, but discussion petered out after that.

As it stands, PostGIS functions cannot be properly costed to take maximum advantage of parallelism until PostgreSQL inlining behaviour is made more tolerant of costly parameters.

Conclusions

  • PostgreSQL seems to weight declared cost of functions relatively low in the priority of factors that might trigger parallel behaviour.

    • In sequential scans, costs of 100+ are required.
    • In joins, costs of 10000+ are required. This is suspicious (100x more than scan costs?) and even with fixes in function costing, probably not desireable.
  • Required changes in PostGIS costs for improved parallelism will break other aspects of PostGIS behaviour until changes are made to PostgreSQL inlining behaviour…

September 10, 2018 04:00 PM

September 09, 2018

Postgres OnLine Journal (Leo Hsu, Regina Obe)

PGOpen 2018 Data Loading Presentation Slides

At PGOpen 2018 in San Francisco, we gave a talk on 10 ways to load data into Posgres. This is one of the rare talks where we didn't talk much about PostGIS. However we did showcase tools ogr_fdw, ogr2ogr, shp2pgsql, which are commonly used for loading spatial data, but equally as good for loading non-spatial data. Below are the slide links.

Continue reading "PGOpen 2018 Data Loading Presentation Slides"

by Leo Hsu and Regina Obe (nospam@example.com) at September 09, 2018 04:16 AM

September 08, 2018

Boston GIS (Regina Obe, Leo Hsu)

pgAdmin4 now offers PostGIS geometry viewer

pgAdmin4 version 3.3 released this week comes with a PostGIS geometry viewer. You will be able to see the graphical output of your query directly in pgAdmin, provided you output a geometry or geography column. If your column is of SRID 4326 (WGS 84 lon/lat), pgAdmin will automatically display against an OpenStreetMap background.

We have Xuri Gong to thank for working on this as a PostGIS/pgAdmin Google Summer of Code (GSOC) project. We'd like to thank Victoria Rautenbach and Frikan Erwee for mentoring.

Continue reading "pgAdmin4 now offers PostGIS geometry viewer"

by Regina Obe (nospam@example.com) at September 08, 2018 09:13 PM

September 04, 2018

Paul Ramsey

Moving on to Crunchy Data

Today is my first day with my new employer Crunchy Data. Haven’t heard of them? I’m not surprised: outside of the world of PostgreSQL, they are not particularly well known, yet.

Moving on to Crunchy Data

I’m leaving behind a pretty awesome gig at CARTO, and some fabulous co-workers. Why do such a thing?

While CARTO is turning in constant growth and finding good traction with some core spatial intelligence use cases, the path to success is leading them into solving problems of increasing specificity. Logistics optimization, siting, market evaluation.

Moving to Crunchy Data means transitioning from being the database guy (boring!) in a geospatial intelligence company, to being the geospatial guy (super cool!) in a database company. Without changing anything about myself, I get to be the most interesting guy in the room! What could be better than that?

Crunchy Data has quietly assembled an exceptionally deep team of PostgreSQL community members: Tom Lane, Stephen Frost, Joe Conway, Peter Geoghegan, Dave Cramer, David Steele, and Jonathan Katz are all names that will be familiar to followers of the PostgreSQL mailing lists.

They’ve also quietly assembled expertise in key areas of interest to large enterprises: security deployment details (STIGs, RLS, Common Criteria); Kubernetes and PaaS deployments; and now (ta da!) geospatial.

Why does this matter? Because the database world is at a technological inflection point.

Core enterprise systems change very infrequently, and only under pressure from multiple sources. The last major inflection point was around the early 2000s, when the fleet of enterprise proprietary UNIX systems came under pressure from multiple sources:

  • The RISC architecture began to fall noticeably behind x86 and particular x86-64.
  • Pricing on RISC systems began to diverge sharply from x86 systems.
  • A compatible UNIX operating system (Linux) was available on the alternative architecture.
  • A credible support company (Red Hat) was available and speaking the language of the enterprise.

The timeline of the Linux tidal wave was (extremely roughly):

  • 90s - Linux becomes the choice of the tech cognoscenti.
  • 00s - Linux becomes the choice of everyone for greenfield applications.
  • 10s - Linux becomes the choice of everyone for all things.

By my reckoning, PostgreSQL is on the verge of a Linux-like tidal wave that washes away much of the enterprise proprietary database market (aka Oracle DBMS). Bear in mind, these things pan out over 30 year timelines, but:

  • Oracle DBMS offers no important feature differentiation for most workloads.
  • Oracle DBMS price hikes are driving customers to distraction.
  • Server-in-a-cold-room architectures are being replaced with the cloud.
  • PostgreSQL in the cloud, deployed as PaaS or otherwise, is mature.
  • A credible support industry (including Crunchy Data) is at hand to support migrators.

I’d say we’re about half way through the evolution of PostgreSQL from “that cool database” to “the database”, but the next decade of change is going to be the one people notice. People didn’t notice Linux until it was already running practically everything, from web servers to airplane seatback entertainment systems. The same thing will obtain in database land; people won’t recognize the inevitability of PostgreSQL as the “default database” until the game is long over.

Having a chance to be a part of that change, and to promote geospatial as a key technology while it happens, is exciting to me, so I’m looking forward to my new role at Crunchy Data a great deal!

Meanwhile, I’m going to be staying on as a strategic advisor to CARTO on geospatial and database affairs, so I get to have a front seat on their continued growth too. Thanks to CARTO for three great years, I enjoyed them immensely!

September 04, 2018 04:00 PM

August 19, 2018

PostGIS Development

PostGIS 2.5.0rc1

The PostGIS development team is pleased to release PostGIS 2.5.0rc1.

This release is a work in progress. Remaining time will be focused on bug fixes and documentation until PostGIS 2.5.0 release. Although this release will work for PostgreSQL 9.4 and above, to take full advantage of what PostGIS 2.5 offers, you should be running PostgreSQL 11beta3+ and GEOS 3.7.0rc1 which were released recently.

Best served with PostgreSQL 11beta3 which was recently released.

Changes since PostGIS 2.5.0beta2 release are as follows:

  • 4146, Fix compilation error against Postgres 12 (Raúl Marín).
  • 4147, 4148, Honor SOURCEDATEEPOCH when present (Christoph Berg).

View all closed tickets for 2.5.0.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

2.5.0rc1

by Regina Obe at August 19, 2018 12:00 AM

August 11, 2018

PostGIS Development

PostGIS 2.5.0beta2

The PostGIS development team is pleased to release PostGIS 2.5.0beta2.

This release is a work in progress. Remaining time will be focused on bug fixes and documentation until PostGIS 2.5.0 release. Although this release will work for PostgreSQL 9.4 and above, to take full advantage of what PostGIS 2.5 offers, you should be running PostgreSQL 11beta3+ and GEOS 3.7.0beta2 which were released recently.

Best served with PostgreSQL 11beta3 which was recently released.

Changes since PostGIS 2.5.0beta1 release are as follows:

  • 4115, Fix a bug that created MVTs with incorrect property values under parallel plans (Raúl Marín).
  • 4120, ST_AsMVTGeom: Clip using tile coordinates (Raúl Marín).
  • 4132, ST_Intersection on Raster now works without throwing TopologyException (Vinícius A.B. Schmidt, Darafei Praliaskouski)
  • 4109, Fix WKT parser accepting and interpreting numbers with multiple dots (Raúl Marín, Paul Ramsey)
  • 4140, Use user-provided CFLAGS in address standardizer and the topology module (Raúl Marín)
  • 4143, Fix backend crash when ST_OffsetCurve fails (Dan Baston)
  • 4145, Speedup MVT column parsing (Raúl Marín)

View all closed tickets for 2.5.0.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

2.5.0beta2

by Regina Obe at August 11, 2018 12:00 AM

August 03, 2018

Simon Greener

How to apply spatial constraints to PostGIS tables [2]

Spatial Constraints PostGis Ddl

by Simon Greener at August 03, 2018 02:44 AM

July 19, 2018

Anita Graser (Underdark)

Movement data in GIS #15: writing a PL/pgSQL stop detection function for PostGIS trajectories

Do you sometimes start writing an SQL query and around at line 50 you get the feeling that it might be getting out of hand? If so, it might be useful to start breaking it down into smaller chunks and wrap those up into custom functions. Never done that? Don’t despair! There’s an excellent PL/pgSQL tutorial on postgresqltutorial.com to get you started.

To get an idea of the basic structure of a PL/pgSQL function and to proof that PostGIS datatypes work just fine in this context, here’s a basic function that takes a trajectory geometry and outputs its duration, i.e. the difference between its last and first timestamp:

CREATE OR REPLACE FUNCTION AG_Duration(traj geometry) 
RETURNS numeric LANGUAGE 'plpgsql'
AS $BODY$ 
BEGIN
RETURN ST_M(ST_EndPoint(traj))-ST_M(ST_StartPoint(traj));
END; $BODY$;

My end goal for this exercise was to implement a function that takes a trajectory and outputs the stops along this trajectory. Commonly, a stop is defined as a long stay within an area with a small radius. This leads us to the following definition:

CREATE OR REPLACE FUNCTION AG_DetectStops(
   traj geometry, 
   max_size numeric, 
   min_duration numeric)
RETURNS TABLE(sequence integer, geom geometry) 
-- implementation follows here!

Note how this function uses RETURNS TABLE to enable it to return all the stops that it finds. To add a line to the output table, we need to assign values to the sequence and geom variables and then use RETURN NEXT.

Another reason to use PL/pgSQL is that it enables us to write loops. And loops I wanted for my stop detection function! Specifically, I wanted to go through all the points in the trajectory:

FOR pt IN SELECT (ST_DumpPoints(traj)).geom LOOP
-- here comes the magic!
END LOOP;

Eventually the function should go through the trajectory and identify all segments that stay within an area with max_size diameter for at least min_duration time. To test for the area size, we can use:

IF ST_MaxDistance(segment,pt) <= max_size THEN is_stop := true; 

Putting everything together, my current implementation looks like this:

CREATE OR REPLACE FUNCTION AG_DetectStops(
   traj geometry,
   max_size numeric,
   min_duration numeric)
RETURNS TABLE(sequence integer, geom geometry) 
LANGUAGE 'plpgsql'
AS $BODY$
DECLARE 
   pt geometry;
   segment geometry;
   is_stop boolean;
   previously_stopped boolean;
   stop_sequence integer;
   p1 geometry;
BEGIN
segment := NULL;
sequence := 0;
is_stop := false;
previously_stopped := false;
p1 := NULL;
FOR pt IN SELECT (ST_DumpPoints(traj)).geom LOOP
   IF segment IS NULL AND p1 IS NULL THEN 
      p1 := pt; 
   ELSIF segment IS NULL THEN 
      segment := ST_MakeLine(p1,pt); 
      p1 := NULL;
      IF ST_Length(segment) <= max_size THEN is_stop := true; END IF; ELSE segment := ST_AddPoint(segment,pt); -- if we're in a stop, we want to grow the segment, otherwise we remove points to the specified min_duration IF NOT is_stop THEN WHILE ST_NPoints(segment) > 2 AND AG_Duration(ST_RemovePoint(segment,0)) >= min_duration LOOP
            segment := ST_RemovePoint(segment,0); 
         END LOOP;
      END IF;
      -- a stop is identified if the segment stays within a circle of diameter = max_size
      IF ST_Length(segment) <= max_size THEN is_stop := true; ELSIF ST_Distance(ST_StartPoint(segment),pt) > max_size THEN is_stop := false;
      ELSIF ST_MaxDistance(segment,pt) <= max_size THEN is_stop := true; ELSE is_stop := false; END IF; -- if we found the end of a stop, we need to check if it lasted long enough IF NOT is_stop AND previously_stopped THEN IF ST_M(ST_PointN(segment,ST_NPoints(segment)-1))-ST_M(ST_StartPoint(segment)) >= min_duration THEN
            geom := ST_RemovePoint(segment,ST_NPoints(segment)-1); 
            RETURN NEXT;
            sequence := sequence + 1;
            segment := NULL;
            p1 := pt;
         END IF;
      END IF;
   END IF;
   previously_stopped := is_stop;
END LOOP;
IF previously_stopped AND AG_Duration(segment) >= min_duration THEN 
   geom := segment; 
   RETURN NEXT; 
END IF;
END; $BODY$;

While this function is not really short, it’s so much more readable than my previous attempts of doing this in pure SQL. Some of the lines for determining is_stop are not strictly necessary but they do speed up processing.

Performance still isn’t quite where I’d like it to be. I suspect that all the adding and removing points from linestring geometries is not ideal. In general, it’s quicker to find shorter stops in smaller areas than longer stop in bigger areas.

Let’s test! 

Looking for a testing framework for PL/pgSQL, I found plpgunit on Github. While I did not end up using it, I did use its examples for inspiration to write a couple of tests, e.g.

CREATE OR REPLACE FUNCTION test.stop_at_beginning() RETURNS void LANGUAGE 'plpgsql'
AS $BODY$
DECLARE t0 integer; n0 integer;
BEGIN
WITH temp AS ( SELECT AG_DetectStops(
   ST_GeometryFromText('LinestringM(0 0 0, 0 0 1, 0.1 0.1 2, 2 2 3)'),
   1,1) stop 
)
SELECT ST_M(ST_StartPoint((stop).geom)), 
       ST_NPoints((stop).geom) FROM temp INTO t0, n0;	
IF t0 = 0 AND n0 = 3
   THEN RAISE INFO 'PASSED - Stop at the beginning of the trajectory';
   ELSE RAISE INFO 'FAILED - Stop at the beginning of the trajectory';
END IF;
END; $BODY$;

Basically, each test is yet another PL/pgSQL function that doesn’t return anything (i.e. returns void) but outputs messages about the status of the test. Here I made heavy use of the PERFORM statement which executes the provided function but discards the results:


Update: The source code for this function is now available on https://github.com/anitagraser/postgis-spatiotemporal

by underdark at July 19, 2018 09:31 PM

July 08, 2018

Postgres OnLine Journal (Leo Hsu, Regina Obe)

Using procedures for batch geocoding and other batch processing

One of the features we are looking forward to in upcoming PostgreSQL 11 is the introduction of procedures via the CREATE PROCEDURE ANSI-SQL construct. The major benefit that sets apart procedures from functions is that procedures are not wrapped in an outer transaction and can have COMMITs within them. This means it's not an all or nothing like it is with functions. Even if you stop a procedure in motion, whatever work has been done and committed is saved. In the case of functions, a stop or failure would roll-back all the work. It also means you can see work in progress of a stored procedure since the work will already have been committed. This is a huge benefit for batch processing. Batch processing covers a lot of use-cases of PostGIS users since a good chunk of PostGIS work involves doing some kind of batch processing of data you get from third-parties or machines.

Continue reading "Using procedures for batch geocoding and other batch processing"

by Leo Hsu and Regina Obe (nospam@example.com) at July 08, 2018 05:40 AM

July 03, 2018

PostGIS Development

PostGIS 2.5.0beta1

The PostGIS development team is pleased to release PostGIS 2.5.0beta1.

This release is a work in progress. No more api changes will be made from this point on before 2.5.0 release. Remaining time will be focused on bug fixes and documentation for the new functionality and performance enhancements under the covers. Although this release will work for PostgreSQL 9.4 and above, to take full advantage of what PostGIS 2.5 will offer, you should be running PostgreSQL 11beta2+ and GEOS 3.7.0beta1 which were released recently.

Best served with PostgreSQL 11beta2 which was recently released.

View all closed tickets for 2.5.0.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

2.5.0beta1

by Regina Obe at July 03, 2018 12:00 AM

June 25, 2018

Boston GIS (Regina Obe, Leo Hsu)

New in QGIS 3.2 Save Project to PostgreSQL

We've got customers discovering PostGIS and GIS in general or migrating away from ArcGIS family of tools. When they ask, "How do I see my data?", we often point them at QGIS which is an open source GIS desktop with rich integration with PostGIS/PostgreSQL.

QGIS is something that is great for people who need to live in their GIS environment since it allows for easily laying on other datasources, web services and maps. The DBManager tool allows for more advanced querying (like writing Spatial SQL queries that take advantage of the 100s of functions PostGIS has to offer) , ability to import/export data, and create PostgreSQL views.

QGIS has this thing called Projects, which allow for defining map layers and the symbology associated with them. For example what colors do you color your roads, and any extra symbols, what field attributes do you overlay - street name etc. Projects are usually saved in files with a .qgs or .qgz extension. If you spent a lot of time styling these layers, chances are you want to share them with other people in your group. This can become challenging if your group is not connected via network share.

Continue reading "New in QGIS 3.2 Save Project to PostgreSQL"

by Regina Obe (nospam@example.com) at June 25, 2018 05:47 AM

June 21, 2018

Paul Ramsey

PostGIS Polygon Splitting

One of the joys of geospatial processing is the variety of tools in the tool box, and the ways that putting them together can yield surprising results. I have been in the guts of PostGIS for so long that I tend to think in terms of primitives: either there’s a function that does what you want or there isn’t. I’m too quick to ignore the power of combining the parts that we already have.

A community member on the users list asked (paraphrased): “is there a way to split a polygon into sub-polygons of more-or-less equal areas?”

I didn’t see the question, which is lucky, because I would have said: “No, you’re SOL, we don’t have a good way to solve that problem.” (An exact algorithm showed up in the Twitter thread about this solution, and maybe I should implement that.)

PostGIS developer Darafei Praliaskouski did answer, and provided a working solution that is absolutely brilliant in combining the parts of the PostGIS toolkit to solve a pretty tricky problem. He said:

The way I see it, for any kind of polygon:

  • Convert a polygon to a set of points proportional to the area by ST_GeneratePoints (the more points, the more beautiful it will be, guess 1000 is ok);
  • Decide how many parts you’d like to split into, (ST_Area(geom)/max_area), let it be K;
  • Take KMeans of the point cloud with K clusters;
  • For each cluster, take a ST_Centroid(ST_Collect(point));
  • Feed these centroids into ST_VoronoiPolygons, that will get you a mask for each part of polygon;
  • ST_Intersection of original polygon and each cell of Voronoi polygons will get you a good split of your polygon into K parts.

Let’s take it one step at a time to see how it works.

We’ll use Peru as the example polygon, it’s got a nice concavity to it which makes it a little trickier than an average shape.

CREATE TABLE peru AS 
  SELECT *
  FROM countries
  WHERE name = 'Peru'

Original Polygon (Petu)

Now create a point field that fills the polygon. On average, each randomly placed point ends up “occupying” an equal area within the polygon.

CREATE TABLE peru_pts AS
  SELECT (ST_Dump(ST_GeneratePoints(geom, 2000))).geom AS geom
  FROM peru
  WHERE name = 'Peru'

2000 Random Points

Now, cluster the point field, setting the number of clusters to the number of pieces you want the polygon divided into. Visually, you can now see the divisions in the polygon! But, we still need to get actual lines to represent those divisions.

CREATE TABLE peru_pts_clustered AS
  SELECT geom, ST_ClusterKMmeans(geom, 10) over () AS cluster
  FROM peru_pts;

10 Clusters

Using a point field and K-means clustering to get the split areas was inspired enough. The steps to get actual polygons are equally inspired.

First, calculate the centroid of each point cluster, which will be the center of mass for each cluster.

CREATE TABLE peru_centers AS
  SELECT cluster, ST_Centroid(ST_collect(geom)) AS geom
  FROM peru_pts_clustered
  GROUP BY cluster;

Centroids of Clusters

Now, use a voronoi diagram to get actual dividing edges between the cluster centroids, which end up closely matching the places where the clusters divide!

CREATE TABLE peru_voronoi AS
  SELECT (ST_Dump(ST_VoronoiPolygons(ST_collect(geom)))).geom AS geom
  FROM peru_centers;

Voronoi of Centrois

Finally, intersect the voronoi areas with the original polygon to get final output polygons that incorporate both the outer edges of the polgyon and the voronoi dividing lines.

CREATE TABLE peru_divided AS
  SELECT ST_Intersection(a.geom, b.geom) AS geom
  FROM peru a
  CROSS JOIN peru_voronoi b;

Intersection with Original Polygon

Done!

Clustering a point field to get mostly equal areas, and then using the voronoi to extract actual dividing lines are wonderful insights into spatial processing. The final picture of all the components of the calculation is also beautiful.

All the Components Together

I’m not 100% sure, but it might be possible to use Darafei’s technique for even more interesting subdivisions, like “map of the USA subdivided into areas of equal GDP”, or “map of New York subdivided into areas of equal population” by generating the initial point field using an economic or demographic weighting.

June 21, 2018 08:00 PM

June 18, 2018

Boston GIS (Regina Obe, Leo Hsu)

Coming PostGIS 2.5 ST_OrientedEnvelope

PostGIS 2.5 is just around the corner. One of the new functions coming is the ST_OrientedEnvelop. This is something I've been waiting for for years. It is the true minimum bounding rectangle, as opposed to ST_Envelope which is an axis aligned bounding rectable.

Below is a pictorial view showing the difference between the two.

Continue reading "Coming PostGIS 2.5 ST_OrientedEnvelope"

by Regina Obe (nospam@example.com) at June 18, 2018 09:42 PM

June 09, 2018

Postgres OnLine Journal (Leo Hsu, Regina Obe)

PostgresVision 2018 Slides and Impressions

Leo and I attended PostgresVision 2018 which ended a couple of days ago.

We gave a talk on spatial extensions with main focus being PostGIS. Here are links to our slides PostgresVision2018_SpatialExtensions HTML version PDF.

Unfortunately there are no slides of the pgRouting part, except the one that says PGRouting Live Demos because Leo will only do live demos. He has no fear of his demos not working.

Side note, if you are on windows and use the PostGIS bundle, all the extensions listed in the PostGIS box of the spatial extensions diagram, as well as the pointcloud, pgRouting, and ogr_fdw are included in the bundle.

Continue reading "PostgresVision 2018 Slides and Impressions"

by Leo Hsu and Regina Obe (nospam@example.com) at June 09, 2018 08:56 AM

May 28, 2018

PostGIS Development

PostGIS 2.5.0alpha

The PostGIS development team is pleased to release PostGIS 2.5.0alpha.

This release is a work in progress, and some features are still slated to be added. Although this release will work for PostgreSQL 9.4 and above, to take full advantage of what PostGIS 2.5 will offer, you should be running PostgreSQL 11beta+.

Best served with PostgreSQL 11beta which was recently released.

View all closed tickets for 2.5.0.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

2.5.0alpha

by Regina Obe at May 28, 2018 12:00 AM

May 14, 2018

Paul Ramsey

PostGIS Talks @ FOSS4G North America

I presented my “PostGIS for Managers” talk for the last time (at least in this form) today at FOSS4G North America. The reason it’s for the last time is that the central conceit it’s built around, that a core decision is between a proprietary and an open source database, isn’t really operative anymore. The real decisions are now being driven by other considerations, like cloud platforms, and the services available there. So, it’s not really PostgreSQL versus Oracle anymore.

I also presented my “SQL Festival” talk, for the first time! New material is always a little challenging: will it work, is it the right level for the audience? It seemed to be well received, a mix of generic SQL tidbits, and some specific interesting queries you can do with PostGIS.

May 14, 2018 04:00 PM

April 16, 2018

Anita Graser (Underdark)

Movement data in GIS #12: why you should be using PostGIS trajectories

In short: both writing trajectory queries as well as executing them is considerably faster using PostGIS trajectories (as LinestringM) rather than the commonly used point-based approach.

Here are a couple of examples to give you an impression of the differences.

Spoiler alert! Trajectory queries are up to 500 times faster than comparable point-based queries.

A quick look at indexing

In both cases, we have indexed the tracker id, geometry, and time columns to speed up query processing.

The trajectory table has 3 indexes

  • gist (time_range)
  • gist (track gist_geometry_ops_nd)
  • btree (tracker)

The point-based table has 4 indexes

  • gist (pt)
  • btree (trajectory_id)
  • btree (tracker)
  • btree (t)

Length

First, let’s see how to determine trajectory length for all observed moving objects (identified by a tracker id).

Using the point-based approach, we first need to ensure that the points are in the correct temporal order, create the lines, and finally sum up their length:

WITH ordered AS (
 SELECT trajectory_id, tracker, t, pt
 FROM geolife.trajectory_pt
 ORDER BY t
), tmp AS (
 SELECT trajectory_id, tracker, st_makeline(pt) traj
 FROM ordered 
 GROUP BY trajectory_id, tracker
)
SELECT tracker, round(sum(ST_Length(traj::geography)))
FROM tmp
GROUP BY tracker 
ORDER BY tracker

With trajectories, we can go right to computing lengths:

SELECT tracker, round(sum(ST_Length(track::geography)))
FROM geolife.trajectory_ext
GROUP BY tracker
ORDER BY tracker

On my test system, the trajectory query run time is 22.7 sec instead of 43.0 sec for the point-based approach:

Duration

Compared to trajectory length, duration is less complicated in the point-based approach:

WITH tmp AS (
 SELECT trajectory_id, tracker, min(t) start_time, max(t) end_time
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
)
SELECT tracker, sum(end_time - start_time)
FROM tmp
GROUP BY tracker
ORDER BY tracker

Still, the trajectory query is less complex and much faster at 31 ms instead of 6.0 sec:

SELECT tracker, sum(upper(time_range) - lower(time_range))
FROM geolife.trajectory_ext
GROUP BY tracker
ORDER BY tracker

Temporal filter

Extracting trajectories that occurred during a certain time frame is another common use case:

WITH tmp AS (
 SELECT trajectory_id, tracker, min(t) start_time, max(t) end_time
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
)
SELECT trajectory_id, tracker, start_time, end_time
FROM tmp
WHERE end_time > '2008-11-26 11:00'
AND start_time < '2008-11-26 15:00'
ORDER BY tracker

This point-based query takes 6.0 sec while the shorter trajectory query finishes in 12 ms:

SELECT id, tracker, time_range
FROM geolife.trajectory_ext
WHERE time_range && '[2008-11-26 11:00+1,2008-11-26 15:00+01]'::tstzrange

or equally fast (12 ms) by making use of the n-dimensional index:

WHERE track &&&	ST_Collect(
 ST_MakePointM(-180, -90, extract(epoch from '2008-11-26 11:00'::timestamptz)),
 ST_MakePointM(180, 90, extract(epoch from '2008-11-26 15:00'::timestamptz))
)

Spatial filter

Finally, of course, let’s have a look at spatial filters, for example, trajectories that start in a certain area:

WITH my AS ( 
 SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(116.31894,39.97472),4326),0.0005) areaA
), tmp AS (
 SELECT trajectory_id, tracker, min(t) t 
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
)
SELECT distinct traj.tracker, traj.trajectory_id 
FROM tmp
JOIN geolife.trajectory_pt traj
ON tmp.trajectory_id = traj.trajectory_id AND traj.t = tmp.t
JOIN my
ON ST_Within(traj.pt, my.areaA)

This point-based query takes 6.0 sec while the shorter trajectory query finishes in 488 ms:

WITH my AS ( 
 SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(116.31894, 39.97472),4326),0.0005) areaA
)
SELECT id, tracker, ST_AsText(track)
FROM geolife.trajectory_ext
JOIN my
ON areaA && track
AND ST_Within(ST_StartPoint(track), areaA)

For more generic “does this trajectory intersect another geometry”, the points can also be aggregated to a linestring on the fly but that takes 21.9 sec:

I’ll be presenting more work on PostGIS trajectories at GI_Forum in Salzburg in July. In the talk, I’ll also have a look at the custom PG-Trajectory datatype. Here’s the full open-access paper:

Graser, A. (2018) Evaluating Spatio-temporal Data Models for Trajectories in PostGIS Databases. GI_Forum ‒ Journal of Geographic Information Science, 1-2018, 16-33. DOI: 10.1553/giscience2018_01_s16.

You can find my fork of the PG-Trajectory project – including all necessary fixes – on Bitbucket.


This post is part of a series. Read more about movement data in GIS.

by underdark at April 16, 2018 08:30 PM

April 10, 2018

Michal Zimmermann

PostgreSQL Backup and Recovery Orchestration: systemd Automation

Posts in this series have described the basic automation of PostgreSQL backup/recovery strategy. The process itself consists of different periodic tasks that shouldn’t be executed manually. There are essentially two tools dedicated to periodic task running in Linux: cron and systemd.

Cron used to be my first choice of automation in Linux, as it’s very easy to use. On the other hand, it’s quite messy (running crontab -e under different users to find out which one has the job defined) and a bit difficult to test - many times I ran into a situation when underlying bash script executed just fine, while cron job kept failing for reason unknown.

My own cron experience together with a few words from a workmate brought me into the arms of systemd, which is a Linux system and service manager. It’s capable of running periodic tasks just like cron, yet making it more transparent.

Important bits

Understanding the whole systemd is way out of scope of a poor GIS guy, yet I managed to tame three important parts of the ecosystem:

  • services
  • timers
  • targets

Services

Service is a configuration saved inside “.service” file specifying what you want systemd to do. Following code shows how you can tell systemd to vacuum your database once in a while.

[Unit]
Description=CR vacuumdb
OnFailure=unit-status-mail@%n.service unit-status-slack@%n.service
Wants=cr-sunday.timer

[Service]
User=postgres
Group=postgres
Type=simple
ExecStart=/bin/bash /usr/local/sbin/pgsql-vacuumdb.sh --port %i

[Install]
WantedBy=cr-sunday.target

Unit files come with several handy features. First of all, they are orchestrated with systemctl. Second, any service configuration file containing @ in its filename might be symlinked/copied and run for different instances. Third, notice OnFailure directive in the code above. If anything goes wrong, systemd might serve as a postman delivering the bad news. I set up both e-mail and Slack notifications and they’ve been working like a charm ever since.

On top of that, I find systemd orchestration much easier to test and maintain compared to cron.

With the above code saved in /lib/systemd/system/pgsql-vacuumdb@.service, you can copy the file to /lib/systemd/system/pgsql-vacuumdb@5432.service, /lib/systemd/system/pgsql-vacuumdb@5432.service etc. If you look at ExecStart part of the service file, you’ll notice %i being used at the end - a placeholder replaced with the string between @ and .service in the filename.

This systemd service file is no more than a simple wrapper around the following bash code. We run three different database clusters on one machine and this approach makes their maintenance pretty comfortable.

#!/bin/bash
#
# @author: Michal Zimmermann <michal.zimmermann@clevermaps.cz>
# Vacuums the whole database cluster running on a given port.

while [[ $# > 0 ]]
do
    key="$1"

    case $key in
        -p|--port)
            PORT="$2"
            shift
            ;;
        *)
            echo "Usage: `basename $0` --port|-p [port_number]"
            exit 1
            ;;
    esac
    shift
done

if [[ -z "$PORT" ]]
then
    echo "Port not provided!"
    $0 *
    exit 2
fi

/usr/bin/vacuumdb -U postgres -p $PORT --all --full --analyze

What you get so far is the possibility to run systemctl start pgsql-vacuumdb@5432 instead of calling the underlying bash code manually. Not much, really. That’s where timers come to the party.

Timers

Timer files ends with “.timer” and are responsible for running services on given time. The code below, coming from /lib/systemd/system/cr-sunday.timer file runs the pgsql-vacuumdb service every Sunday at 3:45 am.

[Unit]
Description=CR Sunday timer

[Timer]
OnCalendar=Sun *-*-* 03:45
Persistent=yes
Unit=cr-sunday.target

[Install]
WantedBy=multi-user.target

Targets

Target files end with “.target” and are used to group units in general. In our case, the target file for vacuumdb service is as simple as the following code.

[Unit]
Description=CR Sunday target
StopWhenUnneeded=yes

Targets might be called by other targets. Running systemctl start cr-sunday.target would eventually lead to running all the services wanted by that target.

As I already mentioned, I find systemd services easy to code and test. If any of them should fail, you’d find a message in syslog or via systemctl status pgsql-vacuumdb.

by Michal Zimmermann at April 10, 2018 10:00 AM

April 06, 2018

PostGIS Development

PostGIS Patch Releases 2.2.7, 2.3.7, and 2.4.4

The PostGIS development team is pleased to provide bug fix release 2.2.7, 2.3.7 and 2.4.4 for the 2.2, 2.3, and 2.4 stable branches.

Please note that PostGIS 2.2 is reaching end-of-life and there will be one more patch release later in the year for it. If you have not upgraded to at least PostGIS 2.3, we encourage you do so before we discontinue support for PostGIS 2.2.

View all closed tickets for 2.4.4, 2.3.7, 2.2.7.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

2.2.7

2.3.7

2.4.4

by Regina Obe at April 06, 2018 12:00 AM

March 02, 2018

Michal Zimmermann

PostgreSQL Backup and Recovery Orchestration: Bash Automation

There is a bunch of periodic database-related tasks in a life of PostgreSQL administrator. Some should be done daily, others weekly, others can wait for a whole month. Many of them are essential for your database health. Forget to run such a task or screw up the run accidentally, and you’ll be snowed under with fixing your database.

Those tasks are easily done with bash, which is the first step to full automation. Following tasks are perfect candidates to be implemented as bash scripts:

  • full backups (both creation and removal)
  • WAL backups (both creation and removal)
  • vacuum
  • pgBadger log analysis (both creation and removal)
  • log maintenance (if you don’t use log rotate)

Full backup creation is just one example of how powerful bash can be.

#!/bin/bash
#
# @author: Michal Zimmermann <michal.zimmermann@clevermaps.cz>
# Creates base backup.

CUR_DIR=$(dirname "$0")
if [[ ! -f ${CUR_DIR}/pgsql-common.sh ]]
then
    echo "pgsql-common.sh not found!"
    exit 1
fi

source "${CUR_DIR}/pgsql-common.sh"
source "$CONFIG"

if [[ -d ${CR_BASE_BACKUP_DIR}/${CR_LABEL} ]]
then
    echo "${CR_BASE_BACKUP_DIR}/${CR_LABEL} already exists and is not empty!"
    exit 2
fi

pg_basebackup \
    --pgdata=${CR_BASE_BACKUP_DIR}/${CR_LABEL} \
    --format=plain \
    --write-recovery-conf \
    --wal-method=stream \
    --label=${CR_LABEL} \
    --checkpoint=fast \
    --progress \
    --verbose

if [[ $? -gt 0 ]]
then
    rm -rf ${CR_BASE_BACKUP_DIR}/${CR_LABEL}
    echo "pg_basebackup on ${CR_LABEL} failed!"
    exit 3
fi

tar -czf ${CR_BASE_BACKUP_DIR}/${CR_LABEL}.tar.gz ${CR_BASE_BACKUP_DIR}/${CR_LABEL} && rm -rf ${CR_BASE_BACKUP_DIR}/${CR_LABEL}

As you probably noticed, a pgsql-common.sh file is sourced at the beginning of the script. This script in turn just loads the proper config file that provides variables to other, devops, scripts. As you might need those variables in several of your scripts, it is a good idea to put their load to a separate file.

#!/bin/bash
#
# @author: Michal Zimmermann <michal.zimmermann@clevermaps.cz>
# Sourced in pgsql-*.sh scripts.

while [[ $# > 0 ]]
do
    key="$1"

    case $key in
        -c|--config)
            CONFIG="$2"
            shift
            ;;
        *)
            echo "Usage: `basename $0` --config|-c [config_file]"
            exit 1
            ;;
    esac
    shift
done
# /Input parameters

if [[ -z "$CONFIG" ]]
then
    echo "Config file is not set! See the script usage below."
    $0 *
    exit 2
fi

if [[ ! -f "$CONFIG" ]]
then
    echo "$CONFIG not found!"
    exit 3
fi

A config file might remain as simple as this:

# Base backup location
export CR_BASE_BACKUP_DIR="/mnt/backup/symap/base/"
# WAL backup location
export CR_WAL_BACKUP_DIR="/mnt/backup/symap/wal"
# PostgreSQL WAL location
export CR_PG_XLOG_DIR="/var/lib/postgresql/10/symap/pg_wal"
export CR_PG_LOG_DIR="/var/lib/postgresql/10/symap/pg_log"
# Base backup pattern (set to YYYYMMDD)
export CR_LABEL=symap_$(date +%Y%m%d)
export PGPORT=5432

Another, likely the simplest, example is a vacuumdb task:

#!/bin/bash
#
# @author: Michal Zimmermann <michal.zimmermann@clevermaps.cz>
# Vacuums the whole database cluster running on a given port.

while [[ $# > 0 ]]
do
    key="$1"

    case $key in
        -p|--port)
            PORT="$2"
            shift
            ;;
        *)
            echo "Usage: `basename $0` --port|-p [port_number]"
            exit 1
            ;;
    esac
    shift
done

if [[ -z "$PORT" ]]
then
    echo "Port not provided!"
    $0 *
    exit 2
fi

/usr/bin/vacuumdb -U postgres -p $PORT --all --full --analyze

Tips

  • Always test your bash scripts before production deployment. Even a single line of code might lead to a very different, possibly unexpected, outcome.
  • Try to stay as defensive as possible. Imagine a variable did not get sourced properly. Is it going to blow your database? Trust me, I know what I am talking about (see the tweet below).

Pitfalls

You do not want to run your bash scripts by hand. You probably do not want them to be run by cron. You want to run them with systemd. More on this next time.

by Michal Zimmermann at March 02, 2018 03:00 PM

February 22, 2018

Oslandia

Database migrations and Pum

At Oslandia we often need to deal with database migrations. So it’s an important topic to us. And
it should be an important topic to anyone maintaining databases in production.

First of all I want to mention a good series of articles by K. Scott Allen on the philosophy and
practice of database migrations and version control:

These articles are not about specific databases and migration tools. Instead they cover the basics of having your databases under version control and managing migrations.

Many database migration tools exist, both in the opensource and proprietary worlds.

Some tools relate to specific programming languages and ORMs (Object-Relational Mapping), such as Rails Migrations and Alembic. I’ve personally used Alembic in a number of Python projects in the past. Alembic is a fantastic tool for Python applications based on the SQLAlchemy toolkit. I would hightly recommend it. Other popular database migration tools include Flyway and Liquidbase. With Flyway migrations are written in either Java or SQL. With Liquidbase migrations are described using declarative formats such as XML, YAML and JSON; “plain SQL” being also supported.

Pum, which stands for “Postgres Upgrades Manager”, is a new migration tool. It was created in the context of the QWAT and QGEP projects, but it is completely independent and agnostic to those projects. Pum was initially created by OPENGIS.ch, with help from Oslandia and the whole QWAT community for the design and specifications of the tool.

Pum was greatly inspired by Flyway and Liquidbase. Flyway and Liquidbase are written in Java, and supports various database systems. In contrast Pum is written in Python, and focuses on the PostgreSQL database system. Pum is simple, and easy to use and extend. Also, with Pum, migrations, called “deltas” in the Pum jargon, are written in “plain SQL”. Pum fully embraces SQL and doesn’t attempt to hide it behind a declarative or programming language.

Pum is a CLI (Command Line Interface) tool. The main commands provided by Pum are

  • check
  • baseline
  • upgrade
  • test-and-upgrade

The check command compares two databases and shows the differences. Let’s create a simple example to illustrate it:

$ createdb prod  # create the "production" database
$ createdb ref   # create the "reference" database
$
$ cat > pg_service.conf << EOF # create a PostgreSQL service file for "prod" and "ref" > [prod]
> dbname=prod
> [ref]
> dbname=ref
> EOF
$ export PGSERVICEFILEFILE=pg_service.conf
$
$ pum check -p1 prod -p2 ref
Check...OK
columns: []
constraints: []
functions: []
indexes: []
rules: []
sequences: []
tables: []
triggers: []
views: []

The check command reports that the databases “prod” and “ref” are identical. They’re actually both empty. Now let’s add a table to the “ref” database, and run the check command again:

$ # add the "test" table to the "ref" database
$ psql -d ref -c "create table test (name text)"
$
$ pum check -p1 prod -p2 ref
Check...DIFFERENCES FOUND
columns:
- + ('public', 'test', 'name', None, 'YES', 'text', None, None, None, None)
constraints: []
functions: []
indexes: []
rules: []
sequences: []
tables:
- + ('public', 'test')
triggers: []
views: []

This time the check command reports that the “ref” database has a table and a column that the “prod” database does not have. If we created the same “test” table in “prod” the `check` command would report that “prod” and “ref” are the same again.

The baseline command assigns a version to a database. For example let’s assign the version 0.0.1 to the “prod” database, and the version 0.0.2 to the “ref” database:

$ pum baseline -p prod -t public.pum_upgrades -d deltas -b 0.0.1
$ pum baseline -p ref -t public.pum_upgrades -d deltas -b 0.0.2
$
$ # check that version 0.0.1 was assigned to the "prod" database
$ psql service=prod -c "table pum_upgrades"
 id | version | description | type | script | checksum | installed_by |        installed_on        | execution_time | success 
----+---------+-------------+------+--------+----------+--------------+----------------------------+----------------+---------
  1 | 0.0.1   | baseline    |    0 |        |          | postgres     | 2018-02-22 16:33:40.948143 |              1 | t
(1 row)

$
$ # check that version 0.0.2 was assigned to the "ref" database
$ psql service=ref -c "table pum_upgrades"
 id | version | description | type | script | checksum | installed_by |       installed_on        | execution_time | success
----+---------+-------------+------+--------+----------+--------------+---------------------------+----------------+---------
  1 | 0.0.2   | baseline    |    0 |        |          | postgres     | 2018-02-22 16:56:25.19542 |              1 | t
(1 row)

Finally, we’re going to write a migration script to migrate the “prod” database, and use Pum to do the migration.

We create the migration script delta_0.0.2_000.sql with the following content:

create table test (name text);

Let’s now migrate “prod”, checking that it now matches the “ref” database:

$ # a "test" database is required by the test-and-upgrade command
$ createdb test
$ cat >> pg_service.conf << EOF > [test]
> dbname=test
> EOF
$
$ # now perform the migration
$ pum test-and-upgrade -pp prod -pt test -pc ref -t public.pum_upgrades -d delta -f output.backup
Test and upgrade...Dump...OK
Restore...OK
Upgrade...     Applying delta 0.0.2... OK
OK
Check...OK
columns: []
constraints: []
functions: []
indexes: []
rules: []
sequences: []
tables: []
triggers: []
views: []

Apply deltas to prod? [n]|y: y
Upgrade...     Applying delta 0.0.2... OK
OK
OK

The “prod” database has been migrated, and during the process PUM has checked that the “prod” and “ref” are now in the same state.

That’s all folks! Please contact us at infos@oslandia.com if you have any questions or want to discuss this further.

by Eric Lemoine at February 22, 2018 06:20 PM

February 16, 2018

Michal Zimmermann

PostgreSQL Backup and Recovery Orchestration: Recovery

PostgreSQL continuous backups are very powerful, if you know how to use them for recovery. There’s nothing else to do to be sure about that other than actually trying it. Personally, I see recovery as a single process with two possibly different outcomes:

  • you’re recovering to the same state your cluster is/was in (because of a hardware failure, provider switch, …) - it’s more of a data migration, but you need your backup anyway
  • you’re doing a point-in-time-recovery (someone dropped the wrong table, data got corrupted, …)

Both scenarios follow the same steps and differ slighty at the end.

  1. Stop the PostgreSQL cluster.
  2. Copy the current PGDATA_DIR somewhere safe, just in case you screw up.
  3. Replace the PGDATA_DIR with the full backup. If you start the cluster right away, it will boot to the last full backup state (in my case, missing a week of WAL segments tops).

General recovery

In this case, you’re trying to recover as far as possible. With previous steps done succesfully, the next follow:

  • Copy all archived WAL segments created after the last full backup to PGDATA_DIR/pg_xlog. These can be found with find -newer command run against the corresponding .backup file in your wal-archive/u/p directory.
  • If your full backup strategy includes recovery.conf file creation, you cane safely move it or remove it.
  • Start the database cluster again. It is going to boot to the last working state.

If you’re about to migrate your data, you might be better off with simple pg_dump, pg_dumpall and pg_restore commands rather than using full backup/WAL segments combination.

Point-in-time-recovery

PostgreSQL’s PITR can help you restore your accidentally deleted/corrupted data. After the first three steps mentioned above, you should follow with these:

  • Copy all archived segments created after the last full backup somewhere the PostgreSQL user can read them (/your-wal-recovery-folder/ for example).
  • Set up the recovery.conf file properly. If you know something nasty happened at 2018-01-29 08:00:00, try to recover right to that point (or to any other, as described in the documentation).
restore_command = 'cp /your-wal-recovery-folder/%f "%p"'
recovery_target_time = '2018-01-29 08:00:00'
  • Start the database cluster again. It is going to boot to the last full backup and then play all the WAL segments until the recovery target. Depending on how many WAL segments are about to be used, this might take a while.

Pitfalls

You don’t want to find yourself in the middle of the biggest database failure of the century just to find out your backups don’t work, and even if they did, you would have no idea how to use them. Or, even worse, there are no backups at all, because your backup strategy has been failing silently without a single notice for several months.

Tips

Try to recover from your backups once in a while.

I forget things and make mistakes. We all do. That’s why I built an ensemble that takes care of our database automatically. Nothing fancy, just a bunch of good old Bash scripts managed with systemd rathern than cron. Next time, I’d like to show you the code and walk you through our current setup.

by Michal Zimmermann at February 16, 2018 03:00 PM

February 15, 2018

Postgres OnLine Journal (Leo Hsu, Regina Obe)

FDWS for PostgreSQL Windows 32 and 64-bit

We've updated our binaries for PostgreSQL 10 windows, both 32 and 64-bit. The 64-bit should work fine with EnterpriseDb windows as well as BigSQL.

Continue reading "FDWS for PostgreSQL Windows 32 and 64-bit"

by Leo Hsu and Regina Obe (nospam@example.com) at February 15, 2018 05:25 AM

February 12, 2018

Michal Zimmermann

PostgreSQL Backup and Recovery Orchestration: WAL Archiving

Just a very few of my day-to-day work tasks can be accomplished without PostgreSQL. For years I’ve been a (power) user of this wonderful relational database, knowing almost nothing about how its internals really work. Faced with the need to build a backup and recovery strategy, I’ve recently read up a lot on this topic.

As I don’t find it very odd for a GIS person to be given such an extraordinary task (nobody wants to lose the priceless spatial data, right?), I hope this series might shed light on how to prepare and manage the backup/recovery process to those, who are up to such a task. I won’t be discussing backup strategies based on pg_backup tool, as those don’t offer neither continuous archivation, nor point-in-time-recovery (PITR) - those two features disqualifies it as CleverMaps production backup strategy.

That leaves us with taking periodic base backups combined with continuous WAL archivation, as described below.

Taking base backups

Archived WAL segments are worthless without a base backup they can be run on. It’s crucial to have consistent, periodic base backups to keep your data safe.

pg_basebackup takes base backup of PostgreSQL cluster. Nothing fancy. Gzipping the output folder once the backup is done is definitely a good idea.

pg_basebackup \
    --pgdata=/mnt/backup/base/backup_number \
    --format=plain \
    --write-recovery-conf \
    --xlog-method=stream \
    --label=${CR_LABEL} \
    --checkpoint=fast \
    --progress \
    --verbose

In our current environment, we take a base backup of each of our clusters once a week.

WAL archiving configuration

To properly set WAL archiving, several postgresql.conf settings has to be adjusted:

  • wal_level = replica
  • archive_mode = on
  • archive_command = test ! -f /backup/wal/%f && cp %p /backup/wal/%f

Setting wal_level to replica writes enough information for WAL archiving. Turning on archive_mode will run archive_command each time a WAL segment is completed. archive_command might be anything from simple cp to rsync or aws s3 cp commands. It is absolutely critical that the command returns non-zero exit code in case of failure (including when a file with the same name already exists in your backup folder).

That’s it, after reloading PostgreSQL service, new WAL files should be copied to /backup/wal directory. The PostgreSQL process user (postgres usually) has to be able to write to the location.

Pitfalls

  • If archive_command fails, WAL segment remains on your database drive. If it keeps failing long enough, you’ll run out of space and the database will crash.
  • If the backup location fills up, the above-mentioned happens as well.
  • If you lose or corrupt any of the archived WAL segments, you won’t be able to pass through. That’s why you want to be sure that your archive_command actually does what you think it does.

Tips

It might be a real PITA (fiddling around WAL segments included) to start a crashed database cluster with no space left. Keeping a dummy file in your pg_xlog location might save you a lot of trouble. Create one with following command. If you run out of space, remove this file and you get 300 MB for free. Don’t forget to recreate it after you start the cluster.

dd if=/dev/zero of=/path_to_your_database_cluster/pg_xlog/DO_NOT_MOVE_THIS_FILE bs=1MB count=300

There’s no need to keep archived WAL segments forever. They’re only needed until you take another base backup. Again, deleting WAL segments manually (or using find ! -newer previous_base_backup.tar.gz) might lead to accidental corruption of your backups. It’s much safer to use pg_archivecleanup pointed to your WAL backup folder, referencing the last sucessful full backup. Below is the script we use to keep our WAL backup folder of reasonable size, keeping the last three full backups.

# Find base_backup files not older than 3 weeks
# Sort by date
# Use the oldest one as a reference
OLDEST_BASE_BACKUP=$(basename $(find ${CR_WAL_BACKUP_DIR}/u/p/ -type f -iname "*.backup" -mtime -21 -print0 | \
xargs -0 ls -t | \
tail -n 1))

# Find all subfolders
# Except the u/p backup subfolder
# Execute pg_archivecleanup for each of the subfolders
find $CR_WAL_BACKUP_DIR \
    -type d \
    -not -path "${CR_WAL_BACKUP_DIR}u*" \
    -exec pg_archivecleanup -d {} $OLDEST_BASE_BACKUP \;

Functional backups are crucial part of a solid backup/recovery system. They’re still just one half of that system, though. If not tested thoroughly, they’re even less than that. More on testing backups and recovering from failures next time.

by Michal Zimmermann at February 12, 2018 03:00 PM

January 17, 2018

PostGIS Development

PostGIS Patch Releases 2.3.6 and 2.4.3

The PostGIS development team is pleased to provide bug fix release 2.3.6 and 2.4.3 for the 2.3 and 2.4 stable branches.

Key fixes in these releases are Brin upgrade, ST_Transform schema qualification to fix issues with restore, foreign table, and materialized view use, ClusterKMeans and encoded polyline fixes.

View all closed tickets for 2.4.3 and 2.3.6.

After installing the binaries or after running pg_upgrade, make sure to do:

ALTER EXTENSION postgis UPDATE;

— if you use the other extensions packaged with postgis — make sure to upgrade those as well

ALTER EXTENSION postgis_sfcgal UPDATE;
ALTER EXTENSION postgis_topology UPDATE;
ALTER EXTENSION postgis_tiger_geocoder UPDATE;

If you use legacy.sql or legacy_minimal.sql, make sure to rerun the version packaged with these releases.

2.3.6

2.4.3

by Regina Obe at January 17, 2018 12:00 AM

December 14, 2017

Paul Ramsey

PostGIS Scaling

Earlier this month I got to speak at the Spatial Data Science Conference hosted by my employer Carto at our funky warehouse offices in Bushwick, Brooklyn. The topic: PostGIS scaling.

PostGIS Scaling

Now.

“Make it go faster” is a hard request to respond to in the generic: what is “it”, what are you doing with “it”, are you sure that your performance isn’t already excellent but you’re just too damned demanding?

So, the talk covers a number of routes to better performance: changing up query patterns, adding special PostgreSQL extensions, leaning on new features of PostgreSQL, and just plain old waiting for PostgreSQL to get better. Which it does, every release.

December 14, 2017 04:00 PM

December 06, 2017

Paul Ramsey

PostGIS "Fund Me" Milestone

On the twitter this morning, there was a good question:

TL;DR: If you find a feature in “Fund Me” and want to fund it, join the postgis-devel mailing list and make yourself known.

If you go to the PostGIS ticket report and scroll through the pages you’ll first see some milestones tied to released versions. These are usually bug reports, both big and small, valid and invalid, and will eventually be closed.

We unfortunately carry a lot of tickets in the current development milestone (2.5 right now) which are, at best, speculative. They should probably be closed (we really will never do them and don’t much care) or moved to the “Fund Me” category (they are valid, but we have no personal/professional impetus to address them).

The “Fund Me” category used to be called “Future”. This was a bad name, as it implied that sometime in the “Future” the ticket might actually be addressed, and all you needed was sufficient patience to wait. The reality is that the way a ticket got into the “Future” category was that it was ignored for long enough that we couldn’t stand to see it in the current milestone anymore.

The PostGIS development community includes all kinds of developers, who make livings in all kinds of ways, and there are folks who will work on tasks for money. The “Fund Me” milestone is a way of pointing up that there are tasks that can be done, if only someone is willing to pay a developer to do them.

That’s the good news!

The bad news is that the tickets all look the same, but they are wildly variable in terms of level of effort and even feasibility.

  • #220 “Implement ST_Numcurves and ST_CurveN” would probably take a couple hours at the outside, and almost any C developer could do it, even oen with zero experience in the PostGIS/PostgreSQL/GEOS ecosystem.
  • #2597 “[raster] St_Grayscale” would require some knowledge of the PostGIS raster implementation and image processing routines or at least the GDAL library.
  • #2910 “Implement function to output Mapbox Vector Tiles” actually happened in 2.4, but the (duplicate) ticket remained open, as a reminder that we’re terrible at ticket management.

And then there’s the “big kahunas”, tasks that live quietly in one ticket but actually encompass massive research and development projects spanning months or years.

  • #1629 “Tolerance and Precision strategy” is a super idea, that would allow functions like ST_Intersects() or ST_Equals() to return true if a condition was met within a tolerance. However, it would require substantial enhancement to GEOS, to allow predicate evaluation within a tolerance context, as well as a changes to non-GEOS backed distance functions, and new signatures for every geometry relationship function. Given the depth of the GEOS problem, I’d estimate multiple months of effort, and a potential for zero deliverables at all if things went pear-shaped.
  • #472 “Missing ST_IsValid for Geography Types” is even worse than the tolerance problem, since it should really be implemented as a complete rewrite of GEOS to understand non-linear edge types, either through a cheater’s strategy to turn do local projections of geographic edges, or as a full understanding of geographic edges. On the upside, doing that would allow many of the other GEOS functions to support geography which would vastly expand geography functionality in one stroke. On the downside, it is again in the category of a year-long effort with a potential failure at the end of it if for unexpected reasons it turns out to be impossible within that timeframe.

These kind of core features basically never get funded, because the marginal benefit they provide is generally much lower than the development cost for any one organization. This is a common open source weakness: aggregating funding is something everyone agrees is a great idea in principle but rarely happens in practice.

Occasionally, lightning does strike and a major funded feature happens. PostGIS topology was funded by a handful of European governments, and my work on the geography type was funded entirely by Palantir. However, usually funders show up with a few thousand dollars in hand and are dismayed when they learn of the distance between their funds and their desires.

December 06, 2017 04:00 PM

November 30, 2017

Michal Zimmermann

QGIS Plugin Development: Testing Your Code

Good news, everyone! The AttributeTransfer plugin has been approved for QGIS Python Plugins Repository. It’s available via QGIS Manage and Install Plugins menu. Feel free to download!

Nevertheless, this post (the last in the series) covers QGIS plugin testing rather than my personal feelings about the aforementioned success.

Testing means mocking

To test a QGIS plugin you need to simulate the environment it’s meant to run in. And that environment is obviously QGIS itself, yet it’s not feasible to launch QGIS every time you run a test. Luckily, there’s a great QGIS mock that gets you going in no time (it completely slipped my mind where I found that piece of code though).

Testing means you need data

Every test is run again and again, which means it has to reset the data being used to its default state. This might be a PIDA if the test changes the data in an unpredictable manner.

Using QGIS memory layers you can prepare fresh data for each of your tests, effectively putting the whole data manipulation process aside.

Writing tests

Each of the AttributeTransfer plugin tests inherits from unittest.TestCase, which comes with several methods you might be familiar with from other languages: setUp() is run before for every test method, while tearDown() is run after each of them. Tests are defined as methods whose names start with the word test.

Each test should call some assertWhatever method that checks whether the test passed or failed. Here’s an example of such a test covering non-point layers.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date    : 2017-11-18 18:40:50
# @Author  : Michal Zimmermann <zimmicz@gmail.com>

import os
import sip
import sys
import unittest
from qgis.core import QgsMapLayerRegistry, QgsVectorLayer, QgsFeature, QgsGeometry, QgsPoint
from utilities import get_qgis_app

sys.path.append(os.path.dirname(os.path.abspath(__file__)) + "/..")
from attribute_transfer import AttributeTransfer
from create_dummy_data import create_dummy_data_polygon_or_line

sip.setapi('QtCore', 2)
sip.setapi('QString', 2)
sip.setapi('QDate', 2)
sip.setapi('QDateTime', 2)
sip.setapi('QTextStream', 2)
sip.setapi('QTime', 2)
sip.setapi('QUrl', 2)
sip.setapi('QVariant', 2)

QGIS_APP = get_qgis_app()
IFACE = QGIS_APP[2]


class AttributeTransferTestPolygonOrLine(unittest.TestCase):

    def setUp(self):
        self.source_layer = QgsVectorLayer(
            "Polygon?crs=epsg:4326&field=id:integer&field=textAttr:string&field=intAttr:integer&field=decAttr:double&field=dateAttr:date&index=yes", "source layer", "memory")
        self.target_layer = QgsVectorLayer(
            "Linestring?crs=epsg:4326&field=id:integer&index=yes", "target layer", "memory")
        self.widget = AttributeTransfer(IFACE)

        registry = QgsMapLayerRegistry.instance()
        registry.removeAllMapLayers()
        registry.addMapLayers([self.source_layer, self.target_layer])
        create_dummy_data_polygon_or_line(self.source_layer, self.target_layer)
        self.widget.initGui()
        self.widget.vectors = [self.source_layer, self.target_layer]
        self.widget.editable_vectors = [self.source_layer, self.target_layer]
        self.widget.dlg.sourceLayer.addItems(["source layer", "target layer"])

    def test_text_attr(self):
        ATTRIBUTE_NAME = "textAttr"
        ATTRIBUTE_INDEX = 1

        self._test_attr(ATTRIBUTE_NAME, ATTRIBUTE_INDEX)

    def test_int_attr(self):
        ATTRIBUTE_NAME = "intAttr"
        ATTRIBUTE_INDEX = 2

        self._test_attr(ATTRIBUTE_NAME, ATTRIBUTE_INDEX)

    def test_dec_attr(self):
        ATTRIBUTE_NAME = "decAttr"
        ATTRIBUTE_INDEX = 3

        self._test_attr(ATTRIBUTE_NAME, ATTRIBUTE_INDEX)

    def test_date_attr(self):
        ATTRIBUTE_NAME = "dateAttr"
        ATTRIBUTE_INDEX = 4

        self._test_attr(ATTRIBUTE_NAME, ATTRIBUTE_INDEX)

    def test_existing_attr(self):
        ATTRIBUTE_NAME = "id"
        ATTRIBUTE_INDEX = 0

        self.widget.dlg.sourceAttribute.setCurrentIndex(ATTRIBUTE_INDEX)
        self.widget.dlg.targetAttribute.setText(ATTRIBUTE_NAME)

        self.assertEqual(
            self.widget.dlg.sourceAttribute.currentText(), ATTRIBUTE_NAME)
        self.assertFalse(self.widget.transfer())

    def _test_attr(self, attr_name, attr_index):
        self.widget.dlg.sourceAttribute.setCurrentIndex(attr_index)
        self.widget.dlg.targetAttribute.setText(attr_name)

        self.assertEqual(
            self.widget.dlg.sourceAttribute.currentText(), attr_name)

        self.widget.transfer()

        target_fields = [f.name()
                         for f in self.target_layer.dataProvider().fields()]
        self.assertIn(attr_name, target_fields)

        source_features = [f for f in self.source_layer.getFeatures()]
        target_features = [f for f in self.target_layer.getFeatures()]

        for idx, f in enumerate(source_features):
            self.assertEqual(f.attribute(attr_name), target_features[
                             idx].attribute(attr_name))


if __name__ == "__main__":
    unittest.main()

by Michal Zimmermann at November 30, 2017 04:36 PM

November 27, 2017

Paul Ramsey

Nested Loop Join with FDW

Update: See below, but I didn’t test the full pushdown case, and the result is pretty awesome.

I have been wondering for a while if Postgres would correctly plan a spatial join over FDW, in which one table was local and one was remote. The specific use case would be “keeping a large pile of data on one side of the link, and joining to it”.

Because spatial joins always plan out to a “nested loop” execution, where one table is chosen to drive the loop, and the other to be filtered on the rows from the driver, there’s nothing to prevent the kind of remote execution I was looking for.

I set up my favourite spatial join test: BC voting areas against BC electoral districts, with local and remote versions of both tables.

CREATE EXTENSION postgres_fdw;

-- Loopback foreign server connects back to
-- this same database
CREATE SERVER test
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (
    host '127.0.0.1', 
    dbname 'test', 
    extensions 'postgis'
    );

CREATE USER MAPPING FOR pramsey
        SERVER test
        OPTIONS (user 'pramsey', password '');
        
-- Foreign versions of the local tables        
CREATE FOREIGN TABLE ed_2013_fdw
( 
  gid integer, 
  edname text, 
  edabbr text, 
  geom geometry(MultiPolygon,4326)
) SERVER test 
  OPTIONS (
    table_name 'ed_2013', 
    use_remote_estimate 'true');

CREATE FOREIGN TABLE va_2013_fdw
( 
  gid integer OPTIONS (column_name 'gid'), 
  id text OPTIONS (column_name 'id'),
  vaabbr text OPTIONS (column_name 'vaabbr'), 
  edabbr text OPTIONS (column_name 'edabbr'), 
  geom geometry(MultiPolygon,4326) OPTIONS (column_name 'geom')
) SERVER test 
  OPTIONS (
    table_name 'va_2013', 
    use_remote_estimate 'true');

The key option here is use_remote_estimate set to true. This tells postgres_fdw to query the remote server for an estimate of the remote table selectivity, which is then fed into the planner. Without use_remote_estimate, PostgreSQL will generate a terrible plan that pulls the contents of the `va_2013_fdw table local before joining.

With use_remote_estimate in place, the plan is just right:

SELECT count(*), e.edabbr
  FROM ed_2013 e
  JOIN va_2013_fdw v
  ON ST_Intersects(e.geom, v.geom)
  WHERE e.edabbr in ('VTB', 'VTS')
  GROUP BY e.edabbr;
GroupAggregate  (cost=241.14..241.21 rows=2 width=12)
 Output: count(*), e.edabbr
 Group Key: e.edabbr
 ->  Sort  (cost=241.14..241.16 rows=6 width=4)
     Output: e.edabbr
     Sort Key: e.edabbr
     ->  Nested Loop  (cost=100.17..241.06 rows=6 width=4)
         Output: e.edabbr
         ->  Seq Scan on public.ed_2013 e  (cost=0.00..22.06 rows=2 width=158496)
             Output: e.gid, e.edname, e.edabbr, e.geom
             Filter: ((e.edabbr)::text = ANY ('{VTB,VTS}'::text[]))
         ->  Foreign Scan on public.va_2013_fdw v  (cost=100.17..109.49 rows=1 width=4236)
             Output: v.gid, v.id, v.vaabbr, v.edabbr, v.geom
             Remote SQL: SELECT geom FROM public.va_2013 WHERE (($1::public.geometry(MultiPolygon,4326) OPERATOR(public.&&) geom)) AND (public._st_intersects($1::public.geometry(MultiPolygon,4326), geom))

For FDW drivers other than postgres_fdw this means there’s a benefit to going to the trouble to support the FDW estimation callbacks, though the lack of exposed estimation functions in a lot of back-ends may mean the support will be ugly hacks and hard-coded nonsense. PostgreSQL is pretty unique in exposing fine-grained information about table statistics.

Update

One “bad” thing about the join pushdown plan above is that it still pulls all the resultant records back to the source before aggregating them, so there’s a missed opportunity there. However, if both the tables in the join condition are remote, the system will correctly plan the query as a remote join and aggregation.

SELECT count(*), e.edabbr
  FROM ed_2013_fdw e
  JOIN va_2013_fdw v
  ON ST_Intersects(e.geom, v.geom)
  WHERE e.edabbr in ('VTB', 'VTS')
  GROUP BY e.edabbr;
 Foreign Scan  
   (cost=157.20..157.26 rows=1 width=40) 
   (actual time=32.750..32.752 rows=2 loops=1)
   Output: (count(*)), e.edabbr
   Relations: Aggregate on ((public.ed_2013_fdw e) INNER JOIN (public.va_2013_fdw v))
   Remote SQL: SELECT count(*), r1.edabbr FROM (public.ed_2013 r1 INNER JOIN public.va_2013 r2 ON (((r1.geom OPERATOR(public.&&) r2.geom)) AND (public._st_intersects(r1.geom, r2.geom)) AND ((r1.edabbr = ANY ('{VTB,VTS}'::text[]))))) GROUP BY r1.edabbr
 Planning time: 12.752 ms
 Execution time: 33.145 ms

November 27, 2017 04:00 PM

November 23, 2017

Michal Zimmermann

QGIS Plugin Development: AttributeTransfer Plugin

This part finally brings the whole source code of the QGIS AttributeTransfer plugin.

The plugin itself resides in the attribute_transfer.py file. When run() method is invoked, the QT form pops up with combos prefilled with available vector layers that support attribute editing.

Source and target layer combos are mutually exclusive, thus it’s not possible to transfer the attribute within the same layer.

Coding the plugin, I came across minor issues related mainly to the QgsSpatialIndex implementation. In the nearest neighbor analysis part of the series, the QgsSpatialIndex.nearestNeighbor method was mentioned. Yet, as I found out, this method only works with QgsPoint geometries. Those are impossible to get from QgsPolygon or QgsPolyline, though. What can one possibly do, facing such a misfortune? Well… draw a solution matrix.

point line polygon
point QgsSpatialIndex.nearestNeighbor QgsSpatialIndex.nearestNeighbor; layers have to be switched, e.g. source layer = line QgsSpatialIndex.nearestNeighbor; layers have to be switched, e.g. source layer = polygon
line QgsSpatialIndex.nearestNeighbor QgsSpatialIndex.intersects with QgsGeometry.distance QgsSpatialIndex.intersects with QgsGeometry.distance
polygon QgsSpatialIndex.nearestNeighbor QgsSpatialIndex.intersects with QgsGeometry.distance QgsSpatialIndex.intersects with QgsGeometry.distance

Using the spatial index brings one more issue I’ve come to realize just after implementing the special comparison workflows for different geometry types. There’s a chance of finding the nearest feature using the bounding box that’s actually not the nearest feature. In that case, I chose to find the most distant vertex of such a feature and use it to construct the rectangle around the target feature. If there are any source features in such a rectangle, it’s very likely one of them is the real nearest feature.

Right now, I’m working on finding the nearest feature even if no bounding box intersection is found. Meanwhile, the plugin is being reviewed to be featured in QGIS Plugins repository. Fingers crossed.

I thought this was going to be the last part of the series. But how could one possibly claim the coding project done without writing tests? Stay tuned for the next episode.

by Michal Zimmermann at November 23, 2017 06:00 PM

November 16, 2017

Boston GIS (Regina Obe, Leo Hsu)

Happy PostGIS Day 2017

To commemorate PostGIS Day 2017, I put together a 3D scene listing my favorite functions.

You can see it PostGIS Day in 3D

by Regina Obe (nospam@example.com) at November 16, 2017 11:51 PM

Michal Zimmermann

QGIS Plugin Development: Creating GUI with Qt Designer

After fiddling with QGIS Python console and implementing nearest neighbor analysis, I’m going to create a very simple GUI for the plugin at last.

While QGIS API docs took me few hours to grasp, the PyQGIS ecosystem knocked my socks off. Here comes the list of tools you should incorporate into your development process as soon as possible.

Plugin Builder

The QGIS Plugin Builder is a plugin created to create… well, other plugins. It gets you going in minutes and lets you code instead of setting up things you don’t want to be setting up. A definite must-have. Note you should put the plugin inside the QGIS plugins folder (defaults to ~/.qgis2/python/plugins) in Linux.

Remember to run pyrcc4 -o resources.py resources.qrc inside your plugin folder before you add it to QGIS.

Plugin Reloader

The QGIS Plugin Reloader is a plugin (possibly created with QGIS Plugin Builder) that lets you live reload your plugin while you code. No QGIS restarts needed. A definite must-have.

Qt Designer

Qt Designer comes with qt4-designer package in Ubuntu. It is tailored to design and build GUIs from Qt components that can be used within QGIS. Its drag&drop interface lets you prototype quickly.

Thanks to the Plugin Builder you can load the attribute_transfer_dialog_base.ui file straight into the Qt Designer and adjust it to your needs.

It doesn’t take much, just one QLineEdit and a few QComboBox widgets. Those will be available in the attribute_transfer.py file as properties of the AttributeTransferDialog class.

The widget name can be customized in the right sidebar and I advise you to do so. I chose the following:

Once loaded with Plugins -> Manage and Install Plugins -> AttributeTransfer, the plugin is available right from the toolbar or Vector menu. It is missing the business logic completely, but I have this covered in the previous part.

All that is to be done is to bind those two parts together.

by Michal Zimmermann at November 16, 2017 02:00 PM

November 15, 2017

PostGIS Development

PostGIS Patch Releases

The PostGIS development team has uploaded bug fix releases for the 2.3 and 2.4 stable branches.

2.3.5

2.4.2

by Paul Ramsey at November 15, 2017 12:00 AM

November 09, 2017

Michal Zimmermann

QGIS Plugin Development: Finding Nearest Neighbors

I described basics of vector layers manipulation in the previous part of the series. With my goal in mind (fully functional custom plugin capable of writing an attribute value from a source layer to a target layer based on a feature distance), I’d like to discuss spatial indexing and nearest neighbor analysis.

The picture above illustrates the task that can be solved solely by using QGIS API. Imagine you’re given a source layer with an attribute filled with values. You’re given a target layer as well, sadly though, the values in this layer are missing (not so rare in the GIS world, right?). Yet you know that the missing attribute value of each feature in the target layer can be filled by the value of its nearest neighbor from the source layer. How do you do that?

Generating dummy data

Let’s create two memory data sets with id and value attributes. Both of them will have ten features.

from qgis.core import QgsMapLayerRegistry, QgsVectorLayer, QgsFeature, QgsGeometry, QgsPoint, QgsSpatialIndex
from qgis.utils import iface

source_layer = QgsVectorLayer("point?crs=epsg:4326&field=id:integer&field=value:integer", "Source layer", "memory")
target_layer = QgsVectorLayer("point?crs=epsg:4326&field=id:integer&field=value:integer", "Target layer", "memory")

def create_dummy_data():

    source_layer.startEditing()
    target_layer.startEditing()

    feature = QgsFeature(source_layer.pendingFields())

    for i in range(10):
        feature.setGeometry(QgsGeometry.fromPoint(QgsPoint(i, i)))
        feature.setAttribute("id", i)
        feature.setAttribute("value", i)
        source_layer.addFeature(feature)

    feature = QgsFeature(source_layer.pendingFields())

    for i in range(10):
        feature.setGeometry(QgsGeometry.fromPoint(QgsPoint(i + i, i)))
        feature.setAttribute("id", i)
        target_layer.addFeature(feature)

    source_layer.commitChanges()
    target_layer.commitChanges()

    QgsMapLayerRegistry.instance().addMapLayer(source_layer)
    QgsMapLayerRegistry.instance().addMapLayer(target_layer)

create_dummy_data()

Writing values from the nearest neighbor

The actual nearest neighbor analysis can be done in ten lines of code! First, the qgis.core.QgsSpatialIndex is built from all the source_layer features. Then, you iterate over the target_layer features and for each of them, gets only one (nearestNeighbor(f.geometry().asPoint(), 1)[0]) nearest neighbor. At last, you just write the nearest neighbor’s attribute value to the target layer and commit changes. Just use the following code with the code above.

def write_values_from_nn():
    source_layer_index = QgsSpatialIndex(source_layer.getFeatures())
    source_layer_features = {feature.id(): feature for (feature) in source_layer.getFeatures()}
    target_layer_features = target_layer.getFeatures()

    target_layer.startEditing()

    for f in target_layer_features:
        nearest = source_layer_index.nearestNeighbor(f.geometry().asPoint(), 1)[0]
        value = source_layer_features[nearest].attribute("value")
        target_layer.changeAttributeValue(f.id(), 1, value)

    target_layer.commitChanges()

write_values_from_nn()

Missing pieces or what’s next

I’m one step closer to my goal. What’s missing?

  • capabilities checks: does the target layer support edits? Check the layer data provider capabilities to find out.
  • user logging: notices, warnings or errors are completely missing. It will be great to have them shown inside qgis.gui.QgsMessageBar.
  • custom attributes: this version expects both layers to have the same attribute with the same data type.
  • GUI: a very simple PyQt widget will turn this console-based script into a custom plugin. That’s what’s going to be next.

by Michal Zimmermann at November 09, 2017 02:00 PM

November 07, 2017

PostGIS Development

Move PostGIS extension to a different schema

As of PostGIS 2.3, the postgis extension was changed to no longer allow relocation. All function calls within the extension are now schema qualified.

While this change fixed some issues with database restore, it created the issue of if you installed PostGIS in a schema other than the one you wanted to it is not intuitive how to move it to a different schema. Luckily there is a way to do this.

For this exercise, I will install PostGIS in the default schema and then demonstrate how to move it into another schema location.

You can run these steps using psql or pgAdmin or any other PostgreSQL tool you want.

Continue Reading by clicking title hyperlink ..

by Regina Obe at November 07, 2017 12:00 AM

November 06, 2017

Paul Ramsey

Parallel PostGIS IIA

One of the core complaints in my review of PostgreSQL parallelism, was that the cost of functions executed on rows returned by queries do not get included in evaluations of the cost of a plan.

So for example, the planner appeared to consider these two queries equivalent:

SELECT *
FROM pd;

SELECT ST_Area(geom)
FROM pd;

They both retrieve the same number of rows and both have no filter on them, but the second one includes a fairly expensive function evaluation. No amount of changing the cost of the ST_Area() function would cause a parallel plan to materialize. Only changing the size of the table (making it bigger) would flip the plan into parallel mode.

Fortunately, when I raised this issue on pgsql-hackers, it turned out to have been reported and discussed last month, and Amit Kapila had already prepared a patch, which he kindly rebased for me.

With the patch in place, I now see rational behavior from the planner. Using the default PostGIS function costs, a simple area calculation on my 60K row polling division table is sequential:

EXPLAIN
SELECT ST_Area(geom)
FROM pd;
Seq Scan on pd  
(cost=0.00..14445.17 rows=69534 width=8)

However, if the ST_Area() function is costed a little more realistically, the plan shifts.

ALTER FUNCTION ST_Area(geometry) COST 100;

EXPLAIN
SELECT ST_Area(geom)
FROM pd;
 Gather  
 (cost=1000.00..27361.20 rows=69534 width=8)
   Workers Planned: 3
   ->  Parallel Seq Scan on pd  
       (cost=0.00..19407.80 rows=22430 width=8)

Perfect!

While not every query receives what I consider a “perfect plan”, it now appears that we at least have some reasonable levers available to get better plans via applying some sensible (higher) costs across the PostGIS code base.

November 06, 2017 08:00 PM

November 02, 2017

Michal Zimmermann

QGIS Plugin Development: Using Python Console

As mentioned in previous part of the series, the QGIS Python console is an entry point to GIS workflow automation within QGIS. Remember there’s an iface object representing qgis.gui.QgisInterface instance within the console that gives you access to the whole QGIS GUI. Let’s see what we can do inside the console.

Loading vector layers folder

import glob
from qgis.core import QgsMapLayerRegistry, QgsVectorLayer

def load_folder(folder):
    VALID_EXTENSIONS = ('.geojson', '.gpkg', '.shp')
    files = [f for f in glob.glob("{}/*".format(folder)) if f.endswith(VALID_EXTENSIONS)]

    for f in files:
        layer = QgsVectorLayer(f, f.split('/')[-1], 'ogr')

        if not layer.isValid():
            iface.messageBar().pushCritical("Failed to load:", f)
            continue

        QgsMapLayerRegistry.instance().addMapLayer(layer)

load_folder("path/to/your/vector/files/folder")
  • QgsMapLayerRegistry represents Layers Panel as present in the QGIS GUI
  • iface.messageBar() returns the message bar of the main app and lets you notify the user of what’s going on under the hood
  • QgsVectorLayer represents a vector layer with its underlying vector data sets

Editing active layer attribute table

The following code demonstrates the possibility to edit vector layer attribute table via console.

  • Any attribute to be written has to come in form of a qgis.core.QgsField - this is more or less an encapsulation of an attribute name and its type (PyQt4.QtCore.QVariant to be precise)
  • The underlying data provider has to be capable of attribute addition (caps & QgsVectorDataProvider.AddAttributes)
  • QgsVectorLayer.addAttribute method returns boolean rather than throwing an exception
from qgis.core import QgsField
from qgis.gui import QgsMessageBar
from PyQt4.QtCore import QVariant


def edit_active_layer(attr_name, attr_type):
    layer = iface.activeLayer()
    caps = layer.dataProvider().capabilities()

    if caps & QgsVectorDataProvider.AddAttributes:
        layer.startEditing()
        if layer.addAttribute(QgsField(attr_name, attr_type)):
            iface.messageBar().pushMessage("Attribute {0} was successfully added to the active layer.".format(attr_name), QgsMessageBar.SUCCESS)
            layer.commitChanges()
        else:
            iface.messageBar().pushMessage("Attribute {0} was not added. Does it already exist?".format(attr_name), QgsMessageBar.CRITICAL)
            layer.rollBack()

edit_active_layer("new_string_attribute", QVariant.String)

The whole series aims to present a plugin capable of writing a new attribute and its value to an existing layer. Thus, this code might come handy in the future.

Creating a new vector layer

It’s possible to create a whole new vector layer with QGIS Python console. I present a very simple create_new_layer function, yet I hope you can imagine the ways it can be tweaked.

from qgis.core import QgsField, QgsFields, QgsVectorLayer, QgsFeature, QgsGeometry, QgsPoint
from PyQt4.QtCore import QVariant

def create_new_layer():
    filename = "/path/to/your/vector/file.gpkg"

    fields = QgsFields()
    fields.append(QgsField("attr1", QVariant.String))
    fields.append(QgsField("attr2", QVariant.Int))

    file = QgsVectorFileWriter(
        filename,
        "UTF8",
        fields,
        QGis.WKBPoint,
        QgsCoordinateReferenceSystem(4326),
        "GPKG"
    )

    layer = QgsVectorLayer(filename, filename.split("/")[-1], "ogr")
    QgsMapLayerRegistry.instance().addMapLayer(layer)

    if not layer.dataProvider().capabilities() & QgsVectorDataProvider.AddAttributes:
        pass

    feature = QgsFeature(layer.pendingFields())
    feature.setGeometry(QgsGeometry().fromPoint(QgsPoint(0, 0)))
    feature.setAttribute("attr1", "attr1")
    feature.setAttribute("attr2", 2)

    layer.startEditing()

    if layer.addFeature(feature, True):
        layer.commitChanges()
    else:
        layer.rollBack()
        iface.messageBar().pushMessage("Feature addition failed.", QgsMessageBar.CRITICAL)

create_new_layer()

Those were just few examples of what can be done with QGIS API and Python console. Next time, I’d like to focus on spatial joins inside QGIS - another step to the final plugin.

by Michal Zimmermann at November 02, 2017 02:00 PM