Welcome to Planet PostGIS

July 06, 2016

UpStats blog (Stefan Petrea)

Geolocation using multiple services

Intro

In a previous post I wrote about PostGIS and ways of querying geographical data.

This post will focus on building a system that queries free geolocation services 1 and aggregates their results.

Overview

In summary, we're making requests to different web services (or APIs), we're doing reverse geocoding on the results and then we'll aggregate them.

Comparing geonames and openstreetmap

To relate to the previous post, here are some differences between geonames and openstreetmap:

criterion OSM geonames
size 50 GB compressed 309 MB compressed
entities 3.25 billion 11 million
has administrative area data yes yes
has lat/long city data yes yes
has neighbourhood/district data yes yes
has region/area polygonal areas yes no
has intracity-level metadata yes no
has terrain metadata yes no

They are meant for different purposes. Geonames is meant for city/administrative area/country data and can be used for geocoding. Openstreetmap has much more detailed data (one could probably extract the geonames data using openstreetmap) and can be used for geocoding, route planning and more .

Asynchronous requests to geolocation services

We're using the gevent library to make asynchronous requests to the geolocation services.

import gevent
import gevent.greenlet
from gevent import monkey; gevent.monkey.patch_all()

geoip_service_urls=[
        ['geoplugin'    , 'http://www.geoplugin.net/json.gp?ip={ip}' ],
        ['ip-api'       , 'http://ip-api.com/json/{ip}'              ],
        ['nekudo'       , 'https://geoip.nekudo.com/api/{ip}'        ],
        ['geoiplookup'  , 'http://api.geoiplookup.net/?query={ip}'   ],
        ]

# fetch url in asynchronous mode (makes use of gevent)
def fetch_url_async(url, tag, timeout=2.0):
    data = None
    try:
        opener = urllib2.build_opener(urllib2.HTTPSHandler())
        opener.addheaders = [('User-agent', 'Mozilla/')]
        urllib2.install_opener(opener)
        data = urllib2.urlopen(url,timeout=timeout).read()
    except Exception, e:
        pass

    return [tag, data]

# expects req_data to be in this format: [ ['tag', url], ['tag', url], .. ]
def fetch_multiple_urls_async(req_data):

    # start the threads (greenlets)
    threads_ = []
    for u in req_data:
        (tag, url) = u
        new_thread = gevent.spawn(fetch_url_async, url, tag)
        threads_.append(new_thread)

    # wait for threads to finish
    gevent.joinall(threads_)

    # retrieve threads return values
    results = []
    for t in threads_:
        results.append(t.get(block=True, timeout=5.0))

    return results

def process_service_answers(location_data):
    # 1) extract lat/long data from responses
    # 2) reverse geocoding using geonames
    # 3) aggregate location data
    #    (for example, one way of doing this would
    #     be to choose the location that most services
    #     agree on)
    pass

def geolocate_ip(ip):
    urls = []
    for grp in geoip_service_urls:
        tag, url = grp
        urls.append([tag, url.format(ip=ip)])
    results = fetch_multiple_urls_async(urls)
    answer = process_service_answers(results)
    return answer

City name ambiguity

Cities with the same name within the same country

There are many cities with the same name within a country, in different states/administrative regions. There's also cities with the same name in different countries.

For example, according to Geonames, there are 24 cities named Clinton in the US (in 23 different states, with two cities named Clinton in the same state of Michigan).

WITH duplicate_data AS (
    SELECT
    city_name,
    array_agg(ROW(country_code, region_code)) AS dupes
    FROM city_region_data
    WHERE country_code = 'US'
    GROUP BY city_name, country_code
    ORDER BY COUNT(ROW(country_code, region_code)) DESC
)
SELECT
city_name,
ARRAY_LENGTH(dupes, 1) AS duplicity,
( CASE WHEN ARRAY_LENGTH(dupes,1) > 9 
  THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...')
  ELSE ARRAY_TO_STRING(dupes,',') END
) AS sample
FROM duplicate_data
LIMIT 5;
city_name duplicity sample
Clinton 24 (US,NY),(US,AR),(US,NC),(US,MA),(US,MD),(US,OH),(U…
Franklin 19 (US,ME),(US,MA),(US,NC),(US,TX),(US,NC),(US,LA),(U…
Springfield 19 (US,MN),(US,KY),(US,SD),(US,MI),(US,VA),(US,IL),(U…
Madison 18 (US,CT),(US,MN),(US,NJ),(US,ME),(US,SD),(US,FL),(U…
Greenville 18 (US,NC),(US,SC),(US,MS),(US,KY),(US,RI),(US,ME),(U…

Cities with the same name in the same country and region

Worldwide, even in the same region of a country, there can be multiple cities with the exact same name.

Take for example Georgetown, in Indiana. Geonames says there are 3 towns with that name in Indiana. Wikipedia says there are even more:

WITH duplicate_data AS (
    SELECT
    city_name,
    array_agg(ROW(country_code, region_code)) AS dupes
    FROM city_region_data
    WHERE country_code = 'US'
    GROUP BY city_name, region_code, country_code
    ORDER BY COUNT(ROW(country_code, region_code)) DESC
)
SELECT
city_name,
ARRAY_LENGTH(dupes, 1) AS duplicity,
( CASE WHEN ARRAY_LENGTH(dupes,1) > 9 
  THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...')
  ELSE ARRAY_TO_STRING(dupes,',') END
) AS sample
FROM duplicate_data
LIMIT 4;
city_name duplicity sample
Plantation 3 (US,FL),(US,FL),(US,FL)
Georgetown 3 (US,IN),(US,IN),(US,IN)
Robinwood 3 (US,MD),(US,MD),(US,MD)
Prospect Park 2 (US,NJ),(US,NJ)

Reverse geocoding

Both (city_name, country_code) and (city_name, country_code, region_name) tuples have failed as candidates to uniquely identify a location.

We would have the option of using zip codes or postal codes except we can't use those since most geolocation services don't offer that.

But most geolocation services do offer longitude and latitude, and we can use those to eliminate ambiguity.

Geometric data types in PostgreSQL

I looked further into the PostgreSQL docs and found that it also has geometric data types and functions for 2D geometry. Out of the box you can model points, boxes, paths, polygons, circles, you can store them and query them.

PostgreSQL has some additional extensions in the contrib directory. They are available out of the box with most Postgres installs.

In this situation we're interested in the cube and earthdistance extensions 2. The cube extension allows you to model n-dimensional vectors, and the earthdistance extension uses 3-cubes to store vectors and represent points on the surface of the Earth.

We'll be using the following:

  • the earth_distance function is available, and it allows you to compute the great-circle distance between two points
  • the earth_box function to check if a point is within a certain distance of a reference point
  • a gist expression index on the expression ll_to_earth(lat, long) to make fast spatial queries and find nearby points

Designing a view for city & region data

Geonames data was imported into 3 tables:

Then we create a view that pulls everything together 3. We now have population data, city/region/country data, and lat/long data, all in one place.

CREATE OR REPLACE VIEW city_region_data AS ( 
    SELECT
        b.country AS country_code,
        b.asciiname AS city_name,
        a.name AS region_name,
        b.region_code,
        b.population,
        b.latitude AS city_lat,
        b.longitude AS city_long,
        c.name    AS country_name
    FROM geo_admin1 a
    JOIN (
        SELECT *, (country || '.' || admin1) AS country_region, admin1 AS region_code
        FROM geo_geoname
        WHERE fclass = 'P'
    ) b ON a.code = b.country_region
    JOIN geo_countryinfo c ON b.country = c.iso_alpha2
);

Designing a nearby-city query and function

In the most nested SELECT, we're only keeping the cities in a 23km radius around the reference point, then we're applying a country filter and city pattern filter (these two filters are optional), and we're only getting the closest 50 results to the reference point.

Next, we're reordering by population because geonames sometimes has districts and neighbourhoods around bigger cities 4, and it does not mark them in a specific way, so we just want to select the larger city and not a district (for example let's say the geolocation service returned a lat/long that would resolve to one distrct of a larger metropolitan area. In my case, I'd like to resolve this to the larger city it's associated with)

We're also creating a gist index (the @> operator will make use of the gist index) which we're using to find points within a radius of a reference point.

This function takes a point (using latitude and longitude) and returns the city, region and country that is associated with that point.

CREATE INDEX geo_geoname_latlong_idx ON geo_geoname USING gist(ll_to_earth(latitude,longitude));
CREATE OR REPLACE FUNCTION geo_find_nearest_city_and_region(
    latitude double precision,
    longitude double precision,
    filter_countries_arr varchar[],
    filter_city_pattern  varchar,
) RETURNS TABLE(
    country_code varchar,
    city_name varchar,
    region_name varchar,
    region_code varchar,
    population bigint,
    _lat double precision,
    _long double precision,
    country_name varchar,
    distance numeric
    ) AS $$
BEGIN
    RETURN QUERY
    SELECT *
    FROM (
        SELECT
        *
        FROM (
            SELECT 
            *,
            ROUND(earth_distance(
                   ll_to_earth(c.city_lat, c.city_long),
                   ll_to_earth(latitude, longitude)
                  )::numeric, 3) AS distance_
            FROM city_region_data c
            WHERE earth_box(ll_to_earth(latitude, longitude), 23000) @> ll_to_earth(c.city_lat, c.city_long) AND
                  (filter_countries_arr IS NULL OR c.country_code=ANY(filter_countries_arr)) AND
                  (filter_city_pattern  IS NULL OR c.city_name LIKE filter_city_pattern)
            ORDER BY distance_ ASC
            LIMIT 50
        ) d
        ORDER BY population DESC
    ) e
    LIMIT 1;
END;
$$
LANGUAGE plpgsql;

Conclusion

We've started from the design of a system that would query multiple geoip services, would gather the data and would then aggregate it to get a more reliable result.

We first looked at some ways of uniquely identifying locations.

We've then picked a way that would eliminate ambiguity in identifying them. In the second half, we've looked at different ways of structuring, storing and querying geographical data in PostgreSQL.

Then we've built a view and a function to find cities near a reference point which allowed us to do reverse geocoding.

Footnotes:

1

By using multiple services (and assuming they use different data sources internally) after aggregation, we can have a more reliable answer than if we were using just one.

Another advantage here is that we're using free services, no setup is required, we don't have to take care of updates, since these services are maintained by their owners.

However, querying all these web services will be slower than querying a local geoip data structures. But, there are city/country/region geolocation database out there such as geoip2 from maxmind, ip2location or db-ip.

2

There's a nice post here using the earthdistance module to compute distances to nearby or far away pubs.

3

Geonames has geonameIds as well, which are geonames-specific ids we can use to accurately refer to locations.

4

geonames does not have polygonal data about cities/neighbourhoods or metadata about the type of urban area (like openstreetmap does) so you can't query all city polygons (not districts/neighbourhoods) that contain that point.

July 06, 2016 04:00 AM

June 18, 2016

Stephen Mather

Using PostGIS for Hydrologic Modeling (reblog)

The Problem We have to filter out the roads and ditches without removing streams that cross roads or follow them closely. I’m going to use PostGIS to find the intersection of the streams lines data with a buffered roads polygon. If the intersected line is less than 50% of the length of the stream line, […]

via Filtering Roads from Extracted Streams Data — GeoKota


by smathermather at June 18, 2016 01:57 AM

June 04, 2016

Boston GIS (Regina Obe, Leo Hsu)

FOSS4GNA 2016: pgRouting - A Crash course video is out

Leo's pgRouting : a Crash Course video made it thru great. Better than mine. Leo doesn't believe in slides, so this is all live demo stuff. The data he used in the video is part of our code/data download for pgRouting: A Practical Guide.


Continue reading "FOSS4GNA 2016: pgRouting - A Crash course video is out"

by Regina Obe (nospam@example.com) at June 04, 2016 07:05 AM

May 31, 2016

Boston GIS (Regina Obe, Leo Hsu)

FOSS4GNA 2016 PostGIS Spatial Tricks video is out

The videos for FOSS4G NA 2016 have started coming out. Recently Andrea Ross posted PostGIS Spatial Tricks talk video. I'm happy to say it looks pretty good and I didn't suck as badly as I worried I would. Thank you very much Andrea. Some talks unfortunately did not come thru. I'm hoping Leo's pgRouting : a Crash Course video made it thru okay as well, and will post that later if it does.

Only small little nit-picks is the first 2-5 minutes or so didn't make it thru and the blue colors on the slides got a little drowned out, but here are the slides if you need full resolution.


Continue reading "FOSS4GNA 2016 PostGIS Spatial Tricks video is out"

by Regina Obe (nospam@example.com) at May 31, 2016 08:36 PM

May 30, 2016

Stephen Mather

Using foreign data wrapper to use PostGIS with SQLServer

Here was the problem that needed solved last week (we have a few similar problems in upcoming projects, so this was an exciting thing to try): we needed to use PostGIS to access data in a SQLServer database. The SQLServer database backs the web site in question, the underlying content management system, etc., so no– removing SQLServer isn’t really an option at this stage. Obviously PostGIS is a requirement too… .

Before I go further, I used tds_fdw as the foreign data wrapper. The limitations here are as follows: it is a read-only connection, and only works with Sybase and SQLServer, as it uses tabular data stream protocol for communicating between client and server. This is not as generic a solution as we can use. Next time I’ll try ogr_fdw which is more generic as it can connect with other databases and other data types. Another advantage to ogr_fdw is we can use IMPORT FOREIGN SCHEMA. Regina Obe and Leo Hsu warn though to limit this to 150 tables or so for performance reasons.

With the limitations listed above, this is how we built the thang:

DROP SERVER beach_fdw CASCADE;

-- Create the server connection with FOREIGN DATA WRAPPER
CREATE SERVER beach_fdw
FOREIGN DATA WRAPPER tds_fdw
OPTIONS (servername 'name_or_ip', port '1433', database 'non-spatial-database', tds_version '7.1', msg_handler 'notice');

-- We map the postgres user to the user that can read the table we're interested in
CREATE USER MAPPING FOR postgres
SERVER beach_fdw
OPTIONS (username 'user', password 'password');

-- Create the actual foreign table connection
CREATE FOREIGN TABLE beach_closures (
AutoNumber int NOT NULL,
Title varchar NOT NULL,
StartDate timestamp without time zone NOT NULL,
WaterQuality varchar NOT NULL,
Latitude varchar NOT NULL,
Longitude varchar NOT NULL,
BeachStatus varchar NOT NULL,
ClosureColor varchar NOT NULL)
SERVER beach_fdw
OPTIONS (schema_name 'schema_name', table_name 'vw_CMPBeachClosures');

-- Now we create a spatial view using our longitude and latitude
CREATE VIEW v_beach_closures AS
SELECT
AutoNumber, Title, StartDate, WaterQuality, Latitude,
Longitude, BeachStatus, ClosureColor, ST_SetSRID(ST_MakePoint(Longitude::numeric, Latitude::numeric), 4326)	AS geom
FROM beach_closures;

Voila! A nice little PostGIS enabled view of a SQLServer view or table!


by smathermather at May 30, 2016 02:34 AM

May 25, 2016

UpStats blog (Stefan Petrea)

Redesigning a notification system

Intro

As previously mentioned, UpStatsBot is part of UpStats. It's a notification system in the form of a bot. You can get notifications about jobs while you're on the street or in a cafe.

This bot has been running for ~7 months now. And it was running quite well, except for one thing, the Postgres logs were growing a lot. Not long ago, I've analyzed PostgreSQL logs for 1 week using pgbadger. The results were startling:

As the image shows, the bot was responsible for more than 40% of all queries to that database. In addition, the Pg logs were growing quite a lot.

The bot would run queries to poll for new items every 10 minutes; those queries would run regardless if the collector had brought in new data since the last time the queries were run.

This blog post will describe how I fixed that issue using PostgreSQL's LISTEN/NOTIFY feature 1 , 2.

The main advantage of LISTEN/NOTIFY is the ability to receive near-realtime notifications with a fraction of the queries.

Overview of the UpStats project

The diagram above summarizes how UpStats and UpStatsBot work and how the system is composed:

  • A data collector
  • A PostgreSQL db
  • A web app
  • Metrics (that are recomputed periodically)
  • A notification system

A user can access it from the telegram bot or from the web app. The data is collected from the UpWork API, placed in a PostgreSQL database. Metrics can be computed(via more complex SQL queries) and notifications dispatched using the Telegram API.

We aim to move the logic responsible for notifications computation from the bot into the collector in order to realize near real-time dispatch (i.e. whenever new data becomes available).

Tables involved in notifications

The relevant tables here are:

  • odesk_job
  • odesk_search_job
  • odesk_telegram

In summary, we use search keywords from odesk_telegram and search for them in odesk_job via odesk_search_job. The odesk_search_job table holds full-text indexes for the jobs. The odesk_telegram table holds search keywords and active search streams for each subscribed user.

PosgreSQL's LISTEN/NOTIFY in Python

To offer some background, some use-cases of LISTEN/NOTIFY include:

  • using it in conjunction with websockets to build chat systems 3
  • building asynchronous and trigger-based replication systems 4
  • keeping caches in sync with a PostgreSQL database 5

We're using the psycopg2 connector 6 for PostgreSQL. The connector uses a socket to talk to the Postgres database server, and that socket has a file descriptor. That descriptor is used in the select call. Select checks if the descriptor is ready for reading 7 .

In order to exemplify this, we'll write a simple Python class that allows to send and listen to notifications 8 .

import select
import psycopg2
import psycopg2.extensions
from psycopg2.extensions import QuotedString
import json
import time

__doc__="""
This class is used to create an easy to use queue
mechanism where you can send and listen to messages.
"""

class PgQueue:
    dbuser = None
    dbpass = None
    dbname = None

    conn = None
    curs = None
    channel = None

    continue_recv = True

    def __init__(self,channel,dbname=None,dbuser=None,dbpass=None):
        """
        Connect to the database.
        If one of dbname, dbuser or dbpassword are not provided,
        the responsibility of providing (and setting a connection on
        this object) will fall on the calling code. 

        Otherwise, this will create a connection to the database.
        """
        self.dbname = dbname
        self.dbuser = dbuser
        self.dbpass = dbpass
        self.channel = channel

        if not channel:
            raise Exception('No channel provided')

        if dbname and dbuser and dbpass:
            # store connection
            self.conn = psycopg2.connect( \
                    'dbname={dbname} user={dbuser} password={dbpass} host=127.0.0.1'.format(\
                    dbname=dbname,dbuser=dbuser,dbpass=dbpass))
            # this is required mostly by the NOTIFY statement because it has
            # to commit after the query has been executed
            self.conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT)

    def recvLoop(self):
        """
        Loop that's concerned with receiving notifications
        """

        self.curs = self.conn.cursor()
        self.curs.execute("LISTEN {0};".format(self.channel))

        conn = self.conn
        curs = self.curs

        while self.continue_recv:
            if select.select([conn],[],[],6) == ([],[],[]):
                print "consumer: timeout"
            else:
                conn.poll()
                print "consumer: received messages"
                while conn.notifies:
                    notif = conn.notifies.pop(0)
                    # print "Got NOTIFY:", notif.pid, notif.channel, notif.payload
                    self.recvCallback(notif)

    def recvCallback(self, notification):
        """
        Needs to be implemented with notification handling logic
        """
        pass

    def send(self, data):
        """
        Send a notification
        """
        curs = self.conn.cursor()

        message = {}
        print "producer: sending.."
        # equip the message object with a timestamp
        message['time'] = time.time()
        message['data'] = data
        messageJson = json.dumps(message)
        messagePg = QuotedString(messageJson).getquoted()

        query = 'NOTIFY {0}, {1};'.format(self.channel, messagePg )
        print query
        curs.execute(query)

Now that we've implemented the class we can use it. The producer will be quite simple, and the consumer will need to either patch the notifyCallback method or subclass the PgQueue class to override the same method. We'll use the former, we'll patch the method. We'll run the producer in a thread and the consumer in a different thread.

def sample_producer_thread():
    q = PgQueue('botchan', dbname='dbname', dbuser='username', dbpass='password')

    while(True):
        time.sleep(0.4)
        message = {}
        message['test'] = "value"
        q.send(message)

def sample_consumer_thread():
    q = PgQueue('botchan', dbname='dbname', dbuser='username', dbpass='password')

    def newCallback(m):
        if m.payload:
            payload = m.payload
            print "callback: ", payload

    # replace the receiver callback
    q.recvCallback = newCallback
    q.recvLoop()

if __name__ == '__main__':
    import signal 
    from threading import Thread

    thread_producer = Thread(target=sample_producer_thread)
    thread_consumer = Thread(target=sample_consumer_thread)
    thread_producer.start()
    thread_consumer.start()
    thread_producer.join()
    thread_consumer.join()

Putting together user notifications

The regexes below are creating tsquery-compatible strings. Then those strings are used to run full-text searches on the job table. This way we can build notifications for each user and for each of their active search streams.

The last_job_ts is used to make sure we limit our searches to the new data.

We make use of wCTE (WITH common table expressions) because they're easy to work with and allow for gradually refining results of previous queries until the desired data can be extracted.

Near the end of the query we neatly pack all the data using PostgreSQL's JSON functions.

WITH user_notifs AS (
    SELECT
    id,
    last_job_ts,
    search,
    chat_id,
    regexp_replace(
           LOWER(
               regexp_replace(
                   rtrim(ltrim(search,' '),' '),
                   '\s+',' ','g'
               )
           ),
        '(\s*,\s*|\s)' , ' & ', 'g'
    )
    AS fts_query
    FROM odesk_telegram
    WHERE paused = false AND deleted = false
), jobs AS (
    SELECT A.job_id, A.tsv_basic, B.job_title, B.date_created
    FROM odesk_search_job A
    JOIN odesk_job B ON A.job_id = B.job_id
    WHERE B.date_created > EXTRACT(epoch FROM (NOW() - INTERVAL '6 HOURS'))::int
), new AS (
    SELECT
    A.id, A.chat_id, A.search, B.job_id, B.job_title, B.date_created
    FROM user_notifs AS A
    JOIN jobs B ON (
        B.tsv_basic @@ A.fts_query::tsquery AND
        B.date_created > A.last_job_ts
    )
), json_packed AS (
    SELECT
    A.id,
    A.search,
    A.chat_id,
    json_agg(
    json_build_object (
        'job_id', A.job_id,
        'job_title', A.job_title,
        'date_created', A.date_created
    )) AS j
    FROM new A
    GROUP BY A.id, A.search, A.chat_id
)
SELECT * FROM json_packed;

Tightening the constraints

Near the end of the collector program we compute the notifications and an expensive search query needs to be run in order to find out what to send and whom to send it to.

However, every time this query has run, we can store the latest timestamp on that search stream so next time we can tighten the search constraints and next time we're trying to compute the notifications we only search after that timestamp.

In order to do this, the search streams' last_job_ts needs to be updated:

UPDATE odesk_telegram SET last_job_ts = %(new_ts)d WHERE id = %(id)d;

For the active search streams that had new jobs, the earliest timestamp can be passed as parameter to this query.

Even for the active search streams that have seen no new jobs, we still have to tighten the search by updating their last_job_ts to the time when the collector started (we can only go so far, any later than this and we might miss jobs that were posted while the collector was running).

Optimizing the search streams

If enough data and enough users are present, we could craft a better query for this. For example, the search keywords could be organized in a tree structure.

This particular tree would store keywords based on the number of search results they're present in; in other words, the more queries a keyword exists in, the closer to the root that keyword will be.

A search stream corresponds to a path in this tree.

For example, in the tree below, php, wordpress is a search stream and user3 has registered for that particular search stream. Accordingly, user3 will receive job ads that match the words php and wordpress. Given the logic described above, php will match more jobs than wordpress.9

A temporary table can be created for the high-volume top-level search streams. To get to the more specific search streams, a JOIN on this table followed by the conditions for the lower-volume keywords would be enough.

For example, there are two search streams php,wordpress (for user3) and php,mysql (for user2). We could cache the ids of the notifications for the larger stream php and then refine it in order to get the two streams we're interested in.

This would be particularly interesting for a situation with a large number of subscribed users and a lot of active search streams as the tree would expand, preferably in depth rather than breadth.

Conclusion

This blog post describes the redesign of a notification system and some ideas about improving its performance.

Footnotes:

1

Notifications in PostgreSQL have been around for a long time, but starting with version 9.0 they are equipped with a payload.

2

The message passing described in this blog post is a simple one.

There are situations in which message delivery is more critical than the method described here.

For that purpose, this article provides a better approach that handles delivery failures. It uses a queue table and two triggers. The queue table is used to persist the messages that are sent.

One trigger will be placed on the table whose changes are of interest.

The other trigger will be placed on the queue table. So the message will be the "cascading" result of modifying the actual table. The advantage here is persistence (among other advantages that you can read in that article).

What's more, for the first trigger, the function row_to_json offers a way to serialize the changes in a structure-agnostic way.

Here's another article describing a queue-table-centric approach (without any LISTEN/NOTIFY). There's an emphasis put on locking and updating the 'processed' state of each item in the queue table, and different approaches for that. This would be more in line with a distributed queue.

3

For example this presentation in which the author explains how Postgres sends notifications to a Python backend via LISTEN/NOTIFY, which are then forwarded to the browser via websockets. The presentation is also available on youtube here.

5

This article describes a scenario where a PostgreSQL database updates a cache by broadcasting changes to it.

6

Although the code here is in Python, you may certainly use PostgreSQL's notifications in other languages (it's a well-supported feature) including the following:

7

More details about the conn.poll() statement. The poll() method comes from psycopg2 (it's called conn_poll in the C code) and it reads all the notifications from PQsocket (the ones we have issued a LISTEN statement for). There are 5 functions involved:

  • conn_poll (which in turn calls _conn_poll_query)
  • _conn_poll_query (in turn, calls pq_is_busy)
  • pq_is_busy (which calls conn_notifies_process and conn_notice_process)
  • conn_notifies_process (reads them from the PQsocket and populates C data structures)
  • conn_notice_process (turns the available notifications into Python data structures)
8

Need to keep in mind that the payloads are limited to 8000 bytes.

9

This section about optimizing and caching search results is just an idea at this point. Details such as how to keep the tree updated, which search streams should be cached in the temporary table, and how to represent it are not yet worked out. It's not yet implemented, it will probably be used later on.

May 25, 2016 04:00 AM

May 21, 2016

Boston GIS (Regina Obe, Leo Hsu)

pgRouting 2.2.3 released with support for PostgreSQL 9.6beta1

pgRouting 2.2.3 was released last week. Main change is this version now supports PostgreSQL 9.6. Many thanks to Vicky Vergara for working thru the issues with PostgreSQL 9.6 and getting it to work. Vicky has also been doing a good chunk of the coding (a lot of Boost refactoring and integrating more Boost features), testing, and documentation in pgRouting, osm2pgrouting, and QGIS pgRoutingLayer in general for pgRouting 2.1, 2.2, and upcoming 2.3. We are very indebted to her for her hard work.

If you are a windows user testing the waters of PostgreSQL 9.6beta1, we have pgRouting 2.2.3 binaries and PostGIS 2.3.0dev binaries at http://postgis.net/windows_downloads.


Continue reading "pgRouting 2.2.3 released with support for PostgreSQL 9.6beta1"

by Regina Obe (nospam@example.com) at May 21, 2016 04:24 PM

May 13, 2016

Postgres OnLine Journal (Leo Hsu, Regina Obe)

PLV8 binaries for PostgreSQL 9.6beta1 windows both 32-bit and 64-bit

To celebrate recent release of PostgreSQL 9.6beta1, we've started to experiment with our favorite extensions. For starters, PLV8 (aka PL/JavaScript) binaries listed below and upcoming PostGIS 2.3.0 and ogr_fdw detailed here


Continue reading "PLV8 binaries for PostgreSQL 9.6beta1 windows both 32-bit and 64-bit"

by Leo Hsu and Regina Obe (nospam@example.com) at May 13, 2016 10:56 PM

Boston GIS (Regina Obe, Leo Hsu)

PostgreSQL 9.6beta1 out, help test PostGIS 2.3 windows binaries available

PostgreSQL 9.6beta1 came out yesterday. It is the first version of PostgreSQL that will have parallel query support and PostGIS 2.3 will be the first PostGIS to support parallelism in queries. Although work is not yet committed in PostGIS repo to support this, you can expect to see this soon (currently here - https://github.com/pramsey/postgis/tree/parallel , waiting for someone to you know who you are do something about it.)

UPDATE: Some parallel support has been committed. More to come. pgRouting 9.6 issues resolved, many thanks to Vicky Vergara. Now pgRouting 2.2.3 windows binaries available for PostgreSQL 9.6beta1. Also recently tested these with BigSQL 9.6beta1 Windows 64-bit (at least the PostGIS ones), and seem to work fine. BigSQL paths are a little different from EDB (so be careful when copying to copy to right folder, zip structure follows the EDB structure. Also make sure not to overwrite files that came packaged with BigSQL, since there is more file overlap since we both use mingw64, but their gcc is newer).

Because of the newness of the parallelization feature, there are some caveats. As with all big things, we expect there to be a lot of bugs, and the more eyeballs on those and real workloads we've got hammering on them, the sweeter the PostGIS 2.3.0 and PostgreSQL 9.6 release will be.

Binaries for Windows users

For windows users, winnie the PostGIS windows buildbot is now building PostGIS for 9.6. Get PostgreSQL 9.6 binaries and installers from PostgreSQL 9.6beta1 for windows.

Once you have that, just copy the contents of the respective PostGIS 2.3 9.6 binaries listed here - http://postgis.net/windows_downloads/ into your install folder.

In the extras folder, you'll also find ogr_fdw foreign data wrapper latest development version which we covered extensively in FOSS4GNA2016 PostGIS Spatial Tricks. Talk also covered some new PostGIS 2.3.0 stuff.

We don't have pgRouting binaries available yet. pgRouting team is working out some compatibility issues with PostgreSQL 9.6. Once those are resolved, we will publish pgRouting binaries as well.

by Regina Obe (nospam@example.com) at May 13, 2016 09:02 PM

May 10, 2016

UpStats blog (Stefan Petrea)

Parallel testing multiple versions of ImageMagick

While investigating some test failures on a project involving ImageMagick (a parser for the metadata that ImageMagick extracts out of documents), I noticed different versions of ImageMagick were producing outputs in different formats. Not only the formats were differing but also some of the data was missing, or the values were different.

I only knew about two versions that were affected:

  • 6.7.7-10 2014-03-06 Q16
  • 6.9.2-4 Q16 x86_64 2015-11-27

On the newer one I got failures, so I quickly solved them. But I went a bit further, because I was interested to know what other versions in between these two were failing any tests.

I found the ImageMagick git repository for Debian. I cloned it, and looked at the branches. There were multiple branches, but the ones I was interested in were origin/debian/*.

Next I wrote a bash script to build all versions between 6.7.7.9-1 all the way up to 6.9.2.3-1.

#!/bin/bash
GIT_REPO=https://anonscm.debian.org/git/collab-maint/imagemagick.git
GIT_BRANCHES=()
GIT_BRANCHES+=('remotes/origin/debian/6.7.7.9-1')
# ...
GIT_BRANCHES+=('remotes/origin/debian/6.9.2.3-1')

# Clone the debian repository
git clone $GIT_REPO imagemagick

# We're only interested in the identify binary so we aim
# at making a custom build for each branch. We're not
# building any documentation, we're also not going to let
# the IM build run any tests, we only want the binaries.
for BRANCH in "${GIT_BRANCHES[@]}"; do
    BRANCH_VER=$(echo "$BRANCH" | perl -ne 'm{(\d+(?:\.\d+)+.*)$} && print "$1"')
    IM_VER=../custom-build/im-$BRANCH_VER/
    pushd .
    cd imagemagick
    # remove any unversioned files created by the previous build
    git clean -dfx .
    # undo any changes to versioned files
    git reset --hard HEAD
    # checkout a new branch for the build
    git checkout $BRANCH
    # building a minimal imagemagick for testing purposes
    autoreconf -i
    ./configure --without-perl --without-magick-plus-plus --disable-docs --prefix="$PWD/$IM_VER"
    # speed up the build with 4 parallel jobs
    make -j4
    mkdir -p "$IM_VER"
    make install
    popd
done

It took one hour to build 57 different 1 versions of ImageMagick, resulting in a directory (4.2 GB) with each version's binaries.

Now I wanted to know on which versions of ImageMagick, the unit-tests were failing. So I wrote another bash script to run the tests in parallel for each version of ImageMagick. To parallelize the unit tests I've only made use of xargs -P which I read about some time ago 2 (there are other ways to do that too, but I found this one to be the easiest in this situation).

#!/bin/bash
BUILD_PATHS=$(find custom-build/ -maxdepth 1 -mindepth 1)
OUT_DIR=$(mktemp -d)
for P in $BUILD_PATHS; do
    # extract IM version from the directory
    DIR_VERSION=$(echo "$P" | perl -ne 'm{(\d+(?:\.\d+)+[^\/]*)\/?} && print "$1"')
    # build output path
    OUT_FILE="$OUT_DIR/$DIR_VERSION"'.txt'
    # run the tests using a custom built version of imagemagick
    # (prepend custom built path so it can be used by the tests)
    NEW_PATH="$PWD/$P/bin:$PATH"
    CMD="PATH=\"" "$NEW_PATH" "\" ./run-tests.sh >$OUT_FILE 2>&1;"
    echo "$CMD"
done | xargs -I% -P 6 /bin/bash -c '%';

echo "$OUT_DIR"
echo "Test reports with failures:"
grep -lir FAIL "$OUT_DIR"

It took under 33 seconds to run through all the versions, and I got this result

Test reports with failures:
/tmp/tmp.8f4zL7ImPd/6.8.0.7-1.txt
/tmp/tmp.8f4zL7ImPd/6.8.1.5-1.txt
/tmp/tmp.8f4zL7ImPd/6.8.1.0-1.txt
/tmp/tmp.8f4zL7ImPd/6.8.0.10-1.txt

Now I was down to 4 ImageMagick versions that I had to look into. On looking further into those failed tests, they were testing properties of a tree built from looking at PDF metadata. For those versions, IM only outputs data for the first page of the PDF, not for each page of the PDF (as is the case with most other versions) so this is the reason why they were failing.

I then used unittest.skipIf to skip those tests if the IM version used was one of the four ones listed above.

This made for an interesting analysis, and I was happy to learn more about the way IM works.

Footnotes:

1

There's a more complicated scenario in which all dependencies (libraries IM depends on) would taken into consideration, not just IM's source code. The assumption here is that IM alone is responsible for the output of the identify utility. And that's a fair assumption, at least for the format of the output.

May 10, 2016 04:00 AM

April 29, 2016

Paul Ramsey

OGR FDW Update

I’ve had a productive couple of weeks here, despite the intermittently lovely weather and the beginning of Little League baseball season (not coaching, just supporting my pitcher-in-training).

13 Days

The focus of my energies has been a long-awaited (by me) update to the OGR FDW extension for PostgreSQL. By binding the multi-format OGR library into PostgreSQL, we get access to the many formats supported by OGR, all with just one piece of extension code.

As usual, the hardest part of the coding was remembering how things worked in the first place! But after getting my head back in the game the new code flowed out and now I can reveal the new improved OGR FDW!

OGR FDW Update

The new features are:

  • Column name mapping between OGR layers and PgSQL tables is now completely configurable. The extension will attempt to guess mapping automagically, using names and type consistency, but you can over-ride mappings using the table-level column_name option.
  • Foreign tables are now updateable! That means, for OGR sources that support it, you can run INSERT, UPDATE and DELETE commands on your OGR FDW tables and the changes will be applied to the source.

    • You can control which tables and foreign servers are updateable by setting the UPDATEABLE option on the foreign server and foreign table definitions.
  • PostgreSQL 9.6 is supported. It’s not released yet, but we can now build against it.
  • Geometry type and spatial reference system are propogated from OGR. If your OGR source defines a geometry type and spatial reference identifier, the FDW tables in PostGIS will now reflect that, for easier integration with your local geometry data.
  • GDAL2 and GDAL1 are supported. Use of GDAL2 syntax has been made the default in the code-base, with mappings back to GDAL1 for compatibility, so the code is now future-ready.
  • Regression tests and continuous integration are in place, for improved code reliability. Thanks to help from Even Roualt, we are now using Travis-CI for integration testing, and I’ve enabled a growing number of integration tests.

As usual, I’m in debt to Regina Obe for her timely feedback and willingness to torture-test very fresh code.

For now, early adopters can get the code by cloning and building the project master branch, but I will be releasing a numbered version in a week or two when any obvious bugs have been shaken out.

April 29, 2016 06:00 PM

April 21, 2016

Postgres OnLine Journal (Leo Hsu, Regina Obe)

PGConfUS 2016 PostGIS slides and tutorial material

We gave a PostGIS Intro Training and a PostGIS talk at PGConfUS 2016 in Brooklyn, New York and just got back. A number of people asked if we'd make the slides and material available. We have these posted on our presentation page: http://www.postgis.us/presentations and will be putting on the PostgreSQL Wiki as well in due time. There will be a video coming along for the talk, but the training was not recorded.

We also have two more talks coming up in North Carolina in Early May at FOSS4G NA 2016 - one on PostGIS Spatial Tricks which has more of a GIS specialist focus than the top 10 talk we gave, but there will be some overlap. The other talk is a topic a couple of people asked us in training and after our talk, on routing along constrained paths. If you are attending FOSS4G NA 2016, you won't want to miss our talk pgRouting: A Crash Course which is also the topic of our upcoming book.

Just like FOSS4G NA 2015, there is a pgDay track which is PostgreSQL specific material, useful to a spatial crowd, but not PostGIS focused.


Continue reading "PGConfUS 2016 PostGIS slides and tutorial material"

by Leo Hsu and Regina Obe (nospam@example.com) at April 21, 2016 05:04 PM

April 19, 2016

Archaeogeek (Jo Cook)

PortableGIS 5.6

I’m pleased to announce the latest release of Portable GIS. This version (v5.6) has the following changes:

  • QGIS 2.14.1 LTR
  • By popular demand: Geoserver 2.8

You can download the setup exe and the md5 checksum here.

Older versions are still available but have been archived to avoid confusion.

As always, please let me know of any problems via the Portable GIS google group.

Note that I will shortly be publicising a GitLabs repository for the changed files, along with developer and user documentation, to allow people to roll their own versions or contribute to development. This work is nearly complete, so watch this space!

April 19, 2016 10:20 AM

April 15, 2016

Postgres OnLine Journal (Leo Hsu, Regina Obe)

First Look at pgAdmin 4

When David Page announced pgAdmin 4, I was really excited to try it out. I was impressed I could compile it so easily on windows. I had a few bumps, but not too bad.

One of the reasons I'm excited about it is that it's built on Python and a web framework, and there is a large Python and web developer following in the GIS community, so I suspect someone will step up to the plate to add a mapviewer plugin to this so I can have a seamless PostGIS experience.

The interface is also very slick and pretty and I love the sorting and paging capability now in the query window. Check this sampling from our workshop database.


Continue reading "First Look at pgAdmin 4 "

by Leo Hsu and Regina Obe (nospam@example.com) at April 15, 2016 11:16 PM

April 05, 2016

PostGIS Development

Nautilytics Case Study

Nautilytics is a small data visualization and GIS startup based out of Boston, MA. We use PostGIS and PostgreSQL, among other open-source tools to build powerful web applications for US government organizations, public, and private sector companies.

Continue Reading by clicking title hyperlink ..

by Christopher Lanoue at April 05, 2016 12:00 AM

April 04, 2016

PostGIS Development

PostGIS for fast prototyping and Research

I used extensively postgis (+ecosystem) for my phd thesis, in several ways. The first is that PostGIS is a good steady horse (elephant?): a database is the perfect place to store a lot of very different information in the same place and put them in relation. For geospatial data, postgis means you always have a way to put data in relation (are they at the same place?).

Continue Reading by clicking title hyperlink ..

by Rémi Cura at April 04, 2016 12:00 AM

April 02, 2016

Postgres OnLine Journal (Leo Hsu, Regina Obe)

PostGIS 2.2 Windows users hold off on installing latest PostgreSQL patch release

Someone reported recently on PostGIS mailing list, that they were unable to install PostGIS 2.2.1 bundle or PostGIS 2.2.2 binaries on a clean PostgreSQL 9.5.2 install. Someone also complained about PostgreSQL 9.3 (though not clear the version) if that is a separate issue or the same. I have tested on PostgreSQL 9.5.2 Windows 64-bit and confirmed the issue. The issue does not affect PostgreSQL 9.5.1 and older. I haven't confirmed its an issue with the 32-bit installs, but I suspect so too. This issue will affect OGR_FDW users and people who used our compiled WWW_FDW.


Continue reading "PostGIS 2.2 Windows users hold off on installing latest PostgreSQL patch release"

by Leo Hsu and Regina Obe (nospam@example.com) at April 02, 2016 03:34 PM

March 31, 2016

Paul Ramsey

Parallel PostGIS Joins

In my earlier post on new parallel query support coming in PostgreSQL 9.6 I was unable to come up with a parallel join query, despite much kicking and thumping the configuration and query.

Parallel PostGIS Joins

It turns out, I didn’t have all the components of my query marked as PARALLEL SAFE, which is required for the planner to attempt a parallel plan. My query was this:

EXPLAIN ANALYZE 
 SELECT Count(*) 
  FROM pd 
  JOIN pts 
  ON pd.geom && pts.geom
  AND _ST_Intersects(pd.geom, pts.geom);

And _ST_Intersects() was marked as safe, but I neglected to mark the function behind the && operator – geometry_overlaps – as safe. With both functions marked as safe, and assigned a hefty function cost of 1000, I get this query:

 Nested Loop  
 (cost=0.28..1264886.46 rows=21097041 width=2552) 
 (actual time=0.119..13876.668 rows=69534 loops=1)
   ->  Seq Scan on pd  
   (cost=0.00..14271.34 rows=69534 width=2512) 
   (actual time=0.018..89.653 rows=69534 loops=1)
   ->  Index Scan using pts_gix on pts  
   (cost=0.28..17.97 rows=2 width=40) 
   (actual time=0.147..0.190 rows=1 loops=69534)
         Index Cond: (pd.geom && geom)
         Filter: _st_intersects(pd.geom, geom)
         Rows Removed by Filter: 2
 Planning time: 8.365 ms
 Execution time: 13885.837 ms

Hey wait! That’s not parallel either!

It turns out that parallel query involves a secret configuration sauce, just like parallel sequence scan and parellel aggregate, and naturally it’s different from the other modes (gah!)

The default parallel_tuple_cost is 0.1. If we reduce that by an order of magnitude, to 0.01, we get this plan instead:

 Gather  
 (cost=1000.28..629194.94 rows=21097041 width=2552) 
 (actual time=0.950..6931.224 rows=69534 loops=1)
   Number of Workers: 3
   ->  Nested Loop  
   (cost=0.28..628194.94 rows=21097041 width=2552) 
   (actual time=0.303..6675.184 rows=17384 loops=4)
         ->  Parallel Seq Scan on pd  
         (cost=0.00..13800.30 rows=22430 width=2512) 
         (actual time=0.045..46.769 rows=17384 loops=4)
         ->  Index Scan using pts_gix on pts  
         (cost=0.28..17.97 rows=2 width=40) 
         (actual time=0.285..0.366 rows=1 loops=69534)
               Index Cond: (pd.geom && geom)
               Filter: _st_intersects(pd.geom, geom)
               Rows Removed by Filter: 2
 Planning time: 8.469 ms
 Execution time: 6945.400 ms

Ta da! A parallel plan, and executing almost twice as fast, just like the doctor ordered.

Complaints

Mostly the parallel support in core “just works” as advertised. PostGIS does need to mark our functions as quite costly, but that’s reasonable since they actually are quite costly. What is not good is the need to tweak the configuration once the functions are properly costed:

  • Having to cut parallel_tuple_cost by a factor of 10 for the join case is not any good. No amount of COST increases seemed to have an effect, only changing the core parameter did.
  • Having to increase the cost of functions used in aggregates by a factor of 100 over cost of functions used in sequence filters is also not any good.

So, with a few more changes to PostGIS, we are quite close, but the planner for parallel cases needs to make more rational use of function costs before we have achieved parallel processing nirvana for PostGIS.

March 31, 2016 06:00 PM

March 29, 2016

PostGIS Development

Plenario (Univeristy of Chicago)

The Urban Center for Computation and Data (UrbanCCD) is a research initiative of Argonne National Laboratory and the Computation Institute of the University of Chicago. We create computational tools to better understand cities. One of these is Plenario, our hub for open geospatial data. PostGIS makes the spatial operations at the heart of Plenario possible.

Continue Reading by clicking title hyperlink ..

by William Engler at March 29, 2016 12:00 AM

March 26, 2016

Paul Ramsey

Parallel PostGIS

Parallel query support in PostgreSQL in the upcoming 9.6 release will be available for a number of query types: sequence scans, aggregates and joins. Because PostGIS tends to involve CPU-intensive calculations on geometries, support for parallel query has been at the top of our request list to the core team for a long time. Now that it is finally arriving the question is: does it really help?

Parallel PostGIS

TL;DR:

  • With some adjustments to function COST both parallel sequence scan and parallel aggregation deliver very good parallel performance results.
  • The cost adjustments for sequence scan and aggregate scan are not consistent in magnitude.
  • Parallel join does not seem to work for PostGIS indexes yet, but perhaps there is some magic to learn from PostgreSQL core on that.

Setup

In order to run these tests yourself, you will need to check out and build:

For testing, I used the set of 69534 polling divisions defined by Elections Canada.

shp2pgsql -s 3347 -I -D -W latin1 PD_A.shp pd | psql parallel

It’s worth noting that this data set is, in terms of number of rows very very small as databases go. This will become important as we explore the behaviour of the parallel processing, because the assumptions of the PostgreSQL developers about what constitutes a “parallelizable load” might not match our assumptions in the GIS world.

With the data loaded, we can do some tests on parallel query. Note that there are some new configuration options for parallel behaviour that will be useful during testing:

  • max_parallel_degree sets the maximum degree of parallelism for an individual parallel operation. Default 0.
  • parallel_tuple_cost sets the planner’s estimate of the cost of transferring a tuple from a parallel worker process to another process. The default is 0.1.
  • parallel_setup_cost sets the planner’s estimate of the cost of launching parallel worker processes. The default is 1000.
  • force_parallel_mode allows the use of parallel queries for testing purposes even in cases where no performance benefit is expected. Default ‘off’.

Parallel Sequence Scan

Before we can test parallelism, we need to turn it on! The default max_parallel_degree is zero, so we need a non-zero value. For my tests, I’m using a 2-core laptop, so:

SET max_parallel_degree=2;

Now we are ready to run a query with a spatial filter. Using EXPLAIN ANALYZE suppressed the actual answer in favour of returning the query plan and the observed execution time:

EXPLAIN ANALYZE 
  SELECT Count(*) 
    FROM pd 
    WHERE ST_Area(geom) > 10000;

And the answer we get back is:

 Aggregate  
 (cost=14676.95..14676.97 rows=1 width=8) 
 (actual time=757.489..757.489 rows=1 loops=1)
   ->  Seq Scan on pd  
   (cost=0.00..14619.01 rows=23178 width=0) 
   (actual time=0.160..747.161 rows=62158 loops=1)
         Filter: (st_area(geom) > 10000)
         Rows Removed by Filter: 7376
 Planning time: 0.137 ms
 Execution time: 757.553 ms

Two things we can learn here:

  • There is no parallelism going on here, the query plan is just a single-threaded one.
  • The single-threaded execution time is about 750ms.

Now we have a number of options to fix this problem:

  • We can force parallelism using force_parallel_mode, or
  • We can force parallelism by decreasing the parallel_setup_cost, or
  • We can adjust the cost of ST_Area() to try and get the planner to do the right thing automatically.

It turns out that the current definition of ST_Area() has a default COST setting, so it is considered to be no more or less expensive than something like addition or substraction. Since calculating area involves multiple floating point operations per polygon segment, that’s a stupid cost.

In general, all PostGIS functions are going to have to be reviewed and costed to work better with parallelism.

If we redefine ST_Area() with a big juicy cost, things might get better.

CREATE OR REPLACE FUNCTION ST_Area(geometry)
  RETURNS FLOAT8
  AS '$libdir/postgis-2.3','area'
  LANGUAGE 'c' IMMUTABLE STRICT PARALLEL SAFE
  COST 100;

Now the query plan for our filter is much improved:

Finalize Aggregate  
(cost=20482.97..20482.98 rows=1 width=8) 
(actual time=345.855..345.856 rows=1 loops=1)
->  Gather  
   (cost=20482.65..20482.96 rows=3 width=8) 
   (actual time=345.674..345.846 rows=4 loops=1)
     Number of Workers: 3
     ->  Partial Aggregate  
         (cost=19482.65..19482.66 rows=1 width=8) 
         (actual time=336.663..336.664 rows=1 loops=4)
           ->  Parallel Seq Scan on pd  
               (cost=0.00..19463.96 rows=7477 width=0) 
               (actual time=0.154..331.815 rows=15540 loops=4)
                 Filter: (st_area(geom) > 10000)
                 Rows Removed by Filter: 1844
Planning time: 0.145 ms
Execution time: 349.345 ms

Three important things to note:

  • We have a parallel query plan!
  • Some of the execution results output are wrong! They say that only 1844 rows were removed by the filter, but in fact 7376 were (as we can confirm by running the queries without the EXPLAIN ANALYZE). This is a known limitation, reporting on the results of only one parallel worker, which (should) maybe, hopefully be fixed before 9.6 comes out.
  • The execution time has been halved, just as we would hope for a 2-core machine!

Now for the disappointing part, try this:

EXPLAIN ANALYZE
  SELECT ST_Area(geom) 
    FROM pd;

Even though the work being carried out (run ST_Area() on 70K polygons) is exactly the same as in our first example, the planner does not parallelize it, because the work is not in the filter.

 Seq Scan on pd  
 (cost=0.00..31654.84 rows=69534 width=8) 
 (actual time=0.130..722.286 rows=69534 loops=1)
 Planning time: 0.078 ms
 Execution time: 727.344 ms

For geospatial folks, who tend to do a fair amount of expensive calculation in the SELECT parameters, this is a bit disappointing. However, we still get impressive parallelism on the filter!

Parallel Aggregation

The aggregate most PostGIS users would like to see parallelized is ST_Union() so it’s worth explaining why that’s actually a little hard.

PostgreSQL Aggregates

All aggregate functions in PostgreSQL consist of at least two functions:

  • A “transfer function” that takes in a value and a transfer state, and adds the value to the state. For example, the Avg() aggregate has a transfer state consisting of the sum of all values seen so far, and the count of all values processed.
  • A “final function” that takes in a transfer state and converts it to the final aggregate output value. For example, the Avg() aggregate final function divides the sum of all values by the count of all values and returns that number.

For parallel processing, PostgreSQL adds a third kind of function:

  • A “combine” function, that takes in two transfer states and outputs a singe combined state. For the Avg() aggregate, this would add the sums from each state and counts from each state and return that as the new combined state.

So, in order to get parallel processing in an aggregate, we need to define “combine functions” for all the aggregates we want parallelized. That way the master process can take the completed transfer states of all parallel workers, combine them, and then hand that final state to the final function for output.

To sum up, in parallel aggregation:

  • Each worker runs “transfer functions” on the records it is responsible for, generating a partial “transfer state”.
  • The master takes all those partial “transfer states” and “combines” them into a “final state”.
  • The master then runs the “final function” on the “final state” to get the completed aggregate.

Note where the work occurs: the workers only run the transfer functions, and the master runs both the combine and final functions.

PostGIS ST_Union Aggregate

One of the things we are proud of in PostGIS is the performance of our ST_Union() implementation, which gains performance from the use of a cascaded union algorithm.

Cascaded union involves the following steps:

  • Collects all the geometries of interest into an array (aggregate transfer function), then
  • Builds a tree on those geometries and unions them from the leaves of the tree upwards (aggregate final function).

Note that all the hard work happens in the final step. The transfer functions (which is what would be run on the workers) do very little work, just gathering geometries into an array.

Converting this process into a parallel one by adding a combine function that does the union would not make things any faster, because the combine step also happens on the master. What we need is an approach that does more work during the transfer function step.

PostGIS ST_MemUnion Aggregate

“Fortunately” we have such an aggregate, the old union implementation from before we added “cascaded union”. The “memory friendly” union saves memory by not building up the array of geometries in memory, at the cost of spending lots of CPU unioning each input geometry into the transfer state.

In that respect, it is the perfect example to use for testing parallel aggregate.

The non-parallel definition of ST_MemUnion() is this:

CREATE AGGREGATE ST_MemUnion (
  basetype = geometry,
  sfunc = ST_Union,
  stype = geometry
 );

No special types or functions required: the transfer state is a geometry, and as each new value comes in the two-parameter version of the ST_Union() function is called to union it onto the state. There is no final function because the transfer state is the output value. Making the parallel version is as simple as adding a combine function that also uses ST_Union() to merge the partial states:

CREATE AGGREGATE ST_MemUnion (
  basetype = geometry,
  sfunc = ST_Union,
  combinefunc = ST_Union,
  stype = geometry
 );

Now we can run an aggregation using ST_MemUnion() to see the results. We will union the polling districts of just one riding, so 169 polygons:

EXPLAIN ANALYZE 
  SELECT ST_Area(ST_MemUnion(geom)) 
    FROM pd 
    WHERE fed_num = 47005;

Hm, no parallelism in the plan, and an execution time of 3.7 seconds:

 Aggregate  
 (cost=14494.92..14495.18 rows=1 width=8) 
 (actual time=3784.781..3784.782 rows=1 loops=1)
   ->  Seq Scan on pd  
   (cost=0.00..14445.17 rows=199 width=2311) 
   (actual time=0.078..49.605 rows=169 loops=1)
         Filter: (fed_num = 47005)
         Rows Removed by Filter: 69365
 Planning time: 0.207 ms
 Execution time: 3784.997 ms

We have to bump the cost of the two parameter version of ST_Union() up to 10000 before parallelism kicks in:

CREATE OR REPLACE FUNCTION ST_Union(geom1 geometry, geom2 geometry)
  RETURNS geometry
  AS '$libdir/postgis-2.3','geomunion'
  LANGUAGE 'c' IMMUTABLE STRICT PARALLEL SAFE
  COST 10000;

Now we get a parallel execution! And the time drops down substantially, though not quite a 50% reduction.

 Finalize Aggregate  
 (cost=16536.53..16536.79 rows=1 width=8) 
 (actual time=2263.638..2263.639 rows=1 loops=1)
   ->  Gather  
   (cost=16461.22..16461.53 rows=3 width=32) 
   (actual time=754.309..757.204 rows=4 loops=1)
         Number of Workers: 3
         ->  Partial Aggregate  
         (cost=15461.22..15461.23 rows=1 width=32) 
         (actual time=676.738..676.739 rows=1 loops=4)
               ->  Parallel Seq Scan on pd  
               (cost=0.00..13856.38 rows=64 width=2311) 
               (actual time=3.009..27.321 rows=42 loops=4)
                     Filter: (fed_num = 47005)
                     Rows Removed by Filter: 17341
 Planning time: 0.219 ms
 Execution time: 2264.684 ms

The punchline though, is what happens when we run the query using a single-threaded ST_Union() with cascaded union:

EXPLAIN ANALYZE 
  SELECT ST_Area(ST_Union(geom)) 
    FROM pd 
    WHERE fed_num = 47005;

Good algorithms beat brute force still:

 Aggregate  
 (cost=14445.67..14445.93 rows=1 width=8) 
 (actual time=2031.230..2031.231 rows=1 loops=1)
   ->  Seq Scan on pd  
   (cost=0.00..14445.17 rows=199 width=2311) 
   (actual time=0.124..66.835 rows=169 loops=1)
         Filter: (fed_num = 47005)
         Rows Removed by Filter: 69365
 Planning time: 0.278 ms
 Execution time: 2031.887 ms

The open question is, “can we combine the subtlety of the cascaded union algorithm with the brute force of parallel execution”?

Maybe, but it seems to involve magic numbers: if the transfer function paused every N rows (magic number) and used cascaded union to combine the geometries received thus far, it could possibly milk performance from both smart evaluation and multiple CPUs. The use of a magic number is concerning however, and the approach would be very sensitive to the order in which rows arrived at the transfer functions.

Parallel Join

To test parallel join, we’ll build a synthetic set of points, such that each point falls into one polling division polygon:

CREATE TABLE pts AS 
SELECT 
  ST_PointOnSurface(geom)::Geometry(point, 3347) AS geom, 
  gid, fed_num 
FROM pd;

CREATE INDEX pts_gix 
  ON pts USING GIST (geom);

Points and Polling Divisions

A simple join query looks like this:

EXPLAIN ANALYZE 
 SELECT Count(*) 
  FROM pd 
  JOIN pts 
  ON ST_Intersects(pd.geom, pts.geom);

But the query plan has no parallel elements! Uh oh!

Aggregate  
(cost=222468.56..222468.57 rows=1 width=8) 
(actual time=13830.361..13830.362 rows=1 loops=1)
   ->  Nested Loop  
   (cost=0.28..169725.95 rows=21097041 width=0) 
   (actual time=0.703..13815.008 rows=69534 loops=1)
         ->  Seq Scan on pd  
         (cost=0.00..14271.34 rows=69534 width=2311) 
         (actual time=0.086..90.498 rows=69534 loops=1)
         ->  Index Scan using pts_gix on pts  
         (cost=0.28..2.22 rows=2 width=32) 
         (actual time=0.146..0.189 rows=1 loops=69534)
               Index Cond: (pd.geom && geom)
               Filter: _st_intersects(pd.geom, geom)
               Rows Removed by Filter: 2
 Planning time: 6.348 ms
 Execution time: 13843.946 ms

The plan does involve a nested loop, so there should be an opportunity for parallel join to work magic. Unfortunately no variation of the query or the parallel configuration variables, or the function costs will change the situation: the query refuses to parallelize!

SET parallel_tuple_cost=0.001;
SET force_parallel_mode=on;
SET parallel_setup_cost=1;

The ST_Intersects() function is actually a SQL wrapper on top of the && operator and the _ST_Intersects() function, but unwrapping it and using the components directly also has no effect.

EXPLAIN ANALYZE 
 SELECT Count(*) 
  FROM pd 
  JOIN pts 
  ON pd.geom && pts.geom
  AND _ST_Intersects(pd.geom, pts.geom);

The only variant I could get to parallelize omitted the && index operator.

EXPLAIN                       
 SELECT *        
  FROM pd 
  JOIN pts 
  ON _ST_Intersects(pd.geom, pts.geom);

Unfortunately without the index operator the query is so inefficient it doesn’t matter that it’s being run in parallel, it will take days to run to completion.

 Gather  
 (cost=1000.00..721919734.88 rows=1611658891 width=2552)
   Number of Workers: 2
   ->  Nested Loop  
   (cost=0.00..576869434.69 rows=1611658891 width=2552)
         Join Filter: _st_intersects(pd.geom, pts.geom)
         ->  Parallel Seq Scan on pd  
         (cost=0.00..13865.73 rows=28972 width=2512)
         ->  Seq Scan on pts  
         (cost=0.00..1275.34 rows=69534 width=40)

So, thus far, parallel query seems to be a wet squib for PostGIS, though I hope with some help from PostgreSQL core we can figure out where the problem lies.

UPDATE: Marking the geometry_overlaps function which is bound to the && operator as PARALLEL SAFE allows PostgreSQL to generate parallel join plans when the index is in place.

Conclusions

While it is tempting to think “yay, parallelism! all my queries will run $ncores times faster!” in fact parallelism still only applies in a limited number of cases:

  • When there is a sequence scan large (costly) enough to be worth parallelizing.
  • When there is an aggregate large (costly) enough to be worth parallelizing, and the aggregate function can actually parallize the work effectively.
  • (Theoretically) when there is a (nested loop) join large (costly) enough to be worth parallelizing.

Additionally there is still work to be done on PostGIS for optimal use of the parallel features we have available:

  • Every function is going to need a cost, and those costs may have to be quite high to signal to the planner that we are not garden variety computations.
  • Differences in COST adjustments for different modes need to be explored: why was a 10000 cost needed to kick the aggregation into action, while a 100 cost sufficed for sequence scan?
  • Aggregation functions that currently back-load work to the final function may have to be re-thought to do more work in the transfer stage.
  • Whatever issue is preventing our joins from parallelizing needs to be tracked down.

All that being said, the potential is to see a large number of queries get $ncores faster, so this promises to be the most important core development we’ve seen since the extension framework arrived back in PostgreSQL 9.1.

March 26, 2016 08:00 PM

Boston GIS (Regina Obe, Leo Hsu)

FOSS4G NA 2016 Early Bird ending soon and come see us

The FOSS4G NA 2016 in North Carolina early bird registration running May 2nd-5th (May 2nd is workshop day) will be ending this month. This means after this month, ticket prices will go up by as much as $200 depending on what kind of pass you need. If you are a speaker you should register now so we have a good head-count. Hotel prices will go up after this coming week too since conference room block will expire so all the more reason to register now.

On a related note, both Leo and I will be giving talks.

There will be other PostGIS talks and PostgreSQL day (May 3rd) talks for PostGIS and PostgreSQL users.

by Regina Obe (nospam@example.com) at March 26, 2016 07:26 PM

March 22, 2016

PostGIS Development

PostGIS 2.2.2 Released

The PostGIS development team is happy to release patch for PostGIS 2.2, the 2.2.2 release. As befits a patch release, the focus is on bugs and breakages.

Bug Fixes and Improvements

  • #3463, Fix crash on face-collapsing edge change
  • #3422, Improve ST_Split robustness on standard precision double systems (arm64, ppc64el, s390c, powerpc, …)
  • #3427, Update spatial_ref_sys to EPSG version 8.8
  • #3433, ST_ClusterIntersecting incorrect for MultiPoints
  • #3435, ST_AsX3D fix rendering of concave geometries
  • #3436, memory handling mistake in ptarrayclonedeep
  • #3437, ST_Intersects incorrect for MultiPoints
  • #3461, ST_GeomFromKML crashes Postgres when there are innerBoundaryIs and no outerBoundaryIs
  • #3429, upgrading to 2.3 or from 2.1 can cause loop/hang on some platforms
  • #3460, ST_ClusterWithin “Tolerance not defined” error after upgrade
  • #3490, Raster data restore issues, materialized views. Added scripts postgis_proc_set_search_path.sql, rtpostgis_proc_set_search_path.sql
  • #3426, failing POINT EMPTY tests on fun architectures

See the full list of changes in the news file and please report bugs that you find in the release.

Binary packages will appear in repositories over the coming weeks as packagers roll out builds.

View all closed tickets for 2.2.2.

by Paul Ramsey at March 22, 2016 12:00 AM

March 20, 2016

Stephen Mather

Taking Slices from LiDAR data: Part VI

I finally got PDAL properly compiled with Point Cloud Library (PCL) baked in. Word to the wise — CLANG is what the makers are using to compile. The PDAL crew were kind enough to revert the commit which broke GCC support, but why swim upstream? If you are compiling PDAL yourself, use CLANG. (Side note, the revert to support GCC was really helpful for ensuring we could embed PDAL into OpenDroneMap without any compiler changes for that project.)

With a compiled version of PDAL with the PCL dependencies built in, I can bypass using the docker instance. When I was spawning tens of threads of Docker and then killing them, recovery was a problem (it would often hose my docker install completely). I’m sure there’s some bug to report there, or perhaps spawning 40 docker threads is ill advised for some grander reason, but regardless, running PDAL outside a container has many benefits, including simpler code. If you recall our objectives with this script, we want to:

  • Calculate relative height of LiDAR data
  • Slice that data into bands of heights
  • Load the data into a PostgreSQL/PostGIS/pgPointCloud database.

The control script without docker becomes as follows:

#!/bin/bash 

# readlink gets us the full path to the file. This is necessary for docker
readlinker=`readlink -f $1`
# returns just the directory name
pathname=`dirname $readlinker`
# basename will strip off the directory name and the extension
name=`basename $1 .las`

# PDAL must be built with PCL.
# See http://www.pdal.io/tutorial/calculating-normalized-heights.html

pdal translate "$name".las "$name".bpf height --writers.bpf.output_dims="X,Y,Z,Intensity,ReturnNumber,NumberOfReturns,ScanDirectionFlag,EdgeOfFlightLine,Classification,ScanAngleRank,UserData,PointSourceId,HeightAboveGround"

# Now we split the lidar data into slices of heights, from 0-1.5 ft, etc.
# on up to 200 feet. We're working in the Midwest, so we don't anticipate
# trees much taller than ~190 feet
for START in 0:1.5 1.5:3 3:6 6:15 15:30 30:45 45:60 60:105 105:150 150:200
        do
        # We'll use the height classes to name our output files and tablename.
        # A little cleanup is necessary, so we're removing the colon ":".
        nameend=`echo $START | sed s/:/-/g`

        # Name our output
        bpfname=$name"_"$nameend.bpf

        # Implement the height range filter
        pdal translate $name.bpf $bpfname -f range --filters.range.limits="HeightAboveGround[$START)"

        # Now we put our data in the PostgreSQL database.
        pdal pipeline -i pipeline.xml --writers.pgpointcloud.table='pa_layer_'$nameend --readers.bpf.filename=$bpfname --writers.pgpointcloud.overwrite='false'
done

We still require our pipeline xml in order to set our default options as follows:

<?xml version="1.0" encoding="utf-8"?>
<Pipeline version="1.0">
  <Writer type="writers.pgpointcloud">
    <Option name="connection">
      host='localhost' dbname='user' user='user' password=‘password’
    </Option>
    <Option name="table">54001640PAN_heightasz_0-1.5</Option>
    <Option name="compression">dimensional</Option>
    <Filter type="filters.chipper">
      <Option name="capacity">400</Option>
      <Reader type="readers.bpf">
      <Option name="filename">54001640PAN_heightasz_0-1.5.bpf</Option>
      </Reader>
    </Filter>
  </Writer>
</Pipeline>

And as before, we can use parallel to make this run a little lot faster:

find . -name '*.las' | parallel -j20 ./pdal_processor.sh

For the record, I found out through testing that my underlying host only has 20 processors (though more cores). No point in running more processes than that… .

So, when point clouds get loaded, they’re broken up in to “chips” or collections of points. How many chips do we have so far?:

user=# SELECT COUNT(*) FROM "pa_layer_0-1.5";
  count   
----------
 64413535
(1 row)

Now, how many rows is too many in a PostgreSQL database? Answer:

In other words, your typical state full of LiDAR (Pennsylvania or Ohio for example) are not too large to store, retrieve, and analyze. If you’re in California or Texas, or have super dense stuff that’s been flown recently, you will have to provide some structure in the form of partitioning your data into separate tables based on e.g. geography. You could also modify your “chipper” size in the XML file. I have used the default 400 points per patch (for about 25,765,414,000 points total), which is fine for my use case as then I do not exceed 100 million rows once the points are chipped:

      <Option name="capacity">400</Option>

by smathermather at March 20, 2016 02:43 AM

March 17, 2016

Oslandia

OSGeo CS 2016 report : Point Clouds

#TOSprint

The annual OSGeo Codesprint took place from 23d to 26th of February in Paris, at Mozilla's Foundation offices. Here is a summary of some achievements done during this event by developers working on Point Cloud data.

Point Clouds


Point Cloud, as seen in the famous musical masterpiece, is a data type growing in popularity lately. It often comes from LIDAR sensors, which generate 300k to 500k points/s these days. They can also come out of a Structure-from-Motion computation, based on simple pictures.


Data volumes, as well as some specific aspects of this data type ( e.g. no fixed schema ) lead to new challenges in the softwares needed to handle Point Clouds. A similarly growing number of persons work on these challenges, especially in the OSGeo community.


This post summarizes some of the achievement in this field for the OSGeo Codesprint in Paris.

Greyhound & Entwine



Hobu's team was present at the codesprint, with Howard Butler, Connor Manning and Andrew Bell. They disclosed two technologies they have been working on in the last months : Greyhound and Entwine.


Entwine is a Point Cloud data preparation library, indexing large amounts of data in distributed file systems. Connor Manning improved the performances of this new software. Greyhound is a Point Cloud streaming server. It is based on Entwine for data access, and is able to answer queries, streaming corresponding points giving a location, LOD and other informations. Greyhound and Entwine focus on serving large amounts of data with high performances.


Greyhound was originally bundled with Plas.io, which is a WebGL client to visualize point cloud data served by Greyhound. During the codesprint, Hobu's collaborators teamed with Maarten van Meersbergen and Oscar Martinez Rubi and achived to plug Potree to Greyhound. This is a really interesting improvement, since potree is currently the most advanced opensource point cloud visualization tool. Furthermore, it is a step towards interoperability. The Open Geospatial Consortium being currently interested in Point Cloud formats and protocol standardization, the work achieved during the codesprint makes a good pragmatic implementation as a base for standardization discussions.


There are now online demonstrations of the Entwine-Greyhound-Potree coupling :

iTowns


The iTowns project is a 3D WebGL geospatial visualization solution, developped by the French national mapping agency ( IGN ), Oslandia and AtolCD. We are currently working on the future version of iTowns, improving the code base and adding new features.


During the Codesprint, Quoc worked on integrating Potree inside iTowns for massive Point Cloud data visualization. Both projects are based on THREE.js, therefore the incorporation makes sense. He managed to get a working prototype, displaying Point Clouds from a Potree data source ( Flat files structure ) inside iTowns. This work is currently in a specific branch, as it required some modifications on the potree side. Next steps will be to propose modularization refactoring to potree so that the integration into other frameworks can be done easily.


Our PhD student, Jeremy Gaillard spent some time improving its building server. It is a piece of software serving 3D objects, namely buildings, from a PostGIS database to a web client. Jeremy enhanced the indexing mechanism, which allows to load the data on the client following priorities given by a cost function. It allows for example to visualize specific landmarks first, or big buildings, or another characteristics used as priority .

And more

You will find more details in Howard Butler's post :

by Vincent Picavet at March 17, 2016 11:00 AM

PostGIS Development

Vanguard Appraisals

Vanguard Appraisals is new to the GIS world. In fact, we aren’t really in the GIS world; we just kind of brush up against it. We do mass property appraisal for entire county and city jurisdictions, and we develop software to collect, price and maintain values. We also host assessment data online so that homeowners can search and find property information much simpler from the comfort of their own home. Our software and websites are used in 7 states (IA, IL, MN, MO, NE, ND, SD).

Continue Reading by clicking title hyperlink ..

by Andy Colson at March 17, 2016 12:00 AM

PostGIS Development

Clever Maps

Clever°Maps’ is a three years old startup based in the Czech Republic. We create web apps for four distinct market segments:

  1. business/location inteligence - helping companies to make decisions based on data, not on feelings
  2. farming - simplifying agenda and everyday work
  3. road infrastructure administration - settlement of land property rights, treaty evidence, speeding up the whole administrative process
  4. assets administration - treaty management, land purchases
Continue Reading by clicking title hyperlink ..

by Michal Zimmermann at March 17, 2016 12:00 AM

March 16, 2016

Oslandia

OSGeo CS 2016 report : PostGIS

#TOSprint

The annual OSGeo Codesprint took place from 23d to 26th of February in Paris, at Mozilla's Foundation offices. Here is a summary of some achievements done during this event by the PostGIS team.

PostGIS


A big team of PostGIS developers worked together during the sprint on numerous issues.

New functions and dev guide

Remi Cura from IGN had prepared a full list of improvements to existing PostGIS functions and new features to develop. Together with Marc Ducobu ( Champs Libres ) , they managed to go all the way from simple PostGIS users to PostGIS developers. They dived into the code and managed to write clean patches for functions and documentation.

They documented their work and created a step-by-step guide for new PostGIS contributors.

New clustering features


New features have been introduced for PostGIS 2.3, among them clustering methodsST_ClusterKMeans() and ST_ClusterDBSCAN() by Paul Ramsey and Dan Baston. These are really nice Window functions for PostGIS !


On a side note, the documentation for PostGIS Window functions has been improved too ( R. Obe )

OGR ODBC FDW

Regina Obe managed to travel to Paris, and she put efforts analyzing why the ODBC performances of the ogr_fdw PostGIS foreign data wrapper were not as good as expected. This should lead to performance improvement later on. She also worked on problems encountered when restoring PostGIS data ( especially rasters), related to how the search_path is handled. There is still some work to experiment with EVENT TRIGGERS, but the issue is on a good path to be solved.

3D in PostGIS : SFCGAL


Our team at Oslandia has been working on SFCGAL, the 3D library behind PostGIS 3D features. Vincent Mora and Hugo Mercier teamed with Mickael Borne ( IGN ) and Sébastien Loriot ( GeometryFactory ) to break things and rebuild them, better. They focused mainly on performances :

  • different computation and precision kernels in CGAL, to use them efficiently and be much faster
  • set a flag for valid geometries, so as not to re-test for validity in every operation
  • lower serialization overhead when passing data between PostGIS, CGAL and SFCGAL

This effort lead to significant improvement in speed. Preliminary tests on 3D intersection showed improvement from 15 seconds down to 100 milliseconds, which is impressive result.


A lot of refactoring of the code has to be done, and this work also started, to simplify and ease the use of SFCGAL.

New indexes for big data


Another significant contribution is the work on BRIN indexes for PostGIS. At PGConf Europe, we already spoke with the team at Dalibo about the new index type in PostgreSQL 9.5 : BRIN use cases and implementation in a GIS context. Some time before OSGeo code sprint, we realized that Giuseppe Broccolo from 2ndQuadrant had started a prototype implementation, so we invited him to join. Giuseppe teamed with Ronan Dunklau and Julien Rouhaud from Dalibo, and together they managed to have a working implementation of this new type of indexes for PostGIS.


Having the opportunity for PostgreSQL developers to meet in this geospatial environment was the ideal way to get things done efficiently.


PostGIS BRIN will be detailled after some code consolidation and benchmarks, but here is the idea. BRIN indexes are partial indexes : they deal with data blocks ( a given amount of pages ) and not rows. You can set the number of pages per block. This makes indexes much faster to build, and a lot smaller than classic indexes ( GiST for PostGIS ). Of course, the tradeoff is that they are slower to query. And since they group data in blocks, they loose their efficiency if the data is not clustered on disk.


Some of our use cases, with tables full of Billions of points ( using PgPointCloud ), are a good match for BRIN indexes : the data is naturally clustered spatially ( patches of points are loaded sequencially ), and sometimes the data is too big for a table's index to fit entirely in memory.

  • E.g. we have a 3TB table, resulting in a 20GB GiST index.

A BRIN index can be a really good compromise for this : preliminary tests on a medium-size dataset show a performance degradation of 5x in querying, while the index is 1000x smaller And this tradeoff is adjustable, according to your dataset size and hardware configuration.

And more

Other topics which have been worked on (mainly by Paul, see his blog post) :

  • Expanded object header, to avoid serialization / deserialization steps
  • A nasty upgrade bug ( 2.2 blocker )

You will find more details in the following reports :

Thanks to the PostGIS team for such hard work and for their reports !

Figures credits : Openstreetmap's Soup, OpenScienceMap, Wikipedia

by Vincent Picavet at March 16, 2016 11:00 AM

March 15, 2016

Oslandia

OSGeo CS 2016 report : some achievements


#TOSprint

The annual OSGeo Codesprint took place from 23d to 26th of February in Paris, at Mozilla's Foundation offices. Here is a summary of some achievements done during this event.

OSGeo projects at the codesprint

A lot of OSGeo projects were represented at the codesprint. Each had setup their plans with tasks to work on, goals to achieve and issues to discuss. Look at the projects and their initial goals :

A big part of the plans have been achieved, and most participants were happy with their results.

Some reports

Some projects already did some reporting on what they achieved during the codesprint. Here are some of them, and next articles on this blog will give more insight about PostGIS projects as well as Point Cloud technologies.

Orfeo ToolBox

Victor Poughon explains the work done on OTB during the sprint. They worked on QGIS / OTB integration so that now OTB algorithm are available through QGIS Processing. OSGeo Live OTB documentation has been improved. OGR integration in OTB has been analyzed, bash completion for GDAL and OTB has been added. Some progresses on ITK pipeline optimization have been achieved too.

You can read the full report here :

OpenSuse

Bruno Friedmann, an OpenSuse Linux packager was present, and worked with Angelos Kalkas on improving Geospatial packages in OpenSuse. GDAL, PostGIS, pgRouting, pgPointCloud,PDAL and Mapserver had their updates and improvement.


Read more on this report :

GDAL

Even Rouault, GDAL maintainer, has been working on a various features and bug fixing. Support of the M dimension of geometries was one of the issues, not easy to solve.


Rob Emmanuele put efforts on having a better error reporting system for some GDAL drivers. The new design should be in GDAL 2.1.


It is not a joke, Yann Chemin explored how to use GDAL for other planets and satellites. That is spatial geospatial development ...


Among other improvements to GDAL : OGR ODBC driver ( Regina Obe) and bash completion for CLI tools ( Guillaume Pasero).


You can read Even's complete report here :

MapServer

The MapServer project released some new versions of MapServer, with bugfixes. The 7.0.1 and 6.4.3 maintainance versions have been released on the 25th of february. You can see the changelog for 7.0.1 or for 6.4.3 . A new version of MapCache, 1.4.1 has also been released, and all three components are available for download .


The Mapserver team also closed 430 tickets during the codesprint. They flooded 80 happy watcher's inboxes in the process. This is a result of an automated script from Stephan Meißl.


Daniel Morissette added a service providers page on MapServer website. Mike fixed a few begs, and added some Docker container informations for MapServer.


A new demo and tutorial has been setup for MapServer. This is work from Lars, Seth and Julien and is available here ( RFC 115 ) :

Proj.4

Proj.4 now has a new website , thanks to the work of Howard Butler !

The next article will present what the PostGIS team achieved...

by Vincent Picavet at March 15, 2016 11:00 AM

March 14, 2016

Oslandia

OSGeo CodeSprint 2016 in Paris


C Tribe OSGeo Code Sprint, 2016 Paris edition


At the end of February, Mozilla Paris hosted the annual OSGeo "C tribe" Code Sprint event. Around 50 developers from several Open Source GIS projects gathered to work together.


The "C tribe" code sprint was originally born in North America, almost a decade ago. It took place in Europe for the first time in 2014 (Vienna edition). Vienna was a successful event, and an incitation to alternate the location between the old and the new continent. This allows for more local developers to attend, and new places to visit. This is why we landed in Paris.


This event can be seen as a collection of several smaller sprints (several projects involved), each with its own agenda :

But in the meantime each attendee benefits from all resources gathered. This really makes a difference for some issues...


During a code sprint, coding obviously takes a central place. But discussions about projects orientations and software design are as important. And most important is also all informal exchanges between participants.


You can have a look at Vincent's pictures on Flickr .

A special event

This event had its special taste and color, thanks to some exceptional aspects :

  • The venue itself really helped ! Located in the very center of Paris, the main room is magnificent ; perfect mix beetween old fashion style ala Versailles (this place used to be an ambassy) and latest high tech facilities. From a symbolic point of view, Mozilla Paris offices could be seen as an Open Source temple, using the old aristocratics power's codes. It also conveys the message that Open Source is no longer only written in underground places.
  • Each project was encouraged to define its own agenda before the sprint (as linked above). This is a good bootstrap for the first day.
  • I also wanted to enlarge the attendees list, to people not used to attend OSGeo Code Sprints. So we reached out to other communities and organizations and invited them to join. All of them apparently enjoyed a lot the experience. We can be proud of the OSGeo's community to be able to welcome new energies, faces and skills.
  • In a similar way, during the second evening, we opened the doors to a external people curious to see how a code sprint look like. Participants had the opportunity to briefly present what they were working on, in front of the whole audience. Giving a chance to share the knowledge, and the following buffet, was a really nice moment.
  • A strong attention has been paid to the locations for evening events, as well as food and drinks quality. This may be French way of life, but it really improves the social side of the event.

It apparently paid off, as sprinters seems enthusiastics :


Yes, it was worth it !

Gathering around 50 developerss, during almost a week, represents more or lesse one man-year effort, plus extra resources ( travel fees & more). Was it worth it ? As an organizer, I obviously think it was !


A code sprint really changes the way developers interact with each other. Tough discussions are much easier face to face than on IRC or mailing list, and decision-taking is greatly improved. Spending a week close to the others changes the way you interact afterwards and ease communications. Moreover, finding highly skilled people in the same room makes problem solving simple, as someone almost always has the right answer to any tricky issue.


This sprint helped create many bridges, be it between individuals, organizations or projects. Most of them will grow into new interactions in the coming weeks and months. The event was also a good way to inject new energy into the community, and welcome newcomers into this growing set of people, for the best of OpenSource geospatial technologies.


Therefore, a lot of "Mercis" to all who contributed to this event :

by Olivier Courtin at March 14, 2016 11:00 AM

March 13, 2016

PostGIS Development

Selecting only pixels of particular range of values with ST_Reclass

This raster question comes up quite a bit on PostGIS mailing lists and stack overflow and the best answer often involves the often forgotten ST_Reclass function that has existed since PostGIS 2.0.
People often resort to the much slower though more flexible ST_MapAlgebra or dumping out their rasters as Pixel valued polygons they then filter with WHERE val > 90, where ST_Reclass does the same thing but orders of magnitude faster.

Continue Reading by clicking title hyperlink ..

by Regina Obe at March 13, 2016 12:00 AM

March 12, 2016

Paul Ramsey

libpostal for PostgreSQL

Dealing with addresses is a common problem in information systems: people live and work in buildings which are addressed using “standard” postal systems. The trouble is, the postal address systems and the way people use them aren’t really all that standard.

libpostal for PostgreSQL

Postal addressing systems are just standard enough that your average programmer can whip up a script to handle 80% of the cases correctly. Or a good programmer can handle 90%. Which just leaves all the rest of the cases. And also all the cases from countries where the programmer doesn’t live.

The classic resource on postal addressing is called Falsehoods programmers believe about addresses and includes such gems as:

An address will start with, or at least include, a building number.

Counterexample: Royal Opera House, Covent Garden, London, WC2E 9DD, United Kingdom.

No buildings are numbered zero

Counterexample: 0 Egmont Road, Middlesbrough, TS4 2HT

A street name won’t include a number

Counterexample: 8 Seven Gardens Burgh, WOODBRIDGE, IP13 6SU (pointed out by Raphael Mankin)

Most solutions to address parsing and normalization have used rules, hand-coded by programmers. These solutions can take years to write, as special cases are handled as they are uncovered, and are generally restricted in the language/country domains they cover.

There’s now an open source, empirical approach to address parsing and normalization: libpostal.

Libpostal is built using machine learning techniques on top of Open Street Map input data to produce parsed and normalized addresses from arbitrary input strings. It has binding for lots of languages: Perl, PHP, Python, Ruby and more.

And now, it also has a binding for PostgreSQL: pgsql-postal.

You can do the same things with the PostgreSQL binding as you can with the other languages: convert raw strings into normalized or parsed addresses. The normalization function returns an array of possible normalized forms:

SELECT unnest(
  postal_normalize('412 first ave, victoria, bc')
  );
                  unnest                  
------------------------------------------
 412 1st avenue victoria british columbia
 412 1st avenue victoria bc
(2 rows)

The parsing function returns a jsonb object holding the various parse components:

SELECT postal_parse('412 first ave, victoria, bc');
         postal_parse                                   
----------------------------------
 {"city": "victoria", 
  "road": "first ave", 
  "state": "bc", 
  "house_number": "412"}
(1 row)

The core library is very fast once it has been initialized, and the binding has been shown to be acceptably fast, despite some unfortunate implementation tradeoffs.

Thanks to Darrell Fuhriman for motivating this work!

March 12, 2016 11:00 AM

March 08, 2016

Postgres OnLine Journal (Leo Hsu, Regina Obe)

Paris OSGeo Code Sprint 2016 Highlights

Leo and I attended the Paris OSGeo Code Sprint at Mozilla Foundation put together by Oslandia and funded by several companies. It was a great event. There was quite a bit of PostGIS related hacking that happened by many new faces. We have detailed at BostonGIS: OSGeo Code Sprint 2016 highlights some of the more specific PostGIS hacking highlights. Giuseppe Broccolo of 2nd Quadrant already mentioned BRIN for PostGIS: my story at the Code Sprint 2016 in Paris.


Continue reading "Paris OSGeo Code Sprint 2016 Highlights"

by Leo Hsu and Regina Obe (nospam@example.com) at March 08, 2016 01:01 AM

Boston GIS (Regina Obe, Leo Hsu)

Paris OSGeo Code Sprint 2016 PostGIS Highlights

We'd like to give a big thanks to Olivier Courtin and Oslandia for organizing this year's OSGeo Code Sprint in Paris. It was quite memorable and I walked away feeling energized.

Leo and I attended the Paris OSGeo Code Sprint for 1.5 days. I'm really glad we were able to attend at least part of it. It would have been nicer to attend all of it. Though we didn't get too much done while there, we did have some interesting conversations and learned about what others were doing. I walked out with a TO DO list, of which I'm happy to say I've accomplished some of now.

While we were there Leo spent time cleaning up my Mingw64 compile scripts and starting to test PostGIS against PostgreSQL 9.6 in preparation for parallelization testing of PostGIS 2.3 features specifically targetted for PostgreSQL 9.6.


Continue reading "Paris OSGeo Code Sprint 2016 PostGIS Highlights"

by Regina Obe (nospam@example.com) at March 08, 2016 01:01 AM

March 07, 2016

Paul Ramsey

Paris Code Sprint, PostGIS Recap

At the best of times, I find it hard to generate a lot of sympathy for my work-from-home lifestyle as an international coder-of-mystery. However, the last few weeks have been especially difficult, as I try to explain my week-long business trip to Paris, France to participate in an annual OSGeo Code Sprint.

Paris Code Sprint

Yes, really, I “had” to go to Paris for my work. Please, stop sobbing. Oh, that was light jealous retching? Sorry about that.

Anyhow, my (lovely, wonderful, superterrific) employer, CartoDB was an event sponsor, and sent me and my co-worker Paul Norman to the event, which we attended with about 40 other hackers on PDAL, GDAL, PostGIS, MapServer, QGIS, Proj4, PgPointCloud etc.

Paul Norman got set up to do PostGIS development and crunched through a number of feature enhancements. The feature enhancement ideas were courtesy of Remi Cura, who brought in some great power-user ideas for making the functions more useful. As developers, it is frequently hard to distinguish between features that are interesting to us and features that are using to others so having feedback from folks like Remi is invaluable.

The Oslandia team was there in force, naturally, as they were the organizers. Because they work a lot in the 3D/CGAL space, they were interested in making CGAL faster, which meant they were interested in some “expanded object header” experiments I did last month. Basically the EOH code allows you to return an unserialized reference to a geometry on return from a function, instead of a flat serialiation, so that calls that look like ST_Function(ST_Function(ST_Function())) don’t end up with a chain of three serialize/deserialize steps in them. When the deserialize step is expensive (as it in for their 3D objects) the benefit of this approach is actually measureable. For most other cases it’s not.

(The exception is in things like mutators, called from within PL/PgSQL, for example doing array appends or insertions in a tight loop. Tom Lane wrote up this enhancement of PgSQL with examples for array manipulation and did find big improvements for that narrow use case. So we could make things like ST_SetPoint() called within PL/PgSQL much faster with this approach, but for most other operations probably the overhead of allocating our objects isn’t all that high to make it worthwhile.)

There was also a team from Dalibo and 2nd Quadrant. They worked on a binding for geometry to the BRIN indexes (9.5+). I was pretty sceptical, since BRIN indexes require useful ordering, and spatial data is not necessarily well ordered, unlike something like time, for example. However, they got a prototype working, and showed the usual good BRIN properties: indexes were extremely small and extremely cheap to build. For narrow range queries, they were about 5x slower than GIST-rtree, however, the differences were on the order of 25ms vs 5ms, so not completely unusable. They managed this result with presorted data, and with some data in its “natural” order, which worked because the “natural” order of GIS data is often fairly spatially autocorrelated.

I personally thought I would work on merging the back-log of GitHub pull-requests that have built up on the PostGIS git mirror, and did manage to merge several, both new ones from Remi’s group and some old ones. I merged in my ST_ClusterKMeans() clustering function, and Dan Baston merged in his ST_ClusterDBSCAN() one, so PostGIS 2.3 will have a couple of new clustering implementations.

However, in the end I spent probably 70% of my time on a blocker in 2.2, which was related to upgrade. Because the bug manifests during upgrade, when there are two copies of the postgis.so library floating in memory, and because it only showed up on particular Ubuntu platforms, it was hard to debug: but in the end we found the problem and put in a fix, so we are once again able to do upgrades on all platforms.

The other projects also got lots done, and there are more write-ups at the event feedback page. Thanks to Olivier Courtin from Oslandia for taking on the heavy weight of organizing such an amazing event!

March 07, 2016 08:00 PM

February 17, 2016

Oslandia

iTowns 1.0 release : 3D web visualization framework


Oslandia is pleased to announce the first release of iTowns, a new 3D geospatial data visualization web framework developed by the iTowns project, including people from French IGN, Oslandia and AtolCD .


Contact for this project at Oslandia : infos+itowns@oslandia.com



iTowns is a web framework written in Javascript/WebGL for visualisation of 3D geographic data, allowing precise measurements in 3D. Its first purpose is the visualisation of street view images and terrestrial lidar point clouds, though it now supports much more data types.


Version 1.0 is the first opensource release of iTowns. It is the core of the original iTowns application developped during the last years at French IGN MATIS research laboratory. Oslandia worked closely with IGN during the last months, in order to prepare this move from an in-house application to an opensource project.



This version provides the following features :

  • Load and project Oriented Images on mesh (cube or city model)
  • Load and display Panoramic Images
  • Load Depth Panoramic Image and render in 3D
  • Load 2D multipolygons with height (building footprint) from WFS or local file and triangulate it to create building boxes. This mesh can then be use for texture projection.
  • Navigate through Image Data using click and go functions
  • Load and display Point Cloud from PLY files.
  • Load and display 3D textured models (B3D, 3DS).
  • Simple API interface.

You will find all information and a demo on iTowns website :

https://itowns.github.io

French IGN provided a sample dataset, which can be viewed on the iTowns demo repository :

http://itowns.github.io/itowns-sample-data/

You can download iTowns Version 1.0 here :

https://github.com/iTowns/itowns/releases/tag/v1.0

OpenSource

iTowns is an OpenSource organization, delivering software components for 3D geospatial data visualization on the web. The iTowns project is open to contributions from anyone in the community. You can fork the project on GitHub and send Pull Request, and we will be pleased to integrate them. Do not hesitate to consult the iTowns wiki for general informations on how to contribute.


Collaboration

On the 3d and 4th of February, 2016, the iTowns project held a codesprint, hosted by IGNFab in Paris. It gathered around 15 developers and enabled to finalize the 1.0 release, work on iTowns project organization, documentation and demos, as well as design and test new features for the next generation of the iTowns framework. A PSC has been formed, with Alexandre Devaux (IGN) as chair.


This collaboration has been fruitful, and is a good kick-off for all developers willing to work together on 3D geospatial data visualization.

Support

iTowns is an original work from French IGN, MATIS research laboratory. It has been funded through various research programs involving the French National Research Agency, Cap Digital, UPMC, Mines ParisTec, CNRS, LCPC.


iTowns is currently maintained by IGN ( http://www.ign.fr ) and Oslandia ( http://www.oslandia.com )


Should you wish to have more information or need support, please contact infos+itowns@oslandia.com

by Vincent Picavet at February 17, 2016 10:30 AM

February 11, 2016

Stephen Mather

PostgreSQL table drop — a little fun with ouroboros

My latests posts on PDAL have been fun. For the moment, a quick bit of code for dropping all the tables in your PostgreSQL database. BTW, the following is a bad idea. Many bad ideas are really useful. This is a really useful but bad idea:

echo "\dt" | psql | grep layer | awk '{print "DROP TABLE \"" $3"\";"}' | psql

How does this work? Let’s do a quick step through.

First we echo “\dt”.

$ echo "\dt"
\dt

This just prints literally (the literally kind of literally) “\dt”. \dt is a psql command for listing all the tables in your database. We “pipe” (this symbol is a pipe: |) that into psql. Essentially what this does is run this command within psql as follows:

$ echo "\dt" | psql
               List of relations
 Schema |        Name        | Type  |  Owner  
--------+--------------------+-------+---------
 public | layer_0-1.5        | table | user
 public | layer_1.5-3        | table | user
 public | layer_105-150      | table | user
 public | layer_15-30        | table | user
 public | layer_150-200      | table | user
 public | layer_3-6          | table | user
 public | layer_30-45        | table | user
 public | layer_45-60        | table | user
 public | layer_6-15         | table | user
 public | layer_60-105       | table | user
 public | paw_points         | table | user
 public | pointcloud_formats | table | user
 public | spatial_ref_sys    | table | user
(13 rows)

Now that we have a list of tables, we can manipulate that list in order to manipulate our tables. We’ll use the grep command to go line by line and only return the ones with the word “layer” in them. These are the table names that we are interested in dropping.

$ echo "\dt" | psql | grep layer
 public | layer_0-1.5        | table | user
 public | layer_1.5-3        | table | user
 public | layer_105-150      | table | user
 public | layer_15-30        | table | user
 public | layer_150-200      | table | user
 public | layer_3-6          | table | user
 public | layer_30-45        | table | user
 public | layer_45-60        | table | user
 public | layer_6-15         | table | user
 public | layer_60-105       | table | user

Now we use the awk command to just grab the column we want. By default, awk assumes spaces are our column delimiter, so we’ll grab the third column, however we could also use the pipes as our delimiter with the -F flag:

$ echo "\dt" | psql | grep layer | awk '{print $3}'
layer_0-1.5
layer_1.5-3
layer_105-150
layer_15-30
layer_150-200
layer_3-6
layer_30-45
layer_45-60
layer_6-15
layer_60-105

Let us next extend the awk command to print the commands we want to run against our database. Note I had to escape my double quotes with a “\”:

$ echo "\dt" | psql | grep layer | awk '{print "DROP TABLE \"" $3"\";"}' 
DROP TABLE "layer_0-1.5";
DROP TABLE "layer_1.5-3";
DROP TABLE "layer_105-150";
DROP TABLE "layer_15-30";
DROP TABLE "layer_150-200";
DROP TABLE "layer_3-6";
DROP TABLE "layer_30-45";
DROP TABLE "layer_45-60";
DROP TABLE "layer_6-15";
DROP TABLE "layer_60-105";

Now we have the commands we need to feed back into psql to run against the database. Here is where the dragon eats its own tail:

echo "\dt" | psql | grep layer | awk '{print "DROP TABLE \"" $3"\";"}' | psql
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE


by smathermather at February 11, 2016 05:00 AM

February 10, 2016

Stephen Mather

Taking Slices from LiDAR data: Part V

For this post, let’s combine the work in the last 4 posts in order to get a single pipeline for doing the following:

  • Calculate relative height of LiDAR data
  • Slice that data into bands of heights
  • Load the data into a PostgreSQL/PostGIS/pgPointCloud database.
#!/bin/bash 

# readlink gets us the full path to the file. This is necessary for docker
readlinker=`readlink -f $1`
# returns just the directory name
pathname=`dirname $readlinker`
# basename will strip off the directory name and the extension
name=`basename $1 .las`

# Docker run allows us to leverage a pdal machine with pcl built in,
# thus allowing us to calculate height.
# See http://www.pdal.io/tutorial/calculating-normalized-heights.html
docker run -v $pathname:/data pdal/master pdal translate //data/"$name".las //data/"$name"_height.bpf height --writers.bpf.output_dims="X,Y,Z,Intensity,ReturnNumber,NumberOfReturns,ScanDirectionFlag,EdgeOfFlightLine,Classification,ScanAngleRank,UserData,PointSourceId,Height";

# Now we split the lidar data into slices of heights, from 0-1.5 ft, etc.
# on up to 200 feet. We're working in the Midwest, so we don't anticipate
# trees much taller than ~190 feet
for START in 0:1.5 1.5:3 3:6 6:15 15:30 30:45 45:60 60:105 105:150 150:200
 do
  # We'll use the height classes to name our output files and tablename.
  # A little cleanup is necessary, so we're removing the colon ":".
  nameend=`echo $START | sed s/:/-/g`

  # Name our output
  bpfname=$name"_"$nameend.bpf

  # Implement the height range filter
  pdal translate $name"_height".bpf $bpfname -f range --filters.range.limits="Height[$START)"

  # Now we put our data in the PostgreSQL database.
  pdal pipeline -i pipeline.xml --writers.pgpointcloud.table='layer_'$nameend --readers.bpf.filename=$bpfname --writers.pgpointcloud.overwrite='false'
done

Now, we can use parallel to make this run a little faster:

find . -name "*.las" | parallel -j6 ./pdal_processor.sh {}&

Sadly, we can run into issues in running this in parallel:

PDAL: ERROR:  duplicate key value violates unique constraint "pointcloud_formats_pkey"
DETAIL:  Key (pcid)=(1) already exists.


PDAL: ERROR:  duplicate key value violates unique constraint "pointcloud_formats_pkey"
DETAIL:  Key (pcid)=(1) already exists.

This issue is a one time issue, however — we just can’t parallelize table creation. Once the tables are created however, I believe we can parallelize without issue. I’ll report if I find otherwise.


by smathermather at February 10, 2016 04:24 PM

February 05, 2016

Stephen Mather

Taking Slices from LiDAR data: Part IV

Trans-Alaska Pipeline System Luca Galuzzi 2005

In PDAL, a pipeline file can be used to do a variety of operations. Within the following context, I think of a pipeline file like an ad hoc preferences file, where I can use an external command to iterate through the things I want to change, while holding constant everything else in the pipeline file.

In my use case for this vignette, I’ll use the pipeline file to hold my database preferences for putting the point clouds into my PostGIS database. For the record, I’m using vpicavet’s docker pggis as the starting place for installing PostGIS with the pgpointcloud extension. I have adapted the following pipeline file from the PDAL writers.pgpointcloud example.

<?xml version="1.0" encoding="utf-8"?>
<Pipeline version="1.0">
  <Writer type="writers.pgpointcloud">
    <Option name="connection">
      host='localhost' dbname=‘lidar’ user='user' password=‘password’
    </Option>
    <Option name="table">54001640PAN_heightasz_0-1.5</Option>
    <Option name="compression">dimensional</Option>
<!--    <Option name="srid">3270</Option> -->
    <Filter type="filters.chipper">
      <Option name="capacity">400</Option>
      <Reader type="readers.las">
          <Option name="filename">54001640PAN_heightasz_0-1.5.las</Option>
<!--          <Option name="spatialreference">EPSG:3270</Option> -->
      </Reader>
    </Filter>
  </Writer>
</Pipeline>

Some things to note. I have commented out the SRID and readers.las.spatialreference in the XML above. We’ll rely on PDAL to discover this, and use the default output of epsg:4326 to store our point clouds for the time being.

Our wrapper script for the pipeline file is very simple. We will use the wrapper script to specify the table and the input file name.

#!/bin/bash

TABLE=`basename $1 .bpf`
INPUT=$1

#echo $TABLE
pdal pipeline -i pipeline.xml --writers.pgpointcloud.table=$TABLE --readers.bpf.filename=$INPUT

Now to use, we’ll use GNU Parallel, as much for it’s XArgs like functionality as scalability:

ls *.bpf | parallel -j6 ./pipeliner.sh {}

Now we can see what tables got loaded:

psql -d lidar
psql (9.5.0)
Type "help" for help.

lidar=# \dt
                    List of relations
 Schema |             Name              | Type  |  Owner  
--------+-------------------------------+-------+---------
 public | 54001640PAN_heightasz_0-1.5   | table | user
 public | 54001640PAN_heightasz_1.5-6   | table | user
 public | 54001640PAN_heightasz_100-200 | table | user
 public | 54001640PAN_heightasz_30-45   | table | user
 public | 54001640PAN_heightasz_45-55   | table | user
 public | 54001640PAN_heightasz_55-65   | table | user
 public | 54001640PAN_heightasz_6-30    | table | user
 public | 54001640PAN_heightasz_65-75   | table | user
 public | 54001640PAN_heightasz_75-85   | table | user
 public | 54001640PAN_heightasz_85-100  | table | user
 public | pointcloud_formats            | table | user
 public | spatial_ref_sys               | table | user
(12 rows)

W00t! We’ve got point clouds in our database! Next, we will visualize the data, and extract some analyses from it.


by smathermather at February 05, 2016 11:08 PM

Stephen Mather

Taking Slices from LiDAR data: Part III

forest_structure
Borrowed from: http://irwantoshut.com

Continuing my series on slicing LiDAR data in order to analyze a forest, one of the objectives of the current project is to understand the habitats that particular species of birds prefer. This will be accomplished using field info from breeding bird surveys combined with LiDAR data of forest structure to help predict what habitats are necessary for particular species of breeding birds.

There are a number of studies doing just this for a variety of regions which I will detail later, but suffice it to say, structure of vegetation matters a lot to birds, so using LiDAR to map out structure can be an important predictive tool for mapping bird habitat. Being able to do that at scale across entire ecoregions—well, that’s just an exciting prospect.
Let’s get to some coding. I would like to take a few slices through our forest based on functional height groups:

Forest Canopy >15 meters
Sub-Canopy 10-15 meters
Tall Shrub 2-10 meters
Short Shrub 0.5-2 meters

(For the record, we don’t have to get these perfectly right. Sub-canopy, for example could be taller or shorter depending on how tall a forest it is in. This is just a convenience for dividing and reviewing the data for the time being. Also, we’ll split a little finer than the above numbers just for giggles.)

We’ll start by just echoing our pdal translate to make sure we like what we are getting for output:

for START in 0:0.5 0.5:1 1:2 2:5 5:10 10:15 15:20 20:35 35:50 50:75 75:100 100:200
 do
  nameend=`echo $START | sed s/:/-/g`
  echo pdal translate 54001640PAN_heightasz.bpf 54001640PAN_heightasz_$nameend.bpf -f range --filters.range.limits="Height[$START)"
done

Thus we get the following output:

pdal translate 54001640PAN_heightasz.bpf 54001640PAN_heightasz_0-0.5.bpf -f range --filters.range.limits=Height[0:0.5)
pdal translate 54001640PAN_heightasz.bpf 54001640PAN_heightasz_0.5-1.bpf -f range --filters.range.limits=Height[0.5:1)
pdal translate 54001640PAN_heightasz.bpf 54001640PAN_heightasz_1-2.bpf -f range --filters.range.limits=Height[1:2)
pdal translate 54001640PAN_heightasz.bpf 54001640PAN_heightasz_2-5.bpf -f range --filters.range.limits=Height[2:5)
... etc.

Let’s remove our echo statement so this actually runs:

for START in 0:0.5 0.5:1 1:2 2:5 5:10 10:15 15:20 20:35 35:50 50:75 75:100 100:200
 do
  nameend=`echo $START | sed s/:/-/g`
  pdal translate 54001640PAN_heightasz.bpf 54001640PAN_heightasz_$nameend.bpf -f range --filters.range.limits="Height[$START)"
done

We should generalize this a bit too. Let’s make this a script to which we can pass our filenames and ranges:

#!/bin/bash
namer=`basename $1 .bpf`
for START in $2
 do
  nameend=`echo $START | sed s/:/-/g`
  pdal translate $namer.bpf $namer"_"$nameend".bpf" -f range --filters.range.limits="Height[$START)"
done

Which to run, we use a statement as follows:

./tree_slicer.sh 54001640PAN_heightasz.bpf "0:0.5 0.5:1 1:2 2:5 5:10 10:15 15:20 20:35 35:50 50:75 75:100 100:200"

I forgot all my data is in feet, but my height classes in meters, so we’ll just multiply our input values by a factor of 3 to get in the same ballpark (future refinement likely):

./tree_slicer.sh 54001640PAN_heightasz.bpf "0:1.5 1.5:3 3:6 6:15 15:30 30:45 45:60 60:105 105:150 150:200"

We could alternatively stick to our original categories (short shrub, tall shrub, sub-canopy, canopy) as our break points:

./tree_slicer.sh 54001640PAN_heightasz.bpf "0:1.5 1.5:6 6:30 30:45 45:200"

Finally, we can convert to laz, and load all our slices of point clouds in plas.io an animate between the slices

for OUTPUT in $(ls *.bpf); do docker run -v /home/gisuser/test/test:/data pdal/master pdal translate //data/$OUTPUT //data/$OUTPUT.laz; done

If you look closely, you should be able to see where a tornado in the 80s knocked down much of the forest here. It’s signature is in the tall shrub / sub-canopy layer:


by smathermather at February 05, 2016 01:12 PM

UpStats blog (Stefan Petrea)

Simple job queue in Bash using a FIFO

Intro

I was looking for a simple example of a producer-consumer bash script that would distribute tasks to multiple consumers through a job queue. I found a few incomplete examples and decided to try my hand at writing my own.

So this blog post will focus on a few ways of implementing a queue like that, first by using fifos, then by using SysV message queues, and finally using Redis.

Here are a few use-cases:

  • enhancing the speed of document indexing (for example, the maff_table_index function described in this blog post)
  • writing a parallel (or distributed, if you use a distributed message queue) crawler
  • writing a parallel (or distributed) RSS/Atom news feed fetcher
  • speeding up test-suites

The different versions all start with the description of how that version of the message queue works, a description of the problems I encountered using them, then a flowchart, and then the producer and consumer implementations.

Testing

The fact that messages flow from the producer to the consumers is not enough. It's also important that everything produced is consumed. To this end, we make a simple test by writing the output of the consumers to a file, and then make some simple statistics on their output to see how many messages got dropped, how many messages were delivered succesfuly and how many messages there were in total.

seq_stats() {
    AWK_PROG='''
    BEGIN {
        missing_list = "";
    }
    { 
        if(min==""){min=max=$1};
        if($1>max){max=$1};
        if($1<min){min=$1};
        sum+=$1;
        observed+=1;

        ## if there is a gap, store the missing numbers
        if($1-prev>1) {
            for(i=prev+1;i<$1;i++) {
                missing_list = missing_list "," i;
            };
        };
        prev=$1
    }
    END {
        expected=(max-min+1);
        missing=(expected-observed);
        printf "observed: %d\nmin: %d\nmax: %d\nmissing: %d\nmissing_p: %.5f\n",
        observed,min,max,missing,(100.0*(missing/expected));
        printf "missing_list: %s\n",missing_list;
    }
    '''
    sort -g | awk "$AWK_PROG"
}

We can run it with all the integers from 1 to 40 but the integers from 10 to 20 missing and see what we get back

#!/bin/bash
. $PWD/queue-samples/seq-stats.sh
(seq 1 9; seq 21 40) | seq_stats
observed: 29
min: 1
max: 40
missing: 11
missing_p: 27.5
missing_list: ,10,11,12,13,14,15,16,17,18,19,20

FIFO versions

one fifo, multiple consumers, without locking

Here, the producer/consumer open the fifo for every message.

There are certain limitations such as only being able to write atomically at most 4096 bytes to a fifo 1 , 2 , 3, and this leads to problems such as consumers reading only parts of the messages sent to them.

There are several places 4 that advise against having multiple readers read from a single fifo. Usually they're used for 5 a one-consumer and one-producer scenarios.

In this case, an error will be thrown by the producer because it'll try to write ./producer.sh: line 10: echo: write error: Broken pipe. This has to do with the producer writing to a pipe closed by a client who exited too early. So this version will drop messages.

Flowchart

Producer

#!/bin/bash
FIFO=/tmp/fifo1
n=0
mkfifo $FIFO

while [[ true ]]; do
    n=$((n+1))
    MESSAGE="message $n"
    echo "$MESSAGE" >$FIFO
done

Consumer

#!/bin/bash
FIFO=/tmp/fifo1
while [[ true ]]; do
    cat $FIFO
done

one fifo, multiple consumers, with locking

This is very similar to the previous version except all the reads/writes are done if the lock can be acquired. One of two things happens here:

  1. The consumer acquires the lock first, it then starts reading from the pipe but since nobody else has opened the other end for writing, it will block 6. The consumer won't release the lock until it has read something from the fifo. The producer then tries to acquire the lock, but since the consumer has it, it won't be able to.
  2. The producer acquires the lock first, it starts writing to the pipe, but the write operation blocks 6 until the reading end of the fifo is opened. As in the previous case, the consumer can't acquire the lock anymore.

So this is a typical deadlock, neither the consumer nor the producer can interact because of the way they acquire the lock and the blocking fifo operations.

Flowchart

Producer

#!/bin/bash
FIFO=/tmp/fifo1
LOCK=/var/lock/.prod.ex.lock
N=0
mkfifo $FIFO

trap '{  }' SIGPIPE
trap '{ echo -e "\nlast N=$N"; exit 0; }' EXIT

while true; do
    N=$((N+1))
    (
    flock -w 0.03 -n 200
    if [[ "$?" != 0 ]]; then
        echo "can't acquire lock!";
    fi
    MSG="msg $N"
    echo "$MSG" > $FIFO
    ) 200>$LOCK
done

Consumer

#!/bin/bash
FIFO=/tmp/fifo1
LOCK=/var/lock/.prod.ex.lock

# read message from FIFO
while true; do
    (
    flock -w 0.03 -n 200
    if [[ "$?" != 0 ]]; then
        echo "can't acquire lock!";
    fi
    read MSG < $FIFO
    echo "$MSG"
    ) 200>$LOCK
done

multiple fifos and dirlocks

This version is a bit different. We create the fifos, start the consumers and record their pids. Afterwards, the producer will start generating messages. Every time the producer needs to send a message(through the fifo) to a consumer, it will first acquire a dirlock. There is one dirlock for each consumer. In order for the producer not to block while sending a message, the write operation on the fifo is put in background. The dirlock mentioned earlier is only released by the consumer, when it has finished consuming the message received through the fifo. The producer continues to cycle through the consumers to find which one is free and continues to do the same for them. The consumer only reads new messages from the fifo, and unlocks the dirlock, which allows for new messages to be delivered on that fifo. The dirlock 7 is one of the different locks used primarily in bash scripts.

Flowchart

Producer

#!/bin/bash
## In this scenario, we create the number of consumers, each with their
## own fifo, and then the producer will hand out messages by cycling through
## their fifos and writing to each of them.

FIFO_PREF="/tmp/bgfifo"
LOCK_PREF="/tmp/bgfifo.lock."
# message number
N=1
# number of consumers
C=10

create_fifos() {
    ## create the fifos
    for i in $(seq 0 $((C-1))); do
        CONSUMER_FIFO="$FIFO_PREF$i"
        CONSUMER_LOCK="$LOCK_PREF$i"
        mkfifo "$CONSUMER_FIFO"
        rmdir $CONSUMER_LOCK
    done
}

# make a table with these columns
# pid_consumer,pid_echo
# this table should be used when checking the status of writes
# or checking which pid has which index
declare -A status
init_consumers() {
    pids=$(pgrep -f "consumer-back")
    # C rows, 2 columns
    for ((i=0;i<=C-1;i++)) do
        CONSUMER_FIFO="$FIFO_PREF$i"
        # we start the consumer in the background, it will wait for data
        # to be available on its associated fifo
        echo "Creating consumer.."
        nohup ./consumer-back.sh "$i" & 
        # consumer pid
        status[$i,0]=$!
        # slot for consumer echo pid
        status[$i,1]=0
    done
}


kill_consumers() {
    # needs to be reimplemented
    echo "closing consumers"
    for ((i=0;i<=C-1;i++)) do
        CONSUMER_LOCK="$LOCK_PREF$i"
        cpid=${status[$i,0]}
        epid=${status[$i,1]}

        if [[ $cpid > 0 ]]; then
            echo "closing consumer $cpid"
            kill -9 $cpid 2>/dev/null
        fi

        if [[ $epid > 0 ]]; then
            kill -9 $epid 2>/dev/null
        fi

    done
}


producer() {
    ## start generating messages and distribute them
    ## among the consumers
    while (( N <= 5000 )); do
        ## NOTE: We overcome the problems related to blocking writes because
        ## we put in the background every single write we make

        # produce message and send it to the first free consumer
        for ((i=0;i<=C-1;i++)) do
            CONSUMER_FIFO="$FIFO_PREF$i"
            CONSUMER_LOCK="$LOCK_PREF$i"
            cpid=${status[$i,0]}
            epid=${status[$i,1]}

            kill -0 $cpid 2>/dev/null
            ok_cons=$?
            kill -0 $epid 2>/dev/null
            ok_echo=$?


            # if the consumer is still there and there's no previous
            # writes on this consumer's fifo then dispatch a message to it
            if [[ $ok_cons -eq 0 && $( mkdir $CONSUMER_LOCK 2>/dev/null; echo $?;) == 1 ]]; then
                echo "consumer $cpid was free"
                # produce next message
                # and send it
                echo "message $N" > $CONSUMER_FIFO 2>/dev/null &
                # update epid
                status[$i,1]=$!
                # we could break here and wait for the next iteration
                # find another free consumer
                N=$((N+1))
            else
                echo "consumer $cpid was busy"
            fi
        done
        echo "===================="
    done
}

# kill consumers upon exit
trap '{ kill_consumers; exit 0; }' SIGINT

create_fifos
init_consumers
producer

wait

Consumer

#!/bin/bash
FIFO_PREF="/tmp/bgfifo"
FIFO="$FIFO_PREF$1"
LOCK_PREF="/tmo/bgfifo.lock."
LOCK="$LOCK_PREF$1"
while true; do
    ## if no message was received, skip to the
    ## next iteration
    read line < $FIFO
    if [[ -z $line ]]; then
        continue
    fi
    ## message was received, just print it
    echo "$$ -> $line"
    rmdir $LOCK 2>/dev/null
done

Redis & Bash version

Redis provides multiple data structures, and lists are one of them. For lists, it provides blocking and non-blocking operations that act on the head of the list as well as the tail of the list. Using these primitives, we're going to implement the reliable queue pattern 8 according to the Redis documentation. I've also bumped into an error about connections 9 but was able to overcome it.

So in this instance we have 2 queues, q1 and q2. The producer puts messages on q1, the consumer atomically pops an element off of q1 and puts it onto q2. After it's been processed by the consumer, the message is deleted from q2. The second queue (q2) is used to recover from failures(network problems or consumer crashes). If messages are equipped with a timestamp, then their age can be measured, and if they sit in q2 for too much time, they can be transferred back to q1 to be re-processed again (by a consumer).

Flowchart

Producer

#!/bin/bash
REDIS_CLI="redis-cli -h 127.0.0.1"
n=1
nmax=1000
q1="queue"
q2="processing"

clean() {
    echo "DEL $q1" | $REDIS_CLI
    echo "DEL $q2" | $REDIS_CLI
}

produce() {
    while (($n <= $nmax)); do
        MSG="message $n"
        echo "LPUSH $q1 \"$MSG\"" | $REDIS_CLI
        n=$((n+1))
    done
}

clean
produce

Consumer

#!/bin/bash
REDIS_CLI="redis-cli -h 127.0.0.1"
q1="queue"
q2="processing"
# redis nil reply
nil=$(echo -n -e '\r\n')

consume() {
    while true; do
        # move message to processing queue
        MSG=$(echo "RPOPLPUSH $q1 $q2" | $REDIS_CLI)
        if [[ -z "$MSG" ]]; then
            break
        fi
        # processing message
        echo "$MSG"
        # remove message from processing queue
        echo "LREM $q2 1 \"$MSG\"" | $REDIS_CLI >/dev/null
    done
}

consume

Perl version using message queues

In this section we use a SysV message queue. The producer will write to it and the consumers will read from it.

The maximum size 10 of a message in a message queue is 8192 bytes and the total size of the message queue is 16384 bytes (although these limits can be changed, but these are the default values).

The advantage is that a consumer succesfuly retrieves a message from the message queue and then it is removed from the queue. There's no risk of retrieving just part of the message. In addition, the queue has a maximum capacity 10 and calls to msgsnd will block until there is enough room in the queue to write more messages.

Producer

#!/usr/bin/env perl
## This is a basic Producer
use strict;
use warnings;
use Carp;
use IPC::SysV qw(
    IPC_PRIVATE IPC_RMID IPC_CREAT S_IRWXU S_IRUSR
    S_IWUSR ftok IPC_STAT IPC_PRIVATE MSG_NOERROR);

# flush output
$| = 1;

my $queue_file = "/tmp/queue1";

# create file (will be used to generate the 
open(my $fh,'>',$queue_file);
close($fh);

# use file to generate an IPC key value
my $msgkey = ftok($queue_file);

# check if the IPC key is defined
if(!defined $msgkey) {
    croak "couldn't generate IPC key value";
};

# create the message queue
my $ipc_id = msgget( $msgkey, IPC_CREAT | S_IRUSR | S_IWUSR );

my $n = 0;

$SIG{INT} = sub {
    print "Last n=$n\n";
    exit 0;
};
# start sending messages
while(1) {
    my $mtype = 1;
    my $buffer_size = 200;
    $n++;
    my $buffer = "message $n";
    my $msg = pack('V V a200', $mtype, $buffer_size, $buffer);
    msgsnd( $ipc_id, $msg, 0);
};

#sleep 30;
# dispose of the queue file
#unlink $queue_file;

Consumer

#!/usr/bin/env perl
## This is a basic Consumer
use strict;
use warnings;
use Carp;
use IPC::SysV qw(
    IPC_PRIVATE IPC_RMID IPC_CREAT S_IRWXU S_IRUSR
    S_IWUSR ftok IPC_STAT IPC_PRIVATE MSG_NOERROR IPC_NOWAIT);
use Errno qw(:POSIX);
use Time::HiRes qw(usleep);
# flush output
$| = 1;

my $queue_file = "/tmp/queue1";

# create file (will be used to generate the 
open(my $fh,'>',$queue_file);
close($fh);

# use file to generate an IPC key value
my $msgkey = ftok($queue_file);

# check if the IPC key is defined
if(!defined $msgkey) {
    croak "couldn't generate IPC key value";
};

# create the message queue
my $ipc_id = msgget( $msgkey, IPC_CREAT | S_IRUSR | S_IWUSR );

my $qempty_tries_max = 1000;
my $qempty_tries = $qempty_tries_max;


# start sending messages
while(1) {
    my $msg;
    # read raw message from queue
    #
    # IPC_NOWAIT will cause msgrcv to not block and return immediately
    # with ENOMSG if there are no messages of that type in the message
    # queue.
    my $bytes_recv = msgrcv($ipc_id, $msg, 208, 0, IPC_NOWAIT);
    if($!{ENOMSG}) {
        $qempty_tries--;
        if($qempty_tries == 0) {
            # exit loop because we've exhausted the number of tries
            last;
        };
        # sleep 1% of a second (we're basically polling for
        # a new message on the queue. we give up if no message
        # is found and we exhaust the number of tries)
        usleep(1_000_000/100);
    } else {
        # refill tries if a message was present in the queue
        $qempty_tries = $qempty_tries_max;
    };

    # skip (no bytes received)
    next if $bytes_recv == 0;

    # split the message according to its format
    my ($mtype,$buffer_size,$buffer) = unpack("V V a200", $msg);

    print "$buffer\n";
};
warn '[dbg] Queue is empty';

# dispose of the queue file
#unlink $queue_file;

Using ipcmd

This will be very similar to the Perl version above. We're still using a SysV message queue except now we're using Bash and a command-line interface to SysV. The interface is not as complete (it does provide msgget, msgrcv, msgsnd, ftok), but it's enough for what's needed here. The manpage for ipcmd was quite good too, abunding in examples.

Producer

#!/bin/bash
ipcmd="/home/user/sources/ipcmd/bin/ipcmd"
ftok="$ipcmd ftok"
msgget="$ipcmd msgget"
msgsnd="$ipcmd msgsnd"
msgrcv="$ipcmd msgrcv"
queue_file="/tmp/queue.ipcmd"

producer() {
    touch $queue_file
    msgkey=$($ftok $queue_file)
    ipc_id=$($msgget -Q "$msgkey")
    n=1
    while true; do
        msg="message $n"
        echo "$msg" | $msgsnd -q "$ipc_id"
        n=$((n+1))
    done
}

producer

Consumer

#!/bin/bash
ipcmd="/home/user/sources/ipcmd/bin/ipcmd"
ftok="$ipcmd ftok"
msgget="$ipcmd msgget"
msgsnd="$ipcmd msgsnd"
msgrcv="$ipcmd msgrcv"
queue_file="/tmp/queue.ipcmd"

consumer() {
    touch $queue_file
    msgkey=$($ftok $queue_file)
    echo $msgkey
    # using -e will only retrieve the id of the queue
    # instead of attempting to create a new one
    ipc_id=$($msgget -e -Q "$msgkey")
    echo $ipc_id
    while true; do
        msg=$($msgrcv -q "$ipc_id")
        echo $msg
    done
}

consumer

Conclusion

Using a fifo for a situation other than 1 producer and 1 consumer will cause problems. Initially I was interested in a cheap, minimalistic message queue to use a fifo as a message queue. Upon realizing that this was not viable, I decided to look at some other options.

We haven't looked at situations with multiple producers. Below there's a table that summarizes the specifics of each version described in this blog post.

version makes use of drops messages deadlock distributed longer
  3rd party software       messages
fifo v1 [ ] [X] [ ] [ ] ≤ 4096b 3
fifo v2 [ ] [ ] [X] [ ] ≤ 4096b
fifo v3 [ ] [ ] [ ] [ ] ≤ 4096b
perl & sysv mq [ ] [ ] [ ] [ ] ≤ 8192b 10
bash & ipcmd [X] [ ] [ ] [ ] ≤ 8192b
bash & redis [X] [ ] [ ] [X] ≤ 512MB 11

Footnotes:

3

http://stackoverflow.com/a/34016872/827519

#define PIPE_BUF        4096	/* # bytes in atomic write to a pipe */
4

Having multiple readers on a single fifo is also mentioned in Beej's networking tutorial:

Finally, what happens if you have multiple readers? Well, strange things happen. Sometimes one of the readers get everything. Sometimes it alternates between readers. Why do you want to have multiple readers, anyway?

6

The fifo manpage describes how it blocks until both ends are opened:

The kernel maintains exactly one pipe object for each FIFO special file that is opened by at least one process. The FIFO must be opened on both ends (reading and writing) before data can be passed. Normally, opening the FIFO blocks until the other end is opened also.

9

this error is caused by the kernel not cleaning up TCP connections fast enough (see issue 340 in the redis project)

Could not connect to Redis at 127.0.0.1:6379: Cannot assign requested address

February 05, 2016 05:00 AM

January 30, 2016

Oslandia

Tempus V2.0: routing faster, better, stronger !


We are pleased to announce the release of a new version of our route planning framework: Tempus



Tempus is an Open Source C++/Python framework that allows to use, develop and test route planning algorithms with a particular focus on multimodal routing where every possible transport modes are taken into consideration for a trip: private cars, public transports, shared vehicules, bikes, etc.


This new 2.0 version results from a partnership with Mappy and focuses on performance improvements and scalability. It brings the following major changes :

  • use of modern C++11 constructs for a better code
  • a new plugin that implements the Contraction Hierarchies algorithm on road newtorks.
  • more flexibility on graph data structures: specialized graph types can exist, in addition to the multimodal graph
  • a smaller memory footprint of in-memory structures thanks to optimisations and use of serialization to avoid memory fragmentation
  • continuous integration with Travis CI and Appveyor for a stronger code

The contraction hierarchies algorithm allows to answer road itinerary requests very fast, thanks to an efficient preprocessing phase where shortcuts are added to the graph. In conjunction with the newly improved data structures, it is now possible to process requests on country or continental-sized networks.



Tempus is on GitHub, get it, fork it, PR it !


Developments are still going on to improve state-of-the-art algorithms. Please let us know if you are interested in development, training or consulting around these pieces of technology. Do not hesitate to contact us at infos@oslandia.com .

by Hugo Mercier at January 30, 2016 09:00 AM

January 29, 2016

Postgres OnLine Journal (Leo Hsu, Regina Obe)

An almost idiot's guide to install PostgreSQL 9.5, PostGIS 2.2 and pgRouting 2.1.0 with Yum

If you already have a working PostgreSQL 9.5 install, and just want to skip to relevant sections, follow this list:

As a general note, these instructions are what I did for CentOS 7. For lower versions ther are some differences in packages you'll get. For example currently if you are installing on CentOS 6 (and I presume by extension other 6 family), you won't get SFCGAL and might have pgRouting 2.0 (instead of 2.1)


Continue reading "An almost idiot's guide to install PostgreSQL 9.5, PostGIS 2.2 and pgRouting 2.1.0 with Yum"

by Leo Hsu and Regina Obe (nospam@example.com) at January 29, 2016 08:21 AM

January 28, 2016

Boston GIS (Regina Obe, Leo Hsu)

pgRouting: A PRACTICAL GUIDE now on sale

Our upcoming book pgRouting: A PRACTICAL GUIDE (publisher LocatePress) just went on sale a couple of days ago. We currently have a little over 100 pages of content. We expect the final published book to weight in at a little over 250 pages.

pgRouting is a PostgreSQL extension that also utilizes PostGIS for solving routing problems. Routing problems involve application of costs to network paths as well as considerations of the parties involved such as kinds vehicles, people, packages etc. that will be carried across these networks. While routing is often applied to physical networks like roads and bike paths, the fundamentals of routing go beyond the physical. Spatial concepts such as edges and nodes are a convenient way of visualizing costs across even a logical network.

If you purchase now, you'll get the following benefits:

  • Significant discount off the eBook final price. Currently 50% off
  • Our current preview draft and updates to the draft as we update it
  • Input via our discussion page of what you'd like to see covered. That could be specific functions or kinds of examples or just some clarity.

The focus of this book is pgRouting 2.1+. It's really hard to talk about pgRouting without giving some air time to PostGIS and PostgreSQL, which are some of the key foundations of pgRouting. As such expect to see a lot of talk of PostGIS and PostgreSQL in this book. Since pgRouting 2.0 is so different from pgRouting 2.1, we had to make a difficult decision to focus on the future rather than the past. The appendix does cover some content of how things were done in pgRouting 2.0 and how they changed in pgRouting 2.1+. You can expect the appendix discussion to be expanded. For the core of the book we've largely abandonned any mention of pgRouting 2.0 though pgRouting 2.1 is for the most part backward compatible with pgRouting 2.0 if you choose to be limited by the pgRouting 2.0 function signatures.

We've been pushing package maintainers to focus on releasing pgRouting 2.1 rather than sticking with pgRouting 2.0 and also encouraging maintainers who haven't carried pgRouting before, to start doing so. The PostGIS 2.2 windows bundle comes packaged with pgRouting 2.1 for example. Yes these are selfish reasons, but FOSS is inherently a selfish scratch your own itch kind of pursuit, and that's a good thing since all our common self-interests come together to make something very special that will change the world.

We are planning for the book to be released in another 3-4 months. I know we had planned to release around now, but as happens with every book we've written so far, as we get deeper into it, we realize, keeping it baking just a little longer is so much better than rushing thru the process and outputting a half-baked pie. Although pgRouting 2.2 is still being heavily worked on, the development version is almost feature complete and slated for release in next three to four months. As such we plan to cover some new features coming in pgRouting 2.2 and hope pgRouting 2.2 is out just before or around the same time as this book.

by Regina Obe (nospam@example.com) at January 28, 2016 03:55 PM

Stephen Mather

parallel processing in PDAL

Frankfurt Airport tunnel.JPG
Frankfurt Airport tunnel” by Peter IsotaloOwn work. Licensed under CC BY-SA 3.0 via Commons.

In my ongoing quest to process all the LiDAR data for Pennsylvania and Ohio into one gigantic usable dataset, I finally had to break down and learn how to do parallel processing in BASH. Yes, I still need to jump on the Python band wagon (the wagon is even long in the tooth, if we choose to mix metaphors), but BASH makes me soooo happy.

So, in a previous post, I wanted to process relative height in a point cloud. By relative height, I mean height relative to ground. PDAL has a nice utility for this, and it’s pretty easy to use, if you get PDAL installed successfully.

pdal translate 55001640PAN.las 55001640PAN_height.bpf height --writers.bpf.output_dims=&amp;quot;X,Y,Z,Height&amp;quot;;

Installing PDAL is not too easy, so I used the dockerized version of PDAL and it worked great. Problem is, the dockerized version complicates my commands on the command line, especially if I want to run it on a bunch of files.

Naturally, the next step is to run it on a whole bunch of LiDAR files. For that I need a little control script which I called pdal_height.sh, and then I need to run that in a “for” loop.

#!/bin/bash
# Get the pathname from the input value
pathname="${1%/*}";

# Get the short name of the file, sans path and sans extension
name=`basename $1 .las`

docker run -v $pathname:/data pdal/master pdal translate //data/"$name".las //data/"$name"_height.bpf height --writers.bpf.output_dims="X,Y,Z,Intensity,ReturnNumber,NumberOfReturns,ScanDirectionFlag,EdgeOfFlightLine,Classification,ScanAngleRank,UserData,PointSourceId";

Now we need a basic for loop will take care of sending the las files into our pdal_height.sh, thus looping through all available las files:

for OUTPUT in $(ls *.las); do ~/./pdal_height.sh $OUTPUT; done;

This is great, but I calculated it would take 13 days to complete on my 58366 LiDAR files. We’re talking approximately 41,000 square miles of non-water areas for Ohio, and approximately 44,000 square miles of non-water areas for Pennsylvania. I’m on no particular timeline, but I’m not really that patient. Quick duckduckgo search later, and I remember the GNU Parallel project. It’s wicked easy to use for this use case.

ls *.las | parallel -j24 ~/./pdal_height.sh

How simple! First, we list our las files, then we “pipe” them as a list to parallel, we tell parallel we want it to spawn 24 independent processes using that list as the input for our pdal_height script.

Now we can run it on 24 cores simultaneously. Sadly, I have slow disks😦 so really I ran it like this:

ls *.las | parallel -j6 ~/./pdal_height.sh

Time to upgrade my disks! Finally, I want to process my entire LiDAR dataset irrespective of location. For this, we use the find command, name the starting directory location, and tell it we want to search based on name.

find /home/myspecialuser/LiDAR/ -name "*.las" | parallel -j6 ~/./pdal_height.sh

Estimated completion time: 2 days. I can live with that until I get better disks. Only one problem, I should make sure this doesn’t stop if my network drops for any reason. Let’s wrap this in nohup which will prevent network-based hangups:

nohup sh -c 'find /home/myspecialuser/LiDAR -name "*.las" | parallel -j6 ~/./pdal_height.sh {}'

by smathermather at January 28, 2016 01:00 PM

January 27, 2016

Stephen Mather

wget for downloading boatloads of data

My current project to create a complete dataset of airborne LiDAR data for Ohio and Pennsylvania has been teaching me some simple, albeit really powerful tricks. We’ll just discuss one today — recursive use of wget. This allows us to download entire trees of web sites to mirror, or in our case download all the data. Additionally, wget works on ftp trees as well, with no real change in syntax.

wget -r ftp://pamap.pasda.psu.edu/pamap_lidar/cycle1/LAS/

Capture

I recognize that curl can be a more powerful utility than wget, but for dead simple problems like this, a dead simple tool is more than adequate.


by smathermather at January 27, 2016 01:04 PM

January 25, 2016

Oslandia

Meilleurs vœux 2016 ! Best wishes for 2016 !


(english below)

Toute l'équipe d'Oslandia vous souhaite ses meilleurs vœux pour l'année 2016 dans tous vos projets.


L'année passée a été riche et intense, et Oslandia continue de renforcer son équipe. Avec de nouveaux collaborateurs, de nouvelles expériences et expertises, et de nouveaux projets, 2016 part sous de bons augures.


Pour cette nouvelle année, nous vous offrons dès à présent la sortie de Tempus 2.0, le calculateur d'itinéraire multimodal que nous développons avec l'IFSTTAR. Au menu, les "Contraction Hierarchies" , et d'autres améliorations qui permettent le calcul d'itinéraire à haute performance. L'annonce paraitra sous peu.


Ce n'est qu'un début, et en 2016 notre collaboration avec l'IGN va permettre de faire progresser les solutions 3D OpenSource. Nous participons toujours activement au développement de QGIS, et la version 3 devrait voir le jour cette année. Du côté des métiers de l'eau, les avancées sur qWAT et qgis-epanet sont déjà au programme.


Nous avons hâte de collaborer avec vous sur des projets innovants et motivants. N'hésitez donc pas à nous contacter , pour tout besoin en conseil, développement, ou formation !


A très bientôt,

L'équipe Oslandia


Pour toute information ou demande : infos@oslandia.com


Happy new year 2016


The whole team at Oslandia wishes you all the best for 2016 in your projects.


Last year has been rich and intense, and Oslandia keeps reinforcing its team. With new collaborators, new experiences and skills, and new projects, 2016 makes a good start.


For this new year, we offer you a new release of Tempus, our multimodal routing engine developped in collaboration with IFSTTAR : version 2.0 has just seen light. It features "Contraction Hierarchies" and other improvements for high performance routing. Stay tuned, official release announcement soon !


This is just the beginning. In 2016 our collaboration with French IGN will allow us to improve OpenSource 3D solutions. We still contribute actively to QGIS, and version 3.0 should be out later this year. As for the water management domain, qWAT and qgis-epanet progress is already planned.


We will be really pleased to work with you on innovative and motivating projects. Do not hesitate to contact us for any need in consulting, development or training !


À très bientôt,

Oslandia Team

For any information : infos@oslandia.com

by Vincent Picavet at January 25, 2016 11:00 AM

January 24, 2016

Boston GIS (Regina Obe, Leo Hsu)

Early look at Upcoming PostGIS 2.3

PostGIS 2.3 is still in its incubation, but I suspect it will be even more impressive than PostGIS 2.2. PostGIS 2.2 was long in the oven, and out came out a lot, but I'm hoping PostGIS 2.3 will have a shorter gestation period, but be just as, if not more impressive than PostGIS 2.2.

Before starting my "what's already there and what I hope to be there" list for PostGIS 2.3, I'll point out things that PostGIS 2.2 brought

PostGIS 2.2 offerings

Many itemized in PostGIS 2.3 new features

  • KNN for geography and 3D geometries
  • Geometry clustering functions
  • True distance KNN for PostgreSQL 9.5+. If you are using 9.5, your KNN is not just bounding box distance, but true real distance.
  • Faster topology, now a lot has been moved into C for better performance.
  • Lots more 3D advanced functions via SFCGAL, SFCGAL now installable as extension.
  • Lots of new geometry functions like geometry clustering, geometry clipping, subdivide, trajectory functions
  • More raster stats and create overview table function without having to rely on raster2pgsql
  • Tiger geocoder updated for Tiger 2015 data. Address standardizer now part of PostGIS.
  • Last but not least, we added Dan Baston to our Core Contributor list. He gave us all the clustering functions in PostGIS 2.2, fixed lots of bugs and is already got more commits and proposed plans than anyone else for PostGIS 2.3. Go Dan go, and welcome.

Continue reading "Early look at Upcoming PostGIS 2.3"

by Regina Obe (nospam@example.com) at January 24, 2016 10:32 AM

January 23, 2016

UpStats blog (Stefan Petrea)

E-commerce solutions market research

Introduction

Many business owners hire freelance programmers to help them integrate, configure and set up online ecommerce websites. But there are at least 3 key questions that need to be answered along the way:

  • Which ecommerce solution should I use ?
  • Which payment gateway/service should I use ?
  • How should I promote/advertise my ecommerce website ?

This blogpost will focus on answering those questions using data from 336,067 freelance job ads collected over the course of 1 year and 2 months.

Ecommerce platforms

You can split ecommerce platforms into two categories:

  • SaaS solutions
  • selfhosted

The first type, SaaS solutions are centrally hosted, they can be less flexible and customizable but you have the advantage that someone else is taking care of the hosting, reliability, infrastructure. Some examples of SaaS solutions are shopify and bigcommerce.

The selfhosted solutions are very flexible, you can customize/modify/tune/integrate if you're experienced with that particular solution but you also have to take care of installing it, maintainance, monitoring, security, scalability. Some examples of selfhosted solutions are magento, prestashop or woocommerce.

Another aspect is if you're able to export the data. Some of the SaaS solutions make it harder to export data than others (it may be that at some point you'll want to migrate from a SaaS to a selfhosted solution)

WITH classes(label, re1, re2) AS (
    VALUES
    ('magento'      , 'magento'                              , NULL       ),
    ('prestashop'   , 'prestashop|presta shop|presta-shop'   , NULL       ),
    ('woocommerce'  , 'woocommerce|woo-commerce|woo commerce', NULL       ),
    ('bigcommerce'  , 'bigcommerce'                          , NULL       ),
    ('shopify'      , 'shopify'                              , NULL       ),
    ('zencart'      , 'zen cart|zencart'                     , NULL       ),
    ('sharetribe'   , 'sharetribe'                           , NULL       ),
    ('oscommerce'   , 'oscommerce'                           , NULL       ),
    ('nopcommerce'  , 'nopcommerce'                          , NULL       ),
    ('lemonstand'   , 'lemonstand.com'                       , NULL       ),
    ('squarespace'  , 'squarespace'                          , 'commerce' ),
    ('nopcommerce'  , 'nopcommerce'                          , NULL       ),
    ('volusion'     , 'volusion'                             , NULL       ),
    ('wix'          , 'www.wix.com| wix.com| wix '           , NULL       ),
    ('3dcart'       , '3dcart'                               , NULL       ),
    ('shoprocket'   , 'shoprocket'                           , NULL       ),
    ('jigoshop'     , 'jigoshop'                             , NULL       ),
    ('ecwid'        , 'ecwid'                                , NULL       ),
    ('pimcore'      , 'pimcore'                              , NULL       )
), jobs AS (
    SELECT 
    job_id,
    LOWER(job_title || ' ' || snippet) AS jdata
    FROM odesk_job
)
SELECT
c.label AS ecommerce_solution, COUNT(j.job_id) AS cnt
FROM classes c
LEFT JOIN jobs j ON (
    j.jdata ~ ('(' || c.re1 || ')') AND
    (
     CASE WHEN c.re2 IS NULL THEN true
     ELSE (j.jdata ~ ('(' || c.re2 || ')'))
     END
    )
)
GROUP BY c.label
ORDER BY cnt DESC;

Ecommerce marketing/advertising solutions

WITH classes(label, re1, re2) AS (
    VALUES
    ('google adwords'       , 'adwords|google adwords|google.com/adwords|adwords.com'                                 , NULL       ),
    ('facebook ads'         , 'ads manager|facebook.com/business|facebook ads'                                        , NULL       ),
    ('google merchant'      , 'google merchant|google.com/retail/merchant'                                            , NULL       ),
    ('pinterest ads'        , 'ads.pinterest.com|pinterest ads'                                                       , NULL       ),
    ('bing ads'             , 'bingads|bing ads'                                                                      , NULL       ),
    ('twitter ads'          , 'twitter ads|ads.twitter.com'                                                           , NULL       ),
    ('amazon advertising'   , 'amazon sponsored links|advertising.amazon.com|amazon ppc|amazon ads|amazon advertising', NULL       )
), jobs AS (
    SELECT 
    job_id,
    LOWER(job_title || ' ' || snippet) AS jdata
    FROM odesk_job
    WHERE LOWER(job_title || ' ' || snippet) ~ '(commerce|webstore|web store|online shop|[^a-zA-Z]eshop|[^a-zA-Z]e-shop)'
)
SELECT
c.label AS advertising_solution, COUNT(j.job_id) AS cnt
FROM classes c
LEFT JOIN jobs j ON (
    j.jdata ~ ('(' || c.re1 || ')') AND
    (
     CASE WHEN c.re2 IS NULL THEN true
     ELSE (j.jdata ~ ('(' || c.re2 || ')'))
     END
    )
)
GROUP BY c.label
ORDER BY cnt DESC;

Ecommerce payment solutions

WITH classes(label, re1, re2) AS (
    VALUES
    ('skrill'             , 'skrill|moneybookers'                   , NULL),
    ('payoneer'           , 'payoneer'                              , NULL),
    ('amazon'             , 'amazon payments|payments.amazon.com'   , NULL),
    ('paypal'             , 'paypal'                                , NULL),
    ('google wallet'      , 'google wallet|google.com/wallet'       , NULL),
    ('ukash'              , 'ukash'                                 , NULL),
    ('intuit'             , 'intuit payments|payments.intuit.com'   , NULL),
    ('2checkout'          , '2checkout'                             , NULL),
    ('square'             , 'square'                                , NULL),
    ('stripe'             , 'stripe'                                , NULL),
    ('network merchants'  , ' nmi.com|network merchants'            , NULL)
), jobs AS (
    SELECT 
    job_id,
    LOWER(job_title || ' ' || snippet) AS jdata
    FROM odesk_job
    WHERE LOWER(job_title || ' ' || snippet) ~ '(payment)'
)
SELECT
c.label AS payment_solution, COUNT(j.job_id) AS cnt
FROM classes c
LEFT JOIN jobs j ON (
    j.jdata ~ ('(' || c.re1 || ')') AND
    (
     CASE WHEN c.re2 IS NULL THEN true
     ELSE (j.jdata ~ ('(' || c.re2 || ')'))
     END
    )
)
GROUP BY c.label
ORDER BY cnt DESC;

Conclusion

At the time of writing this, it seems that magento + google adwords + paypal is a good combination of solutions to use when building an ecommerce website.

January 23, 2016 05:00 AM

January 20, 2016

Stephen Mather

Point Clouds – the (re)start of a journey

If you have followed this blog for a while, you will notice a continual returning to and refinement of ideas and topics. That’s how the blog started, and this role it has served, as a touch stone in my exploration of topics is critical to how I use it. I hope it is useful to you too, as a reader.

So, let’s talk about point clouds again. Point clouds are like ordinary geographic points — they have X, Y, and Z (typically), but there are a whole lot more of them than ordinary. How much more? Well, instead of thinking in the 100s of thousands, we think more along the lines of 100 of millions or billions of points.

The other thing we need to recognize about point clouds is that they may be n-dimensional — a LiDAR point cloud may have X, Y, Z, intensity, return number, class, etc..

Good Free and Open Source Software (FOSS) tools for dealing with point clouds include a spectrum of projects, including PDAL (Point Data Abstraction Library), and the Pointcloud extension for PostgreSQL (with PostGIS integration, of course).

Previously, I was attempting to process large amounts of LiDAR point clouds in order to model bird habitat. Now that I have the requisite storage, and PDAL has a height calculator, I am ready to dive back into this.

Patch heights showing vegetation and open areas Isometric 3D visualization of 3D chipping Height cloud overlayed with patchs

In the meantime, there are some cool things to check in this space. Let’s start with this presentation by Michael Smith on PDAL from FOSS4G in Seoul last year.

Screen Shot 2016-01-19 at 8.32.03 PM

Want to see some of the incredible things you can do in the browser, go no further than here:

Screen Shot 2016-01-19 at 8.34.52 PM.png

This goes deeper than simple vector tile approaches and employs level of detail optimizations that are necessary for this large of a use case. More on this later

postscript 1/20/2016:

Looks like I outed the above project… . Apologies to https://hobu.co for this. For the record and from their site:

Hobu is a team of three software engineers located in Iowa City, Iowa and Cambridge, Iowa. We have been on the forefront of open source LiDAR software for over five years, and we have been building open source GIS software for even longer. We provide open source lidar software systems development, open source GIS software development in addition to contract research and development, systems design, evaluation and implementation.

Please contact us about any of the open source software for which we provide support.

We are the primary support organization for the following libraries:

  • PDAL — Point Data Abstraction Library
  • plas.io — WebGL point cloud rendering
  • Greyhound — Point cloud data streaming
  • libLAS — ASPRS LAS read/write

Welp, there we go, an unintentional public launch. Awesome work though… .


by smathermather at January 20, 2016 01:39 AM

January 17, 2016

UpStats blog (Stefan Petrea)

The automatic hiring of distributed teams

Intro

I'm reading a lot of startup articles/blogposts these days. Many of them describe difficulties around finding the right people, and difficulties about building cost-effective teams. This is what this blogpost will focus on.

The problem of selecting candidates to form a team has been studied more thoroughly 1 and has been modeled in different ways.

This blogpost will focus on finding optimal teams while at the same time complying with some constraints depending on the needs for that particular team.

The scenarios described are those where a business owner wants to staff a project according to business needs and automatically select the initial team members from a talent pool of qualified candidates.

Overview

All throughout this blogpost we'll first build a candidate pool(using a PostgreSQL query) and then we'll use the GLPK modeling language (GNU Linear Programming Toolkit) to find an optimal team while taking into consideration constraints and goals/objectives for that particular situation.

The SQL query will find the right candidates for the team and it will write its output in a CSV file. The CSV file will then be used by the GLPK model to compute an optimal team.

Below, there's a diagram describing the composition of each type of team we'll cover

Web app team (minimizing the cost, and allow multiple roles for one team member)

In this example, we're aiming to build a team comprised of the following roles:

  • backend
  • frontend
  • web designer
  • sysadmin

In this particular case we want to minimize the cost. Some of the candidates that make their way into the solution might fill both roles (which they actually do when these programs are run on real data), such as one candidate might fill the roles {frontend,backend} , which sounds a lot like fullstack, or similarly, a candidate might fill the roles {sysadmin,backend} which is close to the definition of devops.

Building the candidate pool

\f ','
\pset footer off
\o '../modules/team-solver-3-exp.csv'
WITH candidates AS (
    SELECT *
    FROM odesk_freelancer
    WHERE tzoffset IS NOT NULL AND bill_rate > 0
), backend AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'backend'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('backend & developer & (php | nodejs | python)')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), frontend AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'frontend'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('frontend & developer & (javascript | angular | backbone | reactjs)')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), designer AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'design'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('web & designer & html & (css | photoshop | illustrator)')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), sysadmin AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'sysadmin'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('(linux | centos | redhat | debian | ubuntu) & server & (sysadmin | (system & admin) | (system & administrator) | (system & administration))')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
)
SELECT
freelancer_id,
rate,
role,
tzoffset
FROM (
    SELECT * FROM frontend
    UNION ALL
    SELECT * FROM backend
    UNION ALL
    SELECT * FROM designer
    UNION ALL
    SELECT * FROM sysadmin
) x;
\o

Finding the optimal team

## run with
##    glpsol --math team-solver-1.mod
##

param budget;
set F, dimen 4;
set SKILLS;
table data IN "CSV" "team-solver-3-exp.csv": F <- [freelancer_id,rate,role,tzoffset];

/* Indices */
set I := setof{(f,r,o,t) in F} f;

/* Assignment */
var a{i in I}, binary;

/*
   Create another set of tuples containing only the freelancer_id and the rate.
   Each freelancer will only appear once (since this is a set), and this is important
   because although he may have multiple roles, his rate is only added once
   to the total cost.
*/ 
set B := setof{(f,r,o,t) in F} (f,r);

/* The set B is used(instead of F) in order to avoid the wrong cost by adding a freelancer's rate twice
   because he has two skills */
minimize cost: sum {i in I, (f,r) in B: i == f} r * a[i];

/* 
   Constraint 1: cover all skills 
   (each skill is covered by at least one freelancer)
*/

s.t. c1 {s in SKILLS}:
	(sum {(f,r,o,t) in F: s == o} a[f]) >= 1;

/* Constraint 2: stay within budget */
s.t. c8 :
	sum{(f,r,o,t) in F, i in I: i == f} r * a[f] <= budget;

solve;

printf "Freelancers picked:\n";

# go over freelancers that have at least one skill active
# and print their skills
for {i in I: a[i] == 1} {
   printf "%s ", i;
   printf "[";
   printf {(f,r,o,t) in F: f == i} " %s", o;
   printf " ]";
   printf "\n";
}
printf "\n";

printf "Optimal cost:\n";
printf "%.2f", cost;
printf "\n";

data;

# project budget
param budget := 100;
set SKILLS := 
	frontend
	backend
	sysadmin
	design;

end;

Web app team (team members have separate roles, the cost is minimized)

This will be a similar situation to the one described above. The difference here, is we'll have the following roles:

  • 2 x backend
  • 2 x frontend
  • 2 x web designer
  • 1 x sysadmin
  • 1 x project manager

Here are the constraints we'll use here:

  • one candidate in the solution covers exactly one role (we won't allow for candidates to cover multiple roles like we did in the previous case)
  • the total cost(the sum of the hourly rates for each member of the team) is capped by the budget
  • we want all the team members to be within ± 2h of UTC+00 so they can attend a daily meeting
  • we want the total cost of the team minimized

Building the candidate pool

\f ','
\a
\pset footer off
\o 'team-solver-1-exp.csv'
WITH candidates AS (
    SELECT *
    FROM odesk_freelancer
    WHERE tzoffset IS NOT NULL AND bill_rate > 0
), backend AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'backend'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('backend & developer & (php | nodejs | python)')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), frontend AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'frontend'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('frontend & developer & javascript')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), designer AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'design'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('web & designer & css & html')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), sysadmin AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'sysadmin'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('(linux | centos | redhat | debian | ubuntu) & server & (sysadmin | (system & admin) | (system & administrator) | (system & administration))')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
), project_manager AS (
    SELECT
    f.freelancer_id,
    f.bill_rate AS rate,
    'project_manager'::text AS role,
    tzoffset
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('(manager | management) & team & scrum')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate
    LIMIT 400
)
SELECT
-- row_number() over () AS freelancer_id,
freelancer_id,
rate,
role,
tzoffset
FROM (
    SELECT * FROM frontend
    UNION
    SELECT * FROM backend
    UNION
    SELECT * FROM designer
    UNION
    SELECT * FROM sysadmin
    UNION
    SELECT * FROM project_manager
) x;

Finding the optimal team

## run with
##    glpsol --math team-solver-1.mod
##
## use-case: tyical web development team

param budget;
param tz_max;
param tz_ref;
set F, dimen 4;
table data IN "CSV" "team-solver-1-exp.csv": F <- [freelancer_id,rate,role,tzoffset];


# indices
set I := setof{(f,r,o,t) in F} f;

# assignment
var a{I}, binary;



minimize cost:
	sum{(f,r,o,t) in F} r * a[f];



## two backend
s.t. c1 :
	sum{(f,r,'backend',t) in F} a[f] = 2;

## two frontend
s.t. c2 :
	sum{(f,r,'frontend',t) in F} a[f] = 2;

## two designers
s.t. c3 :
	sum{(f,r,'design',t) in F} a[f] = 2;

## one sysadmin
s.t. c4 :
	sum{(f,r,'sysadmin',t) in F} a[f] = 1;

## one project manager
s.t. c5 :
	sum{(f,r,'project_manager',t) in F} a[f] = 1;

## timezone constraints
s.t. c6 {(f,r,o,t) in F}: (t - tz_ref) * a[f] <= tz_max;
s.t. c7 {(f,r,o,t) in F}: (tz_ref - t) * a[f] <= tz_max;

## stay within budget
s.t. c8 :
	sum{(f,r,o,t) in F} r * a[f] <= budget;


solve;

printf "Freelancers picked:\n";
printf {(f,r,o,t) in F: a[f] == 1} " %s", f;
# printf {(f,r,o,t) in F: a[f] == 1} " %i", f;
printf "\n";

printf "Optimal cost:\n";
printf "%.2f", cost;
printf "\n";

data;

# project budget
param budget := 100;
param tz_max := 7200;
param tz_ref := 0;

end;

Sysadmin team (covering a 24h cycle)

This is a similar situation, except now we want to select sysadmins to take care of the site reliability/operations area of a business.

If the product is global and everyone then it needs support all throughout the 24 hour day.

This is usually modeled by allocating team members to work in shifts to cover parts of the 24 hour business day, but we won't do that.

Since this is a distributed team, the actual advantage here is people can work on their local 09:00-17:00 but at the same time they can still cover the business need of offering support for the product all across the 24h cycle.

So we split the 24h day into 3 intervals relative to UTC:

  • 00:00-08:00
  • 08:00-16:00
  • 16:00-24:00

After this, we'll find 2 sysadmins whose local 09:00 is 00:00 UTC, 2 sysadmins whose local 09:00 is 08:00 UTC and 2 sysadmins whose local 09:00 is 16:00 UTC. We will keep this constraint as well as minimizing the total cost of the team.

Building the candidate pool

\f ','
\a
\pset footer off
\o 'team-solver-2-exp.csv'
WITH candidates AS (
    SELECT *
    FROM odesk_freelancer
    WHERE tzoffset IS NOT NULL AND bill_rate > 0
), sysadmin AS (
    SELECT
    DISTINCT ON (f.freelancer_id, bill_rate)
    f.freelancer_id,
    f.bill_rate AS rate,
    'sysadmin'::text AS role,
    -- their local 09:00 in UTC as seconds elapsed since 00:00 UTC
    -- (the start of their workday)
    (((9 * 3600) - tzoffset + (24 * 3600)) % (24 * 3600)) AS utc_start_hour
    FROM candidates f
    JOIN odesk_job_feedback jf ON f.freelancer_id = jf.freelancer_id
    JOIN (
        SELECT
        j.*
        FROM odesk_job j
        JOIN odesk_search_job sj ON sj.job_id = j.job_id
        WHERE tsv_basic @@ to_tsquery('(linux | centos | redhat | debian | ubuntu) & server & (sysadmin | (system & admin) | (system & administrator) | (system & administration))')
    ) j ON jf.job_id = j.job_id
    ORDER BY bill_rate, freelancer_id
    LIMIT 400
)
SELECT
freelancer_id,
rate,
role,
utc_start_hour AS start_hour
FROM sysadmin;

Finding the optimal team

## Run with
##    glpsol --math team-solver-1.mod
##
## use-case: sysadmin covering the 24hour business day
##
## timezone offsets to cover
## intervals
## - 00:00-08:00
## - 08:00-16:00
## - 16:00-24:00
##
## (the respective timezine offsets are 0, 28800,57600 for the start
##  of each interval)
##
## These are intervals in GMT+0 so they will have to be adapted to the
## freelancer's schedule (which would like to work 09:00-17:00 on his
## local timezone)
##
## The solution will be comprised of 3 groups, each covering a third 
## of the 24-hour cycle.
##
## Exactly two sysadmins will be in each group.
##

param budget;
param tz_delta;
set OFFSETS;
set F, dimen 4;
table data IN "CSV" "team-solver-2-exp.csv": F <- [freelancer_id,rate,role,start_hour];

/* freelancer indices */
set I := setof{(f,r,o,t) in F} f;

/* Assignment */
var a{i in I, t in OFFSETS}, binary;

minimize cost:
	sum{(f,r,o,t) in F, s in OFFSETS} r * a[f,s];


/* Constraint 1: each group needs exactly 2 sysadmins */
s.t. c1 {t in OFFSETS}:
	sum {i in I} a[i,t] = 2;

/* Constraint 2: each sysadmin is assigned to at most one of the 3 groups */
s.t. c2 {i in I}:
	sum {s in OFFSETS} a[i,s] <= 1;

/*
   Constraint 3: a sysadmin is a viable candidate for an interval
   if his timezone matches that interval.

   We want to ensure that |t-s| < tz_delta
*/

# t - s < tz_delta
s.t. c3 {(f,r,o,t) in F, s in OFFSETS}: 
	 a[f,s] * (s - t) <= tz_delta;

# s - t < 3600
s.t. c4 {(f,r,o,t) in F, s in OFFSETS}: 
	 a[f,s] * (t - s) <= tz_delta;

solve;

printf "Sysadmins picked:\n";
printf {(f,r,o,t) in F, s in OFFSETS: a[f,s] == 1} "%i => %s \n", s, f;
printf "\n";


printf "Optimal cost:\n";
printf "%.2f", cost;
printf "\n";


data;
param budget := 100; # project budget
set OFFSETS  := 0 28800 57600;
param tz_delta := 3600;

end;

Conclusion

The same type of reasoning can be adapted to build other types of teams too:

  • game development teams
  • global call-center/customer support teams
  • teams developing mobile apps

Since the profiles are acquired from an online job market, we have several other features for each profile:

  • the candidate's rating by previous clients he's worked with
  • the candidate's total work hours, and work hours spent on certain types of projects
  • the location (city/country) of the candidate
  • the candidate's spoken language
  • the candidate's availability (20-30h/week, 40h/week)

We could've also made constraints using these features and fine-tune our search in various other ways.

While searching for the best team, the fullstack and devops roles arise naturally because of certain constraints. We've looked at some ways of building distributed teams automatically using candidate profiles from a database.

Footnotes:

1

There's an interesting paper I came across on arxiv that describes a hiring game and relates it to the set cover problem In that same paper, a greedy set-cover approximation algorithm is presented. According to wikipedia, the optimization version of the set-cover problem is NP-hard.

The objective in the paper is to minimize the number of candidates, but at the same time there's a discussion about what the candidate brings to the table and if he really is worth hiring (this is expressed in the paper using an utility function).

January 17, 2016 05:00 AM