[转] Making GTFS query more convenient

url:http://ontrakinfo.wordpress.com/2012/10/29/making-gtfs-query-more-convenient/

这简直说出了我的心声。

I have been spending a lot of time parsing the GTFS database. On the surface it is just a simple CSV files. But to extract useful information from GTFS is often unexpected difficult. For example, find the stops from a bus line in sequential order might sounds like basic thing to do. But it is actually non-trivial with GTFS.

One reason is transit service is more complex it seems. It might seems a bus service just hit all the stops in sequence. But the actual service has a lot of variables. The schedule is often different in weekend compare to weekdays. And so does the exact route that it covers. Sometimes a bus is scheduled to run a short route rather than covering the whole length. In more complex case there can be branching where there is a common main trunk and then the buses split to serve two or more alternative destination.

This is the reason why in GTFS one “route” may associate with multiple “shapes”. To find out what shapes are associate with a route, we will have to make a query like this

SELECT
 shape_id
FROM route
 JOIN trips
 JOIN shape
GROUP BY shape_id;

To find out the stops is even more complex. Here we need to join one more table the stop_times. It is also the biggest tables in the GTFS. So this is also the most computation intensive query to do.

SELECT
 shape_id, stop_id
FROM route
 JOIN trips
 JOIN stop_times
 JOIN stops
GROUP BY shape_id, stop_id;

Still most people have a clear concept of what a transit line is where it runs. It shouldn’t be such a pain to compute. A more useful structure should look like below.

    GTFS             More Useful
  Structure           Structure

    route              line
     |                   |
     |                   V
     |                 route*
     |                   |      |    shape          |  +-> route_shape
     |     ^             |  |
     |    /              |  +-> route_stops*
     |   /               |
     V  /                V
    trips              trips
     |                   |
     |        stops      |          stops
     |        ^          |
     |       /           |
     V      /            V
    stop_times         stop_times

Here a shift the terminology a bit. The top level entity is a line (i.e. GTFS’ route). This is service that people know of, like a numbered bus line or a metro line. Below that is routes. These are the collection of alternative routes a line may run. The routes are not explicitly represented in GTFS. You can find that by querying all unique shape_id using the first SQL. Another missing piece is the stops. If we can pre-compute all the route_stops using the second SQL once, for the most part we don’t need the giant stop_times table. For applications that do not deal with scheduled time, this is a huge saver. The is one assumption my structure makes though. It is that different lines do not shape that same route. If should be a reasonable assumption. And if there is indeed share route and shape, we should just replicated them as two separate entities.

The original GTFS structure seems to have a transit operator centric view. It allows them maximum flexibility to author and publish their service data. But for application developers, it is not structured for easy traversal. By adding the route and route_stops tables as indicated, it will greatly facilitate the query and operation of transit information.

时间: 2024-10-15 14:09:59

[转] Making GTFS query more convenient的相关文章

Query runs slow via .NET

Slow in the Application, Fast in SSMS?Understanding Performance Mysteries An SQL text by Erland Sommarskog, SQL Server MVP. Last revision: 2013-08-30.This article is also available in Russian, translated by Dima Piliugin. Introduction When I read var

FluentData -Micro ORM with a fluent API that makes it simple to query a database 【MYSQL】

官方地址:http://fluentdata.codeplex.com/documentation MYSQL: MySQL through the MySQL Connector .NET driver. 连接字符串:Server=127.0.0.1;Database=testDB;Uid=root;Pwd=jnex;<system.data> <DbProviderFactories> <add name="MySQL Data Provider" i

SQL optimizer -Query Optimizer Deep Dive

refer: http://sqlblog.com/blogs/paul_white/archive/2012/04/28/query-optimizer-deep-dive-part-1.aspx    SQL是一种结构化查询语言规范,它从逻辑是哪个描述了用户需要的结果,而SQL服务器将这个逻辑需求描述转成能执行的物理执行计划,从而把结果返回给用户.将逻辑需求转换成一个更有效的物理执行计划的过程,就是优化的过程. 执行SQL的过程: Input Tree We start by looking

Install and run DB Query Analyzer 6.04 on Microsoft Windows 10

      Install and run DB Query Analyzer 6.04 on Microsoft Windows 10  DB Query Analyzer is presented by Master Genfeng, Ma from Chinese Mainland. It has English version named 'DB Query Analyzer' and Simplified Chinese version named   . DB Query Analy

DB Query Analyzer 6.04 is distributed, 78 articles concerned have been published

    DB Query Analyzer 6.04 is distributed,78 articles concerned have been published  DB Query Analyzeris presented by Master Genfeng, Ma from Chinese Mainland. It has Englishversion named 'DB Query Analyzer' and Simplified Chinese versionnamed   . DB

FluentData -Micro ORM with a fluent API that makes it simple to query a database

Code samples Create and initialize a DbContextThe connection string on the DbContext class can be initialized either by giving the connection string name in the *.config file or by sending in the entire connection string. Important configurations Ign

解决query查询输入geometry参数查询不到而通过where条件可以查到的问题

解决query查询输入geometry参数查询不到而通过where条件可以查到的问题 原因: 是因为geometry的坐标系和所要查询的图层不一样导致的(问题引起是由于底图中叠加了不同的坐标系的引起的) 问题描述: 我在公司做好的功能并且测好了,到现场出了问题,发现通过where语句查询时正常的,拉宽查询不正常.并且通过网页打开图层查询请求页面,手动输入代码中得到的geometry查询是可以查到数据的. 问题解决过程: 通过fiddler跟踪请求的http路径(因为arcgisAPI请求arcg

SPOJ375 Query on a tree

https://vjudge.net/problem/SPOJ-QTREE 题意: 一棵树,每条边有个权值 两种操作 一个修改每条边权值 一个询问两点之间这一条链的最大边权 点数<=10000 多组测试数据,case<=20 Example Input: 1 3 1 2 1 2 3 2 QUERY 1 2 CHANGE 1 3 QUERY 1 2 DONE Output: 1 3 #include<cstdio> #include<iostream> #include&

你用什么方法检查PHP脚本的执行效率(通常是脚本执行时间)和数据库SQL的效率(通常是数据库Query时间),并定位和分析脚本执行和数据库查询的瓶颈所在?

腾讯 PHP脚本的执行效率 1, 代码脚本里计时. 2, xdebug统计函数执行次数和具体时间进行分析.,最好使用工具winCacheGrind分析 3, 在线系统用strace跟踪相关进程的具体系统调用. 数据库SQL的效率 sql的explain(mysql),启用slow query log记录慢查询. 通常还要看数据库设计是否合理,需求是否合理等.