Trafi-opendata

API and web interface to Finnish open vehicle registration data

Live version: vehicledata.fi

Source: github.com/ppar/trafi-opendata

Contact: github.com/ppar


Contents

What

JSON API and a web interface for accessing open vehicle registration data released by http://www.trafi.fi/, the Finnish Transport Safety Agencey.

Why

There hasn't been a practical interface available for this data, and it looked like there could be some demand in the hobbyist community for it. Also, I wanted to experiment with Node.js, MongoDB and some frontend frameworks.

Implementation

The JSON API runs on Node.js and Express.js and provides a simple read-only interface to a (currently) single 5+M row table stored in MariaDB/MySQL.

There is experimental support for using a MongoDB backend instead of MySQL (with Mongoose), but MongoDB proved to have unacceptable performance early on during development.

The Web interface uses React as the base framework, jQuery and the wonderful jQueryQueryBuilder plugin for widget-level things and Bootstrap for managing the layout.

Python scripts are used to sanitize and import the CSV data provided by Trafi.

Performance

MariaDB/MySQL

When imported into MariaDB/MySQL, with all columns indexed and MongoDB's silly foobar_UPPER columns still in place, the binary InnoDB table takes only a "modest" 6 GB on disk. Without the junk columns, the 800 MB of original raw CSV translates to a ~3.4 GB InnoDB table, indexes included.

Trivial queries hitting indexes and/or queries where an indexed column significantly reduces the # of seekable rows (like selecting a specific car make) perform quite well. Substring queries causing long disk seeks are still painfully slow.

Pages with high numbers also load slowly - is it a slow LIMIT ... OFFSET ... implementation or just data absent from IO cache? Using a memory based engine for the whole R/O database should also give nice overall performance.

At any rate, the database should be normalized, at least over the "make" column, to reduce size.

MongoDB

Importing the ~ 800 MB CSV file consumes ~ 8 GB in MongoDB without indexes. With single-column indexes, the database bloats up to 12-14 GB.

Even with indexes in place, queries like

{ $and: [ { prop0: 'FOO' }, { prop0: 'BAR' } ] }

use plain seeking instead of index + seek and take forever to execute. To add insult to injury, MongoDB 2.6 can't explain() queries without executing them first. MongoDB 3.x would seem to have a better explain() feature without this problem.

MongoDB can't do case-insensitive searches over indexed fields; you have to resort to tricks like shadow foobar_UPPER columns instead, which adds to database bloat and obviously reduces disk cache efficiency.

Data Set

Trafi publishes quarterly releases of anonymized open data on registered land vehicles, boats and ships. This project initially focuses on vehicle data.

Current original data releases are available at

Vehicle data has been released since 2014. Release history is not available on Trafi's website, but thanks to archive.org, can be found in /trafi-data-releases.txt in the GitHub repo.

Issues

The provided data set presents some difficulties, such as:

Inconsistent data

Vehicle make and model are freetext fields.

'O's are spelled as zeroes, duplicates are rife, etc.

As a result of individual officials typing (and mistyping) entries into various registers over the course of decades, there is no consistent way to identify vehicle brands and models in the data.

There are vehicles with 4000+ liter engine displacements because someone forgot to move the commma

Varying dates

Earlier registrations only contain the registration year, some entries contain no date fields, others a full date

Lack of temporarily decommissioned vehicles

Only vehicles with "commissioned for road use" status are incuded in Trafi's quarterly data dump.

It's common for owners of hobbyist vehicles to temporarily decommission their cars when they're not being used in order to save on insurance and taxes, so each released data set contains only a semi-random snapshot of the vehicles that, in fact, do see regular use. This is even more tragic for someone more interested in 1957 DeSotos than the annual rates of Nissan vs. VW registrations

Continuously merging all released data sets from Trafi would allow us to slightly improve coverage, but this is far from perfect. Because the data is anonymized, a method of identifying individual vehicles across datasets and avoiding duplicates would need to be developed.

Ideally, Trafi should release more complete data sets: one with all known vehicles, and another one with all vehicles that have been road-commissioned e.g. within the last five years.

TODO

License

Code

TBD. It's Open Source. I'm a coder, Jim, not a lawyer.

Metadata

The files metadata.json, www/public/js/columns.json and all files under schema/ are derived from the "Description of data content" document (17629-avoin_data_ajoneuvojen_luokitukset.xls), released by Trafi.fi on 8 th of April 2016 as part of "Open data for vehicles 4.5" at http://www.trafi.fi/en/services/open_data under the "Creative Commons Attribution 4.0 International (CC BY 4.0)" license: http://creativecommons.org/licenses/by/4.0/deed.en