Wednesday 22 August 2012

OSM should not have a database

Openstreetmap should not have a database. Having a database positively hurts the whole community. There, I've said it and I feel better already.

Where there is a database there are nerdy types who want to normalise it, formalise the ontology of it, rationalise it, enhance its performance and all the other things computer science bods and other nerds love to do. They do it because their training or their gut feel tells them it must be an improvement that everyone will welcome it. They are wrong.

If OSM didn't have a database, it would be easier to explain that we don't do that, and we don't want you to do that either. It would be easier to direct these meddling nuisances to some other project, maybe opendatabasewrangling or openstringuntangling. That way we could keep the freedom to carefully choose the tags we use without the risk that some prat would mass-edit them into his view of an organised world, losing all the detail and nuances carefully placed there by hundreds of other mappers. The really, really annoying thing is that most of these people don't actually consume the data in a useful way, it just seems like a good idea to them. If they did use the data they would quickly see that selecting the data you want with a little preprocessing is easy and you always have to do this, so adding some extras to cope with variety is fairly simple. Write the code once and it works over and over again.

Of course we need to store the data somewhere and in reality that needs to be a database, such as the one we have. I must make it clear that I'm not criticising the database, it's design or the way it is managed or run at all, just the fact that something called a database attracts unwelcome urges from a few people. Maybe we could just stop calling it a database. Can we rename it to the tag-pile, or the OSM toy-box or anything that doesn't convey 'database'?

I wish the people who want to reduce the tags to a proscribed list well - I just wish them well somewhere outside of OSM. If they want an organised, limited list of tags, take OSM data and play with it in their world as they want to - just don't upload the changes as a mass-edit back into our toy-box. There is real value in nuanced data and, more importantly, real value in not upsetting the mappers whose carefully chosen tags get squashed to homogenised blandness by these unthinking mass-editors.

What would you rename the database the place we store our data to?

6 comments:

Gregory Marler said...

For this to work we should rename values of this secret-database are human-language words.

The tag 'amenity' will be changed to t1, the tag 'highway' will be renamed to t2, the value 'pub' will be renamed to v1, and so on.
Therefore I map something as t1=v2. Your choice of editor/viewer should deal with translating it into a language you can work with (and not call it a database), languages could be French, British, American, etc. I add a pub in my editor and that gets save/sent to the OSM tag-pile as t1=t2. When I render a map, it translates t1=t2 into pub.jpg.

I quite like the name of tag-pile.

vdp said...

I... honestly dont understand your point.

People tend to want to normalise data because that makes the data easyer to use. The fact that we store data history suggests that we should not do unecessary edits. Some edits may well be silly or undesireable, but that has nothing to do with how we store the data. The problem would be the same with our current postgresql or with hand-writen notes on paper. Incidentally, the current db schema and api are very flexible and purposefully dont enforce any rules.

As for optimising the code or the way we store the data, I can't see how that would be a bad thing. The various OSM tools are decent, but because of the amount of data that they have to handle, they're definitely no speed daemons. I'll take any speedup I'm given, (unless of course it restricts what data we can enter or what we can do with the data, but I haven't ever come across any issue so far).

In summary, the fact that we have a "database" (you know that any somewhat-organised repository of data is a database, wether postgresql of a text file or a bookshelf, right ? I assume you meant "RDBMS" ?) is not the problem. That's not what you want to "fix".

Chris Hill said...

@vdp
I'm afraid you may have missed my point, probably I didn't put it well enough.

People assume that because we store OSM data in a database that somehow means that the data needs to be rationalised, simplified or similarly homogenised. I think this is horribly wrong. There can be real value in the small differences that people use. The people who extract the data, remove these differences and upload this homogenised as a mass-edit ruin the details. The result of these edits not only loses detail, it upsets the people whose detail was lost. These mappers are our most precious resource and we cannot afford to let a few people scare them away.

As you say, the API is very flexible and doesn't impose many rules. Unfortunately some people make up their own rules and try to impose these on everyone else with mass-edits.

I will continue to speak out against this process.

Yurik said...

@chris, I think you mix two concepts into one. There is "meaning" - what the person intended to describe, and there is ID of that meaning (tag). In order for all data consumers to reasonably be able to consume the data, they need to know a list of meanings they need. But there is absolutely no reason why we should have multiple IDs to describe the same meaning. Lets not jump into extremes - most of the time, the difference in tagging is NOT due to the difference in meaning, but rather simply because a novice user didn't know the proper way to document something.

Chris Hill said...

@Yurik,
This post from four and a half years ago was somewhat tongue-in-cheek. I am in favour of helping beginners (we need them all), I am in favour of sorting out obvious mistakes like typo errors. I am hugely against homogenising the data so there is only one set of tags to cover all things in the whole world. I am not really in favour of using standard highway tags to cover things that just don't fit, such as using highway=motorway in the US but not allowing freeway or expressway for example.

OSM is *not* a computer science project. It is a community. It is really interesting that the part of the community who consumes OSM data very rarely shouts about needing homogenised data, they quietly write code to consume what they want to use and understand that once written it can be used again and again at almost not cost, without upsetting anyone who chose the tags.

Much of this is not the problem it was in 2012 as editors use presets to 'suggest' what to use so most people end up using the presets, which by the way do not always agree with the wiki. But then, who does.

Yurik said...

@Chris, I don't agree with two points:
* That homogenizing is the same as restricting - in your example, highway=motorway vs highway=freeway -- if they are different, simply document the difference on the wiki, and everyone will be happy to use whichever value is more appropriate. But both should be documented, and each editor and consumer will have a clear guidance of which value to use when, and how to process/draw it. There is a huge list of various religious denominations documented for the "denomination" tag - noone is requiring some church to fit the mold of a pre-set list.
* There is a huge cost of supporting non-homogenized data. I was involved in the efforts to build map support for Wikipedia - we spent enormous amount of effort handling all the variations of the data, and we only handled the most basic cases. If you look at the size of mapnik stylesheet, it will be a good indication of the complexities involved. And that stylesheet is not reusable for any other map, because other maps would have a different audience, and different goals, so it would have to start nearly from scratch and handle most nuances. The more homogenized the data, the less effort it will be to process it. But again, not to the detriment to the quality - see point #1