The growing amount of navigation services’ users provide many opportunities to improve driving conditions by making a better use of the road network. However, collecting large quantities of such data can be difficult due to both costs and privacy concerns. TomTom collects large amounts of anonymized traffic data, respecting the user’s right to privacy: https://www.tomtom.com/en_gb/privacy/. This allows us to generate statistical descriptions that can be used to generate a large number of synthetic trips with similar properties to real data, but without the associated issues. In this post we’ll explore our approach to creating such a generator.

We’ll start by defining some terms that will be used throughout this post:

**Point:**A location (latitude and longitude), a speed and a time stamp.**Trip**: A sequence of points from one origin to one destination with a fixed interval between consecutive points.**Trace**: A sequence of one or more trips recorded by the same device, all with the same time interval between points.

Each trace is generated from a few parameters sampled from the the following distributions:

**Trip origin:**A 2D histogram that divides the map into cells where trips start.**Trip destination:**A 2D histogram that divides the map into cells where trips end.**Trace begin**year, month, weekday and minute of day.**Number of trips per trace.****Minutes between consecutive trips.****Trip air line**(or great-circle) distance.**Seconds between consecutive points.**

We start the process of generating a trace by defining the number of the trips in the trace and the seconds between points by sampling from their distributions. The first trip’s starting time and location are also sampled from the appropriate distributions and the trip is simulated.

The process used to generate a single trip is illustrated in Figure 1. Given a node on the map as the origin of the trip we sample from the **trips destinations** and the **air line distance** distributions to choose a destination for the trip. The **trip simulator** then uses the map to compute the fastest route between the two nodes and creates points along it with a fixed frequency, defined by the parameter **seconds between points**. The** trip simulator** is also responsible for simulating noise in the coordinates. The output is a list of points, similar to Figure 2.

The points are then appended to the end of the trace, which aggregates the points for all its trips. The process is then repeated for the remaining trips, however both the start node and time are defined by the end of the previous trip. When all trips have been generated the trace is represented as RDF (Turtle) and the next trace can be generated.

As we can see in Figure 3, mimicking traffic with this statistical approach provides results with quite similar densities. Further analysis and explanations on the generator and its results can be found in [1].

[1] Bösche K., Sellam T., Pirk H., Beier R., Mieth P., Manegold S. (2013) Scalable Generation of Synthetic GPS Traces with Real-Life Data Characteristics. In: Nambiar R., Poess M. (eds) Selected Topics in Performance Evaluation and Benchmarking. TPCTC 2012. Lecture Notes in Computer Science, vol 7755. Springer, Berlin, Heidelberg.

*Originally published here*