Research Reflections: Football

Showing posts with label Football. Show all posts

Tuesday, August 24, 2010

PATAT 2010: Multi-objective Sports (Football) Scheduling

At the recent PATAT (8th International Conference on the Practice and Theory of Automated Timetabling) conference I was fortunate enough to be invited to give a plenary presentation.

My talk focussed on sports scheduling. Indeed, the title was "Scheduling Football (Soccer) Fixtures: Progress Made to Date and Future Challenges". I focussed on the conflicting objectives when trying to minimise travel distances, whilst also trying to reduce pair clashes (which can be considered as local derbies for the sake of this discussion).

This is a classic case of a multi-objective problem where minimising one objective causes the other to increase and vice versa. It is not (usually) possible to minimise both objectives, instead you are looking for a trade off. These are plotted on a pareto front where a user would then decide which trade off solution is the best.

In the plenary talk, I showed that it was possible to reduce both the distance and the pair clashes such that (sometimes) the distance did not significantly increase. This is a potentially useful result as it means that supporters do not have to travel any further (statistically) and the policing costs are reduced as they do not have to police so many local derbies.

I should say that this result is only work in progress at the moment in that it has not been verified by the football authorites or the police, but I would hope that it would be of interest to them.

In the same talk, I also discussed how I collected the data for this work (which essentially is the various distances between football clubs). This involved using Google maps and Multimap APIs. I'll talk about this in the next blog. I'll also provide a link to the paper (as I don't have it to hand at the moment). But, if you are interested the reference is:

Kendall G., McCollum B., Cruz F. and McMullan P. (2010) Scheduling English Football Fixtures: Consideration of Two Conflicting Objectives. In proceedings of the 8th International Conference on the Practice and Theory of Automated Timetabling (PATAT 2010), 11-13 August 2010, Queen’s University, Belfast, UK, pp 1-15

Friday, August 28, 2009

Random Number Generation

We all know that computer generated random numbers are not really random at all, but just pseudo-random. And I know that is a lot of discussion about how best to generate random numbers.

To be honest, I have not taken much notice of this in all the programming that I have done but today I found out that it is important.

Without going into the (boring) details, I have been generating a random number that is either 1 or 2 - but I generate quite a few very quickly.

For anybody that is interested the code is something like

r = rand() % n +1 [where n is the range you are interested in]

The problem is, I was returning the same number every time. Only very occasionally did I get the odd difference.

If I increased the range (say generating numbers between 1 and 20), this made no difference (I always got the same number).

However, if I put in a delay between each call to rand(), this "solved" the problem, but this is not a good solution.

Doing some reading around the subject, I think it is to do with the low/high order bits when generating the random numbers.

But that does not help me generate a decent distribution when making very frequent calls to rand().

I seem to recall that there is a C++ class called RNG (Random Number Generator) available and I am sure that Numerical Recipes in C will have something to say on the issue, but I need to look into these.

Wednesday, August 26, 2009

Football Prediction: A decision to be made

Today I have been working on my research that is investigating if it is possible to predict the outcome of football matches. The measure I will eventually use, to see if the predictions can be considered successful, will include if it can make money at the bookmakers, if it it more successful than other tipsters etc.

One of the functions I have in my system is to be able to generate the league table for a given date. That is, taking into account the fixtures played to date, generate the league table for any point in the season.

I believe that my function is working correctly and today I was carrying out some tests to see if the league tables I generated were:

Correct at the end of the season. That is, taking into account every match played, is my input data correct and does my algorithm process that data correctly.
Does my algorithm, given a date, generate the correct table for that point in the season.

I initially thought that point 2 would be very time consuming to check but I found a very useful web site. http://www.statto.com is not only a very useful web site (for all sorts of reasons) but one of the facilities it offers is to generate a league table for a particular point in the season.

When doing my checks (and there are still a lot more to do), I have found some problems where my generated tables are not correct. This is almost certainly down to my inputting the results incorrectly, so I need to check all those.

However, my checks also highlighted another problem. Actually, I knew this was something that I needed to address but I had not really thought it through.

The problem arises when teams having points taken away. I knew that this happened and I had yet to include it in my system so I was expecting the tables not to match exactly.

However, I had assumed that the points were deducted at the start of the season but this does not seem to be the case. It appears that the points can be deducted at any time in the season.
This is not too much of an issue. It just makes the programming more complex than I had hoped.

The real issue is what do I do when a team has had points deducted?

Let me give you an example. A team has won 3 games and drawn 2. That means that they have received 11 points (you get 3 points for a win and 1 point or a draw). But, if they have had 10 points deducted then they will only have 1 point. This obviously affects their league position. If I am using the league position as one of the contributory factors in my predictive model, is this fair - or should I ignore the points deduction for the purposes of prediction?
On the other hand, their league position, with the points deduction, may affect the way they play, and could be a factor in the prediction.

I'm not quite sure what I am going to do yet.

Monday, August 24, 2009

Football Fixture Scheduling: Are all clashes equal?

In a paper published in JORS:

Kendall G. (2008) Scheduling English Football Fixtures Over Holiday Periods. Journal of the Operational Research Society, 59(6), pages 743-755 (doi:10.1057/palgrave.jors.2602382)

I investigated if it was possible to produce superior fixtures for the Christmas/New Year Period with respect to minimising the distance that is traveled by supporters on two particular days.

One of the issues that the underlying model had to capture was that certain teams could not play at home on the same day. For example:

Manchester United and Manchester City
Liverpool, Everton and Tranmere
Chelsea and Fulham
Colchester, Ipswich and Norwich
etc.

Teams that are paired in this way are called pairs (even though there might be more than two teams involved). A problem arises as it is not possible to eliminate all the pair clashes. That is, some paired teams have to play at home on the same day.

In the JORS paper, all the pair clashes were treated equally (e.g. Manchester United and Manchester City playing at home on the same day is not considered any more, or any less, important than Liverpool and Tranmere playing at home on the same day).

I am not sure if any given pair clash should be considered more (or less) important than any other, but I suspect so.

One of the things I plan to do is analyse the past few seasons fixtures and gauge if certain pair clashes are allowed more than others. Then I will use this evidence to weight the clashes during the process of searching for a good set of fixtures.

Wednesday, August 19, 2009

Football Fixture Scheduling: Another Project

Over the past nine months or so I have been working with a local company who provides an online service to schedule various types of sporting events. They contacted me after seeing my contact details on wikipedia (which was a nice way to be contacted).
The company need to produce double round robin tournaments but they have some factors which are not present in the major English leagues. For example

Teams might share pitches so those two teams cannot play at home on the same day.
There are a number of different leagues and the teams in different leagues might share pitches (see above) but the leagues play in different timeslots. That is, pitches are available in certain timeslots and teams can play in certain timeslots and these are not consistent across divisions.
We want to provide a good spread of games such that repeat fixtures (team i vs team j and team j vs team i) are as far apart as possible.

There followed a series of meetings (at 7:30am for reasons best known to ourselves), which discussed the problem so that we both fully understand the task at hand.

At the moment, I am trying to develop a constructive heuristic and then carry out some additional optimisation with simulated annealing.

It has proven to be a very interesting problem and there is still a lot that I want to do, but I have just about got a constructive heuristic that works (in that it produces a feasible solution).

My focus over the coming week or so (as I have some time to work on the problem) is the following.

Investigate if I can produce a more efficient constructive heuristic. At the moment, it takes many iterations to get a feasible solution and I'd like to do it much more quickly.
Run the local search optimisation (in this case simulated annealing) which will take the feasible solution from the constructive heusristic and try to improve on it. In addition, it will also attempt to spread out the repeat fixtures.
Once I have something that seems worthwhile I want to document the algorithm. This is for the purposes of a potential paper (the holy grail in our world!) and also to discuss the approach with the company.

So, the next couple of weeks should be very interesting (well, at least for me!).

Wednesday, August 12, 2009

MISTA Conference: Plenary Talk (David Hine)

The last plenary at MISTA was given by David Hine. He is a sergeant with the Public Order Unit at New Scotland Yard (in particular Football Intelligence). His talk gave the delegates an overview of the issues and problems he faces in organising resources to police football matches in London, the wider UK and also when fans travel overseas (whether visiting other countries or fans coming to the UK).
He said that he wanted to take us into his world and also take us out of our comfort zone and he certainly did that, with some of the shocking videos that he showed. The feedback I received after his talk is captured by comments such as "an excellent presenter", "one of the best conference talks I have ever seen" and "it really brings home the impact that the schedules have on everybody's life."
I invited Dave to give the talk a few months back (after meeting up with him in London) and I would like to offer my personal thanks to him for taking a couple of days out of his busy schedule (and at the start of the football season) to give the talk to the MISTA audience.

Thursday, July 30, 2009

Football (Soccer) Prediction: Data Collection (#002)

If you have read previous versions of this blog you'll know that a) I have an interest in trying to predict football (soccer) matches and b) I am currently developing a system (as a research project) that I hope to get up and running in the next month or so.

What kept me busy for a lot of last season was collecting the data.

Fixtures
The fixtures were easy to collect, being readily available from various sources. As my basic reference guide I used the Sky Sports Yearbook as this is a piece of published work that will be available to future generations. I could have used may of the web sites that are available but as a scientist we don't really like to rely on web sites as they may not stand the test of time. The fact that web sites are not peer reviewed are another factor we also have to consider.

Results
The results are also easy to collect as they are a matter of public record and are reported in the media and, from a research point of view, we can validate them for years to come (e.g. newspapers, next seasons Sky Sports Yearbook etc.)

Bookmakers Odds
One of the other things I wanted to collect was the bookmakers odds. This proved a lot more challenging. There are a couple of problems. As far as I am aware they are not a matter of public record (at least that stand the test of time). Or, to put it another way, can you go and verify what the odds were for a given match at a given time - just by accessing publicly available information? Importantly, if two (or more) people are independently given the same task will they come back with the same answer? And, is it the correct answer anyway?
The second problem is that odds change over time anyway, with the weight of money that has been bet.

Anyway, over the course of last season, I made regular visits to the bookmakers to pick up their fixed odd coupons and I filed them away as evidence of the odds I was using.

Since carrying out the data collection (particularly the odds) I have discovered a couple of interesting other sources. I believe that the Racing Post (which, importantly, is also published as a daily paper, so is a matter of public record) publishes the best odds available on the fixtures for a given day.
I was also pointed to a web site recently (http://www.football-data.co.uk). This is a very good web site that not only has a lot of information but also has at least seven seasons worth of fixtures data including results and odds information from a selection of bookmakers.

I have checked the odds I collected last season against the ones on this web site and they match up, which is encouraging.
It also provides me with more than just last season to carry out initial testing before I use the system in anger on this season.

The downside of the http://www.football-data.co.uk web site is two-fold.

I don't think they will make their data available until a few hours before the kick off time. This might be a problem for the system that I am developing.
The data is still a web site so, from a research point of view, I should not really cite it as the web site may not be available in 1/10/100 years time.

Please don't take this as a criticism of the web site. They have done (and are doing) a fantastic job of collating all this data and, for this particular research project, will save me HOURS of data collection and data entry time.

Monday, July 27, 2009

Football (Soccer) Prediction: Development Framework (#001)

As the new football (soccer in the USA) season approaches I am trying to get a football prediction system up and running. I think I will struggle to get it ready for the start of the new season (which starts Aug 7th) but that is not so important as this is mostly a research project. In any case, the system I have in mind will take a few weeks before it is usable as I need to get some results posted for the prediction system to work on.

I did a quick check on how much time I have spent so far on the programming. As a rough estimate, I think it is about 100 hours, mostly (if not all) at weekends. I still have a lot to do but I almost have the "football framework" that I need. That is, I can read in the data that I have been collecting, generate a league table for a given date in the season and collate various other statistics that I will eventually need. I also have various data structures that I will "pass around" the prediction part of the system.

I reckon that I need about another 20 hours and then I'll have the framework completed. Then I can start to work on the prediction parts of the system.

One thing that I need to implement is an Artificial Neural Network (ANN). I have one from another project I worked on (stock market forecasting) but I want to re-engineer it. At the moment the ANN is only a feed forward network as it was used in an evolutionary setting. That is, the predictions were evolved rather than a more traditional training mechanism.
One thing lacking in my ANN class (I program in C++) is a back propagation training (BP) mechanism So, apart from tidying up the code, I also want to implement a back propagation method, as this seems one potential way to carry out the prediction.

So I have my work cut out over the coming weeks, but I hope that it will be interesting and, you never know, it might just work.

Wednesday, July 22, 2009

Football Prediction: Follow up

Whilst searching around the net looking for relevant resources for my plan to predict football matches, I came across The Sports Exchange. It looks like a relatively new web site, but seems very nice.

I posted a comment in their blog (about football pools prediction), and received a number of replies. The blog entry can be seen here.

Monday, July 20, 2009

Predicting the Results of Football Matches

I have recently become interested in trying to predict the results of football matches.

The interest grew from wondering what else I could use the data for, that I had collected for generating football fixtures (see JORS paper). The data included the travel distances between all the teams in each division. I also maintained the fixtures that were actually played over the Christmas/New Year period so that I could compare the fixtures I generated against those that were actually played. Needless to say, it took a long time to collect all this data.

As I had gone to all the time and trouble of collecting the data I want to maximise its usage, so I began to look for other uses that I could put it to, and prediction seemed an obvious challenge.

With this in mind, for most of last season I collected additional data that I think might be important in predicting football matches. For example, I have been updating all the scores as each fixture was played. I have also been keeping a record of the odds that bookmakers were offering. Just collecting this data was a large data collection exercise in itself and I certainly did not get the odds on every fixture, but I have enough to be going on with.

I'm not totally sure what I am going to do with this data yet but I know if I don't collect some of it as and when it is available, it becomes almost impossible (or at least a lot more difficult) to collect.

I have started programming some support functions. For example, given a date and a set of results I can generate the relevant league table for that point in the season.

My ultimate goal is to develop a prediction model and test out how good it is on the 2008-2009 season and, if I find a good model I will try and predict the fixtures for the 2009-2010 season before the matches are actually played.

The new season starts quite soon (kick off is 8th August 2009, but there is one match scheduled on the 7th August 2009).

If I am going to have ever chance of testing the prediction model over the coming season, I need to get programming!

Thursday, June 18, 2009

Scheduling Football Fixtures

The fixtures for the English football season have just been published (Wednesday 17th June 2009), ready for the kick off in August. Paul Fletcher's (another blogger) excellent article describes some of the complexities that have to be considered when generating the fixtures.

My research has focused on scheduling the fixtures over the Christmas/New Year period. On Boxing Day and New Years Day every team has to play. If a team plays at home on Boxing Day, they must play away on New Years Day (and vice versa), but they are not allowed to play a reverse fixture (for example, if Chelsea play Liverpool on Boxing Day, Liverpool cannot play Chelsea on New Years Day). There are also other factors which have to be considered (such as the number of fixtures that can take place in London etc.)

One of the most important aspects for the fixtures on Boxing Day and New Years Day fixture is to reduce the amount of traveling for the supporters. If you look at these fixtures for (at least) the past seven seasons you will see that these two days have the lowest travel distances when compared to other dates when all the teams play.

This was the focus of a paper I published in Journal of the Operational Research Society (you can access the paper here, or from the journal itself). The paper shows that it is possible to produce a set of fixtures which involves less travel than the fixtures that were actually used.

There are a couple of caveats.

I have tried to make sure that I have captured all the aspects of the real world problem, but I cannot be totally sure. For example, I consider all pair clashes (see the paper if you are interested in what a pair clash is) as being equal. Perhaps, some should be avoided more than others?
There are some seasons (typically World Cup years) where, over the Christmas and New Year period, every team has to play four matches and they must be sequenced Home, Away, Home, Away (or Away, Home, Away, Home). This makes the scheduling problem a lot more difficult, especially when you still attempting to minimise the travel distances. I still need to address this aspect of the problem.

The paper referred to above only considers four seasons. I have other work in progress which not only uses seven seasons of data but also uses more sophisticated search techniques. Not only am I able to significantly speed up the search (from about 30 hours to a matter of minutes) but I also managed to improve on the results reported in the JORS paper. I'll report more if/when the paper is published.

In the meantime, my attention is turning towards the 2009-2010 season. One of the biggest tasks each year is to collect the data representing distances between each football club. For the past seven years I have used http://www.greenflag.com. It is a time consuming task as I have to manually collect the data for each pair of teams. This year I am looking at exploiting google maps. They have an API (Application Programming Interface) which I hope will enable me to automate this data collection task. I will report on the success (or failure) of this approach soon.

Research Reflections