Deriving values for a non-existent table
Tue Oct 16, 2007 at 16.06
This situation might arise in rare cases but if it does there is a simple solution that should get you on your way. Here is the problem: You have two tables that are related by a column in each of them. However, there may or may not be a record in either of the tables for any given key. It is a bit tricky to explain in words, so perhaps an example is in order. For whatever reason you are working with the following tables
game_id int,
venue_id int,
game_date datetime,
total_points int,
team_id int,
primary key(game_id, venue_id, game_date),
unique ven_date_uniq(venue_id, game_date)
);
insert into league_games values
(1, 1, '2007-01-01 12:00:00', 89, 1),
(2, 1, '2007-01-01 16:30:00', 72, 2),
(3, 1, '2007-01-02 11:00:00', 103, 7),
(4, 1, '2007-01-02 19:30:00', 97, 10),
(5, 2, '2007-01-08 16:00:00', 65, 4),
(6, 2, '2007-01-08 19:00:00', 78, 2)
(17, 2, '2007-01-09 17:30:00', 102, 12),
(18, 2, '2007-01-09 19:00:00', 58, 13),
(19, 3, '2007-01-19 14:30:00', 78, 17),
(20, 3, '2007-01-19 16:00:00', 87, 22),
(21, 3, '2007-01-19 18:00:00', 109, 27);
create table tournament_games(
tournament_id int,
game_id int,
game_date datetime,
total_points int,
team_id int,
primary key(tournament_id, game_id),
unique team_date_uniq(team_id, game_date)
);
insert into tournament_games values
(1, 7, '2007-03-07 09:30:00', 88, 1),
(1, 8, '2007-03-07 12:00:00', 76, 3),
(1, 9, '2007-03-07 14:30:00', 92, 5),
(1, 10, '2007-03-07 17:00:00', 80, 7),
(1, 11, '2007-03-07 19:30:00', 109, 9),
(1, 12, '2007-03-08 09:30:00', 97, 2),
(1, 13, '2007-03-08 12:00:00', 56, 4),
(1, 14, '2007-03-08 14:30:00', 85, 6),
(1, 15, '2007-03-08 17:00:00', 105, 8),
(1, 16, '2007-03-08 19:30:00', 93, 10);
From these tables we are trying to calculate the total number of points accumulated by each team in both tournament and league play for the Spring 2007 (January - May 2007) season. It is important to note that we do not have a teams table to join against, which is the crux of the problem. However before we jump into a solution we need to discuss another fact. While it may seem a little counter-intuitive that a team would play in the tournament but not in the leauge, our data supports this situation. Also a team may play in the league but not in the tournament. This adds a dimension to our problem because you must account for the missing values (which in an OUTER JOIN will produce NULLs) in your query. This causes problems because the sum of an integer and NULL is not defined, so it is NULL. But NULL isn't a number so the total points cannot be NULL. To get around this problem we will use the COALESCE function which takes the first non-NULL value in its argument list as the return value. Let's have a look at our first query attempt
, sum(coalesce(l.total_points, 0) + coalesce(t.total_points,0)) total_points
from league_games l
left outer join tournament_games t
on l.team_id = t.team_id
and t.game_date > '2006-12-31 23:59:59'
and t.game_date < '2007-05-01 00:00:00'
where l.game_date > '2006-12-31 23:59:59'
and l.game_date < '2007-05-01 00:00:00'
group by l.team_id
Now this works fine provided there are no teams in the tournament that are not also in the league play. Given the connotation of league and tournament play, it doesn't make sense for this situation to arise, but humor me. We are, given the data above, missing point totals for teams 3, 5, 9, 6, and 8. To get these values we can reverse the order of the tables in query but then we will be missing totals for teams 4, 12, 13, 17, 22, and 27. The only way is to add a third table to the query that gives the valid teams. In an ideal situation we would have a table, teams or even teams_seasons, that would provide the necessary information, but we are not so lucky in our example. To get this data we will derive our teams table in the query using a union. Here is the query to derive the teams table
from league_games l
where l.game_date > '2006-12-31 23:59:59'
and l.game_date < '2007-05-01 00:00:00'
union
select l.team_id
from tournament_games l
where l.game_date > '2006-12-31 23:59:59'
and l.game_date < '2007-05-01 00:00:00'
order by team_id
This produces a list of team_ids that appeared in the Spring 2007 season which we can join against both games tables to get our final result. You might be tempted to just join against this table and forget about the date range for the games tables, but that would be a mistake. While it is possible that a given team only played during one season, it is more likely that they played in multiple seasons and not including the conditions on the games tables would skew your results to total points over all time and not only in the given season. Putting it all together we have our final solution
, sum(coalesce(l.total_points, 0) + coalesce(t.total_points,0)) total_points
from (
select l.team_id
from league_games l
where l.game_date > '2006-12-31 23:59:59'
and l.game_date < '2007-05-01 00:00:00'
union
select l.team_id
from tournament_games l
where l.game_date > '2006-12-31 23:59:59'
and l.game_date < '2007-05-01 00:00:00'
order by team_id
) teams
left outer join league_games l
on teams.team_id = l.team_id
and l.game_date > '2006-12-31 23:59:59'
and l.game_date < '2007-05-01 00:00:00'
left outer join tournament_games t
on teams.team_id = t.team_id
and t.game_date > '2006-12-31 23:59:59'
and t.game_date < '2007-05-01 00:00:00'
group by l.team_id
order by teams.team_id
Which correctly answers our query and is an elegant solution to the presented problem.
- Tagged:
- technicalarticles
- database
- mysql
No comments yet, be the first!