In the previous post on this we saw what is a basic graph data structure, and the layout of an example – a movie database with it. In this I am going to explain how to convert that design into graph tables.
The SQL Server graph architecture is explained really well here.
I already have some relational tables with data for this model.
My relational model looks like below.
Why is this a good example to convert into a graph data model? There are atleast two many to many relationships – moviesactor and moviesdirector. And if you consider making a table of who acted with whom – coactors, that would make it 3. Many to many relationships/bridge tables are a key indicator of what makes a good candidate for graph data. So we are past Step 1, identifying if the data is a good candidate. The next step is to list the questions we want this new model to answer. Mine may be as below:
1 Who are the actors in this movie?
2 Who is this movie directed by?
3 Who is the most prolific actor, according this dataset?
4 How many actors are also directors?
5 What is the shortest path/number of connections Person A needs to reach Person B? (A person may be an actor or a director).
..and so on. If we look at the nouns here – actors/directors – those make up a node. Movie, is another node. We can choose to make actors and directors separate nodes, or put them into one node, called person node. The only criteria here is how much that one table is going to get hit query wise. In this case, since it is a small dataset with minimal querying, I choose to go with one node, called Person Node.
The edges are the verbs – acted, directed and co starred. So each of them make up an edge table. My data model looks like below.
DROP TABLE IF EXISTS PersonNode;
CREATE TABLE PersonNode (
PersonID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
PersonName varchar(500) ) AS NODE;
Now I get a node table..i’d like to see what the node table looks like.
I have highlighted in red what SQL Server adds to the table – the two system columns – graph id, which is bigint, and node id, which is nvarchar and stores json, and the unique index to help with queries.
We can also see from constraint type that this table is similar to other relational tables – it can be enabled for replication and can have related delete or update actions defined on it if need be.
We can now insert data into this table from relational tables as below.
INSERT INTO PersonNode (PersonName)
SELECT actor_name FROM MovieData.dbo.Actor
SELECT director_name FROM MovieData.dbo.Director
Selecting data from the table gives us below. It is interesting and informative to see how node id is stored as JSON.
SELECT * FROM PersonNode
We can create the movie node similarly, as below.
DROP TABLE IF EXISTS MovieNode;
CREATE TABLE MovieNode (
MovieID INT PRIMARY KEY,
MovieTitle varchar(500) NOT NULL,
MovieLanguage varchar(500) NULL,
MovieCountry varchar(500) NULL,
MovieFacebookLikes INT NULL
) AS NODE;
INSERT INTO MovieNode (movieid,movietitle,movielanguage,moviecountry,moviefacebooklikes)
SELECT [MovieId],[Movie_Title],[Language],[Country],movie_facebook_likes FROM MovieData.[dbo].[movies]
Now that we have the two basic nodes, lets try and create the edges or connections between them.
DROP TABLE IF EXISTS MoviesActorLink;
CREATE TABLE MoviesActorLink (
Link BIT NOT NULL DEFAULT 0,
MovieActorLevel smallint NULL,
MovieActorFacebooklikes DECIMAL NULL,
CONSTRAINT EDG_MoviesWithActor CONNECTION (MovieNode TO PersonNode)
) AS EDGE;
Let us explore what an edge table looks like under the covers.
As highlighted in the red square on top, there are a whole bunch of system columns added for an edge table – the first two, graph id and edge id are unique identifying columns while the rest are for the nodes we plan to connect using this edge. There is also a default index added on graph id.
I chose to add a constraint on two nodes i plan to connect – movie node and person node. In case anyone wants to delete data from either of the node tables – there should not be an orphaned record in the edge table that points to the deleted node. This constraint can come in very useful.
Now I am ready to insert data into the edge table from relational table. I do this as below.
INSERT INTO MoviesActorLink ($from_id, $to_id,movieactorlevel,movieactorfacebooklikes)
SELECT a.$node_id, p.$node_id,c.actor_level,c.actor_facebook_likes FROM dbo.MovieNode a
INNER JOIN MovieData.dbo.MoviesActor c ON a.movieid = c.movieid
INNER JOIN MovieData.dbo.Actor b ON c.actorid = b.actorid
INNER JOIN dbo.PersonNode p ON b.actor_name = p.personname
The data in my the newly created edge table can look as below.
SELECT * FROM dbo.MoviesActorLink
As we can see the node and edge ids are all json based pointing to existing nodes. In the next post I will explore running simple queries off of this data. Thanks for reading.