Graph Databases and Baseball
Baseball is a statistics-driven sport and has been since its inception. Can a graph database be used to further analyze the data? You'd be surprised...
Join the DZone community and get the full member experience.
Join For FreeIn my opinion, stating that baseball remains America's Pastime in 2023 is a tough argument to make, as it often is the third-most popular sport in the United States. However, it is very obvious that baseball has been and always will be statistics-driven: baseball-reference.com has data all the way back to the 1876 National League season. Two franchises from that season still exist — the Chicago Cubs and Atlanta Braves — but interestingly, both used a name subsequently used by their later-arriving American League brethren.
And the data available continues to grow: Retrosheet researchers continue to search for box scores and play-by-plays for older seasons, their latest release including box scores and play-by-play for the 1919 and 1920 seasons. Occasionally this research changes individuals' stats. The data is available for download if interested.
Transactions
As a fan of the Oakland (soon-to-be Las Vegas) Athletic, I harass my wife constantly by identifying the former Oakland players on other teams — a bone-crushingly simple exercise in recent decades as stars are traded for prospects instead of retaining them at higher salaries. Very frustrating.
Similarly, there's the ex-Cubs Factor theory, tying teams' playoff success (or lack thereof) to the number of ex-Cubs who played for each team in a playoff series or World Series; fortunately, it did not prevent Oakland in 1973 from winning the World Series (though mitigating factors are claimed).
In recent years the Minnesota Twins seem hesitant to trade promising-but-underperforming young players because of David Ortiz's success in Boston, keeping Miguel Sano until his struggles were too obvious — and painful to ignore, who is currently out of baseball after being released following the 2022 season.
With this background, I decided to explore player movements between teams over time. Through the 1975 season, baseball's Reserve Clause made it basically impossible for players to have any say in where they played or what they were paid; that said, players could change teams if owners decided to trade or sell players. Free agency was introduced before the 1976 season, dramatically changing the game.
Fortunately — and perhaps unsurprisingly — data about all player transactions are available for analysis, and I thought that I could use Neo4J to interrogate the data.
Transactions
As a fan of the Oakland (soon-to-be Las Vegas) Athletic, I harass my wife constantly by identifying the former Oakland players on other teams — a bone-crushingly simple exercise in recent decades as stars are traded for prospects instead of retaining them at higher salaries. Very frustrating.
Similarly, there's the ex-Cubs Factor theory, tying teams' playoff success (or lack thereof) to the number of ex-Cubs who played for each team in a playoff series or World Series; fortunately, it did not prevent Oakland in 1973 from winning the World Series (though mitigating factors are claimed).
In recent years the Minnesota Twins seem hesitant to trade promising-but-underperforming young players because of David Ortiz's success in Boston, keeping Miguel Sano until his struggles were too obvious — and painful to ignore, who is currently out of baseball after being released following the 2022 season.
With this background, I decided to explore player movements between teams over time. Through the 1975 season, baseball's Reserve Clause made it basically impossible for players to have any say in where they played or what they were paid; that said, players could change teams if owners decided to trade or sell players. Free agency was introduced before the 1976 season, dramatically changing the game.
Fortunately — and perhaps unsurprisingly — data about all player transactions are available for analysis, and I thought that I could use Neo4J to interrogate the data.
Raw Data
As mentioned above, Retrosheet has an abundance of raw data available, of which the player and transaction data appeared relevant. Future enhancements may incorporate additional data, such as batting average, earned run average, wins above replacement, etc.
Players
Retrosheet's biographical database contains information about players, managers, coaches, and umpires, including name, birth and death information, first and last appearance date, bat, throws, height, weight, death, and cemetery information. The values are comma-separated and double-quoted only when required.
The biographic zip file is available here.
Transactions
Retrosheet's transaction database contains the details of all transactions: e.g., transaction date, the player involved, the team(s) involved, and the transaction type. The values are comma-separated, and every string is double-quoted, whether required or not.
Baseball has approximately sixty transaction types that describe how players join or leave teams, from the mundane — player drafted, traded, sold — to the interesting — jumping teams, refusing to report, purchase voided.
When the team or teams involved are from the three baseball major leagues — e.g., American, Federal, National — the well-known team abbreviation is used: LAD for Los Angeles Dodgers, BRO for Brooklyn Dodgers, etc. For all other leagues, the team and league names are explicitly spelled out. e.g., Providence of the International League.
The transactions zip file is available here.
Note #1: the home page provides links for year-by-year transactions; however, the page has not been updated since 2020. The zip file contains all transactions since 2022.
Note #2: some Negro League data is available but is separated from the other major leagues. I have not yet looked into what is available and how it may be used.
Note #3: I only loaded post-1900 transaction data as pre-1900 transaction data is less complete and consistent and required additional exception handling. And let's be honest: how many of you even know that the 1899 Cleveland Spiders lost 134 games, much less who was on their roster?
Teams
Retrosheet does not provide a definitive list of teams, so I created my own.
The file teams.txt is a list of all current major league teams or, for the Federal Leagues, the teams that existed when the league folded. No American or National League team has folded since the American League was founded in 1901.
The file teammap.txt allows us to aggregate transactions for a franchise, e.g., BRO maps to LAD, combining Dodger transactions for both Brooklyn and Los Angeles into the current Los Angeles location.
Transactions Types
Though the many transaction types uniquely describe how players moved between teams, the transaction types can be grouped into three basic types.
To Transaction
A player joins a team after being unsigned by any team, e.g., drafted out of high school or college or signed as an international player.
This shows that John Samuel Wilson was drafted by the Red Sox; the transaction date (October 1926) is a property on the relationship.
From Transaction
A player leaves or is dissociated with their team after their signed contract is invalidated, e.g., the contract expires, the player is released by the team, or the player retires.
Since a player must be signed to a team before leaving a team, this graph shows that John Cornelius Ray was traded to the Angels before the 1988 season before being released.
FromTo Transaction
A player disassociates from one team and joins another, most often through a trade or sale of the player.
Roger Maris, best known for breaking Babe Ruth's season home run mark in 1961, was traded twice, though, from the graph, it's impossible to determine the specifics. [In fact, he was traded by Cleveland to the Athletics in 1958 and then to the Yankees in 1960.]
Extreme Player Movements
Octavio Dotel played for 13 teams, none more than once.
This is only the second-most, as Edwin Jackson played for 14 teams.
Examples
Team Frequency
For a pair of teams, how many players have been exchanged, either through trade or sale? Most are trades, but scrolling through the results you do see player sales as well.
Cypher Query
MATCH (t1:Team)-[tx1]->(p:Player)-[tx2]->(t2:Team)
WHERE tx1.retrosheetId = tx2.retrosheetId AND t1.retrosheetId <> t2.retrosheetId
RETURN t1.retrosheetId, t2.retrosheetId, tx1.transactionType, COUNT(p)
ORDER BY COUNT(p) DESC, tx1.transactionType, t1.retrosheetId, t2.retrosheetId
Results
Additional Notes
This query would not be difficult in a relational database or even in Excel but demonstrates a Neo4J Cypher query using the data model. An easy place to start.
Teammates during a career
With whom did a player play, by the team, and how (transaction type) did they join the team?
This query uses retired pitcher Jon Lester as the base player.
Cypher Query
MATCH (bp:Player {firstName: "Jonathan Tyler", lastName: "Lester"})-[tx1]-(bt:Team)
WITH bt.retrosheetId AS teamId,
bp.playerDebut AS debut,
MAX(tx1.transactionDate) AS lastTxn
MATCH (ot:Team)<-[tx2]-(op:Player)
WHERE ot.retrosheetId = teamId AND
tx2.transactionDate >= debut AND
tx2.transactionDate <= lastTxn
RETURN teamId,
op.name,
tx2.transactionDate,
tx2.transactionType
ORDER BY teamId,
op.name,
tx2.transactionDate
Results
Players Returning To the Same Team
Identify players who left a team and later returned, including details about the team and the transaction.
This query only considers players who debuted in the 1960s, which is a property on the Player node.
Cypher Query
MATCH (t1:Team)-[tx1]->(p:Player)-[tx2]->(t2:Team)
WHERE DATETIME(p.playerDebut) > DATETIME('1960-01-01T00:00:00')
AND DATETIME(p.playerDebut) > DATETIME('1970-01-01T00:00:00')
AND t1 = t2
AND DATETIME(tx1.transactionDate) < DATETIME(tx2.transactionDate)
RETURN p.name,
t1.name, tx1.transactionType, tx1.transactionDate,
tx2.transactionType, tx2.transactionDate
Results
How To Read
- Stan Bahnsen became a free agent in October 1980 and resigned from the team in December 1980.
- Rich Baney was sold by the Orioles in June 1971 but returned to the Orioles in September 1971 for unknown reasons.
Players Traded Back To Original Team by Number of Days
Identify players who were traded between two teams, ordered by the number of days between the transactions.
Cypher Query
MATCH (t1:Team)-[tx1:TRADED_FROM]->(p:Player)-[tx2:TRADED_TO]->(t2:Team)-[tx3:TRADED_FROM]->(p:Player)-[tx4:TRADED_TO]->(t1:Team)
WHERE tx1.retrosheetId = tx2.retrosheetId
AND tx3.retrosheetId = tx4.retrosheetId
AND t1.retrosheetId IS NOT NULL
AND t2.retrosheetId IS NOT NULL
AND tx1.transactionDate < tx3.transactionDate
RETURN p.name,
t1.name AS OriginalTeam,
t2.name AS TradedTeam,
tx1.transactionDate AS Trade1Date,
tx3.transactionDate AS Trade2Date,
Duration.inDays(DATETIME(tx1.transactionDate), DATETIME(tx3.transactionDate)).days as DaysBetween
ORDER BY DaysBetween,
p.name,
t1.name,
t2.name
Results
How To Read
Colin David Rea was traded from the Padres to the Marlins and back to the Padres within 3 days.
Closing Thoughts
These examples scratch the surface of how the transactions can be analyzed.
With additional player information, e.g., their lifetime statistics, more detailed analysis could be achieved, such as using WAR to provide a value to a transaction.
The most awkward aspect of this data model is ordering transactions, as relationships have no concept of sequencing. Sometimes the transaction date is easy to use, but not in all situations.
Opinions expressed by DZone contributors are their own.
Comments