PageRank Centrality¶
PageRank is a link analysis algorithm that assigns importance scores to nodes based on the structure of incoming links.
Function Signature¶
pg.centrality.pagerank(
graph: Union[PyGraph, PyDiGraph],
damping: float = 0.85,
max_iter: int = 100,
tolerance: float = 1e-6,
nstart: Optional[Dict[int, float]] = None
) -> Dict[int, float]
Parameters¶
- graph: The graph to analyze (directed or undirected)
- damping: Damping factor (probability of following a link vs random jump). Default: 0.85
- max_iter: Maximum number of iterations. Default: 100
- tolerance: Convergence tolerance. Default: 1e-6
- nstart (optional): Starting values for PageRank iteration
- Type:
Dict[int, float]- Maps node IDs to initial values - Default:
None(uniform distribution) - Use case: Provide custom initial values or continue from previous computation
- Example:
nstart={0: 0.5, 1: 0.3, 2: 0.2}
Returns¶
Dictionary mapping node IDs to PageRank scores. Scores sum to 1.0.
Description¶
PageRank models the behavior of a random surfer who:
- With probability
damping, follows a random outgoing link - With probability
1 - damping, jumps to a random node
The PageRank score represents the long-term probability of finding the surfer at each node.
Algorithm¶
The algorithm iteratively updates scores using:
Where the sum is over all nodes v with edges to u.
Time Complexity¶
O(V + E) per iteration, typically O(k·(V + E)) where k is the number of iterations (usually < 100).
Examples¶
Basic Usage¶
import pygraphina as pg
# Create a simple link structure
g = pg.PyDiGraph()
a, b, c, d = [g.add_node(i) for i in range(4)]
g.add_edge(a, b, 1.0)
g.add_edge(a, c, 1.0)
g.add_edge(b, c, 1.0)
g.add_edge(c, a, 1.0)
g.add_edge(d, c, 1.0)
# Calculate PageRank
scores = pg.centrality.pagerank(g, damping=0.85, max_iter=100, tolerance=1e-6)
print("PageRank scores:")
for node, score in sorted(scores.items(), key=lambda x: x[1], reverse=True):
print(f" Node {node}: {score:.4f}")
Web Page Ranking¶
import pygraphina as pg
# Create a web graph (pages linking to each other)
web = pg.PyDiGraph()
# Add pages
home = web.add_node(1)
about = web.add_node(2)
products = web.add_node(3)
blog = web.add_node(4)
contact = web.add_node(5)
# Add links (directed edges)
web.add_edge(home, about, 1.0)
web.add_edge(home, products, 1.0)
web.add_edge(home, blog, 1.0)
web.add_edge(about, contact, 1.0)
web.add_edge(products, blog, 1.0)
web.add_edge(blog, home, 1.0)
web.add_edge(blog, products, 1.0)
# Calculate page importance
pagerank = pg.centrality.pagerank(web, damping=0.85)
page_names = {home: "Home", about: "About", products: "Products",
blog: "Blog", contact: "Contact"}
print("Page Importance Ranking:")
for node, score in sorted(pagerank.items(), key=lambda x: x[1], reverse=True):
print(f" {page_names[node]}: {score:.4f}")
Social Network Influence¶
import pygraphina as pg
# Create a social network (who follows whom)
social = pg.PyDiGraph()
# Add users
users = [social.add_node(i) for i in range(10)]
# Add follow relationships
follows = [
(0, 1), (0, 2), (1, 2), (2, 3), (3, 4),
(4, 5), (5, 0), (6, 7), (7, 8), (8, 6), (9, 0)
]
for follower, followee in follows:
social.add_edge(follower, followee, 1.0)
# Calculate influence scores
influence = pg.centrality.pagerank(social, damping=0.85)
# Find top influencers
top_influencers = sorted(influence.items(), key=lambda x: x[1], reverse=True)[:3]
print("Top 3 Influencers:")
for rank, (user, score) in enumerate(top_influencers, 1):
print(f" {rank}. User {user}: {score:.4f}")
Parameter Tuning¶
Damping Factor¶
- 0.85 (default): Standard value used by Google
- Higher (0.90-0.95): More emphasis on link structure
- Lower (0.70-0.80): More random jumps, less emphasis on links
# Compare different damping factors
for damping in [0.70, 0.85, 0.95]:
scores = pg.centrality.pagerank(g, damping=damping)
top_node = max(scores, key=scores.get)
print(f"Damping {damping}: Top node {top_node} = {scores[top_node]:.4f}")
Convergence¶
Adjust max_iter and tolerance for speed vs accuracy:
# Fast approximation (fewer iterations)
fast = pg.centrality.pagerank(g, max_iter=10, tolerance=1e-3)
# High precision (more iterations, tighter tolerance)
precise = pg.centrality.pagerank(g, max_iter=200, tolerance=1e-9)
Warm Start with nstart¶
Use nstart to continue iterations from a previous state or provide custom initial values:
import pygraphina as pg
g = pg.PyDiGraph()
nodes = [g.add_node(i) for i in range(5)]
for i in range(4):
g.add_edge(nodes[i], nodes[i+1], 1.0)
# First computation
pr1 = pg.centrality.pagerank(g, max_iter=50, tolerance=1e-6)
print(f"Initial PageRank: {pr1}")
# Continue from previous result (warm start)
pr2 = pg.centrality.pagerank(g, max_iter=50, tolerance=1e-6, nstart=pr1)
print(f"After warm start: {pr2}")
# Custom initialization (prefer certain nodes)
custom_start = {0: 0.5, 1: 0.3, 2: 0.1, 3: 0.05, 4: 0.05}
pr3 = pg.centrality.pagerank(g, nstart=custom_start)
print(f"Custom init: {pr3}")
print(f"Node 0 has highest score: {pr3[0]:.4f}")
Comparison with Eigenvector Centrality¶
PageRank and eigenvector centrality are similar but have primary differences:
| Feature | PageRank | Eigenvector |
|---|---|---|
| Damping | Yes (random jumps) | No |
| Disconnected graphs | Handles well | May fail |
| Convergence | Always converges | May not converge |
| Use case | Web ranking | Social influence |
# Compare the two
pagerank_scores = pg.centrality.pagerank(g, 0.85, 100, 1e-6)
eigenvector_scores = pg.centrality.eigenvector(g, 100, 1e-6)
print("Comparison:")
for node in g.nodes:
print(f"Node {node}: PR={pagerank_scores[node]:.4f}, "
f"EV={eigenvector_scores[node]:.4f}")
Common Pitfalls¶
1. Using on Undirected Graphs¶
PageRank works on both directed and undirected graphs, but interpretation differs:
- Directed: Models link following behavior
- Undirected: All edges are bidirectional
2. Dangling Nodes¶
Nodes with no outgoing edges (sinks) are handled automatically by the algorithm.
3. Disconnected Components¶
PageRank handles disconnected graphs gracefully due to the random jump mechanism.
Performance¶
For large graphs, consider:
- Parallel implementation: Use
pg.parallel.pagerank_parallel() - Early stopping: Use larger
tolfor faster (approximate) results - Fewer iterations: Reduce
max_itersif exact convergence isn't critical
# For large graphs
import pygraphina as pg
# Standard (sequential)
scores = pg.centrality.pagerank(large_graph, 0.85, 100, 1e-6)
# Parallel (faster for large graphs)
scores = pg.parallel.pagerank_parallel(large_graph, 0.85, 100, 1e-6)
Applications¶
- Web search: Ranking web pages
- Social networks: Finding influential users
- Citation networks: Identifying important papers
- Recommendation systems: Ranking items
- Protein networks: Finding critical proteins
- Knowledge graphs: Entity importance
Personalized PageRank¶
Personalized PageRank (PPR) allows you to bias the random jumps toward specific nodes, making the results more relevant to a particular starting point.
Implementation Note
The current implementation uses the Power Iteration method. While generally fast, for extremely large graphs with slow convergence, other methods (like Arnoldi) might be considered in future releases.lowing workaround.
Workaround for Personalized PageRank¶
You can approximate Personalized PageRank by computing PageRank on a modified graph:
import pygraphina as pg
# Create a graph
g = pg.PyGraph()
nodes = [g.add_node(i) for i in range(10)]
# Add edges
for i in range(9):
g.add_edge(i, i + 1, 1.0)
g.add_edge(9, 0, 1.0)
# To personalize to a specific node, extract its ego graph
# and compute PageRank on that subgraph
personalization_node = 0
ego = g.ego_graph(personalization_node, radius=3)
# Compute PageRank on the ego graph
local_scores = pg.centrality.pagerank(ego, 0.85, 100, 1e-6)
print(f"Local importance around node {personalization_node}: {local_scores}")
See Also¶
- Eigenvector Centrality: Similar algorithm without damping
- Personalized PageRank: PageRank with custom start nodes
- Katz Centrality: Weighted path-based centrality
- All Centrality Measures: Overview of centrality algorithms