Skip to content

Basic Concepts

Fundamental concepts and design principles of PyGraphina.

Graph Model

PyGraphina represents graphs using an adjacency list data structure, which provides efficient storage and fast neighbor lookups for sparse graphs (common in real-world networks).

While PyGraphina allows modifying the graph structure, it's generally more efficient to build the graph first and then analyze it. Frequent modifications (especially removals) can be slower than bulk building.

Undirected vs. Directed Graphs

PyGraphina supports both undirected and directed graphs:

Undirected Graphs (PyGraph)

  • Edges have no direction
  • If node A is connected to node B, then B is also connected to A
  • Used for: social networks, collaboration networks, molecular structures
import pygraphina as pg

g = pg.PyGraph()
a = g.add_node(1)
b = g.add_node(2)
g.add_edge(a, b, 1.0)

# In undirected graph, both directions exist
assert g.contains_edge(a, b)
assert g.contains_edge(b, a)

Directed Graphs (PyDiGraph)

  • Edges have direction (from source to target)
  • An edge from A to B doesn't imply an edge from B to A
  • Used for: web links, citation networks, workflow dependencies
dg = pg.PyDiGraph()
a = dg.add_node(1)
b = dg.add_node(2)
dg.add_edge(a, b, 1.0)  # Edge from a to b

# Only the specified direction exists
assert dg.contains_edge(a, b)
assert not dg.contains_edge(b, a)  # No edge from b to a

Nodes

Node IDs

When you add a node, PyGraphina returns a unique node ID (a simple integer):

g = pg.PyGraph()
node_0 = g.add_node(100)  # Returns 0
node_1 = g.add_node(200)  # Returns 1
node_2 = g.add_node(300)  # Returns 2

# Node IDs are sequential starting from 0
assert node_0 == 0
assert node_1 == 1
assert node_2 == 2

Node IDs After Deletion

When you remove a node, its ID is not reused. The next added node gets the next sequential ID.

Node Attributes

Each node has a single integer attribute associated with it (stored as i64 internally):

g = pg.PyGraph()
node_id = g.add_node(42)  # 42 is the node attribute (must be an integer)

# Retrieve the attribute
attr = g.get_node_attr(node_id)
assert attr == 42

# Update the attribute
g.update_node(node_id, 100)
assert g.get_node_attr(node_id) == 100

Integer-Only Attributes

Node attributes must be integers. The valid range is -2^63 to 2^63-1 (64-bit signed integer).

  • ✅ Integers: g.add_node(42), g.add_node(-100)
  • ❌ Floats: g.add_node(3.14) → TypeError
  • ❌ Strings: g.add_node("Alice") → TypeError
  • ❌ Objects: g.add_node({"name": "Bob"}) → TypeError

Use Cases for Node Attributes

Node attributes can represent various properties:

  • User IDs in a social network
  • Station codes in a transportation network
  • Molecule types in a chemical structure
  • Entity IDs for mapping to external data
  • Array indices for external data structures

Storing Rich Node Attributes

If you need to store complex attributes (strings, objects, etc.), use an external dictionary:

import pygraphina as pg

# Create graph and external data store
g = pg.PyGraph()
node_data = {}

# Add nodes with rich attributes stored externally
people = [
    {"name": "Alice", "age": 30, "city": "NYC"},
    {"name": "Bob", "age": 25, "city": "LA"},
]

for i, person in enumerate(people):
    node_id = g.add_node(i)  # Store simple index in graph
    node_data[node_id] = person  # Store rich data externally

# Add edges
g.add_edge(0, 1, 1.0)

# Use algorithms as normal
pagerank = pg.centrality.pagerank(g, 0.85, 100, 1e-6)

# Access rich data when needed
for node_id, score in pagerank.items():
    person = node_data[node_id]
    print(f"{person['name']}: {score:.4f}")

This pattern is common in high-performance graph libraries and allows PyGraphina to maintain its speed advantages while giving you flexibility for complex data.

Edges

Edge Weights

All edges in PyGraphina have a floating-point weight:

g = pg.PyGraph()
a, b = g.add_node(1), g.add_node(2)

# Add edge with weight
g.add_edge(a, b, 2.5)

# Weights represent different things depending on context:
# - Distance (shorter is closer)
# - Similarity (higher is more similar)
# - Strength (higher is stronger connection)
# - Cost (higher is more expensive)

Multiple Edges

PyGraphina does not support multiple edges between the same pair of nodes (no multigraph support):

g = pg.PyGraph()
a, b = g.add_node(1), g.add_node(2)

g.add_edge(a, b, 1.0)
g.add_edge(a, b, 2.0)  # This will update the existing edge

# Only one edge exists between a and b
assert g.edge_count() == 1

Self-Loops

Self-loops (edges from a node to itself) are supported:

g = pg.PyGraph()
a = g.add_node(1)
g.add_edge(a, a, 1.0)  # Self-loop

assert g.contains_edge(a, a)

Graph Operations

Basic Queries

g = pg.PyGraph()
a, b, c = [g.add_node(i) for i in range(3)]
g.add_edge(a, b, 1.0)
g.add_edge(b, c, 1.0)

# Node count
print(g.node_count())  # 3

# Edge count
print(g.edge_count())  # 2

# Check if directed
print(g.is_directed())  # False

# Graph density (ratio of existing edges to possible edges)
print(g.density())  # 0.666...

# Get all nodes
print(g.nodes)  # [0, 1, 2]

# Get neighbors of a node
print(g.neighbors(a))  # [1]
print(g.neighbors(b))  # [0, 2]

# Get node degree
print(g.degree[b])  # 2

Modifying Graphs

g = pg.PyGraph()

# Add nodes
a = g.add_node(1)
b = g.add_node(2)

# Add edges
g.add_edge(a, b, 1.0)

# Remove edge
g.remove_edge(a, b)

# Remove node (also removes incident edges)
g.remove_node(a)

# Clear entire graph
g.clear()
assert g.node_count() == 0

Algorithm Organization

PyGraphina organizes algorithms into logical modules:

pg.centrality

Measures of node importance and influence:

# PageRank centrality
pagerank = pg.centrality.pagerank(g, damping=0.85, max_iter=100, tolerance=1e-6)

# Betweenness centrality
betweenness = pg.centrality.betweenness(g, True)

# Degree centrality
degree = pg.centrality.degree(g)

pg.community

Community detection and clustering:

# Label propagation
communities = pg.community.label_propagation(g, max_iter=100)

# Louvain method
communities = pg.community.louvain(g)

# Connected components
components = pg.community.connected_components(g)

Link prediction algorithms:

# Jaccard coefficient
jaccard = pg.links.jaccard_coefficient(g)

# Adamic-Adar index
adamic_adar = pg.links.adamic_adar_index(g)

# Resource allocation
resource_alloc = pg.links.resource_allocation_index(g)

pg.approximation

Approximation algorithms for hard problems:

# Approximate maximum clique
clique_size = pg.approximation.large_clique_size(g)

# Approximate clustering coefficient
clustering = pg.approximation.average_clustering_approx(g)

# Approximate diameter
diameter = pg.approximation.diameter(g)

pg.core

Core graph generators and utilities:

# Graph generators
g = pg.core.erdos_renyi(n=100, p=0.1, seed=42)
g = pg.core.barabasi_albert(n=100, m=2, seed=42)
g = pg.core.watts_strogatz(n=100, k=4, beta=0.3, seed=42)
g = pg.core.complete_graph(n=10)

# Shortest paths use graph methods directly:
distances = g.dijkstra(source)  # Returns dict of distances
result = g.shortest_path(source, target)  # Returns (distance, path) or None

Graph Metrics

Graph metrics are available as instance methods on graph objects:

# Graph-level metrics (instance methods)
diameter = g.diameter()
radius = g.radius()
avg_clustering = g.average_clustering()
transitivity = g.transitivity()

# Or use top-level convenience functions
diameter = pg.diameter(g)
radius = pg.radius(g)

# Node-level metrics (instance methods)
clustering = g.clustering_of(node)
triangles = g.triangles_of(node)

pg.mst

Minimum spanning tree algorithms:

# Kruskal's algorithm
mst = pg.mst.kruskal(g)

# Prim's algorithm
mst = pg.mst.prim(g, start_node)

pg.traversal

Graph traversal algorithms:

# Breadth-first search
bfs_order = pg.traversal.bfs(g, start_node)

# Depth-first search
dfs_order = pg.traversal.dfs(g, start_node)

pg.subgraphs

Subgraph extraction:

# Extract induced subgraph
subgraph = pg.subgraphs.induced_subgraph(g, node_set)

# Extract ego graph (k-hop neighborhood)
ego = pg.subgraphs.ego_graph(g, center_node, radius=2)

pg.parallel

Parallel implementations for large graphs:

# Parallel PageRank
pagerank = pg.parallel.pagerank_parallel(g, 0.85, 100, 1e-6)

# Parallel BFS
bfs_order = pg.parallel.bfs_parallel(g, [start_node])

Performance Considerations

Graph Size

PyGraphina is designed for graphs with:

  • Millions of nodes
  • Tens of millions of edges
  • Sparse connectivity (typical of real-world networks)

Memory Usage

  • Each node requires minimal memory (just an ID and attribute)
  • Edges are stored efficiently in adjacency lists
  • Memory usage grows approximately linearly with the number of edges

Parallel Processing

Many algorithms have parallel implementations in pg.parallel that can leverage multiple CPU cores:

# Sequential version
result = pg.centrality.pagerank(large_graph, 0.85, 100, 1e-6)

# Parallel version (faster for large graphs)
result = pg.parallel.pagerank_parallel(large_graph, 0.85, 100, 1e-6)

Data Types

Return Types

Algorithms return different data types depending on their nature:

  • Dict[int, float]: Node-to-score mappings (centrality algorithms)
  • Dict[int, int]: Node-to-cluster mappings (community detection)
  • Dict[tuple, float]: Edge-to-score mappings (link prediction)
  • List[int]: Node sequences (paths, traversals)
  • float: Single values (density, diameter)

Example:

# Returns Dict[int, float] mapping node IDs to PageRank scores
pagerank = pg.centrality.pagerank(g, 0.85, 100, 1e-6)
print(pagerank)  # {0: 0.25, 1: 0.35, 2: 0.40}

# Returns Dict[int, int] mapping node IDs to community labels
communities = pg.community.label_propagation(g, 100)
print(communities)  # {0: 0, 1: 0, 2: 1}

# Returns List[int] with node IDs in traversal order
bfs = pg.traversal.bfs(g, start_node=0)
print(bfs)  # [0, 1, 2, 3, 4]

Best Practices

1. Use Appropriate Graph Type

Choose the right graph type for your use case:

  • Use PyGraph for symmetric relationships
  • Use PyDiGraph for asymmetric relationships

2. Batch Operations

When building large graphs, add all nodes and edges in loops rather than one at a time in separate statements.

3. Choose the Right Algorithm

For large graphs:

  • Use parallel implementations when available
  • Use approximation algorithms when exact solutions are too slow
  • Consider the time complexity of algorithms

4. Handle Missing Data

Always check for None or empty results:

result = g.shortest_path(source, target)
if result:
    distance, path = result
    print(f"Path found: {path}, distance: {distance}")
else:
    print("No path exists")

Next Steps