Building payslip-parser: Extracting Structured Data from Payslip PDFs

Payslips are data-rich documents that almost nobody actually analyses. They contain your gross pay, net pay, income tax, National Insurance, pension contributions, and year-to-date totals - all the information you need to track salary progression over time, verify your tax code is correct, and prepare for self-assessment. But they arrive as PDFs, and nobody manually transfers PDF numbers into a spreadsheet.

payslip-parser does that transfer automatically. Drop your PDF payslips into a folder, run one command, and get a clean Excel workbook with monthly pay data, P60 annual summaries, and pay review letters - all structured and formatted.

The PDF Parsing Challenge

PDF parsing sounds simple until you try it. There are two problems:

Problem 1: Text extraction order. A PDF page is a flat collection of text objects with x/y coordinates. When you extract text naively, the library reads objects in the order they appear in the PDF stream - which is usually left-to-right, top-to-bottom, but not always, and never in a way that preserves multi-column layouts.

A payslip has three visual columns: Pay & Allowances on the left, Deductions in the middle, Totals & Balances on the right. Naive extraction gives you one jumbled sequence mixing all three.

Problem 2: No consistent structure. Different payroll providers format their PDFs differently. Even the same provider changes layouts between versions. An approach that works for one employer’s payslips won’t necessarily work for another’s.

Coordinate-Based Column Splitting

The solution to Problem 1 is to stop treating the PDF as text and treat it as a spatial document. pdfplumber exposes word-level bounding boxes - each word has x0, top, x1, bottom coordinates.

Using these coordinates, the parser reconstructs the three-column layout:

_COL2_X = 200   # Deductions column starts here
_COL3_X = 370   # Totals & Balances column starts here
_Y_TOL  = 3     # tolerance for grouping words into the same row

def _extract_columns(page):
    words = page.extract_words()

    # Group words into rows by y-coordinate
    rows: dict[float, list] = {}
    for w in words:
        y = round(w['top'] / _Y_TOL) * _Y_TOL
        rows.setdefault(y, []).append(w)

    result = []
    for y in sorted(rows):
        row_words = sorted(rows[y], key=lambda w: w['x0'])
        cols = ['', '', '']
        for w in row_words:
            x = w['x0']
            idx = 0 if x < _COL2_X else (1 if x < _COL3_X else 2)
            cols[idx] = (cols[idx] + ' ' + w['text']).strip()
        result.append(tuple(cols))
    return result

Each row becomes a three-element tuple: (col1_text, col2_text, col3_text). A row like ('BASE PAY £4,500.00', 'Income Tax -£892.40', 'Taxable Gross £4,500.00') can now be parsed independently per column.

Dynamic Pay Item Discovery

Different payslips have different line items - some months include overtime, bonus, or allowances; others don’t. The parser discovers pay items dynamically rather than expecting a fixed set of columns:

seen_items: dict[str, None] = {}
for p in payslips:
    for k in p['pay_items']:
        seen_items.setdefault(k, None)
dynamic = [k for k in seen_items if k != 'BASE PAY']

Every pay item that appears in any payslip becomes a column in the output. If an item doesn’t exist in a given month, its cell is empty. This means the spreadsheet handles years of payslips - including months with bonus payments or one-off allowances - without any configuration.

Three Document Types

The parser handles three document types, each with different extraction logic:

Monthly payslips - three-column table layout. The column-splitting approach handles these.

P60 annual certificates - single-column text with regex extraction. P60s are standardised HMRC forms with consistent field labels, so regex works reliably:

m = re.search(r'In this employment\s*\*?\s*£([\d,]+\.?\d*)\s+£([\d,]+\.?\d*)', flat)
if m:
    data['pay'] = parse_amount(m.group(1))
    data['tax_deducted'] = parse_amount(m.group(2))

Pay and reward letters - structured letters with a salary table. Extracted by finding the table header and parsing the rows that follow, handling different letter types (pay review, Q1 review, benefit funding change, promotion/reward).

Excel Output

The output workbook has three sheets, each styled for readability:

Monthly Payslips - chronological rows from oldest to newest, with alternating row shading, currency formatting (£#,##0.00), and frozen header row
P60 Annual - one row per tax year with full NI breakdown
Pay & Reward Letters - salary table extracted from each letter

The styling is applied via openpyxl:

_HDR_FILL = PatternFill(start_color='1F4E79', end_color='1F4E79', fill_type='solid')
_HDR_FONT = Font(color='FFFFFF', bold=True)
_ALT_FILL = PatternFill(start_color='D6E4F0', end_color='D6E4F0', fill_type='solid')

Auto-width columns cap at 28 characters to prevent any single column from dominating the layout.

Debug Mode

When adapting to a different payslip format, the --debug flag dumps the raw extracted text from a PDF:

python3 parse.py --debug march-payslip.pdf

This shows exactly what pdfplumber sees, making it straightforward to identify the x-boundaries for the column splitting and adjust _COL2_X and _COL3_X accordingly.

Privacy

All processing happens locally. No data is sent anywhere. The payslips/ directory and the output .xlsx file are gitignored by default - there’s no way to accidentally commit your financial data to a public repository.

The project is on GitHub.