Perl-Python Tutorial: Splitting a Line by Regex

2005-04-15

Often you need to split a line by a textual pattern. This page shows you how.

Python

I have a file that is a translation of Chinese lyrics. It is formatted like this:

你是我最苦澀的等待   |   you are my hardest wait
讓我歡喜又害怕未來   |   giving me joy and also fear the future

The left side is Chinese, the right side is English. (See the file here: http://xahlee.org/Periodic_dosage_dir/sanga_pemci/klaku_canre.html) I want to write a program to split the line, so that i get the whole Chinese part or the whole English part.

Here's the code:

# -*- coding: utf-8 -*-
# Python

import re

myText = ur'''你是我最苦澀的等待   |   you are my hardest wait
讓我歡喜又害怕未來   |   giving me joy and also fear the future'''
 
lines=myText.splitlines() # or lines=re.split(r'\n',myText)

for ln in lines:
    fracture=re.split(r'\s*\|\s*',ln,re.U)
    print fracture[0].encode('utf-8')     # prints just the Chinese column

Unicode chars can be included in regex patterns directly. Just make sure your string starts with ur, and the third argument to re.split is “re.U” to tell re.split to work in a unicode mode. For example: “re.search(ur'苦',mystring,re.U)”.

Unicode can also be represented by “\u” followed by its hexadecimal code. For example, to match the unicode “em space”, which has hexadecimal 2003, do “re.search(ur'\u2003',mystring,re.U)”. (the char “em space” can also be included literally as well)

See also: String Pattern Matching (regex) Documentation.

Perl

To split a line into a list using a text pattern as the seperator, use the function “split”. Here's a basic example:

# perl

$myText = '你是我最苦澀的等待   |   you are my hardest wait
讓我歡喜又害怕未來   |   giving me joy and also fear the future';

@lines= split (/\n/,$myText);

# use Data::Dumper;
# print @lines;

for $ln (@lines) {
    @fracture = split(/\s*\|\s*/, $ln);
    print "$fracture[0]\n";   # prints just the Chinese column
  }

Reference: perldoc -f split↗.

Reference: perldoc perlre↗.


See also:


Page created: 2005-02.
© 2005 by Xah Lee.
Xah Signet