Perl-Python Tutorial: Splitting a Line by Regex

2005-04-15

Often you need to split a line by a textual pattern. This page shows you how.

Python

I have a file that is a translation of Chinese lyrics. It is formatted like this:

你是我最苦澀的等待   |   you are my hardest wait
讓我歡喜又害怕未來   |   giving me joy and also fear the future

The left side is Chinese, the right side is English. (See the file here: 哭沙 (Weeping Sand)) I want to write a program to split the line, so that i get the whole Chinese part or the whole English part.

Here's the code:

# -*- coding: utf-8 -*-
# Python

import re

myText = ur'''你是我最苦澀的等待   |   you are my hardest wait
讓我歡喜又害怕未來   |   giving me joy and also fear the future'''
 
lines=myText.splitlines() # or lines=re.split(r'\n',myText)

for ln in lines:
    fracture=re.split(r'\s*\|\s*',ln,re.U)
    print fracture[0].encode('utf-8')     # prints just the Chinese column

Unicode chars can be included in regex patterns directly. Just make sure your string starts with ur, and the third argument to re.split is “re.U” to tell re.split to work in a unicode mode. For example: “re.search(ur'苦',mystring,re.U)”.

Unicode can also be represented by “\u” followed by its hexadecimal code. For example, to match the unicode “em space”, which has hexadecimal 2003, do “re.search(ur'\u2003',mystring,re.U)”. (the char “em space” can also be included literally as well)

See also: String Pattern Matching (regex) Documentation.

Perl

To split a line into a list using a text pattern as the seperator, use the function “split”. Here's a basic example:

# perl

$myText = '你是我最苦澀的等待   |   you are my hardest wait
讓我歡喜又害怕未來   |   giving me joy and also fear the future';

@lines= split (/\n/,$myText);

# use Data::Dumper;
# print @lines;

for $ln (@lines) {
    @fracture = split(/\s*\|\s*/, $ln);
    print "$fracture[0]\n";   # prints just the Chinese column
  }

perldoc -f split

perldoc perlre

Was this page useful? If so, please do donate $3, thank you donors!
Home
Terms of Use
About
Advertise
Subscribe
Google
2005-02
© 2005 by Xah Lee.