ML 1: String Cleaning

With Python’s split() and strip()
machine learning
Author

Tony Phung

Published

January 14, 2025

1. Import Data as a String

player_str = """
id,   name,   dob
8, iniesta  , 1984 
 10,   messi , 1987  
 16   ,pedri, 2003
 """

player_str
'\nid,   name,   dob\n8, iniesta  , 1984 \n 10,   messi , 1987  \n 16   ,pedri, 2003\n '

1. split() each players by row

  • Split strings by a chosen delimiter
  • A list is returned with each item or element as its own string
player_str.split("\n") # split by \n special character but theres white space items at start and beginning!
# ie there are two items with just empty spaces at first and last entries.
['',
 'id,   name,   dob',
 '8, iniesta  , 1984 ',
 ' 10,   messi , 1987  ',
 ' 16   ,pedri, 2003',
 ' ']

2. strip() white spaces before splitting

Remove white space at the begininer and end of a string

player_list = player_str.strip().split("\n") # split by \n special character but theres white spaces!
player_list # the list no longer has the first and last white space elements
['id,   name,   dob',
 '8, iniesta  , 1984 ',
 ' 10,   messi , 1987  ',
 ' 16   ,pedri, 2003']

3. iterate through players

Each line is a single long string for each player or one column.

for i,player_line in enumerate(player_list): # iterate through each item of list and do something
    print(i, repr(player_line), type(player_line)) # showing that they're indeed strings
0 'id,   name,   dob' <class 'str'>
1 '8, iniesta  , 1984 ' <class 'str'>
2 ' 10,   messi , 1987  ' <class 'str'>
3 ' 16   ,pedri, 2003' <class 'str'>

4. Split player info for each player

Split each player string into separate items, so theres 3 columns, one for each players information.

for i,player_line in enumerate(player_list): # recall can split string into a list delimited 
    player_item = player_line.split(",") # str: 'some,cool,string' -> list: ['some','cool','string']
    print(player_item)
['id', '   name', '   dob']
['8', ' iniesta  ', ' 1984 ']
[' 10', '   messi ', ' 1987  ']
[' 16   ', 'pedri', ' 2003']

5. repr shows each strings values

Notice there are unnecessary white space to be removed

for i,player_line in enumerate(player_list): # recall can split string into a list delimited 
    player_item = player_line.split(",") # str: 'some,cool,string' -> list: ['some','cool','string']
    [print(repr(player_info)) for player_info in player_item] # need to clean each player_info
    
'id'
'   name'
'   dob'
'8'
' iniesta  '
' 1984 '
' 10'
'   messi '
' 1987  '
' 16   '
'pedri'
' 2003'

6. strip each players information

for i,player_line in enumerate(player_list): 
    player_item = player_line.split(",")
    [print(repr(player_info.strip())) for player_info in player_item] 
    
'id'
'name'
'dob'
'8'
'iniesta'
'1984'
'10'
'messi'
'1987'
'16'
'pedri'
'2003'

7. Append each player to to create a players clean list

players = []
for player_line in player_list: 
    # each_str_list = line.split(",")  
    # [print(strg.strip()) for strg in each_str_list] # strip() all the white spaces away
    player = [strg.strip() for strg in player_line.split(",")] # strip() all the white spaces away
    print(player)
    players.append(player)
['id', 'name', 'dob']
['8', 'iniesta', '1984']
['10', 'messi', '1987']
['16', 'pedri', '2003']