Data Cleaning in Excel
One of the fundamental truths of data is that it gets messier over time. It”™s unavoidable ”“ a well-used spreadsheet, even a modest one, will get messier and messier as more and more people touch it. Data gets dirty in a variety of ways ”“ from honest input mistakes to poorly-defined formulas feeding other formulas and tables, to simple aging of the information set.
Occasional data cleansing is therefore not simply recommended, but required. Microsoft Excel remains the most widely-used office desktop spreadsheet, and many businesses still rely on Excel spreadsheets with plenty of macros to run various aspects of their business. While you can always hire data cleansing companies to step in and dig you out of your hole, you don”™t have to unless your data is truly a mess. Here are some of the basic techniques for data cleansing in Excel.
Make it Uniform
One of the basic problems that develop over time in a spreadsheet is inconsistent styling of data. Text string come in lowercase, uppercase, or mixed, and numbers are entered with or without decimals, and often the individual number style of a cell gets inadvertently changed as well. Making your data uniform with a regular data cleansing is vital to keeping your spreadsheet readable:
- Spell check. It sounds so basic, but accomplishes so much.
- Eliminate unintentional duplicate rows. Sometimes duplicate rows are purposeful to weight a sample, but often they are mistakes that will skew your tables and queries. There”™s a useful ‘Remove Duplicates’ function on the ‘DATA’ tab in Excel for this purpose.
- Search and replace. A simple S&R run on common terms in your data will make them appear in a uniform manner. Variances in spelling or abbreviations can result in data being counted twice, or in the wrong categories.
- Uppercase or lowercase. Again, it seems basic, but in some spreadsheets a distinction is made between lowercase and uppercase entries.
Once your data is in a uniform state, data cleansing can help you can seek out some of the most common errors that crop up over time in just about any spreadsheet:
- Number format: Select the columns or rows that contain numbers and choose either ‘Format Styles’ from the Ribbon or right-click and select Format Cells. Then choose the appropriate number format for that column to instantly fix input errors.
- Date: Date format is one of the most common ‘dirty”™ bits of data in a spreadsheet that can be rectified by data cleansing. Selecting the columns or rows (or individual cells) that contain date data and changing them to a consistent format using the ‘Format Cells”™ function will eliminate those errors.
- Merge Columns: Sometimes columns are created that house essentially similar data, and having them as separate columns is simply confusing or, if they are situated far enough apart in a large spreadsheet (often the cause of the duplication in the first place) corrupting to your data. You can merge columns using a simple formula: If you have columns A, B, and C you wish to merge without losing data, selecting the first cell in another column (say, column ‘E”™) and paste in the formula ‘=A1&B1&C1”™. Then select the cells you want the formula to apply to, then paste that resulting formula into every corresponding cell of column E.
A regularly scheduled data cleansing for your spreadsheets will save you a lot of heartache and trouble and time, as the longer you let your Excel spreadsheets run wild, the longer it will take to clean them up.
– Data Czar @ DEO
jordan June 13, 2014 at 7:13 am