Oracle APEX - Parsing fixed-width formatted files

Photo by Kevin Ku on Unsplash

Oracle APEX - Parsing fixed-width formatted files

Introduction

This blog post focuses on how to parse a fixed-width formatted file using Oracle's APEX, SQL and PL/SQL to make it as efficient as possible.

What is a fixed-width formatted file?

The fixed-width format is a defined format in which data is not separated by a separator as in a CSV, but has a fixed length and position.

Here is a basic example

ID     FIRST_NAME          LAST_NAME           CITY                          COUNTRY
0000001LOUIS               MOREAUX             PARIS                         FRANCE                        
0000002JOHN                SMITH               CHICAGO                       UNITED STATES       
0000003JANE                DOE                 LONDON                        UNITED KINGDOM

As you can see, there are five columns, but no separators. Each column has a fixed width.

ColumnSize
ID7 characters
FIRST_NAME20 characters
LAST_NAME20 characters
CITY20 characters
COUNTRY30 characters

How to read it?

There are many articles and examples on how to read XLSX or CSV files, and it can even be done declaratively using the data load definition. However, I haven't found anything about the fixed-width format. This is probably because it's an old format, less used today, but if you have to communicate with existing systems that only understand this format, you have no choice.

The first challenge is to extract the list of lines from the file. To do this, we'll use four different options and compare them.

APEX_DATA_PARSER

apex_data_parser is a useful package when it comes to parsing files in APEX (or in the Oracle database). It can parse XLSX, CSV, XML and JSON and has always been my first approach to reading files.

In this particular case, we'll consider that our text file is a CSV and that the apex_data_ parser.parse function will give you the whole row in column col001 with the following query

select col001
  from table (
          apex_data_parser.parse(
             p_content   => l_blob_content,
             p_file_name => 'dataset.txt',
             p_file_type => apex_data_parser.c_file_type_csv
          )
       )

At this point, the col001 will contain something like this

COL001
0000001LOUIS MOREAUX PARIS FRANCE
0000002JOHN SMITH CHICAGO UNITED STATES
0000003JANE DOE LONDON UNITED KINGDOM

How do you divide this data into separate columns? We can use the SQL function substr() to do just that.

select rtrim(substr(column_value, 1, 7)) as id,
       rtrim(substr(column_value, 8, 20)) as first_name,
       rtrim(substr(column_value, 21, 20)) as first_name,
       rtrim(substr(column_value, 41, 30)) as city,
       rtrim(substr(column_value, 71, 30)) as country
  from (
          select col001 as column_value
            from table (
                    apex_data_parser.parse(
                       p_content   => l_blob_content,
                       p_file_name => 'dataset.txt',
                       p_file_type => apex_data_parser.c_file_type_csv
                    )
                 )
       ) raw_data;

The output of this query will be

IDFIRST_NAMELAST_NAMECITYCOUNTRY
0000001LOUISMOREAUXPARISFRANCE
0000002JOHNSMITHCHICAGOUNITED STATES
0000003JANEDOELONDONUNITED KINGDOM

You'll have noticed that we also use the rtrim() function to eliminate extra space characters at the end of each value.

APEX_STRING

Another option is to use the apex_string.split() function, in this case, the query will be

select rtrim(substr(column_value, 1, 7)) as id,
       rtrim(substr(column_value, 8, 20)) as first_name,
       rtrim(substr(column_value, 21, 20)) as first_name,
       rtrim(substr(column_value, 41, 30)) as city,
       rtrim(substr(column_value, 71, 30)) as country
  from (
          select column_value
            from table (apex_string.split(l_clob_content))
       ) raw_data;

The main difference is that we will have to first convert our file to a clob.

JSON_TABLE

The third option evaluated here is to use the json_table() to read the file, the query will be

select rtrim(substr(column_value, 1, 7)) as id,
       rtrim(substr(column_value, 8, 20)) as first_name,
       rtrim(substr(column_value, 21, 20)) as first_name,
       rtrim(substr(column_value, 41, 30)) as city,
       rtrim(substr(column_value, 71, 30)) as country
  from (
          select val as column_value
            from json_table ('["'
                    || replace(l_clob_content, chr(10), '","')
                    || '"]', '$[*]'
                    columns (
                       val varchar2(4000) path '$'
                    )
                 )
       ) raw_data;

DBMS_LOB

The last option evaluated here (I'm sure there are others), is using the dbms_lob which is a package that provides several handy programs to handle blobs, clobs and bfiles. The code used to read the file is the following one

declare
   l_clob_content clob;
   l_offset       number      := 1;
   l_amount       number      := 32767;
   l_len          number;
   l_buffer       varchar2(32767);
   i              pls_integer := 1;
begin
   l_len := dbms_lob.getlength(l_clob_content);
   begin
      if (dbms_lob.isopen(l_clob_content) != 1) then
         dbms_lob.open(l_clob_content, 0);
      end if;
      l_amount := instr(l_clob_content, chr(10), l_offset);
      while (l_offset < l_len)
      loop
         dbms_lob.read(l_clob_content, l_amount, l_offset, l_buffer);
         dbms_output.put_line(l_buffer);
      end loop;
      if (dbms_lob.isopen(l_clob_content) = 1) then
         dbms_lob.close(l_clob_content);
      end if;
   end;
end;

Which option is the fastest?

To answer this question, we need to compare the four approaches with the same data set. To do this, I generated a TXT file containing 100,000 rows for the 5 columns mentioned above. I created it using a simple table, the amazing Data Generator utility and the apex_data_export package. I may write another blog post about it. If you're interested, leave a comment!

Database table for the timing

All the credit for that goes to Carsten Czarski, I simply reuse what he built in this Oracle Live SQL script.

create table timings(  
    action varchar2(200) primary key,  
    start_time timestamp,  
    end_time timestamp 
);

Oracle APEX application

I created a simple Oracle APEX page to manage the file upload, containing a file browse item, the process for reading the file content and a report showing the results.

Screenshot of the Oracle APEX application used to evaluate the performance of the three options

Here is the PL/SQL code of the page process

declare
   l_clob_content clob;
   l_blob_content blob;
   type t_rec is record(
         id         fixed_width.id%type,
         first_name fixed_width.first_name%type,
         last_name  fixed_width.last_name%type,
         city       fixed_width.city%type,
         country    fixed_width.country%type);
   type t_tab is table of fixed_width%rowtype;
   l_rec          t_rec;
   l_tab          t_tab;
   l_offset       number      := 1;
   l_amount       number      := 32767;
   l_len          number;
   l_buffer       varchar2(32767);
   i              pls_integer := 1;
begin
   -- Delete the timings table before each run
   delete from timings;

   -- Get the blob and the clob from the temporary table
   select blob_content, to_clob(blob_content)
     into l_blob_content, l_clob_content
     from apex_application_temp_files
    where application_id = :app_id
      and name = :p1_file;

   -- APEX_STRING
   -- Insert the start time in the timings table
   insert into timings values ('APEX_STRING', systimestamp, null); 

   -- Bulk collect the file content
   select rtrim(substr(column_value, 1, 7)) as id,
          rtrim(substr(column_value, 8, 20)) as first_name,
          rtrim(substr(column_value, 21, 20)) as first_name,
          rtrim(substr(column_value, 41, 30)) as city,
          rtrim(substr(column_value, 71, 30)) as country
     bulk collect into l_tab
     from (
             select column_value
               from table (apex_string.split(l_clob_content))
          ) raw_data;

   -- Update the timings table with the end time
   update timings set end_time = systimestamp where action = 'APEX_STRING';

   -- Clean up the collection
   l_tab := t_tab();

   -- APEX_DATA_PARSER
   -- Insert the start time in the timings table
   insert into timings values ('APEX_DATA_PARSER', systimestamp, null); 

   -- Bulk collect the file content
   select rtrim(substr(column_value, 1, 7)) as id,
          rtrim(substr(column_value, 8, 20)) as first_name,
          rtrim(substr(column_value, 21, 20)) as first_name,
          rtrim(substr(column_value, 41, 30)) as city,
          rtrim(substr(column_value, 71, 30)) as country
     bulk collect into l_tab
     from (
             select col001 as column_value
               from table (
                       apex_data_parser.parse(
                          p_content   => l_blob_content,
                          p_file_name => 'dataset.txt',
                          p_file_type => apex_data_parser.c_file_type_csv
                       )
                    )
          ) raw_data;

   -- Update the timings table with the end time
   update timings set end_time = systimestamp where action = 'APEX_DATA_PARSER';

   -- Clean up the collection
   l_tab := t_tab();

   -- JSON_TABLE
   -- Insert the start time in the timings table
   insert into timings values ('JSON_TABLE', systimestamp, null); 

   -- Bulk collect the file content
   select rtrim(substr(column_value, 1, 7)) as id,
          rtrim(substr(column_value, 8, 20)) as first_name,
          rtrim(substr(column_value, 21, 20)) as first_name,
          rtrim(substr(column_value, 41, 30)) as city,
          rtrim(substr(column_value, 71, 30)) as country
     bulk collect into l_tab
     from (
             select val as column_value
               from json_table ('["'
                       || replace(l_clob_content, chr(10), '","')
                       || '"]', '$[*]'
                       columns (
                          val varchar2(4000) path '$'
                       )
                    )
          ) raw_data;

   -- Update the timings table with the end time
   update timings set end_time = systimestamp where action = 'JSON_TABLE';

   -- Clean up the collection
   l_tab := t_tab();

   -- DBMS_LOB
   -- Insert the start time in the timings table
   insert into timings values ('DBMS_LOB', systimestamp, null);

   l_len := dbms_lob.getlength(l_clob_content);
   begin
      if (dbms_lob.isopen(l_clob_content) != 1) then
         dbms_lob.open(l_clob_content, 0);
      end if;
      l_amount := instr(l_clob_content, chr(10), l_offset);
      while (l_offset < l_len)
      loop
         dbms_lob.read(l_clob_content, l_amount, l_offset, l_buffer);
         l_rec.id         := rtrim(substr(l_buffer, 1, 7));
         l_rec.first_name := rtrim(substr(l_buffer, 8, 20));
         l_rec.last_name  := rtrim(substr(l_buffer, 21, 20));
         l_rec.city       := rtrim(substr(l_buffer, 41, 30));
         l_rec.country    := rtrim(substr(l_buffer, 71, 30));
         l_tab(i)         := l_rec;
         l_offset         := l_offset + l_amount;
         i                := i + 1;
      end loop;
      if (dbms_lob.isopen(l_clob_content) = 1) then
         dbms_lob.close(l_clob_content);
      end if;
   end;
   -- Update the timings table with the end time
   update timings set end_time = systimestamp where action = 'DBMS_LOB';
end;

The results are displayed using this SQL query

select action,
       to_char(extract(minute from (end_time - start_time)) * 60 + extract(second from (end_time - start_time)), '990D000')
       as elapsed_time_seconds
  from timings
 order by 2 asc

The results are

Screenshot showing the timing results, displaying that JSON_TABLE take 0.582 second to read the files, DBMS_LOB 1.151 seconds, APEX_STRING 2.742 seconds and APEX_DATA_PARSER 5.524 seconds

As you can see, the clear winner is JSON_TABLE!

Conclusion

It was a funny thing to find a solution for reading these fixed-width delimited files and at first, JSON_TABLE wasn't an option in my head. My first attempt was to use the dmbs_lob package, but then I remembered Tom Kyte's mantra (Slow by slow...):

My mantra, that I'll be sticking with thank you very much, is:

  • You should do it in a single SQL statement if at all possible.

  • If you cannot do it in a single SQL Statement, then do it in PL/SQL.

  • If you cannot do it in PL/SQL, try a Java Stored Procedure.

  • If you cannot do it in Java, do it in a C external procedure.

  • If you cannot do it in a C external routine, you might want to seriously think about why it is you need to do it…