Search This Site
Jul 09 2008 03:54 UTC | ||||||||||
XML ManagerStandard Edition |
• Try the Demo! • Download the FREE Trial! • Product Home Page |
• Getting Started Guide • How You Can Save Money • All Your Purchase Options |
Learn how to get CSV data into your database, and back out again:
To compile and run the Java programs in this example, either import them into your IDE
(say, Eclipse or NetBeans), or follow
the standard manual compilation instructions.
When running the programs, run them in the same folder as this README.htm file, so that
they can find their input files.
We will use a small CSV file as our initial example data, before we look at working with large files. This file contains a list of products, with data fields for name, price and code.
products.csvid,code,name,price 1,A01,Apple,99 2,O01,Orange,149 3,O02,Satsuma,219 4,B01,Banana,29 5,P01,Pear,129
We want to be able to insert this data into a database table, and to extract it from that table. The data contains string fields and number fields, so we'll have to make sure to store those properly in the database.
First, let's create our database table. In this example we are assuming that we only have to load and save data into one table. For your own projects, you will have to modify the code in this example to fit your own database. So here is our table:
product-table.sqlCREATE TABLE Product ( id integer, code varchar(20), name varchar(255), price integer );
The easiest way to handle CSV data with CSV Manager is just to load up all the data at once into memory and then use the data as need. Let's look at loading the data into the database using this approach.
Here's the code:
public void doLoadFull() throws Exception {
Connection con = getConnection();
PreparedStatement ps
= con.prepareStatement( "INSERT INTO Product VALUES (?,?,?,?)" );
CsvManager csvman = new CsvManager();
List data = csvman.load( "products.csv" );
for( int line = 1; line < data.size(); line++ ) {
String[] fields = (String[]) data.get(line);
insertRow( fields, ps );
}
con.close();
}
public void insertRow( String[] pFields,
PreparedStatement pPS )
throws Exception
{
pPS.setInt( 1, Integer.parseInt( pFields[0] ) );
pPS.setString( 2, pFields[1] );
pPS.setString( 3, pFields[2] );
pPS.setInt( 4, Integer.parseInt( pFields[3] ) );
pPS.executeUpdate();
}
(The full code for this example is shown in the ImportExport.java file)
So what's happening here? The doLoadFull method is the main one. First, we get a database connection to the
database where we'll be INSERTing the data (the code for the getConnection method is not shown, but you can see
it in the ImportExport.java file).
Second, we load up all the data at once using the CsvManager.load method. Now we have all the data lines from the CSV file in an
ArrayList as String[] arrays. The last thing to do is to loop throug this list, and insert each line into the
database. The insertRow method converts the data for each row into the right type of data for the database table. Notice that
we start at line 1, not line 0. Line 0 contains the data field header names, and we don't want to put those in the database.
This method of loading data from a CSV file into a database is simple, quick and effective. But there is one snag. What if you have lots of data?
What if you are trying to insert a really big gigabyte-sized file into the database? This method won't work as you'll run out of RAM and get a
dreaded java.lang.OutOfMemoryError. We'll look at the answer in a moment — let's look at how to save data from a database first.
Saving data using the direct approach is pretty much the same as loading data. Here's the code:
public static void doSaveFull() throws Exception {
Connection con = getConnection();
PreparedStatement ps
= con.prepareStatement( "SELECT * FROM Product" );
CsvManager csvman = new CsvManager();
csvman.getCsvSpec().setProperty("ResultSet.saveHeaders", true);
ResultSet rs = ps.executeQuery();
csvman.saveResultSet( "products-export.csv", rs );
con.close();
}
This looks even easier than loading — just call
CsvManager.saveResultSet and CSV Manager does all the work for you.
You don't even need a for loop this time.
Notice that we're using the special property
ResultSet.saveHeaders to output the data field headers as
the first line of the CSV file.
So what about those really big files. Well CSV Manager provides two ways of dealing with them. One is event-based and one is stream-based. Let's look at the event-based approach first.
The idea with the event-based approach is that you implement an interface, and CSV Manager calls the methods of that interface whenever a line of CSV data is loaded. That way you can handle lines as they are loaded, rather than waiting until the end when they are all in memory. Using this method, you can load a line of data, immediately save it to the database, and free up the memory for the next line. There is no limit to the amount of data you can process in this manner. (There may be a limit to how long you'll wait for all that data, but don't worry CSV Manager is also pretty fast!). So here's the code:
public void doLoadEvent() throws Exception {
CsvManager csvman = new CsvManager();
csvman.getCsvSpec().setStartLine(2);
csvman.load( "products.csv", new DatabaseListener() );
}
public final class DatabaseListener extends CustomLineListener {
private Connection con = null;
private PreparedStatement ps = null;
protected void startProcessImpl() throws Exception {
con = getConnection();
ps = con.prepareStatement( "INSERT INTO Product VALUES (?,?,?,?)" );
}
protected BadLine handleLineImpl( String[] pFields, int pNumFields,
long pLineNumber, String pOriginalLine)
throws Exception
{
insertRow( pFields, ps );
return null;
}
protected void endProcessImpl() throws Exception {
con.close();
}
}
As you can see, the actual loading code in the doLoadEvent method is very small — just a few lines. Notice
again that we only start on the second line to avoid the headers. The meat of the problem is attacked in the DatabaseListener class.
This extends CustomLineListener which is the recommended way to implement the LineListener interface.
What you need to do here is implement some methods so that CSV Manager can let you know what's going on during the loading process.
The startProcessImpl and endProcessImpl methods let you setup your database connection and close it again. They are called
before loading of CSV data begins, and after it finishes.
But how do you actually get the data? This happens in the handleLineImpl method.
The CSV data comes in via the pFields String[] array. In our case, we're just passing this array on to the insertRow
method that we used the last time. (We return null from handleLineImpl to indicate that there were no errors, otherwise
we can return a BadLine object to let CSV Manager know that something was wrong with the data.)
So what's happening here? Instead of loading all the data into one big data structure, we've given CSV Manager a new object,
DatabaseListener. CSV Manager knows how to use this object to pass on the data that it loads from the CSV file.
And inside DatabaseListener, we actually save the data to the database. Because all of this happens one line at a time,
we can keep processing lines for a very long time. This means that we can handle files with millions of lines (or more).
What about saving lots of data? What if your database table contains millions of rows and you need to get them into a CSV file? Well, we can use the same idea. Here's the code:
public static void doSaveEvent() throws Exception {
CsvManager csvman = new CsvManager();
csvman.save( "products-export.csv", new DatabaseProvider() );
}
public static void loadRow( String[] pFields,
ResultSet pResultSet )
throws Exception
{
pFields[0] = String.valueOf(pResultSet.getInt(1));
pFields[1] = pResultSet.getString(2);
pFields[2] = pResultSet.getString(3);
pFields[3] = String.valueOf(pResultSet.getInt(4));
}
public static final class DatabaseProvider extends CustomLineProvider {
private Connection con = null;
private ResultSet rs = null;
private boolean first = true;
protected void startProcessImpl() throws Exception {
con = getConnection();
PreparedStatement ps = con.prepareStatement( "SELECT * FROM Product" );
rs = ps.executeQuery();
}
protected boolean hasNextLineImpl() throws Exception {
if( first ) { return true; }
return rs.next();
}
protected String[] nextLineImpl() throws Exception {
String[] fields = new String[4];
if( first ) {
fields = new String[] {"id","code","name","price"};
first = false;
}
else {
loadRow( fields, rs );
}
return fields;
}
protected void endProcessImpl() throws Exception {
con.close();
}
}
Yeah, this is a bit more code than the other examples. Actually this is the longest piece of code in this example.
Anyway, the starting method doSaveEvent is pretty much the same as for event-based loading. In this case we use a
DatabaseProvider object.
We've also defined a new utility method: loadRow. This method just gets data out of a ResultSet row and
puts it into a String[] array.
So let's look at DatabaseProvider. The structure is pretty much the same as DatabaseListener.
The startProcessImpl and endProcessImpl methods do pretty much the same stuff. However,
startProcessImpl also executes the database query. Now, you might be saying to yourself, "hold on! that means that
you're loading all the data at once!" Actually we're safe. Almost all database drivers implement ResultSet using a fetch-size
concept. This means that only 10 or so rows (the fetch-size, which depends on the database) are ever loaded at one time. The database driver
goes back to the database for more when the current set runs out. It just appears from the outside that you have immediate access to all
the data. Clever, eh?
Back to DatabaseProvider. Let's look at where all the work happens,
inside the hasNextLineImpl and nextLineImpl methods. Inside hasNextLineImpl we move to the next
ResultSet row. We also check to see if this is the first line of CSV data.
If so, we don't go near the ResultSet and just return true. This allows us to output the CSV headers in
nextLineImpl.
So we get to nextLineImpl. CSV Manager is expecting to get back a String[] array of data from us.
This String[] array will be written out to the CSV file as the next line of data. We create an empty array and
get the data for the array using the loadRow utility method. The only special case is for the first line of the
CSV file. In this case we manually return the headers.
Well this approach is very nice and certainly solves the problem of handling big files. Still, you have to implement interfaces and handle callback methods. Sometimes is nicer just to use direct method calls. Let's look at that option now...
The CSV Manager streaming API works like ResultSet, so it should be fairly familiar.
Basically, you keep calling the hasNext and next
methods until there is no more data (hasNext returns false). The best way to see it is in code, so here it is:
public static void doLoadStream() throws Exception {
Connection con = getConnection();
PreparedStatement ps
= con.prepareStatement( "INSERT INTO Product VALUES (?,?,?,?)" );
CsvManager csvman = new CsvManager();
CsvLoader loader = csvman.makeLoader( "products.csv" );
boolean first = true;
loader.begin();
while( loader.hasNext() ) {
if( first ) {
first = false;
continue;
}
String[] fields = loader.next();
insertRow( fields, ps );
}
loader.end();
con.close();
}
All the database stuff is the same. The new bit is the CsvManager.makeLoader method call. This creates a CsvLoader object that is setup to load
data lines from the specified file (products.csv). You call methods on the CsvLoader in a defined sequence,
just like ResultSet.
The method calls work like so: at the start and end you call
CsvLoader.begin and CsvLoader.end. This lets CSV Manager know that
it should reserve resources for reading the CSV file, and then free them up again afterwards.
After calling begin, you then enter while loop based on the
CsvLoader.hasNext and CsvLoader.next methods. The hasNext method returns true so long as there is another
line of data to deal with. The next method returns that line of data as a String[] array.
As before, we then use the insertRow utility method inside this loop to actually insert the data into the database.
And also as before, we ignore the first line becuase it is the field name headers.
In principle the event-based and stream-based approaches are actually the same. They each load a line of data at a time. In the event-based
approach however, you must respond to a method call, so you do not determine when a line of data is loaded. In the stream-based case, you do decide.
No data is loaded until you start a hasNext, next cycle (small point:
the data is actually loaded in the hasNext method, as it has to be loaded anyway to tell if there is any data).
So you can use the stream-based API if you need exact control of the loading process. And it's also sometimes easier to think in terms of
method calls that you make, rather than callbacks that are made for you. The stream-based API is also useful if you want to chain data,
by wrapping the CsvLoader in a custom class or classes. This would be similar to the way you can chain InputStreams.
OK, let's look at saving data using the streaming API. Here's the code:
public static void doSaveStream() throws Exception {
Connection con = getConnection();
PreparedStatement ps = con.prepareStatement( "SELECT * FROM Product" );
CsvManager csvman = new CsvManager();
CsvSaver saver = csvman.makeSaver( "products-export.csv" );
ResultSet rs = ps.executeQuery();
saver.begin();
String[] fields = new String[] {"id","code","name","price"};
saver.next( fields );
while( rs.next() ) {
loadRow( fields, rs );
saver.next( fields );
}
saver.end();
con.close();
}
Saving is a little bit simpler than loading. You don't have to worry about a hasNext method,
you just keep calling CsvSaver.next while you have more data. In this example code we setup the database as usual,
open a ResultSet, and then use a CsvSaver to save the data one line at a time. As before, the header lines are a special case.
We use the loadRow utility method to get the data out of the database and into a String[] array.
Here is a list of all the files used in this example. Note that the actual source code is slightly longer than the examples above, which have been abridged for clarity.
Feel free to experiment with these files and see what happens. You can also use them as the basis for your own solutions.
Please feel free to email us at examples@ricebridge.com if you have any questions or comments about this example.
Got a question for us?
Just Ask!
$15 Gift Certificate for every bug you find.