| Oracle® Database Globalization Support Guide 10g Release 2 (10.2) Part Number B14225-02 |
|
|
View PDF |
The character set scanner utilities are tools for detecting and verifying valid and invalid data. The Language and Character Set File Scanner supports text files, while the Database Character Set Scanner scans data inside the database.
This chapter introduces the Language and Character Set File Scanner and the Database Character Set Scanner. The topics in this chapter include:
How to Handle Convertible or Lossy Data in the Data Dictionary
Storage and Performance Considerations in the Database Character Set Scanner
The Language and Character Set File Scanner (LCSSCAN) is a high-performance, statistically based utility for determining the language and character set for unknown file text. It can automatically identify a wide variety of language and character set pairs. With each text, the language and character set detection engine sets up a series of probabilities, each probability corresponding to a language and character set pair. The most statistically probable pair identifies the dominant language and character set.
The purity of the text affects the accuracy of the language and character set detection. The ideal case is literary text of one single language with no spelling or grammatical errors. These types of text may require 100 characters of data or more and can return results with a very high factor of confidence. On the other hand, some technical documents can require longer segments before they are recognized. Documents that contain a mix of languages or character sets or text such as addresses, phone numbers, or programming language code may yield poor results. For example, if a document has both French and German embedded, then the accuracy of guessing either language successfully is statistically reduced. Both plain text and HTML files are accepted. If the format is known, you should set the FORMAT parameter to improve accuracy.
This section includes the following topics:
Start the Language and Character Set File Scanner with the LCSSCAN command. Its syntax is as follows:
LCSSCAN [RESULTS=number] [FORMAT=file_type] [BEGIN=number] [END=number] FILE=file_name
The parameters are described in the rest of this section.
RESULTS
The RESULTS parameter is optional.
| Property | Description |
|---|---|
| Default value | 1 |
| Minimum value | 1 |
| Maximum value | 3 |
| Purpose | The number of language and character set pairs that are returned. They are listed in order of probability. The comparative weight of the first choice cannot be quantified. The recommended value for this parameter is the default value of 1. |
FORMAT
The FORMAT paramater is optional.
| Property | Description |
|---|---|
| Default Value | text |
| Purpose | This parameter identifies the type of file to be scanned. The possible values are html, text, and auto. |
BEGIN
The BEGIN parameter is optional.
| Property | Description |
|---|---|
| Default value | 1 |
| Minimum value | 1 |
| Maximum value | Number of bytes in file |
| Purpose | The byte of the input file where LCSSCAN begins the scanning process. The default value is the first byte of the input file. |
END
The END parameter is optional.
| Property | Description |
|---|---|
| Default value | End of file |
| Minimum value | 3 |
| Maximum value | Number of bytes in file |
| Purpose | The last byte of the input file that LCSSCAN scans. The default value is the last byte of the input file. |
FILE
The FILE parameter is required.
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the name of a text file to be scanned |
Example 12-1 Specifying Only the File Name in the LCSSCAN Command
LCSSCAN FILE=example.txt
In this example, the entire example.txt file is scanned because the BEGIN and END parameters have not been specified. One language and character set pair will be returned because the RESULTS parameter has not been specified.
Example 12-2 Specifying the Format as HTML
LCSSCAN FILE=example.html FORMAT=html
In this example, the entire example.html file is scanned because the BEGIN and END parameters have not been specified. The scan will strip HTML tags before the scan, thus results are more accurate. One language and character set pair will be returned because the RESULTS parameter has not been specified.
Example 12-3 Specifying the RESULTS and BEGIN Parameters for LCSSCAN
LCSSCAN RESULTS=2 BEGIN=50 FILE=example.txt
The scanning process starts at the 50th byte of the file and continues to the end of the file. Two language and character set pairs will be returned.
To obtain a summary of the Language and Character Set File Scanner parameters, enter the following command:
LCSSCAN HELP=y
The resulting output shows a summary of the Language and Character Set Scanner parameters.
The Language and Character Set File Scanner supports several character sets for each language.
When the binary values for a language match two or more encodings that have a subset/superset relationship, the subset character set is returned. For example, if the language is German and all characters are 7-bit, then US7ASCII is returned instead of WE8MSWIN1252, WE8ISO8859P15, or WE8ISO8859P1.
When the character set is determined to be UTF-8, the Oracle character set UTF8 is returned by default unless 4-byte characters (supplementary characters) are detected within the text. If 4-byte characters are detected, then the character set is reported as AL32UTF8.
See Also:
"Language and Character Set Detection Support" for a list of supported languages and character setsThe Database Character Set Scanner assesses the feasibility of migrating an Oracle database to a new database character set. The Database Character Set Scanner checks all character data in the database and tests for the effects and problems of changing the character set encoding. A summary report is generated at the end of the scan that shows the scope of work required to convert the database to a new character set.
Based on the information in the summary report, you can decide on the most appropriate method to migrate the database's character set. The methods are:
Export and Import utilities
CSALTER script
CSALTER script with Export and Import utilities on selected tables
Note:
If the Database Character Set Scanner reports conversion exceptions, then these problems must be fixed before using any of the described methods. This may involve further data analysis and modifying the problem data to eliminate those exceptions. In extreme cases, both the database and the application might need to be modified. Oracle Corporation recommends you contact Oracle Consulting Services for services in database character set migration.The Database Character Set Scanner reads the character data and tests for the following conditions on each data cell:
Do character code points of the data cells change when converted to the new character set?
Can the data cells be successfully converted to the new character set?
Will the post-conversion data fit into the current column size?
The Database Character Set Scanner reads and tests for data in CHAR, VARCHAR2, LONG, CLOB, NCHAR, NVARCHAR2, NCLOB and VARRAY columns as well as nested tables. The Database Character Set Scanner does not perform post-conversion column size testing for LONG, CLOB, and NCLOB columns.
The Database Character Set Scanner provides four modes of database scan:
The Database Character Set Scanner reads and verifies the character data of all tables belonging to all users in the database including the data dictionary (such as SYS and SYSTEM users), and it reports on the effects of the simulated migration to the new database character set. It scans all schema objects including stored packages, procedures and functions, and object definitions stored as part of the data dictionary.
To understand the feasibility of migrating your database to a new database character set, you need to perform a full database scan.
The Database Character Set Scanner reads and verifies character data of all tables belonging to the specified user and reports on the effects on the tables of changing the character set.
The Database Character Set Scanner reads and verifies the character data of the specified tables, and reports the effects on the tables of changing the character set.
This section describes how to install and start the Database Character Set Scanner. It includes the following topics:
To use the Database Character Set Scanner, you must have DBA privileges on the Oracle database.
Before using the Database Character Set Scanner, you must run the csminst.sql script to set up the necessary system tables on the database that you plan to scan. The csminst.sql script needs to be run only once. The script performs the following tasks to prepare the database for scanning:
Assigns the necessary privileges to CSMIG
Assigns the default tablespace to CSMIG
Creates the Character Set Scanner system tables under CSMIG
You can modify the default tablespace for CSMIG by editing the csminst.sql script. Modify the following statement in csminst.sql to assign the preferred tablespace to CSMIG as follows:
ALTER USER csmig DEFAULT TABLESPACE tablespace_name;
Ensure that there is sufficient storage space available in the assigned tablespace before scanning the database. The amount of space required depends on the type of scan and the nature of the data in the database.
On UNIX platforms, run csminst.sql using these commands and SQL statement:
% cd $ORACLE_HOME/rdbms/admin % sqlplus sys/password as sysdba SQL> START csminst.sql
You can start the Database Character Set Scanner from the command line by one of these methods:
Using the Database Character Set Scanner parameter file
CSSCAN username/password PARFILE=file_name
Using the command line to specify parameter values. For example:
CSSCAN username/password FULL=y TOCHAR=al32utf8 ARRAY=10240 PROCESS=3
Using an interactive session
CSSCAN username/password
In an interactive session, the Database Character Set Scanner prompts you for the values of the following parameters:
FULL/TABLE/USER
TOCHAR
ARRAY
PROCESS
If you want to specify other parameters, then use the Database Character Set Parameter file or specify the parameter values in the CSSCAN command.
The Database Character Set Scanner parameter file enables you to specify Database Character Set Scanner parameters in a file where they can be easily modified or reused. Create a parameter file using a text editor.
Use one of the following formats to specify parameters in the Database Character Set Scanner parameter file:
parameter_name=value parameter_name=(value1, value2, ...)
You can add comments to the parameter file by preceding them with the pound sign (#). All characters to the right of the pound sign are ignored.
The following is an example of a parameter file:
USERID=system/manager USER=HR # scan HR tables TOCHAR=al32utf8 ARRAY=4096000 PROCESS=2 # use two concurrent scan processes FEEDBACK=1000
The following table shows a summary of parameters for the Database Character Set Scanner. The rest of this section contains detailed descriptions of the parameters.
| Parameter | Default | Prompt | Description |
|---|---|---|---|
| USERID | - | yes | Username/password |
| FULL | N |
yes | Scan entire database |
| USER | - |
yes | Owner of the tables to be scanned |
| TABLE | - | yes | List of tables to scan |
| EXCLUDE | - | no | List of tables to exclude |
| TOCHAR | - | yes | New database character set name |
| FROMCHAR | - | no | Current database character set name |
| TONCHAR | - | no | New national character set name |
| FROMNCHAR | - | no | Current national character set name |
| ARRAY | 1024000 |
yes | Size of array fetch buffer |
| PROCESS | 1 |
yes | Number of concurrent scan processes |
| MAXBLOCKS | - | no | The maximum number of blocks that can be in a table without the table being split |
| CAPTURE | N |
no | Capture convertible data |
| COLUMN | - | no | List of columns to scan |
| QUERY | - | no | Query to apply to restrict output before scan |
| SUPPRESS | - | no | Maximum number of exceptions logged for each table |
| FEEDBACK | - | no | Report progress every n rows |
| BOUNDARIES | - | no | List of column size boundaries for summary report |
| LASTRPT | N |
no | Generate report of the previous database scan |
| LOG | scan |
no | Base file name for report files |
| PARFILE | - | no | Parameter file name |
| PRESERVE | N |
no | Preserve existing scan results |
| LCSD | N |
no | Enable language and character set detection |
| LCSDDATA | LOSSY |
no | Define the scope of the language and character set detection |
| HELP | N |
no | Show help screen |
ARRAY
| Property | Description |
|---|---|
| Default value | 1024000 |
| Minimum value | 4096 |
| Maximum value | Unlimited |
| Purpose | Specifies the size in bytes of the array buffer used to fetch data. The size of the array buffer determines the number of rows fetched by the Database Character Set Scanner at any one time. |
The following formula estimates the number of rows fetched at one time for a given table:
rows fetched = (ARRAY value) / [(sum of all the CHAR and VARCHAR2 column sizes) + (number of CLOB columns * 4000) + (number of VARRAY columns * 4000)]
For example, suppose table A contains two CHAR columns (5 bytes and 10 bytes), two VARCHAR2 columns (100 bytes and 200 bytes), and one CLOB column. If ARRAY=1024000 (the default), then the number of rows fetched is calculated as follows:
1024000/[5 + 10 + 100 + 200 + (1*4000) + (0*4000)] = 237.3
The Database Character Set Scanner can fetch 23 rows of data at one time from table A.
If the sum in the denominator exceeds the value of the ARRAY parameter, then the Database Character Set Scanner fetches only one row at a time. Tables with LONG columns are fetched only one row at a time.
This parameter affects the duration of a database scan. In general, the larger the size of the array buffer, the shorter the duration time. Each scan process allocates the specified size of array buffer.
BOUNDARIES
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the list of column boundary sizes that are used for an application data conversion summary report. This parameter is used to locate the distribution of the application data for the CHAR, VARCHAR2, NCHAR, and NVARCHAR2 datatypes. |
For example, if you specify a BOUNDARIES value of (10, 100, 1000), then the application data conversion summary report produces a breakdown of the CHAR data into the following groups by their column length, CHAR(1..10), CHAR(11..100) and CHAR(101..1000). The behavior is the same for the VARCHAR2, NCHAR, and NVARCHAR2 datatypes.
| Property | Description |
|---|---|
| Default value | N |
| Range of values | Y or N |
| Purpose | Indicates whether to capture the information on the individual convertible rows, as well as the default of storing only the exception rows. Information regarding the convertible rows is written to the CSM$ERRORS table if the CAPTURE parameter is set to Y. It records the data that needs to be converted during the conversion to the target character set. When CAPTURE is set to Y, the data dictionary CONVERTIBLE data cells are also listed in the database scan individual exception report scan.err. With CAPTURE set to Y, the amount of time required to complete the scan can increase and more storage space may be required. |
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the names of the columns to be scanned |
When this parameter is specified, the Database Character Set Scanner scans the specified columns. You can specify the following when you specify the name of the column:
schemaname specifies the names of the user's schema from which to scan the table
tablename specifies the name of the table from which to scan the column
columnname specifies the name of the column to be scanned
For example, the following command scans the columns LASTNAME and FIRSTNAME in the hr sample schema:
CSSCAN system/manager COLUMN=(HR.EMPLOYEES.LASTNAME, HR.EMPLOYEES.FIRSTNAME) ...
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the names of the tables to be excluded from the scan |
When this parameter is specified, the Database Character Set Scanner excludes the specified tables from the scan. You can specify the following when you specify the name of the table:
schemaname specifies the name of the user's schema from which to exclude the table
tablename specifies the name of the table or tables to be excluded
For example, the following command scans all of the tables that belong to the hr sample schema except for the employees and departments tables:
CSSCAN system/manager USER=HR EXCLUDE=(HR.EMPLOYEES , HR.DEPARTMENTS) ...
FEEDBACK
| Property | Description |
|---|---|
| Default value | None |
| Minimum value | 100 |
| Maximum value | 100000 |
| Purpose | Specifies that the Database Character Set Scanner should display a progress meter in the form of a dot for every N number of rows scanned |
For example, if you specify FEEDBACK=1000, then the Database Character Set Scanner displays a dot for every 1000 rows scanned. The FEEDBACK value applies to all tables being scanned. It cannot be set for individual tables.
FROMCHAR
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the current character set name for CHAR, VARCHAR2, LONG, and CLOB datatypes in the database. By default, the Database Character Set Scanner assumes the character set for these datatypes to be the database character set. |
Use this parameter to override the default database character set definition for CHAR, VARCHAR2, LONG, and CLOB data in the database.
FROMNCHAR
| Property | Description |
|---|---|
| Default value | The current database character set. |
| Purpose | Specifies the current national database character set name for NCHAR, NVARCHAR2, and NCLOB datatypes in the database. By default, the Database Character Set Scanner assumes the character set for these datatypes to be the database national character set. |
Use this parameter to override the default database character set definition for NCHAR, NVARCHAR2, and NCLOB data in the database.
FULL
| Property | Description |
|---|---|
| Default value | N |
| Range of values | Y or N |
| Purpose | Indicates whether to perform the full database scan (that is, to scan the entire database including the data dictionary). Specify FULL=Y to scan in full database mode. |
See Also:
"Scan Modes in the Database Character Set Scanner" for more information about full database scansHELP
| Property | Description |
|---|---|
| Default value | N |
| Range of values | Y or N |
| Purpose | Displays a help message with the descriptions of the Database Character Set Scanner parameters |
LASTRPT
| Property | Description |
|---|---|
| Default value | N |
| Range of values | Y or N |
| Purpose | Indicates whether to regenerate the Database Character Set Scanner reports based on statistics gathered from the previous database scan |
If LASTRPT=Y is specified, then the Database Character Set Scanner does not scan the database, but creates the report files using the information left by the previous database scan session instead.
If LASTRPT=Y is specified, then only the USERID, BOUNDARIES, and LOG parameters take effect.
LCSD
| Property | Description |
|---|---|
| Default value | N |
| Range of values | Y or N |
| Purpose | Indicates whether to apply language and character set detection during scanning |
If LCSD=Y is specified, then the Database Character Set Scanner (CSSCAN) performs language and character set detection on the data cells categorized by the LCSDDATA parameter. The accuracy of the detection depends greatly on the size and the quality of the text being analyzed. The ideal case is literary text of one single language with no spelling or grammatical errors. Data cells that contain a mixture of languages or character sets or text such as addresses and names can yield poor results. When CSSCAN cannot determine the most likely language and character set, it may return up to three most likely languages and character sets for each cell. In some cases it may return none. CSSCAN ignores any data cells with less than 10 bytes of data and returns UNKNOWN for their language and character set.
The language and character set detection is a statistically-based technology, so its accuracy varies depending on the quality of the input data. The goal is to provide CSSCAN users with additional information about unknown data inside the database. It is important for CSSCAN users to review the detection results and the data itself before migrating the data to another character set.
Note that language and character set detection can affect the performance of the Database Character Set Scanner, depending on the amount of data that is being analyzed.
LCSDDATA
| Property | Description |
|---|---|
| Default value | LOSSY |
| Range of values | LOSSY, TRUNCATION, CONVERTIBLE, CHANGELESS, ALL |
| Purpose | Specifies the scope of the language and character set detection. The default is to apply the detection to only the LOSSY data cells. |
This parameter takes effect only when LCSD=Y is specified. For example, if LCSD=Y and LCSDDATA=LOSSY, CONVERTIBLE, then the Database Character Set Scanner tries to detect the character sets and languages of the data cells that are either LOSSY or CONVERTIBLE. Data that is classified as CHANGELESS and TRUNCATION will not be processed. Setting LCSDDATA=ALL results in language and character set detection for all data cells scanned in the current session.
After language and character set detection has been applied to CONVERTIBLE and TRUNCATION data cells, some data cells may change from their original classification to LOSSY. This occurs when the character set detection process determines that the character set of these data cells is not the character set specified in the FROMCHAR parameter.
LOG
| Property | Description |
|---|---|
| Default value | scan |
| Purpose | Specifies a base file name for the following Database Character Set Scanner report files:
|
By default, the Database Character Set Scanner generates the three text files, scan.txt, scan.err, and scan.out in the current directory.
MAXBLOCKS
| Property | Description |
|---|---|
| Default value | None |
| Minimum value | 1000 |
| Maximum value | Unlimited |
| Purpose | Specifies the maximum block size for each table, so that large tables can be split into smaller chunks for the Database Character Set Scanner to process |
For example, if the MAXBLOCKS parameter is set to 1000, then any tables that are greater than 1000 blocks in size are divided into n chunks, where n=CEIL(table block size/1000).
Dividing large tables into smaller pieces is beneficial only when the number of processes set with PROCESS is greater than 1. If the MAXBLOCKS parameter is not set, then the Database Character Set Scanner attempts to split up large tables based on its own optimization rules.
PARFILE
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the name for a file that contains a list of Database Character Set Scanner parameters |
| Property | Description |
|---|---|
| Default value | N |
| Range of values | Y or N |
| Purpose | Indicates whether to preserve the statistics gathered from the previous scan session |
If PRESERVE=Y is specified, then the Database Character Set Scanner preserves all of the statistics from the previous scan. It adds (if PRESERVE=Y) or overwrites (if PRESERVE=N) the new statistics for the tables being scanned in the current scan request.
PROCESS
| Property | Description |
|---|---|
| Default value | 1 |
| Minimum value | 1 |
| Maximum value | 32 |
| Purpose | Specifies the number of concurrent scan processes to utilize for the database scan |
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Applies a filter to restrict the data to be scanned by specifying a clause for a SELECT statement, which is applied to all tables and columns in the scanner session |
The value of the query parameter is a string that contains a WHERE clause for a SELECT statement that will be applied to all tables and columns listed in the TABLE and COLUMN parameters.
Only one query clause is allowed per scan session. The QUERY parameter is only applicable when performing table or column scans. The parameter will be ignored when performing a Full database or a User scan. QUERY can be applied to multiple tables and columns scans, however, the identical WHERE clause will be appended to all specified tables and columns.
For example, the following command scans the employess who were hired within the last 30 days:
CSSCAN system/manager TABLE=HR.EMPLOYEES QUERY= 'hire_date > SYSDATE - 180' ...
Note that the WHERE clause is not required inside the QUERY parameter. CSSCAN will automatically remove the WHERE clause if it is found to be the first five characters in the QUERY parameter.
SUPPRESS
| Property | Description |
|---|---|
| Default value | Unset (results in unlimited number of rows) |
| Minimum value | 0 |
| Maximum value | Unlimited |
| Purpose | Specifies the maximum number of data exceptions being logged for each table |
The Database Character Set Scanner inserts information into the CSM$ERRORS table when an exception is found in a data cell. The table grows depending on the number of exceptions reported.
This parameter is used to suppress the logging of individual exception information after a specified number of exceptions are inserted for each table. For example, if SUPPRESS is set to 100, then the Database Character Set Scanner records a maximum of 100 exception records for each table.
TABLE
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the names of the tables to scan |
You can specify the following when you specify the name of the table:
schemaname specifies the name of the user's schema from which to scan the table
tablename specifies the name of the table or tables to be scanned
For example, the following command scans the employees and departments tables in the HR sample schema:
CSSCAN system/manager TABLE=(HR.EMPLOYEES, HR.DEPARTMENTS)
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies a target database character set name for the CHAR, VARCHAR2, LONG, and CLOB data |
TONCHAR
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies a target database character set name for the NCHAR, NVARCHAR2, and NCLOB data |
If you do not specify a value for TONCHAR, then the Database Character Set Scanner does not scan NCHAR, NVARCHAR2, and NCLOB data.
USER
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the owner of the tables to be scanned |
If the USER parameter is specified, then the Database Character Set Scanner scans all tables belonging to the specified owner. For example, the following statement scans all tables belonging to HR:
CSSCAN system/manager USER=hr ...
USERID
| Property | Description |
|---|---|
| Default value | None |
| Purpose | Specifies the username and password (and optional connect string) of the user who scans the database. If you omit the password, then the Database Character Set Scanner prompts you for it |
The following formats are all valid:
username/password username/password@connect_string username username@connect_string
The following examples show you how to use the command-line and parameter-file methods for the Full Database, User, Single Table, and Column scan modes.
The following examples show how to scan the full database to see the effects of migrating it to AL32UTF8. This example assumes that the current database character set is WE8ISO8859P1.
% csscan system/manager parfile=param.txt
The param.txt file contains the following information:
full=y tochar=al32utf8 array=4096000 process=4
The scan.out file shows which tables were scanned. The default file name for the report can be changed by using the LOG parameter.
See Also:
"LOG"Database Character Set Scanner v2.1 : Release 10.2.0.0.0 - Production Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.0.0 - Production With the Partitioning and Data Mining options Enumerating tables to scan... . process 1 scanning SYS.SOURCE$[AAAABHAABAAAAJqAAA] . process 2 scanning SYS.TAB$[AAAAACAABAAAAA0AAA] . process 2 scanning SYS.CLU$[AAAAACAABAAAAA0AAA] . process 2 scanning SYS.ICOL$[AAAAACAABAAAAA0AAA] . process 2 scanning SYS.COL$[AAAAACAABAAAAA0AAA] . process 1 scanning SYS.IND$[AAAAACAABAAAAA0AAA] . process 1 scanning SYS.TYPE_MISC$[AAAAACAABAAAAA0AAA] . process 1 scanning SYS.LOB$[AAAAACAABAAAAA0AAA] . . . . process 1 scanning IX.AQ$_ORDERS_QUEUETABLE_G . process 2 scanning IX.AQ$_ORDERS_QUEUETABLE_I Creating Database Scan Summary Report... Creating Individual Exception Report... Scanner terminated successfully.
The following example shows how to scan the user tables to see the effects of migrating them to AL32UTF8. This example assumes the current database character set is US7ASCII, but the actual data stored is in Western European WE8MSWIN1252 encoding.
% csscan system/manager parfile=param.txt
The param.txt file contains the following information:
user=hr fromchar=we8mswin1252 tochar=al32utf8 array=4096000 process=1
% csscan system/manager user=hr fromchar=we8mswin1252
tochar=al32utf8 array=4096000 process=1
The scan.out file shows which tables were scanned.
Database Character Set Scanner v2.1 : Release 10.2.0.0.0 - Production Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.0.0 - Production With the Partitioning and Data Mining options Copyright (c) 1983, 2005 Oracle Corporation. All rights reserved. Enumerating tables to scan... . process 1 scanning HR.JOBS . process 1 scanning HR.DEPARTMENTS . process 1 scanning HR.JOB_HISTORY . process 1 scanning HR.EMPLOYEES Creating Database Scan Summary Report... Creating Individual Exception Report... Scanner terminated successfully.
The following example shows how to scan a single table to see the effects of migrating it to WE8MSWIN1252. This example assumes the current database character set is in US7ASCII. Language and character set detection is performed on the LOSSY data cells.
% csscan system/manager parfile=param.txt
The param.txt file contains the following information:
table=hr.employees tochar=we8mswin1252 array=4096000 process=1 suppress=100 lcsd=y
% csscan system/manager table=hr.employees tochar=we8mswin1252 array=4096000 process=1 supress=100 lcsd=y
The scan.out file shows which tables were scanned.
Database Character Set Scanner v2.1 : Release 10.2.0.0.0 - Production Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.0.0 - Production With the Partitioning and Data Mining options Copyright (c) 1983, 2005 Oracle Corporation. All rights reserved. . process 1 scanning HR.EMPLOYEES Creating Database Scan Summary Report... Creating Individual Exception Report... Scanner terminated successfully.
The following example shows how to scan a single table to see the effect of migrating it to WE8MSWIN1252. Before scanning, a query is run against the table to limit the rows that will be scanned. This example assumes the current database character set is in US7ASCII. Language and character set detection is performed on the LOSSY data cells.
% csscan system/manager parfile=param.txt
The param.txt file contains the following information:
table=hr.employees query='hire_date > SYSDATE - 180' tochar=we8mswin1252 array=4096000 process=1 suppress=100 lcsd=y
% csscan system/manager table=hr.employees query='hire_date > SYSDATE - 180' tochar=we8mswin1252 array=4096000 process=1 supress=100 lcsd=y
The scan.out file shows which tables were scanned.
Database Character Set Scanner v2.1 : Release 10.2.0.0.0 - Production Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.0.0 - Production With the Partitioning and Data Mining options Copyright (c) 1983, 2005 Oracle Corporation. All rights reserved. . process 1 scanning HR.EMPLOYEES Creating Database Scan Summary Report... Creating Individual Exception Report... Scanner terminated successfully.
The following example shows how to scan columns within a table to see the effects of migrating it to WE8MSWIN1252. This example assumes the current database character set is in US7ASCII. Language and character set detection is performed on the LOSSY data cells.
% csscan system/manager parfile=param.txt
The param.txt file contains the following information:
column=(hr.employees.lastname, hr.employees.firstname) tochar=we8mswin1252 array=4096000 process=1 suppress=100 lcsd=y
% csscan system/manager column=(hr.employees.lastname, hr.employees.firstname) tochar=we8mswin1252 array=4096000 process=1 supress=100 lcsd=y
The scan.out file shows which tables were scanned.
Database Character Set Scanner v2.1 : Release 10.2.0.0.0 - Production Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.0.0 - Production With the Partitioning and Data Mining options Copyright (c) 1983, 2005 Oracle Corporation. All rights reserved. . process 1 scanning HR.EMPLOYEES Creating Database Scan Summary Report... Creating Individual Exception Report... Scanner terminated successfully.
The Database Character Set Scanner generates two reports for each scan:
The Database Scan Summary Report is found in the scan.txt file. The Database Scan Individual Exception Report is found in the scan.err file.
The default file names for the reports can be changed by using the LOG parameter.
See Also:
"LOG"The scan.txt file contains the Database Scan Summary Report. The following output is an example of the report header. This section contains the time when each process of the scan was performed.
Database Scan Summary Report
Time Started : 2002-12-16 20:35:56
Time Completed: 2002-12-16 20:37:31
Process ID Time Started Time Completed
---------- -------------------- --------------------
1 2002-12-16 20:36:07 2002-12-16 20:37:30
2 2002-12-16 20:36:07 2002-12-16 20:37:30
---------- -------------------- --------------------
The report consists of the following sections:
The information available for each section depends on the type of scan and the parameters you select.
This section reports on the current database size as well as identifying the amount of potential data expansion after the character set migration.
The following output is an example.
Tablespace Used Free Total Expansion ------------------------- --------------- --------------- --------------- --------------- SYSTEM 206.63M 143.38M 350.00M 588.00K SYSAUX 8.25M 131.75M 140.00M .00K ------------------------- --------------- --------------- --------------- --------------- Total 214.88M 275.13M 490.00M 588.00K The size of the largest CLOB is 57370 bytes
This section describes the parameters selected and the type of scan you chose. The following output is an example.
Parameter Value ------------------------------ ------------------------------------------------ CSSCAN Version v2.1 Instance Name rdbms06 Database Version 10.2.0.0.0 Scan type Full database Scan CHAR data? YES Database character set WE8ISO8859P1 FROMCHAR WE8ISO8859P1 TOCHAR AL32UTF8 Scan NCHAR data? NO Array fetch buffer size 1024000 Number of processes 2 Capture convertible data? NO Charset Language Detections Yes Charset Language Parameter LOSSY ------------------------------ ------------------------------------------------
This section indicates the feasibility of the database character set migration. There are two basic criteria that determine the feasibility of the character set migration of the database. One is the condition of the data dictionary and the other is the condition of the application data.
The Scan Summary section consists of two status lines: one line reports on the data dictionary, and the other line reports on the application data.
The following is sample output from the Scan Summary:
All character type data in the data dictionary are convertible to the new character set Some character type application data are not convertible to the new character set
Table 12-1 shows the types of status that can be reported for the data dictionary and application data.
Table 12-1 Possible Status of the Data Dictionary and Application Data
| Data Dictionary Status | Application Data Status |
|---|---|
|
All character-type data in the data dictionary remains the same in the new character set. |
All character-type application data remains the same in the new character set. |
|
All character-type data in the data dictionary is convertible to the new character set. |
All character-type application data is convertible to the new character set. |
|
Some character-type data in the data dictionary is not convertible to the new character set. |
Some character-type application data is not convertible to the new character set. |
When all data remains the same in the new character set, it means that the encoding values of the original character set are identical in the target character set. For example, ASCII data is stored using the same binary values in both WE8ISO8859P1 and AL32 UTF8. In this case, the database character set can be migrated using the CSALTER script.
If all the data is convertible to the new character set, then the data can be safely migrated using the Export and Import utilities. However, the migrated data will have different encoding values in the target character set.
See Also:
"Database Scan Individual Exception Report" for more information about non-convertible data
This section contains the statistics about the conversion of the data in the data dictionary. The numbers of data cells with each type of status are reported by datatype. To achieve a comprehensive data dictionary conversion summary, you need to use a full database scan.
Table 12-2 describes the possible types of status of a data cell.
Table 12-2 Possible Status of Data
| Status | Description |
|---|---|
|
|
Data remains the same in the new character set |
|
|
Data can be successfully converted to the new character set |
|
|
Data will be truncated if conversion takes place |
|
|
Character data will be lost if conversion takes place |
The following output is an example.
Datatype Changeless Convertible Truncation Lossy --------------------- ---------------- ---------------- ---------------- ---------------- VARCHAR2 1,214,557 0 0 0 CHAR 967 0 0 0 LONG 88,657 0 0 0 CLOB 138 530 0 0 VARRAY 18 0 0 0 --------------------- ---------------- ---------------- ---------------- ---------------- Total 1,304,337 530 0 0 Total in percentage 99.959% 0.041% 0.000% 0.000% The data dictionary can be safely migrated using the CSALTER script.
If the numbers of data cells recorded in the Convertible, Truncation, and Lossy columns are zero, then no data conversion is required to migrate the data dictionary from the FROMCHAR character set to the TOCHAR character set.
If the numbers in the Truncation and Lossy columns are zero and some numbers in the Convertible columns are not zero, then all data in the data dictionary is convertible to the new character set. However, it is dangerous to convert the data in the data dictionary without understanding their impact on the database. The CSALTER script can convert some of the convertible cells in the data dictionary. The message that follows the conversion summary table indicates whether this conversion can be supported by the CSALTER script.
If the numbers in the Lossy column are not zero, then there is data in the data dictionary that is not convertible. Therefore, it is not feasible to migrate the current database to the new character because the export and import processes cannot convert the data into the new character set. For example, you might have a table name with invalid characters or a PL/SQL procedure with a comment line that includes data that cannot be mapped to the new character set. These changes to schema objects must be corrected manually before migration to a new character set.
If the numbers in the Truncation column are not zero, then the export and import process would truncate the data.
This section contains the statistics on conversion summary of the application data. The numbers of data cells with each type of status are reported by datatype. Table 12-2 describes the types of status that can be reported.
The following output is an example.
Datatype Changeless Convertible Truncation Lossy --------------------- ---------------- ---------------- ---------------- ---------------- VARCHAR2 37,757 3 0 0 CHAR 6,404 0 0 0 LONG 4 0 0 0 CLOB 23 20 0 1 VARRAY 319 0 0 0 --------------------- ---------------- ---------------- ---------------- ---------------- Total 44,507 23 0 1 Total in percentage 99.946% 0.051% 0% 0.002%
This section contains the conversion summary of the CHAR and VARCHAR2 application data. The number of data cells with each type of status are reported by column size boundaries specified by the BOUNDARIES parameter. Table 12-2 describes the possible types of status.
This information is available only when the BOUNDARIES parameter is specified.
The following output is an example.
Datatype Changeless Convertible Truncation Lossy --------------------- ---------------- ---------------- ---------------- ---------------- VARCHAR2(1..30) 28,702 2 0 0 VARCHAR2(31..4000) 9,055 1 0 0 --------------------- ---------------- ---------------- ---------------- ---------------- CHAR(1..30) 6,404 0 0 0 CHAR(31..4000) 0 0 0 0 --------------------- ---------------- ---------------- ---------------- ---------------- Total 44,161 3 0 0
This section shows how Convertible, Truncation, and Lossy data is distributed within the database. The statistics are reported by table. If the list contains only a few rows, then the Convertible data is localized. If the list contains many rows, then the Convertible data occurs throughout the database.
The following output is an example.
USER.TABLE Convertible Truncation Lossy ---------------------------- ---------------- ---------------- ---------------- HR.EMPLOYEES 1 0 0 OE.CUSTOMERS 2 0 0 PM.ONLINE_MEDIA 13 0 0 PM.PRINT_MEDIA 7 0 1 SYS.EXTERNAL_TAB$ 1 0 0 SYS.METASTYLESHEET 80 0 0 ---------------------------- ---------------- ---------------- ----------------
This section shows how Convertible, Truncation, and Lossy data is distributed within the database. The statistics are reported by column. The following output is an example.
USER.TABLE|COLUMN Convertible Truncation Lossy ------------------------------------- ---------------- ---------------- ---------------- HR.EMPLOYEES|FIRST_NAME 1 0 0 OE.CUSTOMERS|CUST_EMAIL 1 0 0 OE.CUSTOMERS|CUST_FIRST_NAME 1 0 0 PM.ONLINE_MEDIA|SYS_NC00042$ 6 0 0 PM.ONLINE_MEDIA|SYS_NC00062$ 7 0 0 PM.PRINT_MEDIA|AD_FINALTEXT 3 0 1 PM.PRINT_MEDIA|AD_SOURCETEXT 4 0 0 SYS.EXTERNAL_TAB$|PARAM_CLOB 1 0 0 SYS.METASTYLESHEET|STYLESHEET 80 0 0 ------------------------------------- ---------------- ---------------- ----------------
This generates a list of all the indexes that are affected by the database character set migration. These can be rebuilt after the data has been imported. The following output is an example.
USER.INDEX on USER.TABLE(COLUMN) -------------------------------------------------------------------------------------- HR.EMP_NAME_IX on HR.EMPLOYEES(FIRST_NAME) HR.EMP_NAME_IX on HR.EMPLOYEES(LAST_NAME) OE.CUST_EMAIL_IX on OE.CUSTOMERS(CUST_EMAIL) --------------------------------------------------------------------------------------
This section applies only to columns that are defined using character semantics. The Truncation Due to Character Semantics section identifies the number of data cells that would be truncated if they were converted to the target character set (for example, by the SQL CONVERT function or another inline conversion process) before the database character set is updated with the CSALTER script. If the data conversion occurs after the database character set is changed, then this section can be ignored.
For example, a VARCHAR2(5 char) column in a WE8MSWIN1252 database can store up to 5 characters, using 5 bytes. When these characters are migrated to AL32UTF8, the same 5 characters can expand to as much as 20 bytes in length. Because the physical byte limits allocated for the column are determined by the current database character set, this column must be manually expanded to 20 bytes before the data can be converted in the target character set. Alternatively, you can apply the character set conversion to this column after the database character set has been changed. Then the same VARCHAR2(5 char) definition will automatically allocate 20 bytes, and no special handling is required.
The following output is an example of the Truncation Due To Character Semantics section of the report.
USER.TABLE|COLUMN Truncation -------------------------------------------------- ---------------- HR.EMPLOYEES|FIRST_NAME 1
This section appears when the language and character set detection is turned on by the LCSD parameter. It displays a list of character sets detected by the Database Character Set Scanner. The character sets are ordered by occurrence. NUMBER refers to the number of data cells.
The following output is an example of the Character Set Detection Result section.
CHARACTER SET NUMBER PERCENTAGE ------------------------------ ---------------- ---------------- WE8MSWIN1252 38 97.436% UNKNOWN 1 2.564% ------------------------------ ---------------- ----------------
This section appears when the language and character set detection is turned on by the LCSD parameter. It displays a list of the languages detected by the Database Character Set Scanner. The languages are ordered by occurrence.
The following output is an example of the Language Detection Result Section.
LANGUAGE NUMBER PERCENTAGE ------------------------------ ---------------- ---------------- ENGLISH 36 92.308% FRENCH 2 5.128% UNKNOWN 1 2.564% ------------------------------ ---------------- ----------------
The scan.err file contains the Individual Exception Report. It consists of the following summaries:
This section describes the parameters and the type of scan chosen. The following output is an example.
Parameter Value ------------------------------ ------------------------------------------------ CSSCAN Version v2.1 Instance Name rdbms06 Database Version 10.2.0.0.0 Scan type Full database Scan CHAR data? YES Database character set